GCP

1

How to setup GCP free?

​To access Google Cloud Platform (GCP) services for free, you can utilize the following offerings:

1. Google Cloud Free Program:

  • $300 Free Credit: New users receive $300 in credits valid for 90 days to explore and evaluate GCP services. citeturn0search15

  • Always Free Tier: Access select GCP products with monthly usage limits at no cost, such as:

    • Compute Engine: 1 non-preemptible e2-micro VM instance per month in specific regions.

    • Cloud Storage: 5 GB of Regional Storage per month.

    • BigQuery: 1 TB of querying data per month.

    • Cloud Functions: 2 million invocations per month.

    • Cloud Pub/Sub: 10 GB of messages per month.

    • Cloud Shell: 50 hours per week.

    • Secret Manager: 10,000 access operations per month.

    • Vision AI: 1,000 units per month.

    • Translate API: 500,000 characters per month.

    • Natural Language API: 5,000 units per month.

    • AutoML Translation: 500,000 translated characters per month.

    • Video Intelligence API: 1,000 units per month.

    • Workflows: 5,000 free internal steps per month.

    • Cloud Source Repositories: Multiple private Git repositories hosted on Google Cloud, free access for up to five users.

    • Google Cloud Marketplace: Click-to-deploy, production-grade solutions from Google Cloud partners with free trials of select apps and services.

    • Secret Manager: Securely store API keys, passwords, certificates, and other sensitive data with 6 secret versions per month.

2. Google Maps Platform Monthly Credit:

  • Receive a recurring $200 monthly credit for Google Maps Platform services. citeturn0search16

Eligibility:

  • You must be new to GCP and have not previously signed up for the Free Trial.

  • A valid credit card or other payment method is required for account verification, but you won't be charged unless you upgrade to a paid account.

Getting Started:

  1. Click on "Get started for free."

  2. Follow the prompts to create your GCP account and set up billing information.

  3. Access the Google Cloud Console to begin exploring services.

Important Considerations:

  • Usage Monitoring: Regularly monitor your usage to stay within free tier limits and avoid unexpected charges.

  • Billing Alerts: Set up billing alerts to notify you when approaching usage limits.

  • Service Availability: Some services may have regional restrictions or varying free tier limits.


2

Scenarios:

Here are five real-time scenarios with dummy datasets for you to practice and align with the job requirements:


Scenario 1: Data Ingestion and Processing with GCP Dataflow

Objective: Ingest a stream of transaction data from a Pub/Sub topic, process it using Apache Beam on Dataflow, and store the cleaned data in Cloud Storage.

Dataset:

Code:


Scenario 2: Data Warehousing and Analytics with BigQuery

Objective: Load cleaned transaction data from Cloud Storage into BigQuery, then run SQL queries to find the total transaction amounts per currency.

Dataset: Same as Scenario 1 (cleaned CSV from Cloud Storage).

Code:


Scenario 3: ETL/ELT Development with Apache Airflow

Objective: Create a DAG in Apache Airflow to perform a daily ETL process for customer data.

Dataset:

Code:


Scenario 4: Performance Optimization for BigQuery

Objective: Optimize a query that calculates the average transaction amount per customer while minimizing costs.

Dataset: Use cleaned transaction data loaded in Scenario 2.

Code:

Optimization Tips:

  • Use clustered tables on customer_id.

  • Partition tables by timestamp.

  • Avoid SELECT *; specify only required fields.


Scenario 5: Data Modeling for Machine Learning

Objective: Design a schema for customer lifetime value (CLV) prediction, train a simple ML model using Python, and save it to GCP.

Dataset:

Code:

These scenarios align with the job requirements and will help you practice hands-on with Python and GCP tools.


3

Here are five more real-time scenarios for each category, with Python examples and dummy datasets:


Data Ingestion and Processing

Scenario 1: Real-Time Stock Price Stream Processing

Objective: Process stock price data from a Pub/Sub topic in real-time and calculate average price per stock.

Dataset:

Code:


Scenario 2: IoT Sensor Data Processing

Objective: Ingest IoT sensor data for temperature monitoring and filter abnormal readings.

Dataset:

Code:


Scenario 3: Social Media Data Processing

Objective: Extract hashtags from a Twitter stream and store them in Cloud Storage.

Dataset:

Code:


Scenario 4: Error Log Monitoring

Objective: Monitor logs for errors and send alerts for critical issues.

Dataset:

Code:


Scenario 5: Weather Data Processing

Objective: Aggregate hourly weather data into daily summaries.

Dataset:

Code:


Data Warehousing and Analytics

Scenario 1: Sales Analytics

Objective: Calculate total sales by product category from BigQuery data.

Code:


Scenario 2: Customer Retention Analysis

Objective: Analyze customer churn rates using BigQuery.

Code:


Scenario 3: Inventory Optimization

Objective: Identify products with low stock in BigQuery.

Code:


Scenario 4: Marketing Campaign Effectiveness

Objective: Analyze the click-through rates for email campaigns.

Code:


Scenario 5: Fraud Detection

Objective: Flag transactions exceeding a specific threshold.

Code:


Here is the missing code for the ETL/ELT Development and Data Modeling scenarios:


ETL/ELT Development

Scenario 1: Data Cleaning with Airflow

Objective: Remove duplicates and null values from a dataset and save cleaned data to Cloud Storage.

Code:


Scenario 2: Data Transformation for Machine Learning

Objective: Add day-of-week and month columns to timestamps in the dataset.

Code:


Scenario 3: Real-Time ETL for E-Commerce

Objective: Calculate hourly revenue for incoming orders and save to BigQuery.

Code:


Scenario 4: Log Processing

Objective: Convert raw logs into structured JSON.

Code:


Scenario 5: Social Media Analytics

Objective: Ingest Facebook ad performance data, calculate cost per click (CPC), and save it.

Code:


Data Modeling

Scenario 1: Customer Lifetime Value (CLV)

Objective: Train a model to predict CLV based on purchase history.

Code:


Scenario 2: Recommendation System

Objective: Build a user-product interaction matrix for collaborative filtering.

Code:


Scenario 3: Demand Forecasting

Objective: Train a time series model to forecast product demand.

Code:


Scenario 4: Fraud Detection

Objective: Use historical data to detect anomalous transactions.

Code:


Scenario 5: Churn Prediction

Objective: Train a classification model to predict customer churn.

Code:

These examples should help you practice and implement real-world use cases! Let me know if you need further assistance.


GCP OpenSource Alternate

If you're looking for an open-source tool that can perform SQL queries on CSV files locally, similar to BigQuery, there are several options you can consider. Here are a few popular ones:

  1. DuckDB:

    • Description: DuckDB is an in-process SQL OLAP database management system. It is designed to be fast and easy to use, and it supports SQL queries on CSV files directly.

    • Features:

      • In-process analytics

      • Support for SQL queries

      • Fast performance

      • Easy integration with Python, R, and other languages

    • Installation:

    • Usage Example:

  2. SQLite:

    • Description: SQLite is a self-contained, serverless, config-free transactional SQL database engine. It can be used to perform SQL queries on CSV files with some additional setup.

    • Features:

      • Lightweight and easy to use

      • Supports SQL queries

      • No server setup required

    • Installation:

    • Usage Example:

  3. Pandas:

    • Description: Pandas is a powerful data manipulation library in Python. While it is not a database, it can perform SQL-like operations on CSV files using its built-in functions.

    • Features:

      • Easy to use for data manipulation

      • Supports SQL-like operations

      • Integrates well with other Python libraries

    • Installation:

    • Usage Example:

  4. Apache Drill:

    • Description: Apache Drill is a schema-free SQL query engine for Hadoop, NoSQL, and cloud storage. It can query CSV files directly.

    • Features:

      • Supports SQL queries

      • Can query various data sources

      • No need for schema definition

    • Installation:

      • Download and install Apache Drill from the official website.

    • Usage Example:

Each of these tools has its own strengths and use cases, so you can choose the one that best fits your needs. If you need more advanced features like those offered by BigQuery, DuckDB or Apache Drill might be the best choices. For simpler tasks, Pandas or SQLite could be sufficient.


Assignment 1

Dataset:

Basic Assignments

  1. List all unique values in the make column.

  2. Count the number of cars with num-of-doors as four and two.

  3. Find the mean, median, and standard deviation of the price column.

  4. Identify the number of cars for each fuel-type.

  5. Filter rows where horsepower is greater than 150.

  6. List all rows where the body-style is hatchback.

  7. Sort the data by price in ascending order.

  8. Display only the first 10 rows of the dataset.

  9. Count the number of unique values in the engine-type column.

  10. Group the data by fuel-type and calculate the average engine-size.

  11. Identify the minimum and maximum city-mpg.

  12. Filter cars with a compression-ratio above 10.

  13. Retrieve cars where the width is greater than the median width.

  14. Find the number of cars for each drive-wheels category.

  15. Count how many cars have a height above the 90th percentile.

  16. Filter rows where the engine-location is rear.

  17. Identify the average price of cars grouped by make.

  18. Count the total number of cars with missing values in any column.

  19. Find cars with the top 5 lowest curb-weight.

  20. Retrieve all rows where aspiration is turbo.


Intermediate Assignments

  1. Identify the make with the highest number of models in the dataset.

  2. Group the cars by body-style and calculate the sum of curb-weight for each group.

  3. Find the car with the highest compression-ratio for each fuel-type.

  4. Replace missing values in the normalized-losses column with its mean.

  5. Create a derived column weight-category based on curb-weight being light (< 2500), medium (2500–3500), or heavy (> 3500).

  6. Calculate the ratio of horsepower to engine-size for each car.

  7. Find cars where the price is within 10% of the mean price.

  8. Group by make and count the number of unique body-style values.

  9. Identify cars with both city-mpg and highway-mpg above their respective means.

  10. Sort the dataset first by fuel-type and then by price.

  11. Calculate the average price for each engine-type and sort the result in descending order.

  12. Drop rows where num-of-doors has missing values.

  13. Create a new column is-luxury (True if price > 25,000, else False).

  14. Find cars where the bore is between 2.5 and 3.5, inclusive.

  15. Identify the top 3 body-style categories with the highest average price.

  16. Find the most common value in the fuel-system column.

  17. Calculate the difference between highway-mpg and city-mpg for each car.

  18. Retrieve all cars that have the same engine-type and fuel-type.

  19. Find all cars where the make appears only once in the dataset.

  20. Calculate the range (max-min) of length for each drive-wheels.


Hard Assignments

  1. Normalize the price column to be between 0 and 1 using Min-Max scaling.

  2. Create a column efficiency as (city-mpg + highway-mpg) / 2 and rank cars by efficiency.

  3. Identify the car with the highest horsepower for each combination of body-style and fuel-type.

  4. For each make, calculate the standard deviation of price and identify the make with the highest variability.

  5. Identify outliers in the engine-size column using the IQR method.

  6. Simulate a join: group data by make and calculate the average price for each fuel-type.

  7. Calculate the weighted average performance-index using horsepower (50%), engine-size (30%), and compression-ratio (20%).

  8. Write a query to find cars priced in the top 10% for their fuel-type.

  9. Find the correlation matrix for numeric columns in the dataset.

  10. Identify the top 5 cars with the highest power-to-weight ratio.

  11. Create quartiles for the price column and group cars into these quartiles.

  12. Find the most common fuel-system for each drive-wheels category.

  13. Calculate the median curb-weight for each engine-location.

  14. Rank all cars based on compression-ratio and assign a percentile rank.

  15. Calculate the average price for cars where num-of-cylinders is greater than 6.

  16. Find the car with the maximum difference between highway-mpg and city-mpg.

  17. Retrieve all cars where make is among the top 3 makes with the highest average price.

  18. Analyze and visualize the relationship between length, width, and price.

  19. Calculate the weighted-price of each car, assuming weights of 0.6 for engine-size and 0.4 for horsepower.

  20. Create a derived column luxury-score as a weighted sum of price, curb-weight, and horsepower. Rank all cars by this score.


Assignment 2 - Ecommerce Data

Dataset:

Basic Assignments

  1. Create a table for storing user details with columns for User_ID, Name, and Email.

  2. Insert 5 sample rows into the users table.

  3. Write a query to fetch all rows from the users table.

  4. Select only the Name and Email columns from the users table.

  5. Filter rows where the Name contains the letter "a".

  6. Use ORDER BY to sort the users table alphabetically by Name.

  7. Add a column Age to the users table and update it with random values.

  8. Delete the row where the Name is "John".

  9. Find the total number of users in the users table.

  10. Use an alias to rename the Email column in the output.

  11. Create another table for storing products with columns for Product_ID, Category, and Price.

  12. Insert 10 sample products into the products table.

  13. Write a query to fetch all products priced above 500.

  14. Use the LIMIT keyword to display only the first 3 rows from the products table.

  15. Find the average price of all products in the products table.

  16. Write a query to find the MAX and MIN price from the products table.

  17. Add a column Stock to the products table and update it with random values.

  18. Select all products where Stock is greater than 10.

  19. Write a query using CASE to categorize products into "Expensive" or "Cheap" based on a Price threshold of 1000.

  20. Use a subquery to find products with a price above the average price.

  21. Drop the Age column from the users table.

  22. Count the number of unique categories in the products table.

  23. Write a query to find products priced between 500 and 1500.

  24. Create a new table to store sales data with Sale_ID, Product_ID, and Quantity.

  25. Insert sample rows into the sales table.

  26. Find the total quantity sold for each Product_ID.

  27. Write a query to display all rows where Category starts with the letter "E".

  28. Create a view to show products with Stock greater than 50.

  29. Delete all rows from the users table.

  30. Drop the sales table.


Intermediate Assignments

  1. Write an INNER JOIN query to find products purchased by users.

  2. Use a LEFT JOIN to display all users and their purchases, including users with no purchases.

  3. Write a query using GROUP BY to find the total sales for each Category.

  4. Create a temporary table in DuckDB to store aggregated sales data.

  5. Use a HAVING clause to filter product categories with total sales above 10,000.

  6. Use DuckDB’s STRUCT data type to store user addresses in a single column.

  7. Write a query to calculate the running total of sales using WINDOW functions.

  8. Use RANK to rank products based on their sales quantity.

  9. Write a query to generate a pivot table that shows sales for each product by month.

  10. Create a stored procedure in DuckDB to calculate discounts dynamically.

  11. Write a query to find duplicate rows in the users table.

  12. Use DISTINCT ON to fetch the first row for each user based on the latest Purchase_Date.

  13. Write a query using COALESCE to replace null values in the users table.

  14. Use UNION to combine results from two queries: all users and all customers.

  15. Write a query to find users who didn’t purchase anything using a NOT EXISTS clause.

  16. Create an index on the users table to optimize search for User_ID.

  17. Use the JSON extension in DuckDB to parse and query a JSON column.

  18. Write a query to generate a cumulative percentage of sales for each category.

  19. Create a table partitioned by Category and load products data into it.

  20. Use a FULL OUTER JOIN to combine sales and users data.

  21. Write a query to calculate the median of product prices using DuckDB’s functions.

  22. Load data from a CSV file into DuckDB using the COPY command.

  23. Write a query to calculate the daily sales average for the past 7 days.

  24. Implement a query to group data into weekly buckets.

  25. Use a recursive CTE to generate a series of numbers from 1 to 100.

  26. Write a query using DuckDB’s ARRAY data type to store multiple tags for products.

  27. Create a trigger in DuckDB to log changes to the users table.

  28. Write a query to export results into a Parquet file.

  29. Use a CASE statement inside a JOIN to match complex conditions.

  30. Write a query to load Parquet files directly from a cloud storage URL into DuckDB.


Hard Assignments

  1. Write a query to find products frequently bought together using sales data.

  2. Create a materialized view for monthly sales trends.

  3. Use DuckDB’s STRING_AGG function to concatenate all product names by category.

  4. Write a query to perform text search for product descriptions.

  5. Implement a rolling average of sales for the last 30 days using WINDOW functions.

  6. Optimize a query using EXPLAIN and analyze the output.

  7. Create a table using DuckDB’s TEMPORAL TABLE feature for time-series analysis.

  8. Write a query to perform a geospatial join using DuckDB’s GIS extension.

  9. Use MAP data types in DuckDB to store key-value pairs for product attributes.

  10. Write a query to fetch the top 3 products by sales for each category.

  11. Implement a custom function in DuckDB using SQL.

  12. Write a query to detect anomalies in sales data using statistical functions.

  13. Create a query that splits large tables into partitions and processes them incrementally.

  14. Write a recursive CTE to calculate the Fibonacci sequence.

  15. Use EXCEPT to find records in one table but not another.

  16. Write a query to dynamically create tables based on product categories.

  17. Perform sentiment analysis on product reviews using DuckDB’s text-processing capabilities.

  18. Load, transform, and query large datasets in DuckDB memory.

  19. Use the WITH clause to create multiple subqueries and combine their results.

  20. Implement a time-series decomposition of sales data.

  21. Write a query to fetch the longest sequence of consecutive sales days.

  22. Use ARRAY_AGG to create arrays of products purchased by each user.

  23. Perform a join across Parquet and CSV files in DuckDB.

  24. Use DuckDB’s built-in connectors to query directly from SQLite databases.

  25. Write a query to calculate the 90th percentile of product prices.

  26. Use MERGE to update a sales table with new data conditionally.

  27. Create a pipeline to process and query streaming data in DuckDB.

  28. Write a query to calculate the Levenshtein distance between two strings.

  29. Use DuckDB’s Python API to integrate SQL queries with a Python workflow.

  30. Implement a query to calculate dynamic cohort retention rates.


Assignment 3 - Shoe Price Data

Dataset:

Basic Assignments (Foundational SQL and Table Operations)

  1. Create a table named shoes with columns: brand, color, size, price, offer_price.

  2. Insert 10 sample rows into the shoes table.

  3. Write a query to fetch all rows from the shoes table.

  4. Select only the brand and price columns from the shoes table.

  5. Write a query to filter rows where price is greater than 1000.

  6. Use the ORDER BY clause to sort the shoes table by price in descending order.

  7. Add a new column discount_percentage to the shoes table.

  8. Update the discount_percentage column using a formula: 100 * (price - offer_price) / price.

  9. Delete all rows where price is less than 500.

  10. Find the total number of shoes available in the table using the COUNT function.

  11. Write a query to find the maximum and minimum offer_price in the table.

  12. Use the DISTINCT keyword to find all unique brand values.

  13. Write a query to group shoes by brand and calculate the average price for each brand.

  14. Add a new column category and update it with "Premium" if price > 2000, otherwise "Budget".

  15. Drop the category column from the table.

  16. Create a new table named users with columns: user_id, name, email.

  17. Insert sample data into the users table.

  18. Write a query to fetch all rows from the users table.

  19. Find the total number of unique email domains from the email column.

  20. Use a subquery to find shoes priced above the average price.

  21. Create a temporary table to store shoes with a discount percentage greater than 30%.

  22. Write a query using the LIMIT clause to fetch the top 5 most expensive shoes.

  23. Write a query to find all shoes where the brand starts with the letter "A".

  24. Create a view to show only shoes with size > 8.

  25. Write a query to calculate the total revenue from offer_price.

  26. Write a query to replace null values in any column with default values using COALESCE.

  27. Rename the offer_price column to discounted_price.

  28. Drop the shoes table from the database.

  29. Create a new table with a composite primary key (e.g., order_id, product_id).

  30. Use the EXPLAIN command to analyze the query plan for a basic SELECT query.


Intermediate Assignments (Joins, Aggregations, and Window Functions)

  1. Write an INNER JOIN query between shoes and users using a hypothetical user_shoes table.

  2. Write a LEFT JOIN to show all users and the shoes they purchased, even if they haven't bought any.

  3. Use a RIGHT JOIN to fetch all shoes and the corresponding buyers.

  4. Write a FULL OUTER JOIN query to show all shoes and users, regardless of a purchase relationship.

  5. Write a query using a WINDOW FUNCTION to calculate the cumulative revenue from offer_price.

  6. Use RANK to rank shoes based on offer_price within each brand.

  7. Write a query using PARTITION BY to calculate the average price for each brand.

  8. Create a pivot table to show the count of shoes available in each color by size.

  9. Write a recursive CTE to calculate factorial numbers up to 10.

  10. Use a Common Table Expression (CTE) to find the top 3 most expensive shoes in each brand.

  11. Write a query to calculate the percentage of total revenue contributed by each brand.

  12. Use the WITH clause to create multiple subqueries and combine their results.

  13. Write a query to find the average and median price for each brand.

  14. Use ARRAY_AGG to create an array of all colors available for each brand.

  15. Write a query to count the total number of shoes sold for each size using a hypothetical sales table.

  16. Use CASE in a query to classify shoes into "High Discount" and "Low Discount" categories.

  17. Write a query to filter rows using the HAVING clause after a GROUP BY.

  18. Implement a query that finds products purchased in the last 30 days using the CURRENT_DATE function.

  19. Write a query to export the results of a query into a CSV file using DuckDB.

  20. Write a query to load a CSV file into DuckDB using the COPY command.

  21. Use COALESCE to replace any null values in the size column with a default value of 9.

  22. Write a query to find all shoes whose offer_price falls within the 80th percentile of prices.

  23. Create an index on the price column of the shoes table.

  24. Use the JSON extension in DuckDB to parse a JSON column storing additional attributes.

  25. Write a query to calculate a rolling average of offer_price over the past 3 rows.

  26. Write a query to find duplicate rows in the users table based on email.

  27. Write a query to combine results of two tables using UNION.

  28. Use a NOT EXISTS clause to find users who haven’t purchased any shoes.

  29. Create a trigger to automatically calculate and update discount_percentage after inserting a row.

  30. Write a query to back up the shoes table into a Parquet file.


Hard Assignments (Advanced SQL, Analytics, and DuckDB Features)

  1. Write a query to calculate the retention rate of users over different cohorts using hypothetical purchase data.

  2. Use MAP data types in DuckDB to store and query key-value pairs for product attributes.

  3. Write a query to perform geospatial analysis using DuckDB’s GIS extension.

  4. Implement a sentiment analysis pipeline by querying product reviews stored as text in a DuckDB table.

  5. Use EXCEPT to find shoes that are available in one dataset but not in another.

  6. Write a query to dynamically create partitions based on brand and size.

  7. Perform a time-series analysis on a hypothetical sales dataset to find monthly sales trends.

  8. Write a query to fetch the longest sequence of consecutive days with sales.

  9. Use a recursive CTE to calculate the nth Fibonacci number.

  10. Write a query using STRING_AGG to concatenate all product names for each brand.

  11. Use DuckDB’s Parquet reader to query a dataset stored in Parquet format without importing it.

  12. Perform a full-text search to find products whose descriptions contain specific keywords.

  13. Write a query to calculate the Levenshtein distance between two strings in a name column.

  14. Create a pipeline to process and query streaming data in DuckDB.

  15. Use DuckDB’s Python API to dynamically load data and perform complex queries.

  16. Write a query to identify outliers in the price column using statistical methods.

  17. Implement a query to find products frequently purchased together using association rules.

  18. Write a query to detect anomalies in sales data using DuckDB’s statistical functions.

  19. Use MERGE to synchronize two tables based on specific conditions.

  20. Write a query to calculate the Gini coefficient of revenue distribution across brands.

  21. Perform a cross-database query in DuckDB to combine data from SQLite and CSV.

  22. Use DuckDB’s TEMPORAL features to analyze sales over specific time intervals.

  23. Implement a query to calculate weighted averages of discounts for each brand.

  24. Write a query to rank products within deciles based on offer_price.

  25. Perform a K-means clustering analysis on numerical data stored in DuckDB.

  26. Write a query to calculate the Sharpe ratio for sales trends in DuckDB.

  27. Implement a query to calculate the 95th percentile confidence interval for average sales.

  28. Use DuckDB’s built-in connectors to directly query cloud-stored Parquet files.

  29. Perform a geospatial join to match products to their nearest warehouse location.

  30. Write a DuckDB query to integrate and analyze multi-format datasets (CSV, Parquet, JSON).


8


9


10


11


12


13


14


15


16


17


18


19


20


21


22


23


24


25


Last updated