GCP
1
How to setup GCP free?
To access Google Cloud Platform (GCP) services for free, you can utilize the following offerings:
1. Google Cloud Free Program:
$300 Free Credit: New users receive $300 in credits valid for 90 days to explore and evaluate GCP services. citeturn0search15
Always Free Tier: Access select GCP products with monthly usage limits at no cost, such as:
Compute Engine: 1 non-preemptible e2-micro VM instance per month in specific regions.
Cloud Storage: 5 GB of Regional Storage per month.
BigQuery: 1 TB of querying data per month.
Cloud Functions: 2 million invocations per month.
Cloud Pub/Sub: 10 GB of messages per month.
Cloud Shell: 50 hours per week.
Secret Manager: 10,000 access operations per month.
Vision AI: 1,000 units per month.
Translate API: 500,000 characters per month.
Natural Language API: 5,000 units per month.
AutoML Translation: 500,000 translated characters per month.
Video Intelligence API: 1,000 units per month.
Workflows: 5,000 free internal steps per month.
Cloud Source Repositories: Multiple private Git repositories hosted on Google Cloud, free access for up to five users.
Google Cloud Marketplace: Click-to-deploy, production-grade solutions from Google Cloud partners with free trials of select apps and services.
Secret Manager: Securely store API keys, passwords, certificates, and other sensitive data with 6 secret versions per month.
2. Google Maps Platform Monthly Credit:
Receive a recurring $200 monthly credit for Google Maps Platform services. citeturn0search16
Eligibility:
You must be new to GCP and have not previously signed up for the Free Trial.
A valid credit card or other payment method is required for account verification, but you won't be charged unless you upgrade to a paid account.
Getting Started:
Visit the Google Cloud Free Program page.
Click on "Get started for free."
Follow the prompts to create your GCP account and set up billing information.
Access the Google Cloud Console to begin exploring services.
Important Considerations:
Usage Monitoring: Regularly monitor your usage to stay within free tier limits and avoid unexpected charges.
Billing Alerts: Set up billing alerts to notify you when approaching usage limits.
Service Availability: Some services may have regional restrictions or varying free tier limits.
2
Scenarios:
Here are five real-time scenarios with dummy datasets for you to practice and align with the job requirements:
Scenario 1: Data Ingestion and Processing with GCP Dataflow
Objective: Ingest a stream of transaction data from a Pub/Sub topic, process it using Apache Beam on Dataflow, and store the cleaned data in Cloud Storage.
Dataset:
Code:
Scenario 2: Data Warehousing and Analytics with BigQuery
Objective: Load cleaned transaction data from Cloud Storage into BigQuery, then run SQL queries to find the total transaction amounts per currency.
Dataset: Same as Scenario 1 (cleaned CSV from Cloud Storage).
Code:
Scenario 3: ETL/ELT Development with Apache Airflow
Objective: Create a DAG in Apache Airflow to perform a daily ETL process for customer data.
Dataset:
Code:
Scenario 4: Performance Optimization for BigQuery
Objective: Optimize a query that calculates the average transaction amount per customer while minimizing costs.
Dataset: Use cleaned transaction data loaded in Scenario 2.
Code:
Optimization Tips:
Use clustered tables on
customer_id.Partition tables by
timestamp.Avoid SELECT *; specify only required fields.
Scenario 5: Data Modeling for Machine Learning
Objective: Design a schema for customer lifetime value (CLV) prediction, train a simple ML model using Python, and save it to GCP.
Dataset:
Code:
These scenarios align with the job requirements and will help you practice hands-on with Python and GCP tools.
3
Here are five more real-time scenarios for each category, with Python examples and dummy datasets:
Data Ingestion and Processing
Scenario 1: Real-Time Stock Price Stream Processing
Objective: Process stock price data from a Pub/Sub topic in real-time and calculate average price per stock.
Dataset:
Code:
Scenario 2: IoT Sensor Data Processing
Objective: Ingest IoT sensor data for temperature monitoring and filter abnormal readings.
Dataset:
Code:
Scenario 3: Social Media Data Processing
Objective: Extract hashtags from a Twitter stream and store them in Cloud Storage.
Dataset:
Code:
Scenario 4: Error Log Monitoring
Objective: Monitor logs for errors and send alerts for critical issues.
Dataset:
Code:
Scenario 5: Weather Data Processing
Objective: Aggregate hourly weather data into daily summaries.
Dataset:
Code:
Data Warehousing and Analytics
Scenario 1: Sales Analytics
Objective: Calculate total sales by product category from BigQuery data.
Code:
Scenario 2: Customer Retention Analysis
Objective: Analyze customer churn rates using BigQuery.
Code:
Scenario 3: Inventory Optimization
Objective: Identify products with low stock in BigQuery.
Code:
Scenario 4: Marketing Campaign Effectiveness
Objective: Analyze the click-through rates for email campaigns.
Code:
Scenario 5: Fraud Detection
Objective: Flag transactions exceeding a specific threshold.
Code:
Here is the missing code for the ETL/ELT Development and Data Modeling scenarios:
ETL/ELT Development
Scenario 1: Data Cleaning with Airflow
Objective: Remove duplicates and null values from a dataset and save cleaned data to Cloud Storage.
Code:
Scenario 2: Data Transformation for Machine Learning
Objective: Add day-of-week and month columns to timestamps in the dataset.
Code:
Scenario 3: Real-Time ETL for E-Commerce
Objective: Calculate hourly revenue for incoming orders and save to BigQuery.
Code:
Scenario 4: Log Processing
Objective: Convert raw logs into structured JSON.
Code:
Scenario 5: Social Media Analytics
Objective: Ingest Facebook ad performance data, calculate cost per click (CPC), and save it.
Code:
Data Modeling
Scenario 1: Customer Lifetime Value (CLV)
Objective: Train a model to predict CLV based on purchase history.
Code:
Scenario 2: Recommendation System
Objective: Build a user-product interaction matrix for collaborative filtering.
Code:
Scenario 3: Demand Forecasting
Objective: Train a time series model to forecast product demand.
Code:
Scenario 4: Fraud Detection
Objective: Use historical data to detect anomalous transactions.
Code:
Scenario 5: Churn Prediction
Objective: Train a classification model to predict customer churn.
Code:
These examples should help you practice and implement real-world use cases! Let me know if you need further assistance.
GCP OpenSource Alternate
If you're looking for an open-source tool that can perform SQL queries on CSV files locally, similar to BigQuery, there are several options you can consider. Here are a few popular ones:
DuckDB:
Description: DuckDB is an in-process SQL OLAP database management system. It is designed to be fast and easy to use, and it supports SQL queries on CSV files directly.
Features:
In-process analytics
Support for SQL queries
Fast performance
Easy integration with Python, R, and other languages
Installation:
Usage Example:
SQLite:
Description: SQLite is a self-contained, serverless, config-free transactional SQL database engine. It can be used to perform SQL queries on CSV files with some additional setup.
Features:
Lightweight and easy to use
Supports SQL queries
No server setup required
Installation:
Usage Example:
Pandas:
Description: Pandas is a powerful data manipulation library in Python. While it is not a database, it can perform SQL-like operations on CSV files using its built-in functions.
Features:
Easy to use for data manipulation
Supports SQL-like operations
Integrates well with other Python libraries
Installation:
Usage Example:
Apache Drill:
Description: Apache Drill is a schema-free SQL query engine for Hadoop, NoSQL, and cloud storage. It can query CSV files directly.
Features:
Supports SQL queries
Can query various data sources
No need for schema definition
Installation:
Download and install Apache Drill from the official website.
Usage Example:
Each of these tools has its own strengths and use cases, so you can choose the one that best fits your needs. If you need more advanced features like those offered by BigQuery, DuckDB or Apache Drill might be the best choices. For simpler tasks, Pandas or SQLite could be sufficient.
Assignment 1
Dataset:
Basic Assignments
List all unique values in the
makecolumn.Count the number of cars with
num-of-doorsasfourandtwo.Find the mean, median, and standard deviation of the
pricecolumn.Identify the number of cars for each
fuel-type.Filter rows where
horsepoweris greater than 150.List all rows where the
body-styleishatchback.Sort the data by
pricein ascending order.Display only the first 10 rows of the dataset.
Count the number of unique values in the
engine-typecolumn.Group the data by
fuel-typeand calculate the averageengine-size.Identify the minimum and maximum
city-mpg.Filter cars with a
compression-ratioabove 10.Retrieve cars where the
widthis greater than the median width.Find the number of cars for each
drive-wheelscategory.Count how many cars have a
heightabove the 90th percentile.Filter rows where the
engine-locationisrear.Identify the average
priceof cars grouped bymake.Count the total number of cars with missing values in any column.
Find cars with the top 5 lowest
curb-weight.Retrieve all rows where
aspirationisturbo.
Intermediate Assignments
Identify the
makewith the highest number of models in the dataset.Group the cars by
body-styleand calculate the sum ofcurb-weightfor each group.Find the car with the highest
compression-ratiofor eachfuel-type.Replace missing values in the
normalized-lossescolumn with its mean.Create a derived column
weight-categorybased oncurb-weightbeinglight(< 2500),medium(2500–3500), orheavy(> 3500).Calculate the ratio of
horsepowertoengine-sizefor each car.Find cars where the
priceis within 10% of the mean price.Group by
makeand count the number of uniquebody-stylevalues.Identify cars with both
city-mpgandhighway-mpgabove their respective means.Sort the dataset first by
fuel-typeand then byprice.Calculate the average
pricefor eachengine-typeand sort the result in descending order.Drop rows where
num-of-doorshas missing values.Create a new column
is-luxury(True ifprice> 25,000, else False).Find cars where the
boreis between 2.5 and 3.5, inclusive.Identify the top 3
body-stylecategories with the highest averageprice.Find the most common value in the
fuel-systemcolumn.Calculate the difference between
highway-mpgandcity-mpgfor each car.Retrieve all cars that have the same
engine-typeandfuel-type.Find all cars where the
makeappears only once in the dataset.Calculate the range (max-min) of
lengthfor eachdrive-wheels.
Hard Assignments
Normalize the
pricecolumn to be between 0 and 1 using Min-Max scaling.Create a column
efficiencyas(city-mpg + highway-mpg) / 2and rank cars by efficiency.Identify the car with the highest
horsepowerfor each combination ofbody-styleandfuel-type.For each
make, calculate the standard deviation ofpriceand identify the make with the highest variability.Identify outliers in the
engine-sizecolumn using the IQR method.Simulate a join: group data by
makeand calculate the averagepricefor eachfuel-type.Calculate the weighted average
performance-indexusinghorsepower(50%),engine-size(30%), andcompression-ratio(20%).Write a query to find cars priced in the top 10% for their
fuel-type.Find the correlation matrix for numeric columns in the dataset.
Identify the top 5 cars with the highest
power-to-weightratio.Create quartiles for the
pricecolumn and group cars into these quartiles.Find the most common
fuel-systemfor eachdrive-wheelscategory.Calculate the median
curb-weightfor eachengine-location.Rank all cars based on
compression-ratioand assign a percentile rank.Calculate the average
pricefor cars wherenum-of-cylindersis greater than 6.Find the car with the maximum difference between
highway-mpgandcity-mpg.Retrieve all cars where
makeis among the top 3 makes with the highest averageprice.Analyze and visualize the relationship between
length,width, andprice.Calculate the
weighted-priceof each car, assuming weights of 0.6 forengine-sizeand 0.4 forhorsepower.Create a derived column
luxury-scoreas a weighted sum ofprice,curb-weight, andhorsepower. Rank all cars by this score.
Assignment 2 - Ecommerce Data
Dataset:
Basic Assignments
Create a table for storing user details with columns for
User_ID,Name, andEmail.Insert 5 sample rows into the
userstable.Write a query to fetch all rows from the
userstable.Select only the
NameandEmailcolumns from theuserstable.Filter rows where the
Namecontains the letter "a".Use
ORDER BYto sort theuserstable alphabetically byName.Add a column
Ageto theuserstable and update it with random values.Delete the row where the
Nameis "John".Find the total number of users in the
userstable.Use an alias to rename the
Emailcolumn in the output.Create another table for storing
productswith columns forProduct_ID,Category, andPrice.Insert 10 sample products into the
productstable.Write a query to fetch all
productspriced above 500.Use the
LIMITkeyword to display only the first 3 rows from theproductstable.Find the average price of all products in the
productstable.Write a query to find the
MAXandMINprice from theproductstable.Add a column
Stockto theproductstable and update it with random values.Select all products where
Stockis greater than 10.Write a query using
CASEto categorize products into "Expensive" or "Cheap" based on aPricethreshold of 1000.Use a subquery to find products with a price above the average price.
Drop the
Agecolumn from theuserstable.Count the number of unique categories in the
productstable.Write a query to find products priced between 500 and 1500.
Create a new table to store
salesdata withSale_ID,Product_ID, andQuantity.Insert sample rows into the
salestable.Find the total quantity sold for each
Product_ID.Write a query to display all rows where
Categorystarts with the letter "E".Create a view to show products with
Stockgreater than 50.Delete all rows from the
userstable.Drop the
salestable.
Intermediate Assignments
Write an
INNER JOINquery to find products purchased by users.Use a
LEFT JOINto display all users and their purchases, including users with no purchases.Write a query using
GROUP BYto find the total sales for eachCategory.Create a temporary table in DuckDB to store aggregated sales data.
Use a
HAVINGclause to filter product categories with total sales above 10,000.Use DuckDB’s
STRUCTdata type to store user addresses in a single column.Write a query to calculate the running total of sales using
WINDOWfunctions.Use
RANKto rank products based on their sales quantity.Write a query to generate a pivot table that shows sales for each product by month.
Create a stored procedure in DuckDB to calculate discounts dynamically.
Write a query to find duplicate rows in the
userstable.Use
DISTINCT ONto fetch the first row for each user based on the latestPurchase_Date.Write a query using
COALESCEto replace null values in theuserstable.Use
UNIONto combine results from two queries: all users and all customers.Write a query to find users who didn’t purchase anything using a
NOT EXISTSclause.Create an index on the
userstable to optimize search forUser_ID.Use the
JSONextension in DuckDB to parse and query a JSON column.Write a query to generate a cumulative percentage of sales for each category.
Create a table partitioned by
Categoryand load products data into it.Use a
FULL OUTER JOINto combine sales and users data.Write a query to calculate the
medianof product prices using DuckDB’s functions.Load data from a CSV file into DuckDB using the
COPYcommand.Write a query to calculate the daily sales average for the past 7 days.
Implement a query to group data into weekly buckets.
Use a recursive CTE to generate a series of numbers from 1 to 100.
Write a query using DuckDB’s
ARRAYdata type to store multiple tags for products.Create a trigger in DuckDB to log changes to the
userstable.Write a query to export results into a Parquet file.
Use a
CASEstatement inside aJOINto match complex conditions.Write a query to load Parquet files directly from a cloud storage URL into DuckDB.
Hard Assignments
Write a query to find products frequently bought together using sales data.
Create a materialized view for monthly sales trends.
Use DuckDB’s
STRING_AGGfunction to concatenate all product names by category.Write a query to perform text search for product descriptions.
Implement a rolling average of sales for the last 30 days using
WINDOWfunctions.Optimize a query using
EXPLAINand analyze the output.Create a table using DuckDB’s
TEMPORAL TABLEfeature for time-series analysis.Write a query to perform a geospatial join using DuckDB’s GIS extension.
Use
MAPdata types in DuckDB to store key-value pairs for product attributes.Write a query to fetch the top 3 products by sales for each category.
Implement a custom function in DuckDB using SQL.
Write a query to detect anomalies in sales data using statistical functions.
Create a query that splits large tables into partitions and processes them incrementally.
Write a recursive CTE to calculate the Fibonacci sequence.
Use
EXCEPTto find records in one table but not another.Write a query to dynamically create tables based on product categories.
Perform sentiment analysis on product reviews using DuckDB’s text-processing capabilities.
Load, transform, and query large datasets in DuckDB memory.
Use the
WITHclause to create multiple subqueries and combine their results.Implement a time-series decomposition of sales data.
Write a query to fetch the longest sequence of consecutive sales days.
Use
ARRAY_AGGto create arrays of products purchased by each user.Perform a join across Parquet and CSV files in DuckDB.
Use DuckDB’s built-in connectors to query directly from SQLite databases.
Write a query to calculate the 90th percentile of product prices.
Use
MERGEto update a sales table with new data conditionally.Create a pipeline to process and query streaming data in DuckDB.
Write a query to calculate the Levenshtein distance between two strings.
Use DuckDB’s Python API to integrate SQL queries with a Python workflow.
Implement a query to calculate dynamic cohort retention rates.
Assignment 3 - Shoe Price Data
Dataset:
Basic Assignments (Foundational SQL and Table Operations)
Create a table named
shoeswith columns:brand,color,size,price,offer_price.Insert 10 sample rows into the
shoestable.Write a query to fetch all rows from the
shoestable.Select only the
brandandpricecolumns from theshoestable.Write a query to filter rows where
priceis greater than 1000.Use the
ORDER BYclause to sort theshoestable bypricein descending order.Add a new column
discount_percentageto theshoestable.Update the
discount_percentagecolumn using a formula:100 * (price - offer_price) / price.Delete all rows where
priceis less than 500.Find the total number of shoes available in the table using the
COUNTfunction.Write a query to find the maximum and minimum
offer_pricein the table.Use the
DISTINCTkeyword to find all uniquebrandvalues.Write a query to group shoes by
brandand calculate the averagepricefor each brand.Add a new column
categoryand update it with "Premium" ifprice > 2000, otherwise "Budget".Drop the
categorycolumn from the table.Create a new table named
userswith columns:user_id,name,email.Insert sample data into the
userstable.Write a query to fetch all rows from the
userstable.Find the total number of unique email domains from the
emailcolumn.Use a subquery to find shoes priced above the average price.
Create a temporary table to store shoes with a discount percentage greater than 30%.
Write a query using the
LIMITclause to fetch the top 5 most expensive shoes.Write a query to find all shoes where the
brandstarts with the letter "A".Create a view to show only shoes with
size > 8.Write a query to calculate the total revenue from
offer_price.Write a query to replace null values in any column with default values using
COALESCE.Rename the
offer_pricecolumn todiscounted_price.Drop the
shoestable from the database.Create a new table with a composite primary key (e.g.,
order_id,product_id).Use the
EXPLAINcommand to analyze the query plan for a basic SELECT query.
Intermediate Assignments (Joins, Aggregations, and Window Functions)
Write an
INNER JOINquery betweenshoesandusersusing a hypotheticaluser_shoestable.Write a
LEFT JOINto show all users and the shoes they purchased, even if they haven't bought any.Use a
RIGHT JOINto fetch all shoes and the corresponding buyers.Write a
FULL OUTER JOINquery to show all shoes and users, regardless of a purchase relationship.Write a query using a
WINDOW FUNCTIONto calculate the cumulative revenue fromoffer_price.Use
RANKto rank shoes based onoffer_pricewithin eachbrand.Write a query using
PARTITION BYto calculate the averagepricefor eachbrand.Create a pivot table to show the count of shoes available in each
colorbysize.Write a recursive CTE to calculate factorial numbers up to 10.
Use a Common Table Expression (CTE) to find the top 3 most expensive shoes in each brand.
Write a query to calculate the percentage of total revenue contributed by each
brand.Use the
WITHclause to create multiple subqueries and combine their results.Write a query to find the average and median
pricefor eachbrand.Use
ARRAY_AGGto create an array of all colors available for eachbrand.Write a query to count the total number of shoes sold for each size using a hypothetical
salestable.Use
CASEin a query to classify shoes into "High Discount" and "Low Discount" categories.Write a query to filter rows using the
HAVINGclause after aGROUP BY.Implement a query that finds products purchased in the last 30 days using the
CURRENT_DATEfunction.Write a query to export the results of a query into a CSV file using DuckDB.
Write a query to load a CSV file into DuckDB using the
COPYcommand.Use
COALESCEto replace any null values in thesizecolumn with a default value of 9.Write a query to find all shoes whose
offer_pricefalls within the 80th percentile of prices.Create an index on the
pricecolumn of theshoestable.Use the
JSONextension in DuckDB to parse a JSON column storing additional attributes.Write a query to calculate a rolling average of
offer_priceover the past 3 rows.Write a query to find duplicate rows in the
userstable based onemail.Write a query to combine results of two tables using
UNION.Use a
NOT EXISTSclause to find users who haven’t purchased any shoes.Create a trigger to automatically calculate and update
discount_percentageafter inserting a row.Write a query to back up the
shoestable into a Parquet file.
Hard Assignments (Advanced SQL, Analytics, and DuckDB Features)
Write a query to calculate the retention rate of users over different cohorts using hypothetical purchase data.
Use
MAPdata types in DuckDB to store and query key-value pairs for product attributes.Write a query to perform geospatial analysis using DuckDB’s GIS extension.
Implement a sentiment analysis pipeline by querying product reviews stored as text in a DuckDB table.
Use
EXCEPTto find shoes that are available in one dataset but not in another.Write a query to dynamically create partitions based on
brandandsize.Perform a time-series analysis on a hypothetical
salesdataset to find monthly sales trends.Write a query to fetch the longest sequence of consecutive days with sales.
Use a recursive CTE to calculate the nth Fibonacci number.
Write a query using
STRING_AGGto concatenate all product names for each brand.Use DuckDB’s Parquet reader to query a dataset stored in Parquet format without importing it.
Perform a full-text search to find products whose descriptions contain specific keywords.
Write a query to calculate the Levenshtein distance between two strings in a
namecolumn.Create a pipeline to process and query streaming data in DuckDB.
Use DuckDB’s Python API to dynamically load data and perform complex queries.
Write a query to identify outliers in the
pricecolumn using statistical methods.Implement a query to find products frequently purchased together using association rules.
Write a query to detect anomalies in sales data using DuckDB’s statistical functions.
Use
MERGEto synchronize two tables based on specific conditions.Write a query to calculate the Gini coefficient of revenue distribution across brands.
Perform a cross-database query in DuckDB to combine data from SQLite and CSV.
Use DuckDB’s
TEMPORALfeatures to analyze sales over specific time intervals.Implement a query to calculate weighted averages of discounts for each brand.
Write a query to rank products within deciles based on
offer_price.Perform a
K-meansclustering analysis on numerical data stored in DuckDB.Write a query to calculate the Sharpe ratio for sales trends in DuckDB.
Implement a query to calculate the 95th percentile confidence interval for average sales.
Use DuckDB’s built-in connectors to directly query cloud-stored Parquet files.
Perform a geospatial join to match products to their nearest warehouse location.
Write a DuckDB query to integrate and analyze multi-format datasets (CSV, Parquet, JSON).
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Last updated