GCP

How to setup GCP free?

To access Google Cloud Platform (GCP) services for free, you can utilize the following offerings:

1. Google Cloud Free Program:

$300 Free Credit: New users receive $300 in credits valid for 90 days to explore and evaluate GCP services. citeturn0search15
Always Free Tier: Access select GCP products with monthly usage limits at no cost, such as:
- Compute Engine: 1 non-preemptible e2-micro VM instance per month in specific regions.
- Cloud Storage: 5 GB of Regional Storage per month.
- BigQuery: 1 TB of querying data per month.
- Cloud Functions: 2 million invocations per month.
- Cloud Pub/Sub: 10 GB of messages per month.
- Cloud Shell: 50 hours per week.
- Secret Manager: 10,000 access operations per month.
- Vision AI: 1,000 units per month.
- Translate API: 500,000 characters per month.
- Natural Language API: 5,000 units per month.
- AutoML Translation: 500,000 translated characters per month.
- Video Intelligence API: 1,000 units per month.
- Workflows: 5,000 free internal steps per month.
- Cloud Source Repositories: Multiple private Git repositories hosted on Google Cloud, free access for up to five users.
- Google Cloud Marketplace: Click-to-deploy, production-grade solutions from Google Cloud partners with free trials of select apps and services.
- Secret Manager: Securely store API keys, passwords, certificates, and other sensitive data with 6 secret versions per month.

2. Google Maps Platform Monthly Credit:

Receive a recurring $200 monthly credit for Google Maps Platform services. citeturn0search16

Eligibility:

You must be new to GCP and have not previously signed up for the Free Trial.
A valid credit card or other payment method is required for account verification, but you won't be charged unless you upgrade to a paid account.

Getting Started:

Visit the Google Cloud Free Program page.
Click on "Get started for free."
Follow the prompts to create your GCP account and set up billing information.
Access the Google Cloud Console to begin exploring services.

Important Considerations:

Usage Monitoring: Regularly monitor your usage to stay within free tier limits and avoid unexpected charges.
Billing Alerts: Set up billing alerts to notify you when approaching usage limits.
Service Availability: Some services may have regional restrictions or varying free tier limits.

Scenarios:

Here are five real-time scenarios with dummy datasets for you to practice and align with the job requirements:

Scenario 1: Data Ingestion and Processing with GCP Dataflow

Objective: Ingest a stream of transaction data from a Pub/Sub topic, process it using Apache Beam on Dataflow, and store the cleaned data in Cloud Storage.

Dataset:

# transactions.csv (Pub/Sub topic stream)
transaction_id,customer_id,amount,currency,timestamp
1,101,200,USD,2023-12-01T12:30:45Z
2,102,300,EUR,2023-12-01T13:00:10Z
3,103,-100,USD,2023-12-01T13:30:15Z
4,104,400,JPY,2023-12-01T14:00:20Z

Code:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

def clean_transactions(transaction):
    # Parse the CSV
    fields = transaction.split(',')
    if len(fields) != 5:
        return None  # Skip invalid records
    
    transaction_id, customer_id, amount, currency, timestamp = fields
    # Filter invalid or negative amounts
    if float(amount) <= 0:
        return None
    
    return {
        'transaction_id': int(transaction_id),
        'customer_id': int(customer_id),
        'amount': float(amount),
        'currency': currency,
        'timestamp': timestamp
    }

options = PipelineOptions(
    streaming=True,
    project='your-gcp-project-id',
    region='us-central1',
    runner='DataflowRunner'
)

with beam.Pipeline(options=options) as pipeline:
    (
        pipeline
        | 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(topic='projects/your-project/topics/transactions')
        | 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
        | 'Clean Transactions' >> beam.Map(clean_transactions)
        | 'Filter None' >> beam.Filter(lambda x: x is not None)
        | 'Write to Cloud Storage' >> beam.io.WriteToText('gs://your-bucket/cleaned_transactions')
    )

Scenario 2: Data Warehousing and Analytics with BigQuery

Objective: Load cleaned transaction data from Cloud Storage into BigQuery, then run SQL queries to find the total transaction amounts per currency.

Dataset: Same as Scenario 1 (cleaned CSV from Cloud Storage).

Code:

from google.cloud import bigquery

client = bigquery.Client()

# Define table schema
table_id = 'your_project_id.dataset.transactions'
schema = [
    bigquery.SchemaField("transaction_id", "INTEGER"),
    bigquery.SchemaField("customer_id", "INTEGER"),
    bigquery.SchemaField("amount", "FLOAT"),
    bigquery.SchemaField("currency", "STRING"),
    bigquery.SchemaField("timestamp", "TIMESTAMP"),
]

# Load data into BigQuery
job_config = bigquery.LoadJobConfig(
    schema=schema,
    source_format=bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
)

uri = "gs://your-bucket/cleaned_transactions-*"

load_job = client.load_table_from_uri(uri, table_id, job_config=job_config)
load_job.result()

# Query data
query = """
SELECT currency, SUM(amount) as total_amount
FROM `your_project_id.dataset.transactions`
GROUP BY currency
ORDER BY total_amount DESC
"""
query_job = client.query(query)

for row in query_job:
    print(f"Currency: {row.currency}, Total Amount: {row.total_amount}")

Scenario 3: ETL/ELT Development with Apache Airflow

Objective: Create a DAG in Apache Airflow to perform a daily ETL process for customer data.

Dataset:

# customers.csv
customer_id,name,email,signup_date
101,John Doe,john.doe@example.com,2023-01-10
102,Jane Smith,jane.smith@example.com,2023-03-15
103,Bob Brown,bob.brown@example.com,2023-05-20

Code:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import pandas as pd
from google.cloud import storage

def extract(**kwargs):
    # Extract customer data
    df = pd.read_csv('/path/to/customers.csv')
    kwargs['ti'].xcom_push(key='customer_data', value=df.to_json())

def transform(**kwargs):
    # Transform customer data
    data = kwargs['ti'].xcom_pull(key='customer_data')
    df = pd.read_json(data)
    df['email_domain'] = df['email'].apply(lambda x: x.split('@')[1])
    kwargs['ti'].xcom_push(key='transformed_data', value=df.to_json())

def load(**kwargs):
    # Load transformed data to GCP
    data = kwargs['ti'].xcom_pull(key='transformed_data')
    df = pd.read_json(data)

    # Save to Cloud Storage
    client = storage.Client()
    bucket = client.get_bucket('your-bucket')
    blob = bucket.blob('transformed/customers.csv')
    blob.upload_from_string(df.to_csv(index=False), 'text/csv')

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2023, 1, 1),
    'email_on_failure': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG('customer_etl', default_args=default_args, schedule_interval='@daily')

extract_task = PythonOperator(task_id='extract', python_callable=extract, provide_context=True, dag=dag)
transform_task = PythonOperator(task_id='transform', python_callable=transform, provide_context=True, dag=dag)
load_task = PythonOperator(task_id='load', python_callable=load, provide_context=True, dag=dag)

extract_task >> transform_task >> load_task

Scenario 4: Performance Optimization for BigQuery

Objective: Optimize a query that calculates the average transaction amount per customer while minimizing costs.

Dataset: Use cleaned transaction data loaded in Scenario 2.

Code:

query = """
SELECT customer_id, AVG(amount) AS avg_amount
FROM `your_project_id.dataset.transactions`
WHERE timestamp >= TIMESTAMP('2023-12-01')
GROUP BY customer_id
"""
query_job = client.query(query)

for row in query_job:
    print(f"Customer ID: {row.customer_id}, Avg Amount: {row.avg_amount}")

Optimization Tips:

Use clustered tables on customer_id.
Partition tables by timestamp.
Avoid SELECT *; specify only required fields.

Scenario 5: Data Modeling for Machine Learning

Objective: Design a schema for customer lifetime value (CLV) prediction, train a simple ML model using Python, and save it to GCP.

Dataset:

# customer_transactions.csv
customer_id,total_transactions,total_amount,days_active
101,50,10000,200
102,20,4000,150
103,80,15000,300

Code:

import pandas as pd
from sklearn.linear_model import LinearRegression
from joblib import dump

# Load data
data = pd.read_csv('/path/to/customer_transactions.csv')

# Prepare features and target
X = data[['total_transactions', 'total_amount', 'days_active']]
y = data['total_amount'] / data['days_active']  # Dummy CLV target

# Train model
model = LinearRegression()
model.fit(X, y)

# Save model
dump(model, 'clv_model.joblib')

# Upload to GCP
client = storage.Client()
bucket = client.get_bucket('your-bucket')
blob = bucket.blob('models/clv_model.joblib')
blob.upload_from_filename('clv_model.joblib')

These scenarios align with the job requirements and will help you practice hands-on with Python and GCP tools.

Here are five more real-time scenarios for each category, with Python examples and dummy datasets:

Data Ingestion and Processing

Scenario 1: Real-Time Stock Price Stream Processing

Objective: Process stock price data from a Pub/Sub topic in real-time and calculate average price per stock.

Dataset:

# stock_prices.csv (Pub/Sub topic stream)
ticker,price,timestamp
AAPL,150,2023-12-01T09:00:00Z
GOOG,2800,2023-12-01T09:01:00Z
AAPL,155,2023-12-01T09:02:00Z

Code:

def calculate_average_price(stock):
    ticker, price, timestamp = stock.split(',')
    return ticker, float(price)

def compute_average(stock_prices):
    averages = {}
    for ticker, price in stock_prices:
        if ticker not in averages:
            averages[ticker] = {'total': 0, 'count': 0}
        averages[ticker]['total'] += price
        averages[ticker]['count'] += 1
    return {ticker: data['total'] / data['count'] for ticker, data in averages.items()}

Scenario 2: IoT Sensor Data Processing

Objective: Ingest IoT sensor data for temperature monitoring and filter abnormal readings.

Dataset:

# sensor_data.csv
sensor_id,temperature,humidity,timestamp
1,22.5,45.0,2023-12-01T10:00:00Z
2,35.6,50.2,2023-12-01T10:05:00Z
3,-10.0,30.0,2023-12-01T10:10:00Z

Code:

def filter_abnormal_readings(sensor):
    if -30 <= sensor['temperature'] <= 50:
        return sensor
    return None

Objective: Extract hashtags from a Twitter stream and store them in Cloud Storage.

Dataset:

# tweets.csv
id,username,tweet,timestamp
1,johndoe,"Loving the new #Python update!",2023-12-01T12:00:00Z
2,janedoe,"#GCP is awesome for #DataEngineering.",2023-12-01T12:05:00Z

Code:

def extract_hashtags(tweet):
    words = tweet.split()
    return [word for word in words if word.startswith('#')]

Scenario 4: Error Log Monitoring

Objective: Monitor logs for errors and send alerts for critical issues.

Dataset:

# logs.csv
level,message,timestamp
INFO,"System running smoothly.",2023-12-01T14:00:00Z
ERROR,"Database connection failed.",2023-12-01T14:05:00Z

Code:

def alert_on_critical(log):
    if log['level'] == 'ERROR':
        send_alert(log['message'])

Scenario 5: Weather Data Processing

Objective: Aggregate hourly weather data into daily summaries.

Dataset:

# weather.csv
city,temp,humidity,timestamp
NYC,22,60,2023-12-01T09:00:00Z
NYC,25,55,2023-12-01T10:00:00Z

Code:

def aggregate_daily_weather(data):
    daily_summary = {}
    for row in data:
        city, temp, humidity = row['city'], row['temp'], row['humidity']
        if city not in daily_summary:
            daily_summary[city] = {'temp_sum': 0, 'count': 0}
        daily_summary[city]['temp_sum'] += temp
        daily_summary[city]['count'] += 1
    return {city: {'avg_temp': daily_summary[city]['temp_sum'] / daily_summary[city]['count']} for city in daily_summary}

Data Warehousing and Analytics

Scenario 1: Sales Analytics

Objective: Calculate total sales by product category from BigQuery data.

Code:

SELECT category, SUM(sales) AS total_sales
FROM `project.dataset.sales`
GROUP BY category
ORDER BY total_sales DESC

Scenario 2: Customer Retention Analysis

Objective: Analyze customer churn rates using BigQuery.

Code:

SELECT 
    customer_id, 
    COUNT(*) AS purchase_count,
    MAX(purchase_date) AS last_purchase_date
FROM `project.dataset.customer_purchases`
GROUP BY customer_id

Scenario 3: Inventory Optimization

Objective: Identify products with low stock in BigQuery.

Code:

SELECT 
    product_id, 
    stock 
FROM `project.dataset.inventory`
WHERE stock < 10

Scenario 4: Marketing Campaign Effectiveness

Objective: Analyze the click-through rates for email campaigns.

Code:

SELECT 
    campaign_id, 
    COUNT(CASE WHEN clicked THEN 1 END) / COUNT(*) AS click_through_rate
FROM `project.dataset.campaigns`
GROUP BY campaign_id

Scenario 5: Fraud Detection

Objective: Flag transactions exceeding a specific threshold.

Code:

SELECT 
    transaction_id, 
    amount 
FROM `project.dataset.transactions`
WHERE amount > 10000

Here is the missing code for the ETL/ELT Development and Data Modeling scenarios:

ETL/ELT Development

Scenario 1: Data Cleaning with Airflow

Objective: Remove duplicates and null values from a dataset and save cleaned data to Cloud Storage.

Code:

import pandas as pd
from google.cloud import storage

def clean_data(**kwargs):
    # Load dataset
    df = pd.read_csv('/path/to/raw_data.csv')

    # Remove duplicates and null values
    df_cleaned = df.drop_duplicates().dropna()

    # Save cleaned data
    cleaned_file = '/tmp/cleaned_data.csv'
    df_cleaned.to_csv(cleaned_file, index=False)

    # Upload to Cloud Storage
    client = storage.Client()
    bucket = client.get_bucket('your-bucket')
    blob = bucket.blob('cleaned/cleaned_data.csv')
    blob.upload_from_filename(cleaned_file)

Scenario 2: Data Transformation for Machine Learning

Objective: Add day-of-week and month columns to timestamps in the dataset.

Code:

def transform_data(**kwargs):
    df = pd.read_csv('/path/to/raw_data.csv')

    # Transform timestamp into day-of-week and month
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['day_of_week'] = df['timestamp'].dt.day_name()
    df['month'] = df['timestamp'].dt.month_name()

    # Save transformed data
    transformed_file = '/tmp/transformed_data.csv'
    df.to_csv(transformed_file, index=False)

    # Upload to Cloud Storage
    client = storage.Client()
    bucket = client.get_bucket('your-bucket')
    blob = bucket.blob('transformed/transformed_data.csv')
    blob.upload_from_filename(transformed_file)

Scenario 3: Real-Time ETL for E-Commerce

Objective: Calculate hourly revenue for incoming orders and save to BigQuery.

Code:

from google.cloud import bigquery
import pandas as pd

def process_orders(order_stream):
    df = pd.DataFrame(order_stream)

    # Calculate hourly revenue
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['hour'] = df['timestamp'].dt.hour
    hourly_revenue = df.groupby('hour')['amount'].sum().reset_index()

    # Save to BigQuery
    client = bigquery.Client()
    table_id = 'your_project.dataset.hourly_revenue'
    client.load_table_from_dataframe(hourly_revenue, table_id)

Scenario 4: Log Processing

Objective: Convert raw logs into structured JSON.

Code:

import json

def process_logs(log_file):
    structured_logs = []

    with open(log_file, 'r') as file:
        for line in file:
            parts = line.strip().split(' ')
            structured_logs.append({
                "timestamp": parts[0],
                "log_level": parts[1],
                "message": ' '.join(parts[2:])
            })

    # Save as JSON
    with open('/tmp/structured_logs.json', 'w') as outfile:
        json.dump(structured_logs, outfile)

Objective: Ingest Facebook ad performance data, calculate cost per click (CPC), and save it.

Code:

def calculate_cpc(ad_data):
    df = pd.DataFrame(ad_data)

    # Calculate CPC
    df['cpc'] = df['cost'] / df['clicks']

    # Save to Cloud Storage
    client = storage.Client()
    bucket = client.get_bucket('your-bucket')
    blob = bucket.blob('ad_data/cpc_data.csv')
    blob.upload_from_string(df.to_csv(index=False), 'text/csv')

Data Modeling

Scenario 1: Customer Lifetime Value (CLV)

Objective: Train a model to predict CLV based on purchase history.

Code:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
from joblib import dump

# Load dataset
df = pd.read_csv('/path/to/customer_data.csv')

# Feature engineering
X = df[['purchase_count', 'total_spent', 'days_since_first_purchase']]
y = df['customer_lifetime_value']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Save model
dump(model, 'clv_model.joblib')

Scenario 2: Recommendation System

Objective: Build a user-product interaction matrix for collaborative filtering.

Code:

import pandas as pd
from scipy.sparse import csr_matrix

# Load dataset
df = pd.read_csv('/path/to/interaction_data.csv')

# Create interaction matrix
interaction_matrix = csr_matrix((df['interaction'], (df['user_id'], df['product_id'])))

# Save as pickle or upload for model training
interaction_matrix_file = '/tmp/interaction_matrix.pkl'
interaction_matrix.tofile(interaction_matrix_file)

Scenario 3: Demand Forecasting

Objective: Train a time series model to forecast product demand.

Code:

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Load dataset
df = pd.read_csv('/path/to/demand_data.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Train ARIMA model
model = ARIMA(df['demand'], order=(1, 1, 1))
model_fit = model.fit()

# Forecast
forecast = model_fit.forecast(steps=10)
print(forecast)

Scenario 4: Fraud Detection

Objective: Use historical data to detect anomalous transactions.

Code:

from sklearn.ensemble import IsolationForest
import pandas as pd

# Load dataset
df = pd.read_csv('/path/to/transaction_data.csv')

# Train Isolation Forest
clf = IsolationForest(random_state=42)
df['anomaly_score'] = clf.fit_predict(df[['amount', 'time_delta']])

# Flag anomalies
anomalies = df[df['anomaly_score'] == -1]
print(anomalies)

Scenario 5: Churn Prediction

Objective: Train a classification model to predict customer churn.

Code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
from joblib import dump

# Load dataset
df = pd.read_csv('/path/to/churn_data.csv')

# Feature and target
X = df[['tenure', 'monthly_charges', 'contract_type']]
y = df['churn']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Save model
dump(model, 'churn_model.joblib')

These examples should help you practice and implement real-world use cases! Let me know if you need further assistance.

GCP OpenSource Alternate

If you're looking for an open-source tool that can perform SQL queries on CSV files locally, similar to BigQuery, there are several options you can consider. Here are a few popular ones:

DuckDB:
- Description: DuckDB is an in-process SQL OLAP database management system. It is designed to be fast and easy to use, and it supports SQL queries on CSV files directly.
- Features:
  - In-process analytics
  - Support for SQL queries
  - Fast performance
  - Easy integration with Python, R, and other languages
- Installation:
  pip install duckdb
- Usage Example:
  import duckdb # Connect to DuckDB con = duckdb.connect() # Load CSV file con.execute("CREATE TABLE my_table AS SELECT * FROM 'path/to/your/file.csv'") # Perform SQL query result = con.execute("SELECT * FROM my_table WHERE column_name = 'value'").fetchall() print(result)

SQLite:

Description: SQLite is a self-contained, serverless, config-free transactional SQL database engine. It can be used to perform SQL queries on CSV files with some additional setup.
Features:
- Lightweight and easy to use
- Supports SQL queries
- No server setup required
Installation:
```
pip install sqlite3
```

Usage Example:

import sqlite3
import pandas as pd

# Load CSV file into a DataFrame
df = pd.read_csv('path/to/your/file.csv')

# Connect to SQLite
conn = sqlite3.connect(':memory:')

# Write DataFrame to SQLite
df.to_sql('my_table', conn, index=False)

# Perform SQL query
result = pd.read_sql_query("SELECT * FROM my_table WHERE column_name = 'value'", conn)

print(result)

Pandas:
- Description: Pandas is a powerful data manipulation library in Python. While it is not a database, it can perform SQL-like operations on CSV files using its built-in functions.
- Features:
  - Easy to use for data manipulation
  - Supports SQL-like operations
  - Integrates well with other Python libraries
- Installation:
  pip install pandas
- Usage Example:
  import pandas as pd # Load CSV file into a DataFrame df = pd.read_csv('path/to/your/file.csv') # Perform SQL-like query result = df[df['column_name'] == 'value'] print(result)
Apache Drill:
- Description: Apache Drill is a schema-free SQL query engine for Hadoop, NoSQL, and cloud storage. It can query CSV files directly.
- Features:
  - Supports SQL queries
  - Can query various data sources
  - No need for schema definition
- Installation:
  - Download and install Apache Drill from the official website.
- Usage Example:
  -- Start Drill and connect to it using the Drill shell or a JDBC/ODBC client SELECT * FROM dfs.`path/to/your/file.csv` WHERE column_name = 'value';

Each of these tools has its own strengths and use cases, so you can choose the one that best fits your needs. If you need more advanced features like those offered by BigQuery, DuckDB or Apache Drill might be the best choices. For simpler tasks, Pandas or SQLite could be sufficient.

Assignment 1

Dataset:

kact-datasets/automobile_data.csv at main · kactlabs/kact-datasetsGitHub

Basic Assignments

List all unique values in the make column.
Count the number of cars with num-of-doors as four and two.
Find the mean, median, and standard deviation of the price column.
Identify the number of cars for each fuel-type.
Filter rows where horsepower is greater than 150.
List all rows where the body-style is hatchback.
Sort the data by price in ascending order.
Display only the first 10 rows of the dataset.
Count the number of unique values in the engine-type column.
Group the data by fuel-type and calculate the average engine-size.
Identify the minimum and maximum city-mpg.
Filter cars with a compression-ratio above 10.
Retrieve cars where the width is greater than the median width.
Find the number of cars for each drive-wheels category.
Count how many cars have a height above the 90th percentile.
Filter rows where the engine-location is rear.
Identify the average price of cars grouped by make.
Count the total number of cars with missing values in any column.
Find cars with the top 5 lowest curb-weight.
Retrieve all rows where aspiration is turbo.

Intermediate Assignments

Identify the make with the highest number of models in the dataset.
Group the cars by body-style and calculate the sum of curb-weight for each group.
Find the car with the highest compression-ratio for each fuel-type.
Replace missing values in the normalized-losses column with its mean.
Create a derived column weight-category based on curb-weight being light (< 2500), medium (2500–3500), or heavy (> 3500).
Calculate the ratio of horsepower to engine-size for each car.
Find cars where the price is within 10% of the mean price.
Group by make and count the number of unique body-style values.
Identify cars with both city-mpg and highway-mpg above their respective means.
Sort the dataset first by fuel-type and then by price.
Calculate the average price for each engine-type and sort the result in descending order.
Drop rows where num-of-doors has missing values.
Create a new column is-luxury (True if price > 25,000, else False).
Find cars where the bore is between 2.5 and 3.5, inclusive.
Identify the top 3 body-style categories with the highest average price.
Find the most common value in the fuel-system column.
Calculate the difference between highway-mpg and city-mpg for each car.
Retrieve all cars that have the same engine-type and fuel-type.
Find all cars where the make appears only once in the dataset.
Calculate the range (max-min) of length for each drive-wheels.

Hard Assignments

Normalize the price column to be between 0 and 1 using Min-Max scaling.
Create a column efficiency as (city-mpg + highway-mpg) / 2 and rank cars by efficiency.
Identify the car with the highest horsepower for each combination of body-style and fuel-type.
For each make, calculate the standard deviation of price and identify the make with the highest variability.
Identify outliers in the engine-size column using the IQR method.
Simulate a join: group data by make and calculate the average price for each fuel-type.
Calculate the weighted average performance-index using horsepower (50%), engine-size (30%), and compression-ratio (20%).
Write a query to find cars priced in the top 10% for their fuel-type.
Find the correlation matrix for numeric columns in the dataset.
Identify the top 5 cars with the highest power-to-weight ratio.
Create quartiles for the price column and group cars into these quartiles.
Find the most common fuel-system for each drive-wheels category.
Calculate the median curb-weight for each engine-location.
Rank all cars based on compression-ratio and assign a percentile rank.
Calculate the average price for cars where num-of-cylinders is greater than 6.
Find the car with the maximum difference between highway-mpg and city-mpg.
Retrieve all cars where make is among the top 3 makes with the highest average price.
Analyze and visualize the relationship between length, width, and price.
Calculate the weighted-price of each car, assuming weights of 0.6 for engine-size and 0.4 for horsepower.
Create a derived column luxury-score as a weighted sum of price, curb-weight, and horsepower. Rank all cars by this score.

Assignment 2 - Ecommerce Data

Dataset:

kact-datasets/ecommerce_dataset_20241214.csv at main · kactlabs/kact-datasetsGitHub

Basic Assignments

Create a table for storing user details with columns for User_ID, Name, and Email.
Insert 5 sample rows into the users table.
Write a query to fetch all rows from the users table.
Select only the Name and Email columns from the users table.
Filter rows where the Name contains the letter "a".
Use ORDER BY to sort the users table alphabetically by Name.
Add a column Age to the users table and update it with random values.
Delete the row where the Name is "John".
Find the total number of users in the users table.
Use an alias to rename the Email column in the output.
Create another table for storing products with columns for Product_ID, Category, and Price.
Insert 10 sample products into the products table.
Write a query to fetch all products priced above 500.
Use the LIMIT keyword to display only the first 3 rows from the products table.
Find the average price of all products in the products table.
Write a query to find the MAX and MIN price from the products table.
Add a column Stock to the products table and update it with random values.
Select all products where Stock is greater than 10.
Write a query using CASE to categorize products into "Expensive" or "Cheap" based on a Price threshold of 1000.
Use a subquery to find products with a price above the average price.
Drop the Age column from the users table.
Count the number of unique categories in the products table.
Write a query to find products priced between 500 and 1500.
Create a new table to store sales data with Sale_ID, Product_ID, and Quantity.
Insert sample rows into the sales table.
Find the total quantity sold for each Product_ID.
Write a query to display all rows where Category starts with the letter "E".
Create a view to show products with Stock greater than 50.
Delete all rows from the users table.
Drop the sales table.

Intermediate Assignments

Write an INNER JOIN query to find products purchased by users.
Use a LEFT JOIN to display all users and their purchases, including users with no purchases.
Write a query using GROUP BY to find the total sales for each Category.
Create a temporary table in DuckDB to store aggregated sales data.
Use a HAVING clause to filter product categories with total sales above 10,000.
Use DuckDB’s STRUCT data type to store user addresses in a single column.
Write a query to calculate the running total of sales using WINDOW functions.
Use RANK to rank products based on their sales quantity.
Write a query to generate a pivot table that shows sales for each product by month.
Create a stored procedure in DuckDB to calculate discounts dynamically.
Write a query to find duplicate rows in the users table.
Use DISTINCT ON to fetch the first row for each user based on the latest Purchase_Date.
Write a query using COALESCE to replace null values in the users table.
Use UNION to combine results from two queries: all users and all customers.
Write a query to find users who didn’t purchase anything using a NOT EXISTS clause.
Create an index on the users table to optimize search for User_ID.
Use the JSON extension in DuckDB to parse and query a JSON column.
Write a query to generate a cumulative percentage of sales for each category.
Create a table partitioned by Category and load products data into it.
Use a FULL OUTER JOIN to combine sales and users data.
Write a query to calculate the median of product prices using DuckDB’s functions.
Load data from a CSV file into DuckDB using the COPY command.
Write a query to calculate the daily sales average for the past 7 days.
Implement a query to group data into weekly buckets.
Use a recursive CTE to generate a series of numbers from 1 to 100.
Write a query using DuckDB’s ARRAY data type to store multiple tags for products.
Create a trigger in DuckDB to log changes to the users table.
Write a query to export results into a Parquet file.
Use a CASE statement inside a JOIN to match complex conditions.
Write a query to load Parquet files directly from a cloud storage URL into DuckDB.

Hard Assignments

Write a query to find products frequently bought together using sales data.
Create a materialized view for monthly sales trends.
Use DuckDB’s STRING_AGG function to concatenate all product names by category.
Write a query to perform text search for product descriptions.
Implement a rolling average of sales for the last 30 days using WINDOW functions.
Optimize a query using EXPLAIN and analyze the output.
Create a table using DuckDB’s TEMPORAL TABLE feature for time-series analysis.
Write a query to perform a geospatial join using DuckDB’s GIS extension.
Use MAP data types in DuckDB to store key-value pairs for product attributes.
Write a query to fetch the top 3 products by sales for each category.
Implement a custom function in DuckDB using SQL.
Write a query to detect anomalies in sales data using statistical functions.
Create a query that splits large tables into partitions and processes them incrementally.
Write a recursive CTE to calculate the Fibonacci sequence.
Use EXCEPT to find records in one table but not another.
Write a query to dynamically create tables based on product categories.
Perform sentiment analysis on product reviews using DuckDB’s text-processing capabilities.
Load, transform, and query large datasets in DuckDB memory.
Use the WITH clause to create multiple subqueries and combine their results.
Implement a time-series decomposition of sales data.
Write a query to fetch the longest sequence of consecutive sales days.
Use ARRAY_AGG to create arrays of products purchased by each user.
Perform a join across Parquet and CSV files in DuckDB.
Use DuckDB’s built-in connectors to query directly from SQLite databases.
Write a query to calculate the 90th percentile of product prices.
Use MERGE to update a sales table with new data conditionally.
Create a pipeline to process and query streaming data in DuckDB.
Write a query to calculate the Levenshtein distance between two strings.
Use DuckDB’s Python API to integrate SQL queries with a Python workflow.
Implement a query to calculate dynamic cohort retention rates.

Assignment 3 - Shoe Price Data

Dataset:

kact-datasets/shoe-price-20241214.csv at main · kactlabs/kact-datasetsGitHub

Basic Assignments (Foundational SQL and Table Operations)

Create a table named shoes with columns: brand, color, size, price, offer_price.
Insert 10 sample rows into the shoes table.
Write a query to fetch all rows from the shoes table.
Select only the brand and price columns from the shoes table.
Write a query to filter rows where price is greater than 1000.
Use the ORDER BY clause to sort the shoes table by price in descending order.
Add a new column discount_percentage to the shoes table.
Update the discount_percentage column using a formula: 100 * (price - offer_price) / price.
Delete all rows where price is less than 500.
Find the total number of shoes available in the table using the COUNT function.
Write a query to find the maximum and minimum offer_price in the table.
Use the DISTINCT keyword to find all unique brand values.
Write a query to group shoes by brand and calculate the average price for each brand.
Add a new column category and update it with "Premium" if price > 2000, otherwise "Budget".
Drop the category column from the table.
Create a new table named users with columns: user_id, name, email.
Insert sample data into the users table.
Write a query to fetch all rows from the users table.
Find the total number of unique email domains from the email column.
Use a subquery to find shoes priced above the average price.
Create a temporary table to store shoes with a discount percentage greater than 30%.
Write a query using the LIMIT clause to fetch the top 5 most expensive shoes.
Write a query to find all shoes where the brand starts with the letter "A".
Create a view to show only shoes with size > 8.
Write a query to calculate the total revenue from offer_price.
Write a query to replace null values in any column with default values using COALESCE.
Rename the offer_price column to discounted_price.
Drop the shoes table from the database.
Create a new table with a composite primary key (e.g., order_id, product_id).
Use the EXPLAIN command to analyze the query plan for a basic SELECT query.

Intermediate Assignments (Joins, Aggregations, and Window Functions)

Write an INNER JOIN query between shoes and users using a hypothetical user_shoes table.
Write a LEFT JOIN to show all users and the shoes they purchased, even if they haven't bought any.
Use a RIGHT JOIN to fetch all shoes and the corresponding buyers.
Write a FULL OUTER JOIN query to show all shoes and users, regardless of a purchase relationship.
Write a query using a WINDOW FUNCTION to calculate the cumulative revenue from offer_price.
Use RANK to rank shoes based on offer_price within each brand.
Write a query using PARTITION BY to calculate the average price for each brand.
Create a pivot table to show the count of shoes available in each color by size.
Write a recursive CTE to calculate factorial numbers up to 10.
Use a Common Table Expression (CTE) to find the top 3 most expensive shoes in each brand.
Write a query to calculate the percentage of total revenue contributed by each brand.
Use the WITH clause to create multiple subqueries and combine their results.
Write a query to find the average and median price for each brand.
Use ARRAY_AGG to create an array of all colors available for each brand.
Write a query to count the total number of shoes sold for each size using a hypothetical sales table.
Use CASE in a query to classify shoes into "High Discount" and "Low Discount" categories.
Write a query to filter rows using the HAVING clause after a GROUP BY.
Implement a query that finds products purchased in the last 30 days using the CURRENT_DATE function.
Write a query to export the results of a query into a CSV file using DuckDB.
Write a query to load a CSV file into DuckDB using the COPY command.
Use COALESCE to replace any null values in the size column with a default value of 9.
Write a query to find all shoes whose offer_price falls within the 80th percentile of prices.
Create an index on the price column of the shoes table.
Use the JSON extension in DuckDB to parse a JSON column storing additional attributes.
Write a query to calculate a rolling average of offer_price over the past 3 rows.
Write a query to find duplicate rows in the users table based on email.
Write a query to combine results of two tables using UNION.
Use a NOT EXISTS clause to find users who haven’t purchased any shoes.
Create a trigger to automatically calculate and update discount_percentage after inserting a row.
Write a query to back up the shoes table into a Parquet file.

Hard Assignments (Advanced SQL, Analytics, and DuckDB Features)

Write a query to calculate the retention rate of users over different cohorts using hypothetical purchase data.
Use MAP data types in DuckDB to store and query key-value pairs for product attributes.
Write a query to perform geospatial analysis using DuckDB’s GIS extension.
Implement a sentiment analysis pipeline by querying product reviews stored as text in a DuckDB table.
Use EXCEPT to find shoes that are available in one dataset but not in another.
Write a query to dynamically create partitions based on brand and size.
Perform a time-series analysis on a hypothetical sales dataset to find monthly sales trends.
Write a query to fetch the longest sequence of consecutive days with sales.
Use a recursive CTE to calculate the nth Fibonacci number.
Write a query using STRING_AGG to concatenate all product names for each brand.
Use DuckDB’s Parquet reader to query a dataset stored in Parquet format without importing it.
Perform a full-text search to find products whose descriptions contain specific keywords.
Write a query to calculate the Levenshtein distance between two strings in a name column.
Create a pipeline to process and query streaming data in DuckDB.
Use DuckDB’s Python API to dynamically load data and perform complex queries.
Write a query to identify outliers in the price column using statistical methods.
Implement a query to find products frequently purchased together using association rules.
Write a query to detect anomalies in sales data using DuckDB’s statistical functions.
Use MERGE to synchronize two tables based on specific conditions.
Write a query to calculate the Gini coefficient of revenue distribution across brands.
Perform a cross-database query in DuckDB to combine data from SQLite and CSV.
Use DuckDB’s TEMPORAL features to analyze sales over specific time intervals.
Implement a query to calculate weighted averages of discounts for each brand.
Write a query to rank products within deciles based on offer_price.
Perform a K-means clustering analysis on numerical data stored in DuckDB.
Write a query to calculate the Sharpe ratio for sales trends in DuckDB.
Implement a query to calculate the 95th percentile confidence interval for average sales.
Use DuckDB’s built-in connectors to directly query cloud-stored Parquet files.
Perform a geospatial join to match products to their nearest warehouse location.
Write a DuckDB query to integrate and analyze multi-format datasets (CSV, Parquet, JSON).

PreviousSQLAlchemy and Alembic NextAlembic

Last updated 1 year ago

hashtagScenario 1: Data Ingestion and Processing with GCP Dataflow

hashtagScenario 2: Data Warehousing and Analytics with BigQuery

hashtagScenario 3: ETL/ELT Development with Apache Airflow

hashtagScenario 4: Performance Optimization for BigQuery

hashtagScenario 5: Data Modeling for Machine Learning

hashtagData Ingestion and Processing

hashtagScenario 1: Real-Time Stock Price Stream Processing

hashtagScenario 2: IoT Sensor Data Processing

hashtagScenario 3: Social Media Data Processing

hashtagScenario 4: Error Log Monitoring

hashtagScenario 5: Weather Data Processing

hashtagData Warehousing and Analytics

hashtagScenario 1: Sales Analytics

hashtagScenario 2: Customer Retention Analysis

hashtagScenario 3: Inventory Optimization

hashtagScenario 4: Marketing Campaign Effectiveness

hashtagScenario 5: Fraud Detection

hashtagETL/ELT Development

hashtagScenario 1: Data Cleaning with Airflow

hashtagScenario 2: Data Transformation for Machine Learning

hashtagScenario 3: Real-Time ETL for E-Commerce

hashtagScenario 4: Log Processing

hashtagScenario 5: Social Media Analytics

hashtagData Modeling

hashtagScenario 1: Customer Lifetime Value (CLV)

hashtagScenario 2: Recommendation System

hashtagScenario 3: Demand Forecasting

hashtagScenario 4: Fraud Detection

hashtagScenario 5: Churn Prediction

hashtagGCP OpenSource Alternate

hashtagAssignment 1

hashtagIntermediate Assignments

hashtagHard Assignments

hashtagAssignment 2 - Ecommerce Data

hashtagBasic Assignments

hashtagIntermediate Assignments

hashtagHard Assignments

hashtagAssignment 3 - Shoe Price Data

hashtagBasic Assignments (Foundational SQL and Table Operations)

hashtagIntermediate Assignments (Joins, Aggregations, and Window Functions)

hashtagHard Assignments (Advanced SQL, Analytics, and DuckDB Features)

Scenario 1: Data Ingestion and Processing with GCP Dataflow

Scenario 2: Data Warehousing and Analytics with BigQuery

Scenario 3: ETL/ELT Development with Apache Airflow

Scenario 4: Performance Optimization for BigQuery

Scenario 5: Data Modeling for Machine Learning

Data Ingestion and Processing

Scenario 1: Real-Time Stock Price Stream Processing

Scenario 2: IoT Sensor Data Processing

Scenario 3: Social Media Data Processing

Scenario 4: Error Log Monitoring

Scenario 5: Weather Data Processing

Data Warehousing and Analytics

Scenario 1: Sales Analytics

Scenario 2: Customer Retention Analysis

Scenario 3: Inventory Optimization

Scenario 4: Marketing Campaign Effectiveness

Scenario 5: Fraud Detection

ETL/ELT Development

Scenario 1: Data Cleaning with Airflow

Scenario 2: Data Transformation for Machine Learning

Scenario 3: Real-Time ETL for E-Commerce

Scenario 4: Log Processing

Scenario 5: Social Media Analytics

Data Modeling

Scenario 1: Customer Lifetime Value (CLV)

Scenario 2: Recommendation System

Scenario 3: Demand Forecasting

Scenario 4: Fraud Detection

Scenario 5: Churn Prediction

GCP OpenSource Alternate

Assignment 1

Intermediate Assignments

Hard Assignments

Assignment 2 - Ecommerce Data

Basic Assignments

Intermediate Assignments

Hard Assignments

Assignment 3 - Shoe Price Data

Basic Assignments (Foundational SQL and Table Operations)

Intermediate Assignments (Joins, Aggregations, and Window Functions)

Hard Assignments (Advanced SQL, Analytics, and DuckDB Features)