End-to-End Data Transformation Project: Excel, SQL, & Python Pipeline

If you want to level up from standard spreadsheet tasks to modern data engineering, you need to understand how different tools solve the exact same problem.

In this interactive portfolio challenge—dubbed The Data Artisan Project—you will build an automated data pipeline using four different industry-standard methods: Excel exploration, Power Query automation, SQL translation, and Python scripting. Read the scenario, download the datasets, and try to complete the tasks yourself. Once you're done, click the "Show Answer" toggles to check your logic!

☕ The Scenario: Cafe Nexus Operations

You have just been hired as a Data Analyst for Cafe Nexus, a fast-growing coffee shop chain. Every week, the store managers from three locations (Downtown, Uptown, and Suburbs) email their transaction logs in different formats. The operations team has been spending hours manually copying and pasting this data to calculate weekly revenue and profit.

The Director of Operations needs you to build a reliable, automated pipeline that combines these transaction logs, links them to the master Menu and Loyalty Member databases, cleans up typos, and outputs a pristine dataset ready for dashboarding.

🗂️ Step 1: Download the Project Datasets

To follow along with this tutorial, download the five raw data files below:

🛠️ Stage 1: Manual Excel Exploration

Goal: Understand the shape of the data and spot anomalies before writing code.

📝 How to practice this step:

Open Trans_Downtown.xlsx in Excel.
In a blank column (e.g., Column F), type the formula: =XLOOKUP(D2, '[Menu_Items.csv]Menu_Items'!$A:$A, '[Menu_Items.csv]Menu_Items'!$B:$B, "Not Found") to pull in the Item Name. Drag it down.
Open Loyalty_Members.csv. Highlight Column A, go to Home > Conditional Formatting > Highlight Cells Rules > Duplicate Values.
Select all data in Downtown, go to Insert > Pivot Table. Drag Trans_Date to Rows and Quantity to Values to see daily sales.

✅ Answer & Findings

What you should have found during your manual check:

Trans_Downtown: Some Member_ID fields are completely blank (guest checkouts).
Trans_Uptown: Several Item_ID values have accidental leading or trailing spaces (e.g., " COF-01 "), causing your XLOOKUP to fail for those rows.
Trans_Suburbs: Date formatting is text-based (DD/MM/YYYY) instead of standard Date values.
Loyalty_Members: There are deliberate duplicate rows highlighted in red that need removal.

Why do this? A good data engineer never blindly trusts raw data. Finding these issues manually tells you exactly what data cleaning steps (like TRIM and removing duplicates) you must include in your automated pipeline.

⚙️ Stage 2: Power Query Automation

Goal: Build a scalable, no-code automated pipeline in Excel/Power BI.

📝 How to practice this step:

Open a blank Excel workbook. Go to Data > Get Data > From File > From Workbook/CSV to import all 5 files one by one into Power Query.
For the Uptown query, select the `Item_ID` column, right-click, and choose Transform > Trim.
Go to Home > Append Queries as New and stack Downtown, Uptown, and Suburbs together.
In the new Appended query, go to Home > Merge Queries. Merge with the Menu query using `Item_ID` (Left Outer Join).
Go to Add Column > Custom Column to calculate Revenue = [Quantity] * [Unit_Price].

✅ Answer & Validation

Validation Check:

Appended Table: Should equal exactly 1,500 total rows.
Left Joins: Why a Left Outer Join? Because if a transaction has a missing or invalid Loyalty Member ID, we still want to keep the sales revenue in our final dataset! An Inner Join would delete those coffee sales entirely, resulting in underreported revenue.

Pro-Tip: Create a separate "Error Check" reference query that filters for rows where Unit_Price is null. This gives operations a daily report of barcodes that failed to scan properly (the "ERR-99" items).

🖥️ Stage 3: SQL Translation

Goal: Move the logic out of Excel and into a robust relational database.

📝 How to practice this step:

Download a free SQL tool like DB Browser for SQLite or use an online compiler like SQLiteOnline.
Import your 5 datasets as tables (File > Import > Table from CSV file).
Open the "Execute SQL" tab and paste the code below to combine the tables, fix the messy dates, and calculate profit.

✅ Answer

CREATE VIEW all_transactions AS
SELECT *, 'Downtown' AS Store_Location FROM Trans_Downtown
UNION ALL
SELECT *, 'Uptown' AS Store_Location FROM Trans_Uptown
UNION ALL
SELECT *, 'Suburbs' AS Store_Location FROM Trans_Suburbs;

SELECT
    t.Transaction_ID,
    CASE 
        WHEN t.Store_Location = 'Suburbs' THEN 
            SUBSTR(t.Trans_Date, 7, 4) || '-' || SUBSTR(t.Trans_Date, 4, 2) || '-' || SUBSTR(t.Trans_Date, 1, 2)
        ELSE t.Trans_Date 
    END AS Standard_Date,
    t.Store_Location,
    t.Member_ID,
    COALESCE(l.Member_Name, 'Guest') AS Customer_Name,
    TRIM(UPPER(t.Item_ID)) AS Clean_Item_ID,
    m.Item_Name,
    t.Quantity,
    m.Unit_Price,
    m.Unit_Cost,
    (t.Quantity * m.Unit_Price) AS Revenue,
    (t.Quantity * m.Unit_Price) - (t.Quantity * m.Unit_Cost) AS Profit
FROM all_transactions t
LEFT JOIN (
    SELECT DISTINCT Member_ID, Member_Name FROM Loyalty_Members
) l ON t.Member_ID = l.Member_ID
LEFT JOIN Menu_Items m ON TRIM(UPPER(t.Item_ID)) = m.Item_ID;

Notice the COALESCE function? In SQL, it's a best practice to handle NULL values cleanly. If a `Member_ID` doesn't match or is blank, we replace the null name with the string 'Guest' so our dashboards look professional.

Handling Dates: The CASE statement uses SUBSTR to slice the DD/MM/YYYY string from the Suburbs file and rearrange it into a standard YYYY-MM-DD database format, directly answering the date formatting problem found in Step 1!

🐍 Stage 4: Python (pandas) Scripting

Goal: Write a headless Python script that can process hundreds of thousands of rows in seconds.

📝 How to practice this step:

Install Python and a code editor like VS Code (or use Google Colab in your browser).
If running locally, open your terminal and run pip install pandas openpyxl.
Put the 5 data files in the same folder as a new Python file named pipeline.py.
Copy the code below, run it, and watch it generate the final consolidated CSV and print your metrics!

✅ Answer

import pandas as pd

df_downtown = pd.read_excel('Trans_Downtown.xlsx')
df_downtown['Store_Location'] = 'Downtown'

df_uptown = pd.read_excel('Trans_Uptown.xlsx')
df_uptown['Store_Location'] = 'Uptown'

df_sub = pd.read_csv('Trans_Suburbs.csv')
df_sub['Store_Location'] = 'Suburbs'

transactions = pd.concat([df_downtown, df_uptown, df_sub], ignore_index=True)

transactions['Trans_Date'] = pd.to_datetime(transactions['Trans_Date'], dayfirst=True, format='mixed')
transactions['Item_ID'] = transactions['Item_ID'].astype(str).str.strip().str.upper()

members = pd.read_csv('Loyalty_Members.csv').drop_duplicates(subset='Member_ID')
menu = pd.read_csv('Menu_Items.csv')
menu['Item_ID'] = menu['Item_ID'].astype(str).str.strip().str.upper()

final_df = transactions.merge(members, on='Member_ID', how='left')
final_df = final_df.merge(menu, on='Item_ID', how='left')

final_df['Member_Name'] = final_df['Member_Name'].fillna('Guest')

final_df['Revenue'] = final_df['Quantity'] * final_df['Unit_Price']
final_df['Profit'] = final_df['Revenue'] - (final_df['Quantity'] * final_df['Unit_Cost'])

final_df.to_csv('Cafe_Nexus_Consolidated.csv', index=False)

print("\n--- Pipeline Complete! Verification Metrics ---")
print(f"Total Transactions Processed: {len(final_df)}")
print(f"Valid Matched Items: {final_df['Unit_Price'].notna().sum()}")
print(f"Scan Errors (Unmatched ERR-99): {final_df['Unit_Price'].isna().sum()}")
print(f"Total Revenue Generated: ${final_df['Revenue'].sum():,.2f}")
print(f"Total Profit Generated: ${final_df['Profit'].sum():,.2f}")

The Python Advantage: You can set this script to run automatically using Windows Task Scheduler or an Airflow DAG. Python handles mixed file types (.xlsx and .csv) natively, and the pd.to_datetime function easily normalizes messy regional date formats.

🧠 Stage 5: Comparison Analysis

A senior data professional knows that tools are just a means to an end. Here is when to use each:

Excel: Perfect for initial exploration, finding weird data errors, and doing quick ad-hoc analysis on small files.
Power Query: Excellent for building scheduled reports within the Microsoft ecosystem (Power BI) that business users can easily maintain without coding.
SQL: The gold standard for querying data already housed inside a data warehouse. It performs calculations close to the storage layer, making it extremely fast.
Python: The ultimate glue. Best for pulling data from APIs, handling highly complex file structures, integrating machine learning, and pure automation.

📊 Final Process Results

If you run the Python script provided above on the downloaded datasets, your final console output will match these exact metrics:

Total Transactions 1,500

Matched Items 1,478

Scan Errors (Nulls) 22

Total Revenue $9,201.25

Total Profit $6,677.25

End-to-End Data Transformation Project: Excel, SQL, & Python Pipeline

☕ The Scenario: Cafe Nexus Operations

🗂️ Step 1: Download the Project Datasets

🛠️ Stage 1: Manual Excel Exploration

📝 How to practice this step:

⚙️ Stage 2: Power Query Automation

📝 How to practice this step:

🖥️ Stage 3: SQL Translation

📝 How to practice this step:

🐍 Stage 4: Python (pandas) Scripting

📝 How to practice this step:

🧠 Stage 5: Comparison Analysis

📊 Final Process Results

Post a Comment

Pages

Services

Case Studies

Tools

Contact Form