If you want to transition from a spreadsheet user to a full-stack data professional, you need to understand how different tools handle the exact same problem.
In this interactive portfolio project—dubbed The Rosetta Stone Project—you will build an automated data pipeline using four different approaches: Excel exploration, Power Query automation, SQL translation, and Python scripting. Read the scenario, download the provided datasets, and try to complete the tasks on your own. When you are ready, click the "Show Answer" toggles to verify your work!
🏢 The Scenario: RetailCo's Data Problem
You are a BI Analyst at RetailCo, a multi-regional retail company. Currently, the sales team has been manually consolidating monthly sales reports from three regional offices (North, South, and East) using copy-paste and VLOOKUP formulas in Excel. This process takes 6 hours every month and is extremely error-prone.
The VP of Sales has asked you to build a repeatable, automated data pipeline that combines all regional sales data, validates data quality, and produces clean datasets ready for Power BI dashboards.
🗂️ Step 1: Download the Project Datasets
To follow along with this data pipeline tutorial, download the five raw data files used for this project below:
-
Sales_North.xlsx ~500 orders
-
Sales_South.xlsx ~450 orders
-
Sales_East.csv ~550 orders
-
Customers.csv 150 customers
-
Products.csv 80 products
🛠️ Stage 1: Manual Excel Exploration
Goal: Understand the data manually before automating.
Tasks:
- Open the files and document the date formats and product code formatting issues.
- Find out how many duplicate CustomerIDs exist in the
Customers.csvfile. - In
Sales_North.xlsx, write a VLOOKUP to bring in theCustomerName. How many#N/Aerrors appear? - Create a Pivot Table on
Sales_Northshowing the Sum of(Quantity * UnitPrice).
✅ Answer
Expected Issues:
- Sales_North: ~15 invalid CustomerIDs (not found in the master list).
- Sales_South: Leading/trailing spaces in ProductCodes.
- Sales_East: Date format mismatch (MM/DD/YYYY vs DD/MM/YYYY).
- Products: Inconsistent casing (uppercase/lowercase).
Duplicates: There should be 5 duplicate CustomerIDs in your master file.
VLOOKUP Test: The formula =VLOOKUP(C2, Customers.csv!A:B, 2, FALSE) should yield approximately 15 #N/A errors corresponding to the invalid IDs.
Why do this? Manual Excel exploration is crucial before building an automated pipeline to identify edge cases, formatting errors, and missing relationships that will break your automated code later.
⚙️ Stage 2: Power Query Automation
Goal: Create a refreshable, repeatable data pipeline using Power Query.
Tasks:
- Load all three sales files. Add a custom "Region" column to each, fix the date types, and use
Text.Trimon the ProductCodes. Append them into a single query. - Load and clean the master files (remove duplicates from Customers, UPPERCASE and TRIM ProductCodes in Products).
- Merge the combined sales table with Customers and Products using Left Outer Joins.
- Add calculated columns:
LineTotal = Quantity * UnitPriceandProfit = LineTotal - (Quantity * UnitCost).
✅ Answer
Check your row counts:
- Appended Sales Query: You should have approximately 1,500 rows.
- Deduplicated Customers: You should have 150 unique IDs remaining.
Data Quality Check via Reference Query:
Instead of duplicating the final query, you should create a Reference Query to filter for rows where [CustomerName] = null OR [ProductName] = null.
Why use a Reference Query? A reference query builds on the final output of the combined query. If you update the logic in the main combined query, the reference query inherits those changes automatically.
🖥️ Stage 3: SQL Translation
Goal: Translate the Power Query pipeline into SQL queries for database environments.
Tasks:
- Write a SQL query using
UNION ALLto combine North, South, and East tables. - Write a comprehensive
SELECTstatement that joins the combined sales table with the master tables, cleaning theProductCodeon the fly, and calculating LineTotal and Profit.
✅ Answer
-- 1. Combine Sales Tables
CREATE TABLE sales_combined AS
SELECT *, 'North' AS Region FROM sales_north
UNION ALL
SELECT *, 'South' AS Region FROM sales_south
UNION ALL
SELECT *, 'East' AS Region FROM sales_east;
-- 2. Main Fact Table with Joins & Calculations
SELECT
s.OrderID,
s.OrderDate,
EXTRACT(YEAR FROM s.OrderDate) AS Year,
EXTRACT(MONTH FROM s.OrderDate) AS Month,
s.Region,
s.CustomerID,
c.CustomerName,
TRIM(UPPER(s.ProductCode)) AS ProductCode,
p.ProductName,
s.Quantity,
s.UnitPrice,
s.Quantity * s.UnitPrice AS LineTotal,
(s.Quantity * s.UnitPrice) - (s.Quantity * p.UnitCost) AS Profit
FROM sales_combined s
LEFT JOIN customers c ON s.CustomerID = c.CustomerID
LEFT JOIN products p ON TRIM(UPPER(s.ProductCode)) = p.ProductCode;Why LEFT JOIN instead of INNER JOIN? We use a LEFT JOIN to ensure every single sales transaction is preserved, even if the CustomerID or ProductCode is invalid. If we used an INNER JOIN, transactions with bad master data would silently drop out of our pipeline, inflating our accuracy and hiding critical data quality issues from the business.
🐍 Stage 4: Python (pandas) Automation
Goal: Create a fully automated Python script capable of handling massive datasets seamlessly.
Tasks: Write a Python script using the pandas library to replicate the logic. Read the Excel/CSV files, standardize strings, handle dates, join the dataframes, compute columns, and export the clean data to a new CSV file.
✅ Answer
import pandas as pd
# 1. Load and Combine Sales
regions = ['North', 'South', 'East']
all_sales = []
for region in regions:
df = pd.read_excel(f'Sales_{region}.xlsx')
df['Region'] = region
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')
df['ProductCode'] = df['ProductCode'].str.strip().str.upper() # Clean formatting
all_sales.append(df)
sales = pd.concat(all_sales, ignore_index=True)
# 2. Load Masters
customers = pd.read_csv('Customers.csv').drop_duplicates(subset='CustomerID', keep='first') # Deduplicate
products = pd.read_excel('Products.xlsx')
products['ProductCode'] = products['ProductCode'].str.strip().str.upper()
# 3. Merge and Calculate
merged = sales.merge(customers, on='CustomerID', how='left')
merged = merged.merge(products, on='ProductCode', how='left')
merged['LineTotal'] = merged['Quantity'] * merged['UnitPrice']
merged['Profit'] = merged['LineTotal'] - (merged['Quantity'] * merged['UnitCost'])
# 4. Generate Final Outputs (Clean data only)
clean_df = merged[merged['CustomerName'].notna() & merged['ProductName'].notna()]
clean_df.to_csv('Sales_Consolidated.csv', index=False)
Automation Advantage: Python allows true automation. You can set up Windows Task Scheduler (or a cron job on Linux/Mac) to run this .py file every morning at 6 AM. Unlike Power Query, which requires a user to open an application and click "Refresh", Python runs silently in the background.
🧠 Stage 5: Comparison Analysis & Reflection
A great data analyst doesn't just know how to use a tool; they know when to use it.
- Excel: Best for quick, visual exploration and finding edge cases.
- Power Query: Best for scheduled pipelines utilized directly by business users and Power BI integration.
- SQL: Best for large relational database data and handling millions of rows easily.
- Python (pandas): Best for complex logic, full system automation, and handling large files from diverse APIs and network drives.
Conclusion: Excel is for exploring; SQL is for scale; Python is for automation. Mastering all three makes you a versatile full-stack data professional!