How to Automate PDF to Excel Data Extraction Using Python

If you work with data, you already know the frustration of receiving tabular data locked inside a PDF file. Manually copying and pasting PDF tables into Excel is not just tedious; it is prone to human error and consumes valuable hours that could be spent on actual analysis and dashboard building.

Whether you are preparing datasets for Power BI or running financial reconciliations, automating this process is a massive time-saver.

Today, we are going to look at a completely free, offline Python script that extracts tables from multiple PDFs and compiles them neatly into a single Excel file.


Why Use This Python Approach?

  • 0% Free and Offline: No need for paid SaaS tools or uploading sensitive documents to third-party websites.
  • Batch Processing: It loops through an entire folder of PDFs at once.
  • Smart Fallback: If visual gridlines are missing in the PDF, the script intelligently reads the text spacing to rebuild the table structure.

Prerequisites

Before we begin, you will need Python installed on your computer. You also need to install a few required libraries. Open your command prompt or terminal and run the following command:

pip install pandas pdfplumber openpyxl

(Note: openpyxl is required by Pandas to write data to .xlsx files).

Step-by-Step Guide to Using the Extractor

Step 1: Set Up Your Project Folder

Create a new, empty folder on your computer. This will be your working directory. Place all the PDF files you want to extract data from directly into this folder.

Step 2: Save the Python Script

Inside that same folder, create a new file named pdf_extractor.py. Copy the Python code provided below and paste it into this file.

Step 3: Run the Script

Open your terminal or command prompt, navigate to your folder, and run the script:

python pdf_extractor.py

Step 4: Access Your Data

The terminal will show you the progress as it analyzes the table structure of each PDF. Once finished, a new file named Offline_Table_Extraction.xlsx will appear in your folder. Open it, and you will find all your PDF data neatly organized into columns, with the first column indicating the source file name for easy tracking!


The Python Code

Here is the complete script that powers this automation:

import os
import pandas as pd
import pdfplumber
import re

# Looks in the folder where this script is saved
CURRENT_FOLDER = '.'
OUTPUT_FILE = 'Offline_Table_Extraction.xlsx'

def process_pdfs():
    print("Starting offline, unlimited table extraction...")
    all_data = []
    max_columns = 0
    
    # Loop through all files in the current folder
    for filename in os.listdir(CURRENT_FOLDER):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(CURRENT_FOLDER, filename)
            print(f"Analyzing table structure of {filename}...")
            
            try:
                with pdfplumber.open(pdf_path) as pdf:
                    for page in pdf.pages:
                        # This built-in method visually detects tables and grids
                        tables = page.extract_tables()
                        
                        if tables:
                            for table in tables:
                                for row in table:
                                    # Clean up formatting (removes random line breaks inside cells)
                                    clean_row = [str(cell).replace('\n', ' ').strip() if cell else '' for cell in row]
                                    
                                    # Add the source filename to the beginning of the row
                                    clean_row.insert(0, filename)
                                    all_data.append(clean_row)
                                    
                                    # Keep track of the widest table so our Excel sheet formats correctly
                                    if len(clean_row) > max_columns:
                                        max_columns = len(clean_row)
                        else:
                            # FALLBACK: If no visual gridlines exist, use text spacing to build the table
                            text = page.extract_text()
                            if text:
                                for line in text.split('\n'):
                                    if line.strip():
                                        # Split the text into columns wherever there are 2 or more spaces
                                        clean_row = re.split(r'\s{2,}', line.strip())
                                        clean_row.insert(0, filename)
                                        all_data.append(clean_row)
                                        
                                        if len(clean_row) > max_columns:
                                            max_columns = len(clean_row)
            except Exception as e:
                print(f"Error processing {filename}: {e}")

    if not all_data:
        print("\nNo data could be extracted from any PDFs. They might be scanned images.")
        return

    # Ensure all rows have the exact same number of columns so Pandas doesn't crash
    for row in all_data:
        while len(row) < max_columns:
            row.append('')

    # Create generic column headers (Column 1, Column 2, etc.)
    columns = ['Source File'] + [f'Column {i}' for i in range(1, max_columns)]
    
    # Convert to Excel
    df = pd.DataFrame(all_data, columns=columns)
    df.to_excel(OUTPUT_FILE, index=False)
    print(f"\nSuccess! All tables extracted completely free and saved to '{OUTPUT_FILE}'")

if __name__ == "__main__":
    process_pdfs()

Wrapping Up

By utilizing libraries like pdfplumber and pandas, we can transform a rigid PDF format into flexible Excel data in seconds. This script acts as a perfect data-cleaning step before importing your files into Power BI or performing deeper business analysis.

The tedious manual copy-pasting is over.

Reclaim your hours and accelerate your workflow.

Post a Comment

Previous Post Next Post