Intro

True story, the town I live in recently had some issues around the town budget. As I started to get involved I wanted to look at the data provided by our Board of Finance. To my misfortune, I found myself trying to analyze data in a PDF. Trying to do any sort of analysis on it was time consuming and frustrating. To make the data useful I needed to extract data from PDF and into something more flexible, in this case a simple CSV will do. Here is a sample of what we are working with:

Tooling

For this project we are going to use Pandas and Tabula. I also want to note I am using Python 3.12.2 for this project. To get Tabula up and running you will need to have Java 8+ installed as well. From there its fairly straight forward.

pip install pandas
pip install tabula-py

Setup

First we are going to create a function that uses Tabula to extract data from our source PDF. I was pleasantly surprised by how simple this is to implement. Behold the ease of use:

def extract_tables_from_pages(pdf_path,page):
	# Extract table from specified page
	tables = tabula.read_pdf(pdf_path, pages=page, multiple_tables=True)

  # Check if result is empty
	if not tables:
	  return None

	# Concatenate all tables from the specified pages into one DataFrame
	df = pd.concat(tables, ignore_index=True)

	return df

Now that we have our function to extract data lets handle the rest of the groundwork:

  • Create an empty file with our headers
  • Set the page range for the PDF that our data is in
  • Initialize and empty data frame
# Create file with headers
headers = ['Description', 'FY24Budget', 'FY25Budget', 'Change', 'ChangePercent']
df = pd.DataFrame(columns=headers)
csv_file_path = 'budget_data.csv'
df.to_csv(csv_file_path, index=False)

# Set page range
start_page = 93
end_page = 98

# Initialize an empty DataFrame
df = pd.DataFrame()

With that out of the way lets put it to us:

  • Call the function we created earlier, load the data into a data frame and add the proper column headers to it
  • The PDF sample has a bunch of summary rows on the last page of data so we check if its the last page and remove those rows if it is
  • Since this PDF has a bunch of white space in it Tabula returns several columns that are meaningless, we need to drop all of those
  • The last modification we need to make is removing whitespace from columns with data we need
  • In the end we write the data frame to a CSV and we are all set
pdf_path = 'C:\\Board_of_Finance_Budget.pdf'

for page in range(start_page, end_page + 1):
  # Extract tables from the specified range of pages
  df = extract_tables_from_pages(pdf_path, page)
  new_row = pd.DataFrame([df.columns], columns=df.columns)
  df = pd.concat([new_row, df]).reset_index(drop=True)
  
  # Remove last 6 rows if last page
  if page == end_page:
      df = df.iloc[:-6]
  
  # Drop 2nd column
  df.drop(df.columns[1], axis=1, inplace=True)
  # Drop 3rd column
  df.drop(df.columns[2], axis=1, inplace=True)
  # Drop 4th column
  df.drop(df.columns[3], axis=1, inplace=True)
  # Remove space characters in the 2nd column
  df.iloc[:, 1] = df.iloc[:, 1].astype(str).str.replace(' ', '')
  # Remove space characters in the 3rd column
  df.iloc[:, 2] = df.iloc[:, 2].astype(str).str.replace(' ', '')
  
  if df is not None:
      # Print the DataFrame
      print(df)
      
      # Save the DataFrame to CSV
      df.to_csv(csv_file_path, mode='a', index=False, header=False)
      print(f"DataFrame saved to {csv_file_path}")
  else:
      print(f'No table found on page {page}')

Wrapping Up

With a small amount of effort we were able to extract several pages of detail data from a PDF and write it out to a CSV. I am very excited about the flexibility Tabula has brought to the table. For your needs you may be able to get away with less transformation than I had to do. I found in my research if a table in a PDF has lines outlining its columns the extraction is much easier. Let me know in the comments how you plan to use Tabula to extract data.