Developers often need to extract data from documents in various formats to get valuable business information or to automate simple tasks. Although extracting data from text, Excel, and CSV files is relatively straightforward, the task becomes challenging when working with PDF files. Extracting a table embedded in a PDF or values from unstructured text can pose difficulties since PDFs lack a well-defined data hierarchy and may contain various data types.
PDF conversion involves converting the data in a PDF file into a format that you can interact with, such as a text file, Microsoft Word, Microsoft Excel, or JSON. This article focuses on converting PDF files to Excel. You can convert a PDF to Excel using either a one-to-one or many-to-one copy. A one-to-one copy copies all content from the PDF file to an Excel sheet and maintains the same visual layout, but it offers limited flexibility in working with the data. A many-to-one copy extracts data from the PDF file and inserts it into specific cells on the spreadsheet in a structured format, making the data more useful for analysis.
Converting PDF files to Excel makes the content of documents and forms more accessible to spreadsheets’ data wrangling (filtering/querying) capabilities. It also makes it possible to validate the data quickly and, if necessary, create visualizations. This tutorial explains how to convert PDF files to Excel files. You’ll learn how to use Sensible to create a custom document type and configuration and how to extract and convert data in Python using the Sensible API.
How to Convert a PDF to Excel Using Python
The tutorial uses weekly situation report PDFs from the WHO on COVID-19 to show you how to convert a PDF to Excel using Python. You’ll use the Sensible API to configure a scraper to extract information about new cases, deaths, and so on from a table in the PDF over a 28-day period and then save that data in an Excel file.
Before continuing, make sure that you have the following set up:
- Python 3.10 installed on your computer
- A code editor like Visual Studio Code
- A browser with an internet connection
To begin, make a project directory and launch a terminal within it. Run the following commands in the terminal to create and activate a Python virtual environment:
This library will communicate with the Sensible API via HTTP requests.
Next, open your browser and navigate to the Sensible user registration page. Enter the requested details into the form on the page and click the Create account button, then enter the verification code you received via email:
Sign in to your account, enter the additional detail requested, and click Submit to access the Sensible web app:
Creating Your API Key
To create your Sensible API key, first click the profile icon in the top right corner of the page. This action will open a drop down menu. From this menu, select the Account settings option. On the page that loads, create your API key by clicking the Create API key button which opens a dialogue box. Enter a description for the API key and your account password in the respective fields shown in the image below. Then, click the Create button.
Finally, click the Clipboard Icon shown below to copy your API key.
To use an existing Sensible API key instead, scroll horizontally in the “API keys” section and click the three dots menu button. Then, click the Retrieve Key option, enter your password, and copy your API key.
You will use this API key in the next few steps as you write and run the Python script.
Creating a Configuration
Configurations are JSON files that allow you to extract data from documents using SenseML, Sensible’s query language.
To create a configuration, first, return to the Sensible web app and navigate to the Document Types page. Then, click New document type to open a dialog box. Enter the name of the document type as shown below, leave all other options in their default state, and click Create:
This creates and opens the document type. The Configurations tab is opened by default within the document type. To make a configuration, click the Create configuration button, give it a name, and then click the Create button:
Next, navigate to the Reference documents tab. Here, you must include one or more good examples of the types of documents that are expected when this document type is used. Download the PDF of the COVID-19 situation report for March 8, 2023.
Click the Upload document button, attach the report, and select the configuration created earlier under the Associated configuration field. Then, click the Upload button:
After that, open the configuration editor by clicking any part of the reference document. Enter the following configuration in the left pane of this screen to extract the desired content from the document:
Click the Publish button on the top right, and then click Publish to development in the window that appears:
Take note that the environment is set to Development. This will be referenced in the code. You now have a configuration that can be used to extract data from any similar PDF file.
Extracting Data with Python
To test your configuration, create a Python script and add the following lines of code to import the required libraries and define the document type, environment, and document name as variables:
You have now successfully extracted data from a PDF file and converted it to an Excel file. This tutorial’s configuration, source code, and files can be found in this GitHub repository.
In this tutorial, you converted PDF files to Excel and used the Sensible web app to create a custom document type as well as a configuration to extract relevant data. Finally, you used Python to interact with the Sensible API to extract and convert data from PDF to Excel.
Sensible is a developer-focused document orchestration platform. It allows you to use fine-grained configurations to extract data from structured and unstructured documents. Sensible accomplishes this through the use of SenseML, a JSON-formatted query language. This query language employs machine learning techniques such as natural language processing and optical character recognition to ensure that desired data can be extracted regardless of its format. Try out Sensible today.