To say that the PDF has historically been a challenge for data integration is an understatement; the format has been described as "where documents go to die." Data contained in PDFs is unstructured, making it far less easy to integrate into your system than data delivered through an API. As a result, many organizations turn to manual data entry to ingest PDFs. This approach, however, comes with a host of potential issues—compromised data quality, increased costs, delays in data entry—that can last for hours or even days.
While optical character recognition (OCR) tools can extract text from PDFs, they merely extract text. They don't provide specific data fields to make the text particularly usable. This leaves developers with the daunting task of parsing the extracted data appropriately, which isn't much better than manual entry in the first place.
Developer-first platforms, like Sensible, offer an alternative solution. They provide access to the data in PDF documents as easily as calling an API. Sensible’s document query language, SenseML, eliminates the complexities of PDF parsing. In a few seconds, users can extract data from digital and scanned PDFs and seamlessly ingest it into their workflows.
In this article, you'll learn how to use Python to extract text from PDFs with Sensible. After completing this tutorial, you should be able to use Sensible to extract structured data from any document.
Extract Documents in Python Using Sensible
Sensible makes extracting data from PDF files a breeze. In this tutorial, you'll:
- Create a document configuration in the Sensible app using SenseML, Sensible’s document query language.
- Parse a simple invoice file to extract key information from the document, like the client’s name, invoice date, and total price.
- Obtain an endpoint from the Sensible app in order to use Python to extract data from multiple PDF invoices from the same vendor using the Sensible API.
Before you get started, make sure you have the following:
After creating a Sensible account, sign in to your Sensible dashboard. To follow along with this tutorial, you need to create a new document type in your Sensible account. A document type is a collection of reference documents and SenseML queries (stored as configurations) to help you extract predefined data fields from PDFs. Sensible’s Configuration library contains configurations for extracting data from hundreds of the most popular documents.
In this tutorial, you’ll learn how to parse invoices for a fictitious gardening company called Williamson Gardening. The following is an example invoice from this company:
The extraction configuration contains SenseML queries that extract structured data from your documents. The reference document serves as the source for extracting the data while you write the configuration in the next step.
Creating a Configuration
Click your new configuration to edit it. This opens Sensible’s visual editor, a user-friendly interface that extracts data in response to queries written in natural language. Based on your input, the interface automatically generates SenseML queries to retrieve the necessary information.
It’s a simple and effective way to extract data, but understanding how to construct SenseML queries directly offers greater precision for your extraction. To work with SenseML directly, click Switch to SenseML.
A new screen opens with three panes: one for writing the configuration using SenseML, one for viewing the document, and one for showing the data extraction results.
Writing in SenseML, you’ll define how to find and extract data from the document, as well as the structure of the extracted data. SenseML is a bit like GraphQL in that way, but for querying elements of a document.
Some of the key components of SenseML are:
The following example demonstrates how you can use this powerful method to extract the invoice date:
This query should produce the following results:
The output of this query on the sample document is represented below:
For example, the following query uses SenseML to find the client in the invoice:
Sensible correctly identifies the client as John Doe & Co and its data type as a string,
You can also specify your data type in SenseML to ensure type safety. The example below queries the document for the invoice number, specifying the data type:
This query produces the following output:
Deploying the Configuration
You can combine all the queries into a single file to produce the following configuration:
After this, click Publish to deploy this configuration to a dev environment. There, you can interact with the configuration via the API.
Take note of the extraction endpoint, as you’ll use it later.
Retrieving Your Sensible API Key
To get your Sensible API key, navigate to the account page. Click the reveal icon to view and copy the key.
Writing the Python Code
By this point, you've written your SenseML configuration in Sensible to extract data from your sample invoice PDF. Now you can start writing code to extract data from documents in the same format as the sample document. You can download a test document here.
You can review the complete codebase with sample documents in this tutorial’s GitHub repo.
By this point, you’ve seen how Sensible automates data extraction from digital and scanned PDFs. You learned how to extract structured data from PDFs using Sensible in Python. You constructed SenseML queries to create configurations that target and extract specific structured data, eliminating the need for manual data entry.
Sensible provides a powerful and user-friendly text extraction solution, allowing for faster, more accurate data analysis and efficient workflows. It supports not only a large number of file formats, but also provides third-party integrations and enterprise-level security. With its user-friendly interface and robust feature set, Sensible is a document orchestration platform made for developers and an excellent choice for simplifying your document processing workflow.