How to use GPT-4 to parse free-text documents

Updated on

April 16, 2024

min read

Contributors

No items found.

Author

Frances Elliott

Table of contents

Introduction

Some documents, such as leases and other contracts, bury key information in paragraphs of legalese or other unstructured text. This historically represented a significant challenge for data extraction, since finding the target data required sophisticated natural language processing (NLP) techniques that, even at their best, weren't particularly reliable.

In 2020, OpenAI made a significant leap forward in generative language models (a subdiscipline of NLP) with their GPT-3 model, which generates text that is difficult in many situations to distinguish from human-authored text. Most of the publicity around GPT has focused on its creative applications, such as completing stories, writing code, or finishing your emails.

At Sensible, we tackled the more mundane but still challenging task of using large-language models (LLMs) such as GPT-4 Vision, GPT-4, GPT-3, and GPT-3.5 Turbo to summarize unstructured free text into structured data in a business context.

Sensible's free-text Query Group method is a great way to apply LLMs to real-world business uses. Let's dig into how to use this method to pull structured rent data out of a lease, with no prior knowledge about the exact wording used in the lease.

Given paragraphs like these:

Lease section about rents and charges — Lease example

Sensible can extract information like this:

{
  "rent_computed": [
    {
      "rent_in_dollars": "$895.00",
      "payment_time_period": "month"
    }
  ]
}

To get such slick output, you'd historically put in a lot of work training machine learning (ML) algorithms with sample documents. But not now! You get this extraction out of the box, because GPT-3 is already trained on a ton of documents – as much of the Internet as it could grab, including all of Wikipedia.

So what, exactly, do you need to do to go from unstructured, natural-language documents to this structured data? You need to narrow down the document to just a snippet that contains the target information to avoid LLM token limits. Then, you need to prompt the LLM to extract the target information from the snippet.

Fortunately, with the Sensible app, this multi-step process is easy. Sensible automatically scores chunks of the document based on your queries to find the most likely location, or context, for your data. Then the app can even automatically generate LLM prompts to extract the most interesting facts in the document page you’re currently viewing. All this in a few clicks. Let’s walk through it.

Transforming unstructured into structured text with Sensible

Prerequisites

To follow along:

Sign up for a Sensible account
Download the example PDF: Download link

Auto-extract data

Take the following steps to extract data from the lease:

Click New document type.
Select the example document you just downloaded.

On upload, Sensible automatically extracts important information from the lease for you:

You can edit the automatically generated queries, auto-generate more queries, or manually author your own. For more information, see Recommended Query Groups.

Test the extraction template with a second document

The auto-generated queries, or extraction template, in the right pane in the preceding image can be used to extract from other lease documents. To try it out:

Publish your template by selecting Publish configuration > Publish to production:

Download the second example document: Download link

Upload the second example document by clicking Add file:

Note that the extracted data for the auto-generated queries updates to reflect the new document:

Now that you’ve published the extraction template, you can integrate and extract these queries from lease documents in volume using the Sensible API, SDK, or bulk-upload UI.

Advanced extractions

You’re not limited to auto-generated queries. You can author your own LLM prompts to extract not only short facts, but also tables and complex lists. You can extract from non-text images embedded in documents using multimodal LLMs such as GPT-4 Vision. And if an LLM can’t extract the data you’re looking for, you can always fall back to Sensible’s layout-based extraction methods.

Try it for free

Explore our prebuilt open-source library for extracting from common business documents, check out our docs, and sign up for a free account to start extracting and transforming data from your own documents.

Frances Elliott

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Book a demo

How to use GPT-4 to parse free-text documents

Introduction

Transforming unstructured into structured text with Sensible

Prerequisites

Auto-extract data

Test the extraction template with a second document

Advanced extractions

Try it for free

Turn documents into structured data

Related posts

How to extract data from employment verification forms with Sensible

How to extract data from CMS-1500 forms with Sensible

Splitting Multi-Document PDFs with LLMs

The opinionated guide to JsonLogic for transforming document data