Some documents, such as leases and other contracts, bury key information in paragraphs of legalese or other unstructured text. This historically represented a significant challenge for data extraction, since finding the target data required sophisticated natural language processing (NLP) techniques that, even at their best, weren't particularly reliable.
In 2020, OpenAI made a significant leap forward in generative language models (a subdiscipline of NLP) with their GPT-3 model, which generates text that is difficult in many situations to distinguish from human-authored text. Most of the publicity around GPT-3 has focused on its creative applications, such as completing stories, writing code, or finishing your emails.
At Sensible, we tackled the more mundane but still challenging task of using GPT-3 to summarize unstructured free text into structured data in a business context. Our biggest constraint was GPT-3's prompt size. GPT-3 only accepts about 1,500 words as a prompt for generating new text. For this reason, you can't simply dump long documents like leases into GPT-3 in order to get back, say, the rent in dollars. Instead you need to 1. find the snippet in the lease document that most likely discusses rents and 2. feed the snippet to GPT-3.
Transforming unstructured into structured text with Sensible
As it happens, Sensible's free-text Topic and Summarizer methods are the perfect combination to apply GPT-3 to real-world business uses. Let's dig into how to use these methods to pull structured rent data out of a lease, with no prior knowledge about the exact wording used in the lease.
With the Summarizer method, given paragraphs like these:
Sensible can extract information like this:
To get such slick output, you'd historically put in a lot of work training machine learning (ML) algorithms with sample documents. But not now! You get this extraction out of the box, because GPT-3 is already trained on a ton of documents – as much of the Internet as it could grab, including all of Wikipedia.
So what, exactly, do you need to do to go from unstructured, natural-language documents to this structured data? Let's dive in:
Step 1: Find the paragraph with the rent topic
Getting structured information out of a sentence like "lessee shall pay 895.00 dollars per month for rent" is only half the battle. First GPT-3 requires you to narrow down the document to just a few lines containing the sentence. So how do you consistently locate the lines containing the amount of rent and the rent payment frequency in order to extract this information?
Sensible uses the Topic method:
This code sample tells Sensible to find paragraphs with keywords associated with rent payment. You gather these keywords manually by looking at a variety of lease documents. Sensible then uses a bag-of-words approach paired with document layout inference to find the paragraphs concerned with the topic of rent payment. For example, the single preceding code sample can extract a variety of paragraphs. It can output something like this:
as well as something like this:
Step 2: Summarize the rent topic
Now you found key lines using the Topic method, feed them to the Summarizer method to get out structured data, such as the dollar amount and the rent term.
You configure the Summarizer method with instructions and just a few short samples, so GPT-3 knows what data, or fields, you want to extract from lines about rent:
With the preceding code sample, you might feel like you're talking to Sensible as you would to a human. You're saying, "list the rents, how often the rent must be paid, and when the rent is due." Then you're showing how to extract those values from a couple sample sentences. In the last prompt, you're saying, "For this sample, there's no rent amount in the paragraph, so return the phrase "not found" for rent_in_dollars."
And that's it. GPT-3 semantically analyzes the fields you want to extract and their relationships to the prompts you provide. The next time you feed Sensible some key sentences from an actual document, it'll extract values for rent_in_dollars, payment_time_period, and payment_due.
Step 3: Tie it all together
Now you can chain the Summarizer and Topic methods together:
- The Topic method narrows down a long document to a short free-text snippet using keywords.
- The Summarizer method takes the snippet from the Topic method, and extracts structured information using GPT-3.
At a high level, let's say your input is an example lease with this page:
You chain together the topic method and summarizer like this:
And you get output like this:
Try it for free