how-to

How to use GPT-3 to parse free-text documents

Frances Elliott
Thursday, June 30, 2022

Extract structured data from natural-language, free-text documents like leases and legal contracts with Sensible using GPT-3.

Related Documents

Introduction

Some documents, such as leases and other contracts, bury key information in paragraphs of legalese or other unstructured text. This historically represented a significant challenge for data extraction, since finding the target data required sophisticated natural language processing (NLP) techniques that, even at their best, weren't particularly reliable.

In 2020, OpenAI made a significant leap forward in generative language models (a subdiscipline of NLP) with their GPT-3 model, which generates text that is difficult in many situations to distinguish from human-authored text. Most of the publicity around GPT-3 has focused on its creative applications, such as completing stories, writing code, or finishing your emails.

At Sensible, we tackled the more mundane but still challenging task of using GPT-3 to summarize unstructured free text into structured data in a business context. Our biggest constraint was GPT-3's prompt size. GPT-3 only accepts about 1,500 words as a prompt for generating new text. For this reason, you can't simply dump long documents like leases into GPT-3 in order to get back, say, the rent in dollars. Instead you need to 1. find the snippet in the lease document that most likely discusses rents and 2. feed the snippet to GPT-3. 

Transforming unstructured into structured text with Sensible

As it happens, Sensible's free-text Topic and Summarizer methods are the perfect combination to apply GPT-3 to real-world business uses. Let's dig into how to use these methods to pull structured rent data out of a lease, with no prior knowledge about the exact wording used in the lease.

With the Summarizer method, given paragraphs like these:

Lease section about rents and charges

Sensible can extract information like this:

{
	"rent_computed": [{
		"rent_in_dollars": "$895.00",
		"payment_time_period": "month"
	}]
}

To get such slick output, you'd historically put in a lot of work training machine learning (ML) algorithms with sample documents. But not now! You get this extraction out of the box, because GPT-3 is already trained on a ton of documents – as much of the Internet as it could grab, including all of Wikipedia. 

So what, exactly, do you need to do to go from unstructured, natural-language documents to this structured data? Let's dive in:

Step 1: Find the paragraph with the rent topic

Getting structured information out of a sentence like "lessee shall pay 895.00 dollars per month for rent" is only half the battle. First GPT-3 requires you to narrow down the document to just a few lines containing the sentence. So how do you consistently locate the lines containing the amount of rent and the rent payment frequency in order to extract this information?

Sensible uses the Topic method: 


{
  "fields": [
    {
      "id": "rent_topic_paragraphs",
      "anchor": {
        "match": {
          "type": "first"
        }
      },
      "method": {
        "id": "topic",
        "numParagraphs": 2,
        "terms": [
          "pay",
          "leesee",
          "rent",
          "dollars"
        ]
      }
    }
  ]
}

This code sample tells Sensible to find paragraphs with keywords associated with rent payment. You gather these keywords manually by looking at a variety of lease documents. Sensible then uses a bag-of-words approach paired with document layout inference to find the paragraphs concerned with the topic of rent payment. For example, the single preceding code sample can extract a variety of paragraphs. It can output something like this:

{
	"rent_topic_paragraphs": {
		"type": "string",
		"value": "1. 2 RENTS AND CHARGES Lessee shall pay 895.00 dollars per month for rent. The first month's rent and/or prorated rent amount shall be due prior to move-in. 2. Late fees will be assessed for any payment overdue by 5 days. "
	}
}

as well as something like this:

{
	"rent_topic_paragraphs": {
		"type": "string",
		"value": "RENT. The rent to be paid by the Tenant to the Landlord throughout the term of this Agreement is to be made in monthly installments of $1,100 ("Rent ") and shall be due on the first day of each month ("Due Date "). For the first month of occupancy, the Tenant must pay prorated amount, if the move-in date is after the 10th of the month, plus the security deposit to establish residency."
	}
}

Step 2: Summarize the rent topic

Now you found key lines using the Topic method, feed them to the Summarizer method to get out structured data, such as the dollar amount and the rent term.

First, you configure the Summarizer method with just a few short samples, so GPT-3 knows what data, or fields, you want to extract from lines about rent:

{
	"id": "rent_computed",
	"method": {
		"id": "summarizer",
		"source_id": "rent_topic_paragraphs",
		"fields": [
			"rent_in_dollars",
			"payment_time_period"
		],
		"samples": [{
				"prompt": "Rent 8. Subject to the provisions of this short-term Lease, the rent for the Property is $234.00 each and every week (the \"Rent\").",
				"values": [
					"$234.00",
					"week"
				]
			},
			{
				"prompt": "Rent for this commerical property is due in advance on the ist day of the quarter, at $20,125.00 per quarter, beginning on November 15, 2015, payable to Owner/Agent at 123 Main Blvd., Sacramento, CA 95864. Payments made in person may be delivered to Owner/Agent between the hours of 24/Z.",
				"values": [
					"$20,125.00",
					"quarter"
				]
			},
			{
				"prompt": "Leesee must pay rents biweekly. For the dollar amount due, see addedendum A.",
				"values": [
					"not found",
					"biweekly"
				]
			}
		]
	}
}

With the preceding code sample, you might feel like you're talking to Sensible as you would to a human. You're saying, "I want to find values for the fields rent_in_dollars and payment_time_period. Here are the values I found from a couple example sentences; now you do the same." In the last prompt, you're saying, "For this sample, there's no rent amount in the paragraph, so return the phrase "not found" for rent_in_dollars. For the payment_time_period, return "biweekly" .

And that's it. Sensible semantically analyzes the fields you want to extract and their relationships to the prompts you provide. The next time you feed Sensible some key sentences from an actual document, it'll extract values for rent_in_dollars and payment_time_period.

Step 3: Tie it all together

Now you can chain the Summarizer and Topic methods together:

  1. The Topic method narrows down a long document to a short free-text snippet using keywords. 
  2. The Summarizer method takes the snippet from the Topic method, and extracts structured information using GPT-3.

At a high level, let's say your input is a lease with this page:

You chain together the topic method and summarizer like this:

{
  "fields": [
    {
      "id": "rent_topic_paragraphs",
      "anchor": {
        "match": {
          "type": "first"
        }
      },
      "method": {
        "id": "topic",
        "numParagraphs": 2,
        "terms": [
          "pay",
          "leesee",
          "rent",
          "dollars"
        ]
      }
    }
  ],
  "computed_fields": [
    {
      "id": "rent_computed",
      "method": {
        "id": "summarizer",
        "source_id": "rent_topic_paragraphs",
        "fields": [
          "rent_in_dollars",
          "payment_time_period"
        ],
        "samples": [
          {
            "prompt": "Rent 8. Subject to the provisions of this short-term Lease, the rent for the Property is $234.00 each and every week (the \"Rent\").",
            "values": [
              "$234.00",
              "week"
            ]
          },
          {
            "prompt": "Rent for this commerical property is due in advance on the ist day of the quarter, at $20,125.00 per quarter, beginning on November 15, 2015, payable to Owner/Agent at 123 Main Blvd., Sacramento, CA 95864. Payments made in person may be delivered to Owner/Agent between the hours of 24/Z.",
            "values": [
              "$20,125.00",
              "quarter"
            ]
          },
          {
            "prompt": "Leesee must pay rents biweekly. For the dollar amount due, see addendum A.",
            "values": [
              "not found",
              "biweekly"
            ]
          }
        ]
      }
    }
  ]
}

And you get output like this:

{
  "rent_topic_paragraphs": {
    "type": "string",
    "value": "Lessee shall pay 895.00 dollars per month for rent. The first month's rent and/or prorated rent amount shall be due prior to move-in. For any move in date that is after the 15th of the month, Tenant must pay a full month of rent in order to gain possession of the home. The prorated rent amount will be due the second month of lease. Every month thereafter, Lessee must pay rent on or before the 1st day of each month with 5 days of grace period. The following late fees will apply for payments made after the grace period:"
  },
  "rent_computed": [
    {
      "rent_in_dollars": "895.00",
      "payment_time_period": "month"
    }
  ]
}

Try it for free

To get structured data out of your unstructured documents, sign up for a free Sensible trial today.