How to extract data from rent rolls with LLMs and Sensible

Updated on

May 17, 2024

min read

Contributors

No items found.

Author

Frances Elliott

Table of contents

In the real estate industry, rent rolls are key documents used for valuing properties and for evaluating their commercial health. For example, high rents, low vacancy, and long tenure indicate good health; low rents, high vacancy, and short tenure indicate poor health. Companies in the prop tech space need this sort of data to build solutions such as automated rent collection and billing, rent trend analytics, and property ROI analytics. However, they often lack access to rent rolls in any format other than PDFs, which makes data extraction a potentially difficult problem.

Enter Sensible, which offers intelligent document automation. With Sensible you can easily extract key information out of documents using SenseML, Sensible’s query language. SenseML uses a combination of layout-based rules and LLM prompts to extract from the full spectrum of free-form to structured documents. We’ve written a library of open-source SenseML configurations, so you don’t need to write queries from scratch for common documents. From there, the document data is accessible via Sensible’s API, SDK, app, or 5,000 other software integrations thanks to Zapier.

What we'll cover

This blog post briefly walks you through configuring extractions for rent rolls. By the end, you’ll know a few methods for extracting document data using our query language, and you’ll be on your way to extracting any data you choose using our documentation or our prebuilt open-source configurations.‍

Write document extraction queries with SenseML

Let's extract data from a rent roll. Here's an example of a rent roll PDF with redacted or dummy data:

To extract from this document, take the following prerequisite steps:

Sign up for a Sensible account
Add prebuilt extraction support for rent rolls to your Sensible account. To add support, follow the steps in Out-of-the-box extractions and select proptech.

Our configurations for rent rolls are comprehensive. To keep the example in this post simple, let's just extract:

Total units, total rent, and % occupied
Apartment complex name
Details about each apartment unit, such as the occupant’s name and their monthly rent

We’ll also write some logic to test the monthly rent amounts, to verify that the extraction is working properly.

Extract clustered facts: total units and total rent

Since rent rolls are documents with highly variable layouts, let’s use LLM-based methods to extract the data. By asking the LLM questions such as grand total occupied units, you’ll extract facts as structured data. To improve accuracy and performance, you’ll group together facts that always appear in a cluster together in documents.

See the following screenshot for an overview of how to configure a group of LLM prompts that extract a cluster of co-located facts. In this case, they’re on page 18 of the example document:

You can also view this data in JSON view:

To configure the LLM prompts as shown in the preceding screenshot:

Navigate to the prop tech document type you created in a previous step. This document type contains everything you need to extract from rent rolls.
For the purposes of this tutorial, you’ll create a blank test configuration in the document type. Click Create configuration and name it test_rents.
Click the configuration you created to edit it.
Switch to the JSON editor view by clicking Switch to SenseML. The app displays an example rent roll in the middle pane and the empty configuration in the left pane.
Paste the following code into the left pane of the Sensible app.


{
  "fields": [
    {
      "method": {
        /* group queries if and only if the targeted
         facts are always co-located within a page or two 
         in the document grouping queries improves LLM performance and accuracy 
         */
        "id": "queryGroup",
        "queries": [
          {
            "id": "grand_total_sqft_percent",
            "description": "grand total occupied sqft percent",
            "type": "string"
          },
          {
            "id": "grand_total_units",
            "description": "grand total occupied units",
            "type": "string"
          },
          {
            "id": "grand_total_rent",
            "description": "grand total occupied monthly base rent",
            "type": "string"
          }
        ]
      }
    }
  ]
}

You'll get this output in the right pane:


{
  "grand_total_sqft_percent": {
    "value": "94.8%",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "grand_total_units": {
    "value": "168",
    "type": "string",
    "confidenceSignal": "confident_answer"
  },
  "grand_total_rent": {
    "value": "140,379.00",
    "type": "string",
    "confidenceSignal": "confident_answer"
  }

In the preceding output, the confidenceSignal is a more nuanced alternative to confidence scores that indicates whether the LLM judges its own answer to be correct.

Extract a standalone fact: apartment complex name

In unstructured documents, some facts aren’t consistently co-located with other facts. For example, the apartment complex name in rent rolls lacks a pattern of co-located facts. To handle this, let’s put it in a single-query group.

See the following screenshot for an overview of how to extract the apartment name:

To try this out yourself, paste the following query, or "field" into the left pane of the Sensible app in the fields array:


{
      "method": {
        /* if a fact doesn't consistently occur near other facts,
        target it in a single-member group */
        "id": "queryGroup",
        "queries": [
          {
            "id": "apartment_name",
            "description": "apartment complex name",
            "type": "string"
          }
        ]
      }
    },

Since the apartment name is redacted in the example document, you’ll get back the text "LLC".

Extract repeating data: lists of rent details

In the example document, there’s a list of rent details. For each unit in the apartment complex, the document lists details such as the unit number, type, occupants, and market rent. To extract this repeating data, use the List method. The List method describes the list’s overall contents (rent_roll_details) and each item that repeats in the list (unit, name / occupant, etc).

See the following screenshot for an overview of extracting the rent details mentioned in the rent roll:

Click Show full output to see the full list:

‍

To view the same data as JSON, click Switch to SenseML:

‍

To try this out yourself, paste the following query, or "field", into the left pane of the Sensible app in the fields array:


{
      /* the id is a user-friendly name for the target list */
      "id": "rent_roll_details",
      "method": {
        "id": "list",
        /* overall description of list's contents */
        "description": "rent roll details",
        /* for long lists, use `thorough` to specify an LLM model
           that's slower but more accurate */
        "llmEngine": "thorough",
        /* each recurring item in the list is a 'property' */
        "properties": [
          {
            /* for each item in the list, provide a user-friendly ID and 
               description of the data you want to extract
               and optional instructions to filter or reformat the data */
            "id": "unit",
            "description": "unit",
            "type": "string"
          },
          {
            "id": "name",
            "description": "name / occupant",
            "type": "string"
          },
          {
            "id": "rent roll_start",
            "description": "rent roll start / rent start",
            /* optional: target data is a date. Reformats the source date
               in the document to ISO standard */
            "type": "date"
          },
          {
            "id": "rent roll_rent",
            /* give instructions for handling incorrectly formatted
               whitespaces in the document, 
               for example, read '3 100.45` as `3100.45` */
            "description": "rent roll rent. ignore whitespaces in number",
            "type": "number"
          },
          {
            "id": "sqft",
            "description": "sqft",
            "type": "string"
          },
          {
            "id": "rent roll_end",
            "description": "rent roll end / expiration",
            "type": "date"
          }
        ]
      }
    },

NOTE: The List method can take several minutes to return results when you set the LLM Engine parameter to thorough.

‍

You’ll get output like the following (truncated):


{
  "rent_roll_details": {
    "columns": [
      {
        "id": "unit",
        "values": [
          {
            "value": "1",
            "type": "string"
          },
          {
            "value": "2",
            "type": "string"
          },
          [...]
         
        {
        "id": "name",
        "values": [
          {
            "value": "Maria",
            "type": "string"
          },
          {
            "value": "Darwin",
            "type": "string"
          },
          [...]
        {
        "id": "rent roll_start",
        "values": [
          {
            "source": "11/21/17",
            "value": "2017-11-21T00:00:00.000Z",
            "type": "date"
          },
          {
            "source": "08/31/18",
            "value": "2018-08-31T00:00:00.000Z",
            "type": "date"
          },
       [...]

Transform extracted data: Validate rent amounts

In the example document, there are data-entry errors. See the following screenshots for examples of these typos:

‍

In previous steps, you prompted the LLM to handle typos with the instructions "ignore whitespaces in number". However, the LLM is indeterminate and can still interpret a typo like 3 768,43 as the number 3. Since it’s unlikely that an occupant has a monthly rent of $3, let’s validate that extracted rents are all over a reasonable baseline number, say $100. Let’s return rent "not found" if the rent amount is null.

To try this out yourself, paste the following query, or "field" into the left pane of the Sensible app in the fields array:


{
            "id": "is_rent_over_100_dollars",
            "method": {
              "id": "customComputation",
              "jsonLogic": {
                "if": [
                  /* check the rent amount exists (is non-null) */
                  {
                    "exists": [
                      {
                        "var": "rent roll_rent.value"
                      }
                    ]
                  },
                  /* if it's non-null, return true if the rent value is 
                     greater than or equal to 100 */
                  {
                    ">=": [
                      {
                        "var": "rent roll_rent.value"
                      },
                      "100"
                    ]
                  },
                  /* if the rent value is null, return 'rent not found' */
                  "rent not found"
                ]
              }
            }
          }

Switch back to Sensible Instruct to view the output as a table:

All the rents in the preceding screenshot returned true for is_rent_over_100_dollars. You can write then validations to return error messages on document extractions if a field returns false for this condition.

Test the extraction template with a second document

You can use the extraction queries, or fields, you authored in previous steps to extract from other documents. To try it out:

Publish your template by selecting Publish configuration > Publish to production:

Download the second example document: Download link

Upload the second example document by clicking Add file in the Sensible Instruct editor view:

Note that the extracted data in the right pane updates to reflect the new document:

Extract from your documents

Congratulations, you’ve learned some key methods for extracting structured data from rent rolls. To start extracting from your own rent roll documents:

Use our pre-built support for rent rolls to extract more comprehensive data than covered in this tutorial. To explore the support, open the rent_rolls configuration, and start uploading your own documents to test against this config.
Integrate rent roll document extractions in volume using the Sensible API, SDK, or bulk-upload UI.

Advanced extractions

We offer advanced configuration for LLM prompts, so you can extract facts, lists, and tables from even the trickiest document. You can extract from non-text images embedded in documents using multimodal LLMs such as GPT-4 Vision. And if an LLM can’t extract the data you’re looking for, you can always fall back to Sensible’s layout-based, deterministic extraction methods.

Try it out for free

There's more extraction power for you to uncover. Sign up for an account (no credit card required), check out our prebuilt configs in our open-source library, and peruse our docs to start extracting data from your own documents.

Frances Elliott

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Book a demo