Multimodal Engine for complex document extraction

Updated on
May 1, 2024
5
min read
Contributors
No items found.
Author
Multimodal Engine for complex document extraction
Table of contents
Turn documents into structured data
Get started free
Share this post

Sensible’s new Multimodal Engine uses LLMs to extract data from non-text and partial text images embedded in a document, including pictures, charts, graphs, and handwriting. This parameter also improves data extraction accuracy from documents with challenging layouts, like overlapping lines, non-standard checkboxes, and signatures. With the new Multimodal Engine, you can extract structured data from previously inaccessible sources within a document, like details about elements of a non-text image, adding a powerful new automation tool to your document processing toolset.

The Multimodal Engine parameter sends an image of the document region containing the target data to a multimodal LLM, allowing you to ask questions about non-text and partial text images. As with query groups, Sensible automatically selects a relevant excerpt and surrounding context from the document to send as an image to the multimodal LLM based on your natural language queries. Alternatively, you can set an anchor and use Region parameters to define an image’s location deterministically.

Here are two ways to use Sensible’s Multimodal Engine parameter:

Extract data from images embedded in a document

The Multimodal Engine parameter can extract facts from – or about – an image, or interpret charts and graphs within the context of a query group. Using the following image from a property’s offering memorandum as an example, you can return structured data about the building’s characteristics, including exterior material, number of stories, and presence of trees, as well as facts from the community amenities text box, like ownership updates.

After enabling the Multimodal Engine parameter, use the following configuration to extract data about the building's characteristics:

{
  "fields": [
    {
      "method": {
        "id": "queryGroup",
        "chunkSize": 1,
        "chunkCount": 2,
        "multimodalEngine": {
          "region": "automatic"
        },
        "queries": [
          {
            "id": "trees_present",
            "description": "are there trees on the property? respond true or false",
            "type": "string"
          },
          {
            "id": "multistory",
            "description": "are the buildings multistory? return true or false",
            "type": "string"
          },
          {
            "id": "community_amenities",
            "description": "give one example of a community amenity listed",
            "type": "string"
          },
          {
            "id": "exterior",
            "description": "what is the exterior of the building made of (walls, not roof)?",
            "type": "string"
          }
        ]
      }
    }
  ]
}

The configuration returns the following output:

{
  "trees_present": {
    "value": "true",
    "type": "string"
  },
  "multistory": {
    "value": "true",
    "type": "string"
  },
  "community_amenities": {
    "value": "Gated perimeter with key card access",
    "type": "string"
  },
  "exterior": {
    "value": "Brick",
    "type": "string"
  }
}

Additional instances of extracting data from non-text images include corroborating submitted insurance claims against submitted damage photos, or directly extracting data from visual charts in a financial report.  

Extract data from documents with complex and imprecise layouts

Previously, document formatting issues like overlapping lines, lines between lines, checkboxes, and handwriting, made it difficult to reliably extract data. With the new Multimodal Engine parameter, Sensible sends an image of the relevant region to the LLM, which uses context to holistically process and extract data similarly to the way a human would. In the following example, the handwritten form contains imprecise pen marks and checkboxes, as well as some line overlap. After defining the specific region of the form you want to extract, Sensible sends the image to the multimodal LLM to accurately extract the data, despite formatting issues.

After enabling the Multimodal Engine parameter, and defining a custom extraction region, use the following configuration to extract the handwritten responses:

{
  "preprocessors": [
    {
      "type": "nlp",
      "confidenceSignals": true
    }
  ],
  "fields": [
    {
      "method": {
        "id": "queryGroup",
        "multimodalEngine": {
          "region": {
            "start": "below",
            "width": 7.6,
            "height": 1.75,
            "offsetX": -1.3,
            "offsetY": -0.2
          }
        },
        "queries": [
          {
            "id": "ownership_type",
            "description": "What is the type of ownership?",
            "type": "string"
          },
          {
            "id": "owner_name",
            "description": "What is the full name of the owner?",
            "type": "string"
          }
        ]
      }
    }
  ]
}

The configuration returns the following output:

{
  "ownership_type": {
    "value": "Natural Person(s)",
    "type": "string"
  },
  "owner_name": {
    "value": "Kyle Murray",
    "type": "string"
  }
}

The new Multimodal Engine parameter opens up new possibilities for extracting structured data from non-text and partial text images, as well as improves extraction accuracy for challenging or complex layouts. This enhances your ability to fully automate data extraction from a wider range of documents.

Try Multimodal Engine support in the LLM Query Group method.

Turn documents into structured data
Get started free
Share this post

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.