At Sensible we've used large language models (LLMs) to transform documents into structured data since the developer preview of GPT-3. In that time we've developed a set of best practices for document question answering that complement the basic chunking and embedding scoring approach well-represented in frameworks like LangChain.
In particular we're focused on how to do document question answering at scale across a wide range of unknown document layouts. This differs from a scenario where you're chatting with a particular PDF and can try several prompt variants interactively to get what you want. For us, we need to create robust prompts and chunking strategies that are as invariant as possible in the face of variability between documents.
These techniques are particularly useful in mature industries where documents function as de-facto API calls between companies. We've seen customers across several verticals, including insurance, logistics, real estate, and financial services realize significant operational efficiency gains via LLM-powered document automation.
Let's dig into a few areas of optimization for document question answering with LLMs: chunking, layout preservation, cost optimization, and confidence scores.
Many documents contain more content than can fit in the 4-8k token contexts that the base GPT models provide. And even in cases where a document fits into a given context, you may want to avoid doing so to save on cost and execution time. As a result it’s standard practice to split a document or multiple documents into "chunks," calculate a similarity score between those chunks and your query using embeddings, and then use the highest-scoring chunks as context for your query.
In a basic case, your final prompt to the LLM has the following shape:
Note that in this example and those below we're showing only one chunk as context, but typically you might use several.
In the context of answering questions about documents, we have many tools at our disposal to improve this core approach.
In many cases we know the kind of documents that we're pulling our chunks from, and we can pass that information along to the LLM. This can situate the chunk text more concretely in a real world scenario and lead to higher quality answers. With a context description your prompt will look as follows:
Suppose you know that the information you're looking for in a document tends to appear on the first page, and so in your question you say, "Give me the address of the lessee named at the top of the first page." With the default prompt above, the LLM has no information about pages or the position of text within those pages, but we do have access to that information when creating the chunks.
By including this information in the chunk itself we both cause our chunk scoring to be position-aware, and give the LLM position context for question answering. With page hinting your prompt now looks as follows:
Decoupling chunk scoring from question answering
The more chunks you have, the harder it is to identify the correct chunks to pass to your model. There could be many portions of a long document that have significant semantic overlap with your question, yet don't answer the question. Think again of that commercial lease agreement — the document defines various legal terms as they relate to the lessor and lessee throughout, and so a question like "What is the name of the lessee?" might pull chunks from parts of the document that heavily reference the lessee but don't contain the name of the lessee.
As shown above, we could use page hinting to attempt to solve this since we know that the lessee name will typically appear near the beginning of the lease, but still we've identified a notable tension in the standard approach to this problem: we're using our question text to do both chunk scoring and question answering.
In cases where this tension leads to bad chunk scoring results (often for longer documents, or for complex questions) the best practice is to decouple your chunk-scoring text from your question text. You create a custom snippet of chunk-scoring text, embed that, and then use similarity measurements with that embedding to select your top chunks.
This opens up the ability to create structural and semantic overlap for your chunk scoring. For example, if you know that the policy number you're looking for often shows up in a label/value relationship near an effective and expiration date, you could use the following chunk-scoring text:
Similarly in a commercial lease agreement case, you could simply use the paragraph where the lessor and lessee are defined from an existing commercial lease agreement as your chunk-scoring text. For example:
Splitting a document into chunks involves several implicit decisions: how big are the chunks, how much do they overlap with one another, and how many of the top scoring chunks do I present to the LLM?
We've settled on sizing the chunks in terms of pages (e.g., half a page per chunk or two pages per chunk) rather than a fixed number of tokens. With page-based sizing it's easier to bring your intuitions regarding the documents you're parsing to bear.
You might know that the data you seek is typically contained in one cover page, so you can have a single page per chunk and take only the top-scoring page. Similarly you might know that your target data are spread over several pages, but only occupy a small portion of those pages. In that case you can use a quarter- or half-page chunk size and a larger number of chunks.
Similarly with overlap, you may know that your document layout doesn't flow across page boundaries and zero overlap is appropriate alongside full-page chunks. Typically though some amount of overlap is useful in order to prevent relevant data from splitting across chunks.
What page-based sizing doesn't guarantee is orderly token counts on a per-chunk basis, so you need to check the token count and adjust the number of chunks you use accordingly. In general, as a cost optimization and to simplify the prompt, using a smaller number of top chunks is preferable.
OpenAI's GPT family of LLMs are not intrinsically layout aware. They take a string as context and produce a string as output and accept no explicit metadata, such as bounding boxes, that situate the components of the string in 2D space. In contrast, models such as LayoutLM, Donut, and ERNIE-Layout, explicitly represent 2D bounding boxes in their context.
In principle retaining document layout information should lead to better results for extracting data from documents, since spatial information is so central to how people interpret the information in documents. In practice it's a bit more complicated.
In our testing, the GPT APIs have been able to significantly outperform layout-aware models like LayoutLM in zero-shot question answering. To achieve this, however, we're embedding layout information into the GPT context string via intelligent bounding box to whitespace conversion. Specifically, we perform a human reading order sort on the lines of each page of the document we're extracting from, and then insert newlines and tab stops to preserve approximate spatial relationships between those lines.
This preprocessing step is most impactful when you have semi-structured documents with label/value pairs, tables, and other forms of semantic whitespace.
At 3¢ per 1k tokens for GPT-4 prompting, it's pretty easy to run up a tab if you have many questions about many documents. The 6¢ per 1k for completions is less painful given that question-answering completions are typically quite short. In some other use cases however, such as extracting tables from documents, the completion load can be a factor as well.
For question answering, you save the most money by doing an excellent job at the chunking step, so that you can confidently minimize the amount of context you need to feed the model. For example, if you have a twelve-page document that contains one page of label/value data, then by using chunk scoring text effectively (mimicking the structure of your target page) alongside a one-page, non-overlapping chunk size, you'll essentially guarantee that your top chunk is the one you need for question answering.
The other opportunity we've seen for cost optimization is in table extraction. Here we're feeding a table of an unknown format into the LLM, potentially spanning multiple pages, and then asking the LLM to reorganize the table to match a new set of column descriptions. This can be quite powerful, particularly when you have columns in the original table with multiple data elements that you want to separate out (e.g., you can ask the LLM to rewrite a column with both square feet and cost per square foot in a commercial real estate document as two columns in an extracted table). The downside is that your completion token count will be roughly equal to your prompt token count, which can quickly get expensive for large tables.
In this case you can optimize cost by asking the LLM to generate a concordance between the source and target column headers and then using those to rewrite the table outside the LLM. This approach misses out on some of the power an LLM might provide for column splitting or other data cleaning, but significantly reduces the cost of just getting to a table in your desired schema.
One resounding request we've had from our customers regarding our LLM-based extraction methods is for confidence scores. OCR and some classic ML models typically output a number between 0 and 1 to indicate the confidence of the model.
There is no natural analog to this confidence number when working with LLMs, however. One approach to this issue is to ask the LLM to generate a real-valued number representing its confidence. In practice this results in a misleading sense of specificity. LLMs aren't truly using an internal representation that smoothly samples possible values between 0 and 1. Instead their output is essentially qualitative, which is a bad match for a quantitative metric. In addition, LLM-provided confidence scores are inherently biased. The LLM attempts to find the best possible result and then overestimates its level of confidence.
As a more principled alternative to confidence scores, we've settled on an approach we call confidence signals. We ask the LLM whether any common sources of uncertainty are present in the prompt and its answer. In practice, the LLM correctly identifies uncertainties, for example when we've asked for a single data point and there are multiple data points in the context, or when we've asked for data that simply aren't present. These are the five confidence signals that we track at Sensible:
- No answer found in the context
- The answer is partial or otherwise incomplete
- Multiple answer candidates are present in the context
- The query is too ambiguous to provide a confident answer
- There are no sources of uncertainty for the answer (this is the fully confident case)
These signals can help refine your prompting or context selection so that you receive a more certain answer to your query.
Our commitment at Sensible is to create a bridge between documents and software, and we believe in the power of large language models to automate operations. We're not just theorizing – we've helped many businesses across insurance, logistics, real estate, and financial services realize significant efficiency gains with our document extraction techniques. These best practices can take you beyond the basics, ensuring you effectively scale your document question answering across varied layouts. If you want to nerd out on these topics with us, just drop us a line. Sign up for a free Sensible account to see the above work in action or learn more here.