We’re excited to announce the launch of Sensible’s confidence signals for our natural language methods. Confidence signals gauge the accuracy of LLM extractions similar to how machine learning models use confidence scores.
Common sources of extraction uncertainty, such as multiple possible answers, partial or incomplete answers, uncertain answers, or no answers were returned without proper context. Sensible’s confidence signals now identify these sources of uncertainty, enabling you to enhance your LLM prompting to achieve better extraction results.
Why confidence scores aren’t ideal for LLMs: Using confidence signals to ensure your prompt returns accurate results
The main purpose of any confidence indicator, be it a quantitative score or a qualitative signal, is to highlight potential uncertainties for human review. Sensible's confidence signals help you understand how the LLM interprets your prompts and suggest improvements to achieve more accurate results.
How Sensible’s confidence signals work
To generate confidence signals, Sensible generates an uncertainties property for each extractable field. This property asks the LLM to identify its confidence about an answer, with an exhaustive list of considerations, including:
- Partial answer found: an answer is produced, but the LLM isn’t confident that it fully addresses your query
- Multiple answers found: an answer is produced, but the LLM has identified multiple answers that could work
- No answer found, query too ambiguous: the LLM is unable to identify an answer because of the prompt’s ambiguity
- Answer found: the LLM is confident about the produced answer, and will be able to successfully reproduce the extraction across varying document types
- No answer found: an answer cannot be produced from the context
If the model displays any uncertainty, the appropriate confidence signal is returned with the extraction. From there, you can manually review the extraction, and alter the initial prompt when necessary. Over time, confidence signals help you to create more robust prompts, increasing extraction accuracy and reducing the need for human review.