Migrating off deprecated OpenAI models in a production system

Updated on
December 20, 2023
min read
Migrating off deprecated OpenAI models in a production system
Table of contents
Turn documents into structured data
LLM document extraction for developers
Get started free
Share this post

OpenAI announced in July that it would deprecate a large swath of its legacy completion and embedding models in January 2024. While we primarily use non-legacy GPT-3.5 and GPT-4 completion variants at Sensible, our core query method uses text-davinci-003.

Determining the appropriate replacement model is not a cut-and-dry process. OpenAI's guidance is to replace its InstructGPT completion models, like text-davinci-003, with gpt-3.5-turbo-instruct. This is not the path we took after carefully evaluating multiple replacement models for accuracy.

Model evaluation

To gauge the accuracy of any engine change at Sensible, we have a library of what we call goldens: many hundreds of PDFs, extraction logic using our document query language, SenseML (both LLM-based and layout-based), and the expected output from that combination. For many changes to the engine code we expect no change to our outputs, and we block PRs where our goldens change without an explicit commit to modify their expected output.

In the case of an OpenAI model swap, we do expect some changes to the goldens, but we want them to be an overall improvement while minimizing their absolute number. When we test a new model we run our entire golden set against this new model and evaluate any changes to our output. We can then calculate improvements, neutral changes, and regressions.

To evaluate candidate replacement models, we test model/prompt pairs to find the set that best achieves the above goals. In this process our prompt changed significantly from its text-davinci-003 version and adopted some new best practices in prompting the gpt-3.5 models (e.g., thinking step-by-step).

Previous prompt

Answer the question as truthfully and concisely as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Updated prompt

You are a helpful assistant that retrieves answers from documents for users. Your responses consist of only the answer, with no other comments, description or explanations. Please follow the below steps carefully:          

1. Analyze the query and document context provided thoroughly.         
2. Extract and return the answer for the query from the document context and return it concisely without any modifications from the document. If no answer is found then simply return "I don't know".

In concert with the above prompt changes, we found that gpt-3.5-turbo-0613 is the best-performing model for replacing text-davinci-003 for our use case. The gpt-4-turbo series models are also promising but are currently too slow for our core query method (we do use gpt-4-turbo in the new thorough setting for our list method).

Confidence signals

We also needed to port confidence signals, our LLM-powered version of confidence scores, to the new model. Confidence signals are categorical indications of uncertainty in the model's response (e.g., the answer may be incomplete). With text-davinci-003 we bundled the confidence signal prompt with the core question-answering prompt to get both the signal and answer in a single completion call, which optimized for both speed and cost.

Using our same golden set, we saw significant regressions in confidence signal performance when using a bundled prompt with the gpt-3.5 models. As a result we've split the core question answering and the confidence signal into two separate completions. This is slower and more costly relative to a single gpt-3.5 call, but given that the transition from text-davinci-003 to gpt-3.5 already realizes some cost savings and speed improvements, we're comfortable spending some of those gains on higher quality results.

Splitting the completions has some ancillary benefits. First, we can use a different model for the confidence signal evaluation (currently gpt-3.5-turbo-1106), which reduces model bias in the joint response. Second, with the combined approach we required JSON output from the model. Now our question answering is formatless, which gives our users more flexibility to customize their return values.

This change also opened up a new category of confidence signal: incorrect answer. Whereas before the same LLM completion was providing the answer and assessing it, now the confidence signal completion sees the answer in place with its context and can judge that the question answering completion was incorrect.

A final benefit of this switch is that we now see no result jitter when turning confidence signals on and off. Before enabling confidence signals would increase the cognitive load, so to speak, of the prompt. Now it has no impact on the core question answering.

LLMs are a powerful tool for turning documents into structured data. At Sensible we're focused on providing developers with state-of-the-art LLM-powered document processing and automation by implementing a wide range of best practices , staying current with model advancements, and smoothly managing transitions across models.

LLM document extraction for developers
Turn documents into structured data
Get started free
Share this post

Turn documents into structured data

Document Automation for Developers

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start ingesting documents with just a few lines of code. Add document automation to your product in minutes, not months.