Fine-tuning
Learn how to customize a model for your application.
Introduction
This guide is intended for users of the new OpenAI fine-tuning API. If you are a legacy fine-tuning user, please refer to our legacy fine-tuning guide.
Fine-tuning lets you get more out of the models available through the API by providing:
- Higher quality results than prompting
- Ability to train on more examples than can fit in a prompt
- Token savings due to shorter prompts
- Lower latency requests
GPT models have been pre-trained on a vast amount of text. To use the models effectively, we include instructions and sometimes several examples in a prompt. Using demonstrations to show how to perform a task is often called "few-shot learning."
Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.
At a high level, fine-tuning involves the following steps:
- Prepare and upload training data
- Train a new fine-tuned model
- Use your fine-tuned model
Visit our pricing page to learn more about how fine-tuned model training and usage are billed.
What models can be fine-tuned?
We are working on enabling fine-tuning for GPT-4 and expect this feature to be available later this year.
Fine-tuning is currently available for the following models:
gpt-3.5-turbo-0613
(recommended)babbage-002
davinci-002
We expect
gpt-3.5-turbo
to be the right model for most users in terms of results and ease of use, unless you are migrating a legacy fine-tuned model.
When to use fine-tuning
Fine-tuning GPT models can make them better for specific applications, but it requires a careful investment of time and effort. We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling, with the key reasons being:
- There are many tasks for which our models may initially appear to not perform well at, but with better prompting we can achieve much better results and potentially not need to be fine-tune
- Iterating over prompts and other tactics has a much faster feedback loop than iterating with fine-tuning, which requires creating datasets and running training jobs
- In cases where fine-tuning is still necessary, initial prompt engineering work is not wasted - we typically see best results when using a good prompt in the fine-tuning data (or combining prompt chaining / tool use with fine-tuning)
Our GPT best practices guide provides a background on some of the most effective strategies and tactics for getting better performance without fine-tuning. You may find it helpful to iterate quickly on prompts in our playground.
Common use cases
Some common use cases where fine-tuning can improve results:
- Setting the style, tone, format, or other qualitative aspects
- Improving reliability at producing a desired output
- Correcting failures to follow complex prompts
- Handling many edge cases in specific ways
- Performing a new skill or task that’s hard to articulate in a prompt
One high-level way to think about these cases is when it’s easier to "show, not tell". In the sections to come, we will explore how to set up data for fine-tuning and various examples where fine-tuning improves the performance over the baseline model.
Another scenario where fine-tuning is effective is in reducing costs and / or latency, by replacing GPT-4 or by utilizing shorter prompts, without sacrificing quality. If you can achieve good results with GPT-4, you can often reach similar quality with a fine-tuned
gpt-3.5-turbo
model by fine-tuning on the GPT-4 completions, possibly with a shortened instruction prompt.
Preparing your dataset
Once you have determined that fine-tuning is the right solution (i.e. you’ve optimized your prompt as far as it can take you and identified problems that the model still has), you’ll need to prepare data for training the model. You should create a diverse set of demonstration conversations that are similar to the conversations you will ask the model to respond to at inference time in production.
Each example in the dataset should be a conversation in the same format as our Chat completions API, specifically a list of messages where each message has a role, content, and optional name. At least some of the training examples should directly target cases where the prompted model is not behaving as desired, and the provided assistant messages in the data should be the ideal responses you want the model to provide.
Example format
In this example, our goal is to create a chatbot that occasionally gives sarcastic responses, these are three training examples (conversations) we could create for a dataset:
1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
We do not currently support function calling examples but are working to enable this.
The conversational chat format is required to fine-tune
gpt-3.5-turbo
. For
babbage-002
and
davinci-002
, you can follow the prompt completion pair format used for legacy fine-tuning as shown below.
1
2
3
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
Crafting prompts
We generally recommend taking the set of instructions and prompts that you found worked best for the model prior to fine-tuning, and including them in every training example. This should let you reach the best and most general results, especially if you have relatively few (e.g. under a hundred) training examples.
If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those "baked-in" instructions at inference time.
It may take more training examples to arrive at good results, as the model has to learn entirely through demonstration and without guided instructions.
Example count recommendations
To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with
gpt-3.5-turbo
but the right number varies greatly based on the exact use case.
We recommend starting with 50 well-crafted demonstrations and seeing if the model shows signs of improvement after fine-tuning. In some cases that may be sufficient, but even if the model is not yet production quality, clear improvements are a good sign that providing more data will continue to improve the model. No improvement suggests that you may need to rethink how to set up the task for the model or restructure the data before scaling beyond a limited example set.
Train and test splits
After collecting the initial dataset, we recommend splitting it into a training and test portion. When submitting a fine-tuning job with both training and test files, we will provide statistics on both during the course of training. These statistics will be your initial signal of how much the model is improving. Additionally, constructing a test set early on will be useful in making sure you are able to evaluate the model after training, by generating samples on the test set.
Token limits
Each training example is limited to 4096 tokens. Examples longer than this will be truncated to the first 4096 tokens when training. To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under 4,000. The maximum number of total tokens trained per job is 50 million tokens (
tokens_in_dataset * n_epochs
).
You can compute token counts using our counting tokens notebook from the OpenAI cookbook.
Estimate costs
In order to estimate the cost of a fine-tuning job, please refer to the pricing page for details on the cost per 1k tokens. To estimate the costs for a specific fine-tuning job, use the following formula:
base cost per 1k tokens * number of tokens in the input file * number of epochs trained
For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be ~$2.40.
Check data formatting
Once you have compiled a dataset and before you create a fine-tuning job, it is important to check the data formatting. To do this, we created a simple Python script which you can use to find potential errors, review token counts, and estimate the cost of a fine-tuning job.
Data formatting script
Once you have the data validated, the file needs to be uploaded in order to be used with a fine-tuning jobs:
1
2
3
4
5
6
7
8
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.File.create(
file=open("mydata.jsonl", "rb"),
purpose='fine-tune'
)
Create a fine-tuned model
After ensuring you have the right amount and structure for your dataset, and have uploaded the file, the next step is to create a fine-tuning job.
Start your fine-tuning job using the OpenAI SDK:
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
model
is the name of the model you're starting from (
gpt-3.5-turbo
,
babbage-002
, or
davinci-002
). You can customize your fine-tuned model's name using the suffix parameter.
After you've started a fine-tuning job, it may take some time to complete. Your job may be queued behind other jobs in our system, and training a model can take minutes or hours depending on the model and dataset size. After the model training is completed, the user who created the fine-tuning job will receive an email confirmation.
In addition to creating a fine-tuning job, you can also list existing jobs, retrieve the status of a job, or cancel a job.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# List 10 fine-tuning jobs
openai.FineTuningJob.list(limit=10)
# Retrieve the state of a fine-tune
openai.FineTuningJob.retrieve("ft-abc123")
# Cancel a job
openai.FineTuningJob.cancel("ft-abc123")
# List up to 10 events from a fine-tuning job
openai.FineTuningJob.list_events(id="ft-abc123", limit=10)
# Delete a fine-tuned model (must be an owner of the org the model was created in)
openai.Model.delete("ft-abc123")
Use a fine-tuned model
When a job has succeeded, you will see the
fine_tuned_model
field populated with the name of the model when you retrieve the job details. You may now specify this model as a parameter to in the Chat completions (for
gpt-3.5-turbo
) or legacy Completions API (for
babbage-002
and
davinci-002
), and make requests to it using the Playground.
After your job is completed, the model should be available right away for inference use. In some cases, it may take several minutes for your model to become ready to handle requests. If requests to your model time out or the model name cannot be found, it is likely because your model is still being loaded. If this happens, try again in a few minutes.
1
2
3
4
5
6
7
8
9
completion = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
You can start making requests by passing the model name as shown above and in our GPT guide.
Analyzing your fine-tuned model
We provide the following training metrics computed over the course of training: training loss, training token accuracy, test loss, and test token accuracy. These statistics are meant to provide a sanity check that training went smoothly (loss should decrease, token accuracy should increase).
However we think that evaluating samples from the fine-tuned model provides the most relevant sense of model quality. We recommend generating samples from both the base model and the fine-tuned model on a test set, and comparing the samples side by side. The test set should ideally include the full distribution of inputs that you might send to the model for inference. If manual evaluation is too time-consuming, consider using our Evals library for how to use GPT-4 to perform evaluations.
Iterating on data quality
If the results from a fine-tuning job are not as good as you expected, consider the following ways to adjust the training dataset:
- Collect examples to target remaining issues - If the model still isn’t good at certain aspects, add training examples that directly show the model how to do these aspects correctly
- Scrutinize existing examples for issues - If your model has grammar, logic, or style issues, check if your data has any of the same issues. For instance, if the model now says "I will schedule this meeting for you" (when it shouldn’t), see if existing examples teach the model to say it can do new things that it can’t do
- Consider the balance and diversity of data - If 60% of the assistant responses in the data says "I cannot answer this", but at inference time only 5% of responses should say that, you will likely get an overabundance of refusals
- Make sure your training examples contain all of the information needed for the response - If we want the model to compliment a user based on their personal traits and a training example includes assistant compliments for traits not found in the preceding conversation, the model may learn to hallucinate information
- Look at the agreement / consistency in the training examples - If multiple people created the training data, it’s likely that model performance will be limited by the level of agreement / consistency between people. For instance, in a text extraction task, if people only agreed on 70% of extracted snippets, the model would likely not be able to do better than this
- Make sure your all of your training examples are in the same format, as expected for inference
Iterating on data quantity
Once you’re satisfied with the quality and distribution of the examples, you can consider scaling up the number of training examples. This tends to help the model learn the task better, especially around possible "edge cases". We expect a similar amount of improvement every time you double the number of training examples. You can loosely estimate the expected quality gain from increasing the training data size by:
- Fine-tuning on your current dataset
- Fine-tuning on half of your current dataset
- Observing the quality gap between the two
In general, if you have to make a trade-off, a smaller amount of high-quality data is generally more effective than a larger amount of low-quality data.
Iterating on hyperparameters
We allow you to specify the number of epochs to fine-tune a model for. We recommend initially training without specifying the number of epochs, allowing us to pick a default for you based on dataset size, then adjusting if you observe the following:
- If the model does not follow the training data as much as expected increase the number by 1 or 2 epochs - This is more common for tasks for which there is a single ideal completion (or a small set of ideal completions which are similar). Some examples include classification, entity extraction, or structured parsing. These are often tasks for which you can compute a final accuracy metric against a reference answer.
- If the model becomes less diverse than expected decrease the number by 1 or 2 epochs - This is more common for tasks for which there are a wide range of possible good completions
Fine-tuning examples
Now that we have explored the basics of the fine-tuning API, let’s look at going through the fine-tuning lifecycle for a few different use cases.
Style and tone
Structured output
Migration of legacy models
For users migrating from
/v1/fine-tunes
to the updated
/v1/fine_tuning/jobs
API and newer models, the main difference you can expect is the updated API. The legacy prompt completion pair data format has been retained for the updated
babbage-002
and
davinci-002
models to ensure a smooth transition. The new models will support fine-tuning with 4k token context and have a knowledge cutoff of September 2021.
For most tasks, you should expect to get better performance from
gpt-3.5-turbo
than from the GPT base models.
FAQ
When should I use fine-tuning vs embeddings with retrieval?
Embeddings with retrieval is best suited for cases when you need to have a large database of documents with relevant context and information.
By default OpenAI’s models are trained to be helpful generalist assistants. Fine-tuning can be used to make a model which is narrowly focused, and exhibits specific ingrained behavior patterns. Retrieval strategies can be used to make new information available to a model by providing it with relevant context before generating its response. Retrieval strategies are not an alternative to fine-tuning and can in fact be complementary to it.
When can I fine-tune GPT-4 or GPT-3.5-Turbo-16k?
We plan to release support for fine-tuning both of these models later this year.
How do I know if my fine-tuned model is actually better than the base model?
We recommend generating samples from both the base model and the fine-tuned model on a test set of chat conversations, and comparing the samples side by side. For more comprehensive evaluations, consider using the OpenAI evals framework to create an eval specific to your use case.
Can I continue fine-tuning a model that has already been fine-tuned?
No, we do not currently support continuing the fine-tuning process once a job has finished. We plan to support this in the near future.
How can I estimate the cost of fine-tuning a model?
Please refer to the estimate cost section above.
Does the new fine-tuning endpoint still work with Weights & Biases for tracking metrics?
No, we do not currently support this integration but are working to enable it in the near future.
How many fine-tuning jobs can I have running at once?
Please refer to our rate limit guide for the most up to date information on the limits.
版权归原作者 javastart 所有, 如有侵权,请联系我们删除。