ChatGPT|使用OpenAI微调自己的数据

介绍

微调可让开发者从API提供的模型中获得更多收益：

比即时设计更高质量的结果
能够训练比提示中更多的例子
由于更短的提示而节省了代币
更低的延迟请求

GPT-3已经在来自开放互联网的大量文本上进行了预训练。
当给出仅包含几个示例的提示时，它通常可以凭直觉判断出您要执行的任务合理完成，这通常称为“小样本学习”。
微调通过训练比提示中更多的示例来改进小样本学习，让您在大量任务中取得更好的结果。
对模型进行微调后，您将不再需要在提示中提供示例。这样可以节省成本并实现更低延迟的请求。

其中微调涉及以下步骤：

准备和上传训练数据
创建微调模型
使用微调模型

1、准备和上传训练数据

（1）安装openai cli

pip install --upgrade openai
export OPENAI_API_KEY="<OPENAI_API_KEY>"

（2）准备训练数据
格式为JSON Lines格式。样例如下：

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

（3）数据校验
在微调前，先可以通过工具校验数据是否正确。如下：

openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

2、创建微调模型

如果开发者已经按照上述步骤完成，可以执行：

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>

当前命令会在内部完成如下步骤：

使用文件 API上传文件（或使用已经上传的文件）
创建微调作业
流式传输事件直到作业完成（这通常需要几分钟，但如果队列中有很多作业或您的数据集很大，则可能需要数小时）

开始微调作业后，可能需要一些时间才能完成。在我们的系统中，您的工作可能排在其他工作之后，训练我们的模型可能需要几分钟或几小时，具体取决于模型和数据集的大小。如果事件流因任何原因中断，您可以通过运行以下命令恢复它：

openai api fine_tunes.follow -i <YOUR_FINE_TUNE_JOB_ID>

工作完成后，它应该显示微调模型的名称。
除了创建微调作业外，您还可以列出现有作业、检索作业状态或取消作业。

# 查看作业列表
openai api fine_tunes.list

# 查看作业状态
openai api fine_tunes.get -i <YOUR_FINE_TUNE_JOB_ID>

# 退出指定的作业任务
openai api fine_tunes.cancel -i <YOUR_FINE_TUNE_JOB_ID>

3、使用微调模型

当作业成功时，该fine_tuned_model字段将填充模型名称。

命令行：

openai api completions.create -m <FINE_TUNED_MODEL> -p <YOUR_PROMPT>

curl命令：

curl https://api.openai.com/v1/completions 
  -H "Authorization: Bearer $OPENAI_API_KEY" 
  -H "Content-Type: application/json" 
  -d '{"prompt": YOUR_PROMPT, "model": FINE_TUNED_MODEL}'

python api：

import openai
openai.Completion.create(
    model=FINE_TUNED_MODEL,
    prompt=YOUR_PROMPT)

4、官方例子（微调分类示例）

我们将微调一个Ada分类器，以区分这两种运动：棒球（Baseball）和曲棍球（Hockey）。

执行：

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai

categories = ['rec.sport.baseball', 'rec.sport.hockey']
sports_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)

（1）数据查看

可以使用sklearn加载新闻组数据集。首先，我们将查看数据本身：

执行：

print(sports_dataset['data'][0])

输出：

From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17

In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:

|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?

The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto.  Either way, I seriously
doubt they will sell out until the end of the season.)

-- 
Doug Bank                       Private Systems Division
dougb@ecs.comm.mot.com          Motorola Communications Sector
dougb@nwu.edu                   Schaumburg, Illinois
dougb@casbah.acns.nwu.edu       708-576-8207

sports_dataset.target_names[sports_dataset['target'][0]]

输出：

'rec.sport.baseball'

执行：

len_all, len_baseball, len_hockey = len(sports_dataset.data), len([e for e in sports_dataset.target if e == 0]), len([e for e in sports_dataset.target if e == 1])
print(f"Total examples: {len_all}, Baseball examples: {len_baseball}, Hockey examples: {len_hockey}")

输出：

Total examples: 1197, Baseball examples: 597, Hockey examples: 600

上面可以看到棒球类别的一个样本。它是一封发送给邮件列表的电子邮件。我们可以观察到，我们总共有1197个示例，这些示例在这两种运动之间均匀分布。

（2）数据预处理

我们将数据集转换为pandas数据框，其中包含提示和完成列。提示包含邮件列表中的电子邮件，完成是运动的名称，可以是曲棍球或棒球。仅出于演示目的和微调速度，我们只取300个示例。在实际用例中，示例越多，性能越好。

执行：

import pandas as pd

labels = [sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
texts = [text.strip() for text in sports_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]
df.head()

输出：

prompt completion
0 From: dougb@comm.mot.com (Doug Bank)nSubject:... baseball
1 From: gld@cunixb.cc.columbia.edu (Gary L Dare)... hockey
2 From: rudy@netcom.com (Rudy Wade)nSubject: Re... baseball
3 From: monack@helium.gas.uug.arizona.edu (david... hockey
4 Subject: Let it be KnownnFrom: <ISSBTL@BYUVM.... baseball
Both baseball and hockey are single tokens. We save the dataset as a jsonl file.

df.to_json("sport2.jsonl", orient='records', lines=True)

（3）openai数据预处理工具

现在，我们可以使用数据准备工具，在微调之前对数据集进行一些改进。在启动工具之前，我们更新openai库，以确保我们使用最新的数据准备工具。我们另外指定了-q选项，以自动接受所有建议。

执行：

!pip install --upgrade openai
!openai tools fine_tunes.prepare_data -f sport2.jsonl -q

输出：

Analyzing...

- Your file contains 1197 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Remove 11 long examples [Y/n]: Y
- [Recommended] Add a suffix separator `nn####nn` to all prompts [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y

Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `sport2_prepared_train.jsonl` and `sport2_prepared_valid.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `nn####nn` for the model to start generating completions, rather than continuing with the prompt.
Once your model starts training, it'll approximately take 30.8 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

这个工具很有用，它建议对数据集进行一些改进，并将数据集分成训练集和验证集。
在提示和完成之间添加后缀是必要的，以告诉模型输入文本已经停止，现在需要预测类别。由于我们在每个示例中使用相同的分隔符，因此模型能够学习到它应该在分隔符后面预测棒球或曲棍球。

在完成中添加空格前缀很有用，因为大多数单词标记都是以空格前缀进行标记化的。该工具还识别出这可能是一个分类任务，因此建议将数据集分成训练集和验证集。这将使我们能够轻松地测量新数据的预期性能。

（4）微调

该工具建议我们运行以下命令来训练数据集。由于这是一个分类任务，我们想知道在提供的验证集上的泛化性能如何，以便于我们的分类用例。
该工具建议添加–compute_classification_metrics –classification_positive_class ” baseball”以计算分类指标。
我们可以直接从CLI工具中复制建议的命令。我们特别添加-m ada来微调更便宜和更快的ada模型，通常在分类用例上与更慢和更昂贵的模型相当。

执行：

!openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" 
--compute_classification_metrics --classification_positive_class " baseball" -m ada

输出：

Upload progress: 100%|████████████████████| 1.52M/1.52M [00:00<00:00, 1.81Mit/s]
Uploaded file from sport2_prepared_train.jsonl: file-Dxx2xJqyjcwlhfDHpZdmCXlF
Upload progress: 100%|███████████████████████| 388k/388k [00:00<00:00, 507kit/s]
Uploaded file from sport2_prepared_valid.jsonl: file-Mvb8YAeLnGdneSAFcfiVcgcN
Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2021-07-30 13:15:50] Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
[2021-07-30 13:15:52] Fine-tune enqueued. Queue number: 0
[2021-07-30 13:15:56] Fine-tune started
[2021-07-30 13:18:55] Completed epoch 1/4
[2021-07-30 13:20:47] Completed epoch 2/4
[2021-07-30 13:22:40] Completed epoch 3/4
[2021-07-30 13:24:31] Completed epoch 4/4
[2021-07-30 13:26:22] Uploaded model: ada:ft-openai-2023-04-05-12-26-20
[2021-07-30 13:26:27] Uploaded result file: file-6Ki9RqLQwkChGsr9CHcr1ncg
[2021-07-30 13:26:28] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m ada:ft-openai-2023-04-05-12-26-20 -p <YOUR_PROMPT>

模型在大约十分钟内成功训练完成。我们可以看到模型名称为ada:ft-openai-2023-04-05-12-26-20，我们可以用它来进行推理。

（5）使用模型

现在我们可以调用模型来获取预测结果。

执行：

test = pd.read_json('sport2_prepared_valid.jsonl', lines=True)
test.head()

输出：

prompt completion
0 From: gld@cunixb.cc.columbia.edu (Gary L Dare)... hockey
1 From: smorris@venus.lerc.nasa.gov (Ron Morris ... hockey
2 From: golchowy@alchemy.chem.utoronto.ca (Geral... hockey
3 From: krattige@hpcc01.corp.hp.com (Kim Krattig... baseball
4 From: warped@cs.montana.edu (Doug Dolven)nSub... baseball

我们需要在提示后使用与微调期间相同的分隔符。在这种情况下，它是nn####nn。由于我们关心的是分类，因此我们希望温度尽可能低，并且我们只需要一个令牌完成来确定模型的预测。

执行：

ft_model = 'ada:ft-openai-2023-04-05-12-26-20'
res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + 'nn####nn', max_tokens=1, temperature=0)
res['choices'][0]['text']

输出：

' hockey'

要获取对数概率，我们可以在完成请求中指定logprobs参数。

res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + 'nn####nn', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['logprobs']['top_logprobs'][0]
<OpenAIObject at 0x7fe114e435c8> JSON: {
  " baseball": -7.6311407,
  " hockey": -0.0006307676
}

我们可以看到模型预测冰球比棒球更有可能，这是正确的预测。通过请求log_probs，我们可以看到每个类别的预测（对数）概率。

（6）泛化

有趣的是，我们微调的分类器非常通用。尽管是在不同邮件列表的邮件上进行训练，但它也成功地预测了推文。

执行：

sample_hockey_tweet = """Thank you to the 
@Canes
 and all you amazing Caniacs that have been so supportive! You guys are some of the best fans in the NHL without a doubt! Really excited to start this new chapter in my career with the 
@DetroitRedWings
 !!"""
res = openai.Completion.create(model=ft_model, prompt=sample_hockey_tweet + 'nn####nn', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']

输出：

' hockey'

执行：

sample_baseball_tweet="""BREAKING: The Tampa Bay Rays are finalizing a deal to acquire slugger Nelson Cruz from the Minnesota Twins, sources tell ESPN."""
res = openai.Completion.create(model=ft_model, prompt=sample_baseball_tweet + 'nn####nn', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']

输出：