ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge
通用领域中最近的大型语言模型 (LLM),例如 ChatGPT,在遵循指令和产生类似人类的响应方面取得了显着的成功。 但是语言模型并未针对医学领域量身定制,导致答案准确性较差,无法为医学诊断、药物治疗等提供合理的建议。
因此这篇文章收集了一些关于医学领域的数据集,然后基于Meta的LLaMA进行微调(毕竟chatGPT非开源)。不仅是医学领域,这篇文章的做法可以扩展到很多其他的专有领域中。
在医疗领域,通过利用医患对话数据对大模型进行微调,可以显著促进该模型在医学领域的应用。特别是在医疗资源匮乏的地区,可以使用聊天医生来支持患者的初步诊断和分诊,可以显著提高现有医疗系统的效率。
数据集
由通用数据集(掌握对话能力)和医患对话数据集(保障领域质量)组成。
- 通用数据集自然是Stanford Alpaca,52K instruction-following的数据。
- 作者们收集的InstructorDoctor-205k数据集,其中包含5000个生成的医患对话和20万个真实的医患对话,以保障准确性和多样性,以用于对大型语言模型进行微调。
- 5000 次医患对话。包括700多种疾病及其相应的症状、所需的医学检查和推荐的药物。同时,为了提高数据和模型质量,疾病数据库中的元组(疾病的名称、相应的症状等等)将被输入到ChatGPT API中以自动生成指令和对话数据,即生成患者和医生之间的对话。
- 20万个真实的医患对话。5000 次医患对话虽然确保了准确性,但对话的多样性很低。因此作者们从在线医疗问答网站"Health Care Magic.”中收集了约20万份真实的医患对话(会删除医生和患者的名字,并使用语言工具来纠正回答中的语法错误等操作)。
通过使用这些 205k 医患对话对 LLMs 进行微调的模型,能够更好理解患者需求、提供建议。
一些例子如下,
训练过程
基于InstructorDoctor-205k的微调过程,有如下几个关键点
- 使用6个A100训练18小时
- batch size为192,learning rate为2×10−5,训练3个epoch
- 最大长度为512个token,warmup为0.03,没有weight decay
ChatDoctor的代码已经开源了:
包括数据集!看一眼数据长这样(维持Alpaca的数据格式):
{
"instruction": "If you are a doctor, please answer the medical questions based on the patient's description.",
"input": "Doctor, I have been experiencing sudden and frequent panic attacks. I don't know what to do.",
"output": "Well, based on what you're telling me, it sounds like you may be suffering from panic disorder. The best course of action is to start with psychotherapy and mental health counseling. Additionally, we should conduct an electrocardiogram to make sure that there are no physical issues causing your panic attacks. We will also need to perform a depression screen and a toxicology screen to rule out any other underlying causes. Finally, I would recommend a comprehensive psychological and psychiatric evaluation and therapy to help manage your symptoms."
},
{
"instruction": "If you are a doctor, please answer the medical questions based on the patient's description.",
"input": "Doctor, I have been experiencing a hoarse voice for a few weeks now and it's not getting any better despite taking medication. What could be the problem?",
"output": "It's possible that you have a vocal cord polyp. To confirm this, we need to conduct tracheoscopy and laryngoscopy with a biopsy. We may also need to conduct an occupational therapy assessment such as speech therapy, other diagnostic procedures like an interview or consultation, physical therapy exercises, as well as diagnostic procedures on your nose, mouth, and pharynx to rule out any other underlying conditions. Additionally, we may need to evaluate your eyes and ears for any related conditions, such as ophthalmologic and otologic diagnosis and treatment."
},
{
"instruction": "If you are a doctor, please answer the medical questions based on the patient's description.",
"input": "Hi, Doctor. I think I might have Turner syndrome. ",
"output": "Okay, in that case, you will need to undergo a complete physical skin exam, an ultrasonography, and some other diagnostic procedures such as an interview, evaluation, and consultation. Additionally, we'll need to perform an echocardiography, examine your breast, and check your eyes and ears for any issues. We'll also do a depression screening to assess your mental health."
}
存下近期其他的模型,待更文:
DoctorGLM https://github.com/xionghonglin/doctorglm
Huatuo-Llama-Med-Chinese https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese
visual-med-alpaca: https://github.com/cambridgeltl/visual-med-alpaca
版权归原作者 上杉翔二 所有, 如有侵权,请联系我们删除。