0


深度学习系列37:CLIP模型

1 模型说明

含义:CLIP(Contrastive Language-Image Pre-training)
git地址:https://github.com/openai/CLIP
paper:https://arxiv.org/abs/2103.00020
安装:pip install git+https://github.com/openai/CLIP.git
或者使用另一个开源复现:pip install open_clip_torch
CLIP模型用4亿对来自网络的图文数据对,将文本作为图像标签,使用NLP监督预训练图像分类器,使用256个GPU训练两周。模型为350M,通过蒸馏转为48M,后续又转为24M。

1.1 训练过程

在这里插入图片描述

  1. 2个encoder分别处理文本和图片数据,text encoder使用Transformer,image encoder用了2种模型,ResNet和Vision Transformer(ViT);
  2. encoder representation直接线性投影到multi-modal embedding space;这里包含了所有需要学习的参数
  3. 计算2模态之间的cosine similarity,让N个匹配的图文对相似度最大,不匹配的图文对相似度最小;具体来说,是横着计算一遍交叉熵,再竖着计算一遍交叉熵,然后取平均值。

伪代码如下:T_f和I_f是编码结果,W_i和W_t是embedding参数,T_e和I_e就是多模态结果,两者相乘得到的logits就是上图的矩阵,然后和对角矩阵计算交叉熵损失。
在这里插入图片描述

1.2 api调用

模型的使用方法如下:首先将需要分类的图像经过编码器得到特征,然后对于目标任务数据集的每一个标签,或者你自己定义的标签,都构造一段对应的文本,如上图中的 dog 会改造成 “A photo of a dog”,以此类推。然后经过编码器得到文本和图像特征,接着将文本特征与图像特征做内积,内积最大对应的标签就是图像的分类结果。
在这里插入图片描述
clip函数支持的方法:
clip.available_models()
clip.load(name, device=…, jit=False)
clip.tokenize(text: Union[str, List[str]], context_length=77)
加载进来的模型支持的方法:
model.encode_image(image: Tensor)
model.encode_text(text: Tensor)
model(image: Tensor, text: Tensor):计算余弦相似度

2. 使用样例

一般的流程是:

  1. 通过调用clip.load(模型名称),获取model, preprocess
  2. 调用clip.tokenize向量化文字,然后调用model.encode_text转为text_feature
  3. 调用preprocess处理图片,然后调用model.encode_image转为image_feature
  4. 将两个feature标准化后,计算余弦相似度

2.1 skimage自带图像与描述文字的相似度

  1. import numpy as np
  2. import torch
  3. from pkg_resources import packaging
  4. import clip
  5. # 导入数据
  6. model, preprocess = clip.load("ViT-B/32") # 加载模型
  7. model.cuda().eval()
  8. input_resolution = model.visual.input_resolution
  9. context_length = model.context_length
  10. vocab_size = model.vocab_size
  11. # Model parameters: 151,277,313
  12. # Input resolution: 224
  13. # Context length: 77
  14. # Vocab size: 49408
  15. import os
  16. import skimage
  17. import IPython.display
  18. import matplotlib.pyplot as plt
  19. from PIL import Image
  20. import numpy as np
  21. from collections import OrderedDict
  22. import torch
  23. # 描述文字
  24. descriptions = {
  25. "page": "a page of text about segmentation",
  26. "chelsea": "a facial photo of a tabby cat",
  27. "astronaut": "a portrait of an astronaut with the American flag",
  28. "rocket": "a rocket standing on a launchpad",
  29. "motorcycle_right": "a red motorcycle standing in a garage",
  30. "camera": "a person looking at a camera on a tripod",
  31. "horse": "a black-and-white silhouette of a horse",
  32. "coffee": "a cup of coffee on a saucer"
  33. }
  34. original_images = []
  35. images = []
  36. texts = []
  37. for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]:
  38. name = os.path.splitext(filename)[0]
  39. if name not in descriptions:
  40. continue
  41. image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB")
  42. original_images.append(image)
  43. images.append(preprocess(image))
  44. texts.append(descriptions[name])
  45. image_input = torch.tensor(np.stack(images)).cuda()
  46. text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda() # shape: 8*77
  47. # 512 dimension
  48. with torch.no_grad():
  49. image_features = model.encode_image(image_input).float()
  50. text_features = model.encode_text(text_tokens).float()
  51. image_features /= image_features.norm(dim=-1, keepdim=True)
  52. text_features /= text_features.norm(dim=-1, keepdim=True)
  53. similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T

输出如下图:
在这里插入图片描述

2.2 上述图片使用cifer100的标签分类

  1. # 加载数据
  2. from torchvision.datasets import CIFAR100
  3. cifar100 = CIFAR100(os.path.expanduser("~/.cache"), transform=preprocess, download=True)
  4. # 描述文字
  5. text_descriptions = [f"This is a photo of a {label}" for label in cifar100.classes]
  6. text_tokens = clip.tokenize(text_descriptions).cuda()
  7. with torch.no_grad():
  8. text_features = model.encode_text(text_tokens).float()
  9. text_features /= text_features.norm(dim=-1, keepdim=True)
  10. # 展示概率最高的top5分类
  11. text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
  12. top_probs, top_labels = text_probs.cpu().topk(5, dim=-1)

结果如下:
在这里插入图片描述

2.3 判断性别

  1. classes = ['man', 'woman']
  2. image_input = preprocess(Image.open('man.jpg')).unsqueeze(0)
  3. text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in classes])
  4. #特征编码
  5. with torch.no_grad():
  6. image_features = model.encode_image(image_input)
  7. text_features = model.encode_text(text_inputs)
  8. #选取参数最高的标签
  9. image_features /= image_features.norm(dim=-1, keepdim=True)
  10. text_features /= text_features.norm(dim=-1, keepdim=True)
  11. similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
  12. values, indices = similarity[0].topk(1)
  13. #输出结果
  14. print("\nTop predictions:\n")
  15. print('classes:{} score:{:.2f}'.format(classes[indices.item()], values.item()))

3. 迁移训练

参考如下代码。
其中image_caption_dataset用来加载图像文字对,load_data调用image_caption_dataset来包装训练数据对。
load_pretrian_model用于加载训练用的模型,jit需要设置为False。
通过logits_per_image, logits_per_text = model(images, texts)可以得到预测结果,与torch.arange(N)计算交叉熵进行优化

  1. from PIL import Image
  2. import os
  3. device = 'cuda' if torch.cuda.is_available() else 'cpu'
  4. class image_caption_dataset(Dataset):
  5. def __init__(self, df, preprocess):
  6. self.images = df["image"]
  7. self.caption = df["caption"]
  8. self.preprocess = preprocess
  9. def __len__(self):
  10. return len(self.caption)
  11. def __getitem__(self, idx):
  12. images = self.preprocess(Image.open(self.images[idx]))
  13. caption = self.caption[idx]
  14. return images, caption
  15. def load_data(cup_path, cupnot_path, batch_size, preprocess):
  16. df = {'image': [], 'caption':[]}
  17. cup_list = os.listdir(cup_path)
  18. cupnot_list = os.listdir(cupnot_path)
  19. caption = cup_path.split('/')[-1]
  20. for img in cup_list:
  21. img_path = os.path.join(cup_path, img)
  22. df['image'].append(img_path)
  23. df['caption'].append(caption)
  24. caption = cupnot_path.split('/')[-1]
  25. for img in cupnot_list:
  26. img_path = os.path.join(cupnot_path, img)
  27. df['image'].append(img_path)
  28. df['caption'].append(caption)
  29. dataset = image_caption_dataset(df, preprocess)
  30. train_dataloader = DataLoader(dataset, batch_size=batch_size)
  31. return train_dataloader
  32. def convert_models_to_fp32(model):
  33. for p in model.parameters():
  34. p.data = p.data.float()
  35. p.grad.data = p.grad.data.float()
  36. def load_pretrian_model(model_path):
  37. model, preprocess = clip.load(model_path, device=device, jit=False) # 训练时 jit必须设置为false
  38. if device == "cpu":
  39. model.float()
  40. else:
  41. clip.model.convert_weights(model)
  42. return model, preprocess
  43. def train(epoch, batch_size, learning_rate, cup_path, cupnot_path):
  44. # 加载模型
  45. model, preprocess = load_pretrian_model('ViT-B/32')
  46. #加载数据集
  47. train_dataloader = load_data(cup_path, cupnot_path, batch_size, preprocess)
  48. #设置参数
  49. loss_img = nn.CrossEntropyLoss().to(device)
  50. loss_txt = nn.CrossEntropyLoss().to(device)
  51. optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.98), eps=1e-6, weight_decay=0.2)
  52. for i in range(epoch):
  53. for batch in train_dataloader:
  54. list_image, list_txt = batch
  55. texts = clip.tokenize(list_txt).to(device)
  56. images = list_image.to(device)
  57. logits_per_image, logits_per_text = model(images, texts)
  58. if device == "cpu":
  59. ground_truth = torch.arange(batch_size).long().to(device)
  60. else:
  61. ground_truth = torch.arange(batch_size, dtype=torch.long, device=device)
  62. #反向传播
  63. total_loss = (loss_img(logits_per_image, ground_truth) + loss_txt(logits_per_text, ground_truth)) / 2
  64. optimizer.zero_grad()
  65. total_loss.backward()
  66. if device == "cpu":
  67. optimizer.step()
  68. else:
  69. convert_models_to_fp32(model)
  70. optimizer.step()
  71. clip.model.convert_weights(model)
  72. print('[%d] loss: %.3f' %(i + 1, total_loss))
  73. torch.save(model, './model/model1.pkl')
  74. def main():
  75. epoch = 100
  76. batch_size = 6
  77. learning_rate = 5e-5
  78. cup_path = './data/It is photo with cup'
  79. cupnot_path = './data/It is photo without cup'
  80. train(epoch, batch_size, learning_rate, cup_path, cupnot_path)
  81. if __name__ == '__main__':
  82. main()

本文转载自: https://blog.csdn.net/kittyzc/article/details/125167223
版权归原作者 IE06 所有, 如有侵权,请联系我们删除。

“深度学习系列37:CLIP模型”的评论:

还没有评论