他们最近用开源OpenCLIP训练了三个表现极好的大规模CLIP模型,分别是ViT-L/14, ViT-H/14 和ViT-g/14 (其中ViT-g/14是只训练了大约三分之一的epoch)。其中ViT-H/14模型在ImageNet的zero shot top-1精度上达到78.0%和在MS COCO的zero shot 图像检索的Recall@5上达到73.4% (✪ω✪),据所知在当前9月...
Lets start with the model loader: OpenAI -> model,preprocess=clip.load("./models/ViT-L-14-336px.pt",device=device) OpenClip -> model,_,preprocess=open_clip.create_model_and_transforms('ViT-H-14',device=device,pretrained="laion2b_s32b_b79k") ...
支持更大的模型(如 ViT-L/14、H/14 等)以及优化架构。 可使用 LAION 数据集或 DataComp 数据集训练。 注重支持大规模多语言和多任务场景。2. CLIPModel.from_pretrained 和OpenCLIP 加载方式的区别 CLIP 和 OpenCLIP 模型加载方式存在显著差异:
ViT-H-14-CLIPA datacomp1b 968.24 354.02 0.7178 0.7588 0.9330 0.9628 0.9100 0.9900 0.9960 0.4910 0.7291 0.8140 0.6698 0.8730 0.9272 0.5343 0.4793 0.4578 0.4365 0.5641 0.5584 0.5529 28 ViT-bigG-14-CLIPA datacomp1b 2517.22 1007.93 0.7175 0.7786 0.9374 0.9650 0.9190 0.9930 0.9980 0.4996 0.7414 0.8214 ...
ViT-H-14-CLIPA-336,laion2b,968.64,800.88,0.6439,0.7910,0.9438,0.9826,0.8643,0.1835,0.2158,0.3111,0.7160,0.6393,0.3437,0.9303,0.5007,0.6994,0.7241,0.7213,0.3655,0.9269,0.1561,0.6365,0.7022,0.8009,0.9444,0.7723,0.5787,0.6178,0.7029,0.9476,0.9894,0.7567,0.6255,0.8522,0.5883,0.4878,0.1853,0.5001,0.1666,0....
ViT-H/14-quickgelu (DFN) DFN-5B 224px 39B 83.4% ViT-H-14-378-quickgelu (DFN) DFN-5B 378px 44B 84.4% Model cards with additional model specific details can be found on the Hugging Face Hub under the OpenCLIP library tag: https://huggingface.co/models?library=open_clip. If you fou...