ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering
This repository contains the code for the MetaCLIP, described in the paper Demystifying CLIP Data that formalizes CLIP data curation as a simple algorithm. The main contributions are:
We conclude that:
MetaCLIP is trained w/ face blurred images.
@inproceedings{xu2023metaclip,
title={Demystifying CLIP Data},
author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
journal={arXiv preprint arXiv:2309.16671},
year={2023}
}
@inproceedings{xu2024altogether,
title={Altogether: Image Captioning via Re-aligning Alt-text},
author={Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer},
journal={arXiv preprint arXiv:2410.17251},
year={2024}
}
The pre-trained MetaCLIP models are available in
from PIL import Image
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")
image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)
import torch
from PIL import Image
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-quickgelu', pretrained='metaclip_400m') # for 2.5B use 'metaclip_fullcc' in OpenCLIP or 'metaclip_2_5b' in this repo
image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
All MetaCLIP adhere to OpenAI CLIP training setup: we hope to bring back controlled experiments in the “CLIP era of ImageNet”. Specifically, we use OpenAI CLIP’s quickgelu
activation for all model configs (which was missing in older versions of OpenCLIP that mainly uses nn.GELU
instead). We add ViT-B-16-quickgelu, ViT-L-14-quickgelu, ViT-H-14-quickgelu and ViT-bigG-14-quickgelu in this repo.
model_name |
pretrained |
Data Card | # of Seen Pairs | Res. | GPUs | IN ZS Acc. |
---|---|---|---|---|---|---|
ViT-B-32-quickgelu |
metaclip_400m |
data card | 12.8B | 224 | 64 x V100 | 65.5 |
ViT-B-16-quickgelu |
metaclip_400m |
data card | 12.8B | 224 | 64 x V100 | 70.8 |
ViT-L-14-quickgelu |
metaclip_400m |
data card | 12.8B | 224 | 128 x V100 | 76.2 |
ViT-B-32-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 64 x V100 | 67.6 |
ViT-B-16-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 64 x V100 | 72.1 |
ViT-L-14-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 128 x V100 | 79.2 |
ViT-H-14-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 256 x A100 | 80.5 |
ViT-bigG-14-quickgelu |
metaclip_2_5b |
data card | 12.8B | 224 | 256 x A100 | 82.1 |
This code is customized from OpenCLIP and will be maintained separately for research on MetaCLIP. The following command should install requirements for OpenCLIP and submitit=1.2.1
used by this repo:
conda create -n metaclip python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \
-c pytorch-nightly \
-c nvidia \
-c conda-forge \
-c anaconda
MetaCLIP uses 500,000 queries as metadata to align the training data to distribution over quality writing of Wikipedia/WordNet terms. This metadata also allows us to release training data distribution of a released model as data card.
We have a demo notebook to show how the proposed algorithm works.
CLIP curation can still help as online balancing (Table 6 in the paper). We wrap CLIP curation in two key functions: substring matching (recommended to run offline) and balancing (either offline or online, please check metaclip.balancing:main
).
import json
import numpy as np
from metaclip.substr_matching import substr_matching
from metaclip.balancing import balance_sampling
with open("metadata.json") as f:
metadata = json.load(f)
# entry counts for our 1.6B(pool) -> 400M(curated); please check balance_sampling:main and substr match and count on your own data.
with open("metaclip/entry_counts_400m.json") as f:
entry_count_json = json.load(f)
entry_count = np.array([entry_count_json[entry] for entry in metadata], dtype=np.uint64) # uint64 to be safe for scaling.
t = 20000
entry_count[entry_count < t] = t
entry_prob = t / entry_count
for text in ["jacksons chameleon", "battery plate"]:
matched_entry_ids = substr_matching(text, metadata) # this is for demo purpose that redo substr_matching; see metaclip/README.md.
curation_prob = min(entry_prob[matched_entry_ids].sum(), 1.0)
curated = balance_sampling(matched_entry_ids, entry_prob)
print(f"[curation_prob={curation_prob:.3f}, curated={curated}] {text}")
We release a skeleton code for sub-string matching from CommonCrawl WAT or WARC and balancing. Check here for details.
A numpy impl. of the algorithm can be found at metaclip.pipeline
, close to the impl. used by the paper.
python submitit_openclip.py b32_400m
Please config the corresponding training_data
in run_configs_400m.py
.
Consider start from our code for building CLIP’s 500k metadata.
If you have any questions related to the code or the paper, feel free to email Hu Xu ([email protected]
).
Please cite our paper (accepted by ICLR2024 as spotlight presentation) if MetaCLIP helps your work:
@inproceedings{xu2023metaclip,
title={Demystifying CLIP Data},
author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
journal={arXiv preprint arXiv:2309.16671},
year={2023}
}
The training code is developed based on OpenCLIP, modified to the vanilla CLIP training setup.
The majority of MetaCLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.
We gratefully acknowledge the OpenCLIP team for initial CLIP codebase and integration and NielsRogge’s integration into Huggingface.