text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space
📚 View the paper and user guide
📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.
2024/11/15
: We released the CogVideoX1.5
model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code.2024/11/08
: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.2024/10/13
: A more cost-effective fine-tuning framework for CogVideoX-5B
that works with a single2024/10/10
: We have updated our technical report. Please2024/10/09
: We have publicly2024/9/19
: We have open-sourced the CogVideoX series image-to-video model CogVideoX-5B-I2V.2024/9/19
: The Caption2024/8/27
: We have open-sourced a larger model in the CogVideoX series, CogVideoX-5B. We haveGTX 1080TI
, and CogVideoX-5B on desktop GPUs like RTX 3060
. Please strictly2024/8/6
: We have open-sourced 3D Causal VAE, used for CogVideoX-2B, which can reconstruct videos with2024/8/6
: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B2022/5/19
: We have open-sourced the CogVideo video generation model (now you can see it inCogVideo
branch). This is the first open-source large Transformer-based text-to-video generation model. You canJump to a specific section:
Before running the model, please refer to this guide to see how we use large models like
GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained
with long prompts, and a good prompt directly impacts the quality of the video generation.
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
Follow instructions in sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is
recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
rapid stacking and development.
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txt
Then follow diffusers_demo: A more detailed explanation of the inference code, mentioning the
significance of common parameters.
For more details on quantized inference, please refer
to diffusers-torchao. With Diffusers and TorchAO, quantized inference
is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of
memory and time benchmarks with various settings on A100 and H100 has been published
at diffusers-torchao.
To view the corresponding prompt words for the gallery, please click here
CogVideoX is an open-source version of the video generation model originating
from QingYing. The table below displays the list of video generation
models we currently offer, along with their foundational information.
Model Name | CogVideoX1.5-5B (Latest) | CogVideoX1.5-5B-I2V (Latest) | CogVideoX-2B | CogVideoX-5B | CogVideoX-5B-I2V |
---|---|---|---|---|---|
Release Date | November 8, 2024 | November 8, 2024 | August 6, 2024 | August 27, 2024 | September 19, 2024 |
Video Resolution | 1360 * 768 | Min(W, H) = 768 768 ≤ Max(W, H) ≤ 1360 Max(W, H) % 16 = 0 |
720 * 480 | ||
Inference Precision | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 | FP16*(Recommended), BF16, FP32, FP8*, INT8, Not supported: INT4 | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 | ||
Single GPU Memory Usage |
SAT BF16: 76GB diffusers BF16: from 10GB* diffusers INT8(torchao): from 7GB* |
SAT FP16: 18GB diffusers FP16: 4GB minimum* diffusers INT8 (torchao): 3.6GB minimum* |
SAT BF16: 26GB diffusers BF16 : 5GB minimum* diffusers INT8 (torchao): 4.4GB minimum* |
||
Multi-GPU Memory Usage | BF16: 24GB* using diffusers |
FP16: 10GB* using diffusers |
BF16: 15GB* using diffusers |
||
Inference Speed (Step = 50, FP/BF16) |
Single A100: ~1000 seconds (5-second video) Single H100: ~550 seconds (5-second video) |
Single A100: ~90 seconds Single H100: ~45 seconds |
Single A100: ~180 seconds Single H100: ~90 seconds |
||
Prompt Language | English* | ||||
Prompt Token Limit | 224 Tokens | 226 Tokens | |||
Video Length | 5 seconds or 10 seconds | 6 seconds | |||
Frame Rate | 16 frames / second | 8 frames / second | |||
Position Encoding | 3d_rope_pos_embed | 3d_sincos_pos_embed | 3d_rope_pos_embed | 3d_rope_pos_embed + learnable_pos_embed | |
Download Link (Diffusers) | 🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
Download Link (SAT) | 🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
SAT |
Data Explanation
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
enable_sequential_cpu_offload()
optimization needs to be disabled.FP16
precision, and all CogVideoX-5B models were trained in BF16
precision.torch.compile
, which can significantly improve inference speed. FP8 precision must be used ontorch
, torchao
Python packages. CUDA 12.4 is recommended.diffusers
version of the model supports quantization.We highly welcome contributions from the community and actively contribute to the open-source community. The following
works have already been adapted for CogVideoX, and we invite everyone to use them:
diffusers
version model. Supports more resolutions, andThis open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples
of the CogVideoX open-source model.
Here provide three projects that can be run directly on free Colab T4 instances:
This folder contains some tools for model conversion / caption generation, etc.
The official repo for the
paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
is on the CogVideo branch
CogVideo is able to generate relatively high-frame-rate videos.
A 4-second clip of 32 frames is shown below.
The demo for CogVideo is at https://models.aminer.cn/cogvideo, where you can get
hands-on practice on text-to-video generation. The original input is in Chinese.
🌟 If you find our work helpful, please leave us a star and cite our paper.
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
@article{hong2022cogvideo,
title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
journal={arXiv preprint arXiv:2205.15868},
year={2022}
}
We welcome your contributions! You can click here for more information.
The code in this repository is released under the Apache 2.0 License.
The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under
the Apache 2.0 License.
The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under
the CogVideoX LICENSE.