Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
[中文主页] | [DJ-Cookbook] | [OperatorZoo] | [API] | [Awesome LLM Data]
Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs).
We provide a playground with a managed JupyterLab. Try Data-Juicer straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly support us by starting it (then be instantly notified of our new releases) and citing our works.
Platform for AI of Alibaba Cloud (PAI) has cited our work and integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: PAI-Data Processing for Large Models.
Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us (via issues, PRs, Slack channel, DingDing group, …), in promoting data-model co-development along with research and applications of foundation models!
Systematic & Reusable:
Empowering users with a systematic library of 100+ core OPs, and 50+ reusable config recipes and
dedicated toolkits, designed to
function independently of specific multimodal LLM datasets and processing pipelines. Supporting data analysis, cleaning, and synthesis in pre-training, post-tuning, en, zh, and more scenarios.
User-Friendly & Extensible:
Designed for simplicity and flexibility, with easy-start guides, and DJ-Cookbook containing fruitful demo usages. Feel free to implement your own OPs for customizable data processing.
Efficient & Robust: Providing performance-optimized parallel data processing (Aliyun-PAI\Ray\CUDA\OP Fusion),
faster with less resource usage, verified in large-scale production environments.
Effect-Proven & Sandbox: Supporting data-model co-development, enabling rapid iteration
through the sandbox laboratory, and providing features such as feedback loops and visualization, so that you can better understand and improve your data and models. Many effect-proven datasets and models have been derived from DJ, in scenarios such as pre-training, text-to-video and image-to-text generation.
Data-Juicer is released under Apache License 2.0.
We are in a rapidly developing field and greatly welcome contributions of new
features, bug fixes, and better documentation. Please refer to
How-to Guide for Developers.
Data-Juicer is used across various foundation model applications and research initiatives, such as industrial scenarios in Alibaba Tongyi and Alibaba Cloud’s platform for AI (PAI).
We look forward to more of your experience, suggestions, and discussions for collaboration!
Data-Juicer thanks many community contributors and open-source projects, such as
Huggingface-Datasets, Bloom, RedPajama, Arrow, Ray, …
If you find Data-Juicer useful for your research or development, please kindly cite the following works, 1.0paper, 2.0paper.
@inproceedings{djv1,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
booktitle={International Conference on Management of Data},
year={2024}
}
@article{djv2,
title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models},
author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Zhang, Yilei and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren},
journal={arXiv preprint arXiv:2501.14755},
year={2024}
}
(ICML’25 Spotlight) Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development
(CVPR’25) ImgDiff: Contrastive Data Synthesis for Vision Large Language Models
(Benchmark Data) HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data
(Benchmark Data) DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
(Data Synthesis) Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
(Data Synthesis) MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?
(Data Scaling) BiMix: A Bivariate Data Mixing Law for Language Model Pretraining