Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
[中文主页] | [DJ-Cookbook] | [OperatorZoo] | [API] | [Awesome LLM Data]
Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs).
We provide a playground with a managed JupyterLab. Try Data-Juicer straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly support us by starting it (then be instantly notified of our new releases) and citing our works.
Platform for AI of Alibaba Cloud (PAI) has cited our work and integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: PAI-Data Processing for Large Models.
Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us (via issues, PRs, Slack channel, DingDing group, …), in promoting data-model co-development along with research and applications of foundation models!
Systematic & Reusable:
Empowering users with a systematic library of 100+ core OPs, and 50+ reusable config recipes and
dedicated toolkits, designed to
function independently of specific multimodal LLM datasets and processing pipelines. Supporting data analysis, cleaning, and synthesis in pre-training, post-tuning, en, zh, and more scenarios.
User-Friendly & Extensible:
Designed for simplicity and flexibility, with easy-start guides, and DJ-Cookbook containing fruitful demo usages. Feel free to implement your own OPs for customizable data processing.
Efficient & Robust: Providing performance-optimized parallel data processing (Aliyun-PAI\Ray\CUDA\OP Fusion),
faster with less resource usage, verified in large-scale production environments.
Effect-Proven & Sandbox: Supporting data-model co-development, enabling rapid iteration
through the sandbox laboratory, and providing features such as feedback loops and visualization, so that you can better understand and improve your data and models. Many effect-proven datasets and models have been derived from DJ, in scenarios such as pre-training, text-to-video and image-to-text generation.
data_juicer
version incd <path_to_data_juicer>
pip install -v -e .
cd <path_to_data_juicer>
pip install -v -e . # Install minimal dependencies, which support the basic functions
pip install -v -e .[tools] # Install a subset of tools dependencies
The dependency options are listed below:
Tag | Description |
---|---|
. or .[mini] |
Install minimal dependencies for basic Data-Juicer. |
.[all] |
Install all dependencies except sandbox. |
.[sci] |
Install all dependencies for all OPs. |
.[dist] |
Install dependencies for distributed data processing. (Experimental) |
.[dev] |
Install dependencies for developing the package as contributors. |
.[tools] |
Install dependencies for dedicated tools, such as quality classifiers. |
.[sandbox] |
Install all dependencies for sandbox. |
With the growth of the number of OPs, the dependencies of all OPs become very heavy. Instead of using the command pip install -v -e .[sci]
to install all dependencies,
we provide two alternative, lighter options:
Automatic Minimal Dependency Installation: During the execution of Data-Juicer, minimal dependencies will be automatically installed. This allows for immediate execution, but may potentially lead to dependency conflicts.
Manual Minimal Dependency Installation: To manually install minimal dependencies tailored to a specific execution configuration, run the following command:
# only for installation from source
python tools/dj_install.py --config path_to_your_data-juicer_config_file
# use command line tool
dj-install --config path_to_your_data-juicer_config_file
data_juicer
using pip
:pip install py-data-juicer
data_juicer
and two basic toolsdata_juicer
from source.data_juicer
, we recommend you install from source.either pull our pre-built image from DockerHub:
docker pull datajuicer/data-juicer:<version_tag>
or run the following command to build the docker image including the
latest data-juicer
with provided Dockerfile:
docker build -t datajuicer/data-juicer:<version_tag> .
The format of <version_tag>
is like v0.2.0
, which is the same as the release version tag.
import data_juicer as dj
print(dj.__version__)
Before using video-related operators, FFmpeg should be installed and accessible via the $PATH environment variable.
You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on Debian/Ubuntu, brew install ffmpeg on OS X) or visit the official ffmpeg link.
Check if your environment path is set correctly by running the ffmpeg command from the terminal.
process_data.py
tool or dj-process
command line tool with your config as the argument to process# only for installation from source
python tools/process_data.py --config configs/demo/process.yaml
# use command line tool
dj-process --config configs/demo/process.yaml
Note: For some operators that involve third-party models or resources that are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first.
The default download cache directory is ~/.cache/data_juicer
. Change the cache location by setting the shell environment variable, DATA_JUICER_CACHE_HOME
to another directory, and you can also change DATA_JUICER_MODELS_CACHE
or DATA_JUICER_ASSETS_CACHE
in the same way:
Note: When using operators with third-party models, it’s necessary to declare the corresponding mem_required
in the configuration file (you can refer to the settings in the config_all.yaml
file). During runtime, Data-Juicer will control the number of processes based on memory availability and the memory requirements of the operator models to achieve better data processing efficiency. When running with CUDA environments, if the mem_required for an operator is not declared correctly, it could potentially lead to a CUDA Out of Memory issue.
# cache home
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
# cache models
export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
# cache assets
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"
#... init op & dataset ...
# Chain call style, support single operator or operator list
dataset = dataset.process(op)
dataset = dataset.process([op1, op2])
# Functional programming style for quick integration or script prototype iteration
dataset = op(dataset)
dataset = op.run(dataset)
We have now implemented multi-machine distributed data processing based on RAY. The corresponding demos can be run using the following commands:
# Run text data processing
python tools/process_data.py --config ./demos/process_on_ray/configs/demo.yaml
# Run video data processing
python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo.yaml
ray
, e.g. ray_video_deduplicator
and ray_document_deduplicator
.Users can also opt not to use RAY and instead split the dataset to run on a cluster with Slurm. In this case, please use the default Data-Juicer without RAY.
Aliyun PAI-DLC supports the RAY framework, Slurm framework, etc. Users can directly create RAY jobs and Slurm jobs on the DLC cluster.
analyze_data.py
tool or dj-analyze
command line tool with your config as the argument to analyze your dataset.# only for installation from source
python tools/analyze_data.py --config configs/demo/analyzer.yaml
# use command line tool
dj-analyze --config configs/demo/analyzer.yaml
# you can also use auto mode to avoid writing a recipe. It will analyze a small
# part (e.g. 1000 samples, specified by argument `auto_num`) of your dataset
# with all Filters that produce stats.
dj-analyze --auto --dataset_path xx.jsonl [--auto_num 1000]
NON_STATS_FILTERS
: decorate Filters that DO NOT produce any stats.TAGGING_OPS
: decorate OPs that DO produce tags/categories in meta field.app.py
tool to visualize your dataset in your browser.streamlit run app.py
config_all.yaml
which includes all ops and defaultconfig_all.yaml
, op documents, and advanced Build-Up Guide for developers.python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang=en
The basic config format and definition is shown below.
The data sandbox laboratory (DJ-Sandbox) provides users with the best practices for continuously producing data recipes. It features low overhead, portability, and guidance.
The sandbox is run using the following commands by default, and for more information and details, please refer to the sandbox documentation.
python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml
tools/preprocess
for you to preprocess these data.
data-juicer
, you can run the commands or tools mentioned above using this docker image.# run the data processing directly
docker run --rm \ # remove container after the processing
--privileged \
--shm-size 256g \
--network host \
--gpus all \
--name dj \ # name of the container
-v <host_data_path>:<image_data_path> \ # mount data or config directory into the container
-v ~/.cache/:/root/.cache/ \ # mount the cache directory into the container to reuse caches and models (recommended)
datajuicer/data-juicer:<version_tag> \ # image to run
dj-process --config /path/to/config.yaml # similar data processing commands
# start the container
docker run -dit \ # run the container in the background
--privileged \
--shm-size 256g \
--network host \
--gpus all \
--rm \
--name dj \
-v <host_data_path>:<image_data_path> \
-v ~/.cache/:/root/.cache/ \
datajuicer/data-juicer:latest /bin/bash
# enter into this container and then you can use data-juicer in editable mode
docker exec -it <container_id> bash
Data-Juicer is released under Apache License 2.0.
We are in a rapidly developing field and greatly welcome contributions of new
features, bug fixes, and better documentation. Please refer to
How-to Guide for Developers.
Data-Juicer is used across various foundation model applications and research initiatives, such as industrial scenarios in Alibaba Tongyi and Alibaba Cloud’s platform for AI (PAI).
We look forward to more of your experience, suggestions, and discussions for collaboration!
Data-Juicer thanks many community contributors and open-source projects, such as
Huggingface-Datasets, Bloom, RedPajama, Arrow, Ray, …
If you find Data-Juicer useful for your research or development, please kindly cite the following works, 1.0paper, 2.0paper.
@inproceedings{djv1,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
booktitle={International Conference on Management of Data},
year={2024}
}
@article{djv2,
title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models},
author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Zhang, Yilei and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren},
journal={arXiv preprint arXiv:2501.14755},
year={2024}
}
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development
ImgDiff: Contrastive Data Synthesis for Vision Large Language Models
Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
BiMix: A Bivariate Data Mixing Law for Language Model Pretraining