Recycled data coming soon...
- Prerequisites
- Rephraser Training
- Rephraser Inference
- Quality Filtering
- Tokenization
- Pretraining
- Evaluation
- Citation
Our RL code is based on Open R1, and our pretraining code is based on DCLM. Please refer to their repos for more details.
The code is tested on Python 3.10.16. Install basic dependencies:
pip install -r requirements.txtRun setup.py to download necessary files:
cd pretrain
python setup.py installRun scripts/rl.sh.
- We provide a 1000-example subset of the training data (
rl/1000_sample_low_score.jsonl) for testing purposes. The full dataset can be found here.
bash scripts/infer.sh 0 7- 0 and 7 are the start and end index of the data shards you want to process, you can change them based on your need.
- We processed 600 shards for 72B tokens.
Run scripts/filter.sh:
source_ref_paths: data pool pathoutput_dir: filtered data dir
Please install rust in your conda environment.
Run scripts/tokenize.sh:
input: the original text data diroutput: the tokenized data dir
Run scripts/pretrain.sh:
scale: DCLM running scale, please find the supported ones intraining/configsdata-config: specify the run name ("name") and the tokenized data location ("manifest_url"), create one when you have a new datasetlogs: where to store the checkpointmultiple-data-passes: used to allow multiple epochs
Run scripts/eval.sh:
method: the generated checkpoint dir namecheckpoint: the specific epoch you want to evaluatemodel: model scale config intraining/open_lm_configsoutput-file: where to store the evaluation result
If you find this work useful, please consider citing:
@article{yu2025repro,
title={{RePro}: Training Language Models to Faithfully Recycle the Web for Pretraining},
author={Yu, Zichun and Xiong, Chenyan},
journal={ArXiv preprint},
year={2025}
}
