RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Recycled data coming soon...

Quick Links

Our RL code is based on Open R1, and our pretraining code is based on DCLM. Please refer to their repos for more details.

Prerequisites

The code is tested on Python 3.10.16. Install basic dependencies:

pip install -r requirements.txt

Run setup.py to download necessary files:

cd pretrain
python setup.py install

Rephraser Training

Run scripts/rl.sh.

We provide a 1000-example subset of the training data (rl/1000_sample_low_score.jsonl) for testing purposes. The full dataset can be found here.

Rephraser Inference

bash scripts/infer.sh 0 7

0 and 7 are the start and end index of the data shards you want to process, you can change them based on your need.
We processed 600 shards for 72B tokens.

Quality Filtering

Run scripts/filter.sh:

source_ref_paths: data pool path
output_dir: filtered data dir

Tokenization

Please install rust in your conda environment.

Run scripts/tokenize.sh:

input: the original text data dir
output: the tokenized data dir

Pretraining

Run scripts/pretrain.sh:

scale: DCLM running scale, please find the supported ones in training/configs
data-config: specify the run name ("name") and the tokenized data location ("manifest_url"), create one when you have a new dataset
logs: where to store the checkpoint
multiple-data-passes: used to allow multiple epochs

Evaluation

Run scripts/eval.sh:

method: the generated checkpoint dir name
checkpoint: the specific epoch you want to evaluate
model: model scale config in training/open_lm_configs
output-file: where to store the evaluation result

Citation

If you find this work useful, please consider citing:

@article{yu2025repro,
  title={{RePro}: Training Language Models to Faithfully Recycle the Web for Pretraining},
  author={Yu, Zichun and Xiong, Chenyan},
  journal={ArXiv preprint},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
pretrain		pretrain
rl		rl
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Quick Links

Prerequisites

Rephraser Training

Rephraser Inference

Quality Filtering

Tokenization

Pretraining

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

cxcscmu/RePro

Folders and files

Latest commit

History

Repository files navigation

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Quick Links

Prerequisites

Rephraser Training

Rephraser Inference

Quality Filtering

Tokenization

Pretraining

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages