Skip to content
/ RePro Public

Official repository for RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Notifications You must be signed in to change notification settings

cxcscmu/RePro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Recycled data coming soon...

Quick Links

Our RL code is based on Open R1, and our pretraining code is based on DCLM. Please refer to their repos for more details.

Prerequisites

The code is tested on Python 3.10.16. Install basic dependencies:

pip install -r requirements.txt

Run setup.py to download necessary files:

cd pretrain
python setup.py install

Rephraser Training

Run scripts/rl.sh.

  • We provide a 1000-example subset of the training data (rl/1000_sample_low_score.jsonl) for testing purposes. The full dataset can be found here.

Rephraser Inference

bash scripts/infer.sh 0 7
  • 0 and 7 are the start and end index of the data shards you want to process, you can change them based on your need.
  • We processed 600 shards for 72B tokens.

Quality Filtering

Run scripts/filter.sh:

  • source_ref_paths: data pool path
  • output_dir: filtered data dir

Tokenization

Please install rust in your conda environment.

Run scripts/tokenize.sh:

  • input: the original text data dir
  • output: the tokenized data dir

Pretraining

Run scripts/pretrain.sh:

  • scale: DCLM running scale, please find the supported ones in training/configs
  • data-config: specify the run name ("name") and the tokenized data location ("manifest_url"), create one when you have a new dataset
  • logs: where to store the checkpoint
  • multiple-data-passes: used to allow multiple epochs

Evaluation

Run scripts/eval.sh:

  • method: the generated checkpoint dir name
  • checkpoint: the specific epoch you want to evaluate
  • model: model scale config in training/open_lm_configs
  • output-file: where to store the evaluation result

Citation

If you find this work useful, please consider citing:

@article{yu2025repro,
  title={{RePro}: Training Language Models to Faithfully Recycle the Web for Pretraining},
  author={Yu, Zichun and Xiong, Chenyan},
  journal={ArXiv preprint},
  year={2025}
}

About

Official repository for RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages