1

I've downloaded the UD Treebank dataset, set up a shell script to discover all folders for a given language and converted the .conllu files to .spacy.

Now I have a collection of files like this: de_gsd_ud-train.spacy, de_hdt-ud-train.spacy, ... I'd like to use all of them together to train a new nlp pipeline.

I have configured a training config, and it seems to work well on a single .spacy file (well, two, one for train and one for dev). But it's not clear to me how I can pass multiple files for the trainset.

Here's the failing command for training:

python -m spacy train "$FILLED_CONFIG" --output "$OUTPUT_DIR" --gpu-id 0 --verbose \
  --paths.train "${TRAIN_SPACY_FILES[@]}" \
  --paths.dev "${DEV_SPACY_FILES[@]}"

This will make spacy complain :

✘ Invalid config override './training_data/de_gsd-ud-train.spacy': name
should start with --

So, it parses the second training file (the de_gsd-ud-train.spacy) as a new config override and complains that this config override is not starting with --.

1 Answer 1

0

You should rather concatenate treebanks, transforming them into a single .spacy data file. You can do the concatenation manually, but fortunately Spacy supports that.

  1. Split your treebanks between train, dev, test folders.

  2. Update convert/preprocess script to call spacy convert with --concatenate flag

project.yml

python -m spacy convert
  assets/EWT/train/ # <-- points to a folder, not a '*.conllu' file
  corpus/EWT/ 
  --converter conllu 
  --n-sents 10 
  --concatenate
  1. The name of the output file will match the name of the last folder (train for the above). Update deps and output accordingly:

project.yml

deps:
  - "assets/EWT/train/ewt.conllu"
  - "assets/EWT/train/extra.conllu"
  - "assets/EWT/dev/ewt.conllu"
  - "assets/EWT/dev/extra.conllu"
  - "assets/EWT/test/ewt.conllu"
  - "assets/EWT/test/extra.conllu"
outputs:
  - "corpus/EWT/train.spacy"
  - "corpus/EWT/dev.spacy"
  - "corpus/EWT/test.spacy"
  1. Recheck your script for unnecessary logic. Due to the above naming, the cleanup lines (present in default templates) are unnecessary:

project.yml

- "mv corpus/EWT/... corpus/EWT/train.spacy" # -- junk, remove this from the `preprocess`/`convert` weasel command

The output file is already named train.spacy and sits in the right folder.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.