I've downloaded the UD Treebank dataset, set up a shell script to discover all folders for a given language and converted the .conllu files to .spacy.
Now I have a collection of files like this: de_gsd_ud-train.spacy, de_hdt-ud-train.spacy, ...
I'd like to use all of them together to train a new nlp pipeline.
I have configured a training config, and it seems to work well on a single .spacy file (well, two, one for train and one for dev). But it's not clear to me how I can pass multiple files for the trainset.
Here's the failing command for training:
python -m spacy train "$FILLED_CONFIG" --output "$OUTPUT_DIR" --gpu-id 0 --verbose \
--paths.train "${TRAIN_SPACY_FILES[@]}" \
--paths.dev "${DEV_SPACY_FILES[@]}"
This will make spacy complain :
✘ Invalid config override './training_data/de_gsd-ud-train.spacy': name
should start with --
So, it parses the second training file (the de_gsd-ud-train.spacy) as a new config override and complains that this config override is not starting with --.