12

I trained my unsupervised model using fasttext.train_unsupervised() function in python. I want to save it as vec file since I will use this file for pretrainedVectors parameter in fasttext.train_supervised() function. pretrainedVectors only accepts vec file but I am having troubles to creating this vec file. Can someone help me?

Ps. I am able to save it in bin format. It would be also helpful if you suggest me a way to convert bin file to vec file.

3 Answers 3

15

To obtain VEC file, containing merely all words vectors, I took inspiration from bin_to_vec official example.

from fasttext import load_model

# original BIN model loading
f = load_model(YOUR-BIN-MODEL-PATH)
    lines=[]

# get all words from model
words = f.get_words()

with open(YOUR-VEC-FILE-PATH,'w') as file_out:
    
    # the first line must contain number of total words and vector dimension
    file_out.write(str(len(words)) + " " + str(f.get_dimension()) + "\n")

    # line by line, you append vectors to VEC file
    for w in words:
        v = f.get_word_vector(w)
        vstr = ""
        for vi in v:
            vstr += " " + str(vi)
        try:
            file_out.write(w + vstr+'\n')
        except:
            pass

The obtained VEC file can be big. To reduce file size, you can adjust the format of vector components.

If you want to keep only 4 decimal digits, you can replace vstr += " " + str(vi) with
vstr += " " + "{:.4f}".format(vi)

Sign up to request clarification or add additional context in comments.

3 Comments

ValueError: Dimension of pretrained vectors (7598805550878845300) does not match dimension (300)! Unfortunately it gives me this error when I try to use the vec file that I created in that way. It seems it doesn't keep the dimensions of the word vectors that are supposed to be 300.
I received a similar error: "ValueError: Dimension of pretrained vectors (0) does not match dimension (100)!" I fixed the problem by adding the output of this code: str(len(words)) + " " + str(f.get_dimension()) to the first line of the file, as suggested by @darwin007
I would use read/write type "a" with extreme caution. In fact, there is no value to use "a" after the last change to the answer. If you run the line of code more than once, you will end up appending the word length, dimensions, and all the words and vectors every time you run that line of code. Using "w" instead of "a" will rewrite the file every time you run the code, which is what you probably want. Full line solution: with open(YOUR-VEC-FILE-PATH,'w') as file_out:
1

you should add words num and dimension at first line of your vec file, than use -preTrainedVectors para

Comments

0

You can also try generating your fasttext embeddings with the gensim library. The gensim model has a wv.save_word2vec_format function that makes generating .vec files straightforward.

from gensim.models import FastText

sentences = open('data.txt','r').readlines() #data.txt contains a sentence on every line.

#Apply your desired tokenisation method to the sentences
tokenized_sentences = tokenize(sentences)

model = FastText(vector_size=300, window=5, min_count=1, sentences=tokenized_sentences, epochs=10)

#Save vectors to .vec file
model.wv.save_word2vec_format("embeddings.vec")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.