How to save fasttext model in vec format?

Question

I trained my unsupervised model using fasttext.train_unsupervised() function in python. I want to save it as vec file since I will use this file for pretrainedVectors parameter in fasttext.train_supervised() function. pretrainedVectors only accepts vec file but I am having troubles to creating this vec file. Can someone help me?

Ps. I am able to save it in bin format. It would be also helpful if you suggest me a way to convert bin file to vec file.

tonywang · Accepted Answer · 2021-06-23 06:58:00Z

15

To obtain VEC file, containing merely all words vectors, I took inspiration from bin_to_vec official example.

from fasttext import load_model

# original BIN model loading
f = load_model(YOUR-BIN-MODEL-PATH)
    lines=[]

# get all words from model
words = f.get_words()

with open(YOUR-VEC-FILE-PATH,'w') as file_out:
    
    # the first line must contain number of total words and vector dimension
    file_out.write(str(len(words)) + " " + str(f.get_dimension()) + "\n")

    # line by line, you append vectors to VEC file
    for w in words:
        v = f.get_word_vector(w)
        vstr = ""
        for vi in v:
            vstr += " " + str(vi)
        try:
            file_out.write(w + vstr+'\n')
        except:
            pass

The obtained VEC file can be big. To reduce file size, you can adjust the format of vector components.

If you want to keep only 4 decimal digits, you can replace vstr += " " + str(vi) with
vstr += " " + "{:.4f}".format(vi)

edited Jun 23, 2021 at 6:58

tonywang

1913 silver badges17 bronze badges

answered Oct 11, 2019 at 13:46

Stefano Fiorucci - anakin87

3,57610 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

esin ildiz Over a year ago

ValueError: Dimension of pretrained vectors (7598805550878845300) does not match dimension (300)! Unfortunately it gives me this error when I try to use the vec file that I created in that way. It seems it doesn't keep the dimensions of the word vectors that are supposed to be 300.

dshefman Over a year ago

I received a similar error: "ValueError: Dimension of pretrained vectors (0) does not match dimension (100)!" I fixed the problem by adding the output of this code: str(len(words)) + " " + str(f.get_dimension()) to the first line of the file, as suggested by @darwin007

dshefman Over a year ago

I would use read/write type "a" with extreme caution. In fact, there is no value to use "a" after the last change to the answer. If you run the line of code more than once, you will end up appending the word length, dimensions, and all the words and vectors every time you run that line of code. Using "w" instead of "a" will rewrite the file every time you run the code, which is what you probably want. Full line solution: with open(YOUR-VEC-FILE-PATH,'w') as file_out:

darwin007 · Accepted Answer · 2020-01-02 04:32:05Z

1

you should add words num and dimension at first line of your vec file, than use -preTrainedVectors para

answered Jan 2, 2020 at 4:32

darwin007

112 bronze badges

Comments

mayowa_osibodu · Accepted Answer · 2024-02-11 07:54:28Z

0

You can also try generating your fasttext embeddings with the gensim library. The gensim model has a wv.save_word2vec_format function that makes generating .vec files straightforward.

from gensim.models import FastText

sentences = open('data.txt','r').readlines() #data.txt contains a sentence on every line.

#Apply your desired tokenisation method to the sentences
tokenized_sentences = tokenize(sentences)

model = FastText(vector_size=300, window=5, min_count=1, sentences=tokenized_sentences, epochs=10)

#Save vectors to .vec file
model.wv.save_word2vec_format("embeddings.vec")

answered Feb 11, 2024 at 7:54

mayowa_osibodu

1096 bronze badges

Collectives™ on Stack Overflow

How to save fasttext model in vec format?

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related