Use Tensorflow and pre-trained FastText to get embeddings of unseen words

Question

I am using a pre-trained fasttext model https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).

I use Gensim to load the fasttext model. It can output a vector for any words, no matter it is seen or unseen (out-of-vocabulary).

from gensim.models.wrappers import FastText
en_model = FastText.load_fasttext_format('../wiki.en/wiki.en')
print(en_model['car'])
print(en_model['carcaryou'])

In tensorflow, I know that I can use below code to get the trainable embeddings of seen words:

# Embedding layer
embeddings = tf.get_variable('embedding_matrix', [vocab_size, state_size], Trainable=True)
rnn_inputs = tf.nn.embedding_lookup(embeddings, x)

The indices of known words are easy to get. However, for those unseen words, FastText "predicts" their latent vectors based on sub-word patterns. Unseen words do not have any indices.

In this case, how should I use tensorflow to handle both known words and unseen words using fasttext?

Can you use tf.cond to detect whether the word is known (see tf.lookup for tools for that) coupled with tf.py_func to call FastText if the word is not known? — Alexandre Passos
– Alexandre Passos, Commented Oct 30, 2017 at 19:54
@AlexandrePassos Yes. I think it is doable. But what if I want the embedding of both known and unseen words are trainable? For these unseen words, I need to store their embeddings somewhere. Am I right? — Munichong
– Munichong, Commented Oct 30, 2017 at 20:57
I am struggling to find an answer to this question as well. Did you figure it out @Munichong? — user1669710
– user1669710, Commented Jul 6, 2018 at 17:48

ted · Accepted Answer · 2018-07-26 16:13:05Z

I found a workaround using tf.py_func:

def lookup(arr):
    global model
    global decode

    decoded_arr = decode(arr)
    new_arr = np.zeros((*arr.shape, 300))
    for s, sent in enumerate(decoded_arr):
        for w, word in enumerate(sent):
            try:
                new_arr[s, w] = model.wv[word]
            except Exception as e:
                print(e)
                new_arr[s, w] = np.zeros(300)
    return new_arr.astype(np.float32)

z = tf.py_func(lookup, [x], tf.float32, stateful=True, name=None)

This piece of code works, (using French, sorry but does not matter)

import tensorflow as tf
import numpy as np
from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format("../../Tracfin/dev/han/data/embeddings/cc.fr.300.bin")
decode = np.vectorize(lambda x: x.decode("utf-8"))

def lookup(arr):
    global model
    global decode

    decoded_arr = decode(arr)
    new_arr = np.zeros((*arr.shape, 300))
    for s, sent in enumerate(decoded_arr):
        for w, word in enumerate(sent):
            try:
                new_arr[s, w] = model.wv[word]
            except Exception as e:
                print(e)
                new_arr[s, w] = np.zeros(300)
    return new_arr.astype(np.float32)

def extract_words(token):
    # Split characters
    out = tf.string_split([token], delimiter=" ")
    # Convert to Dense tensor, filling with default value
    out = tf.reshape(tf.sparse_tensor_to_dense(out, default_value="<pad>"), [-1])
    return out


textfile = "text.txt"
words = [
    "ceci est un texte hexabromocyclododécanes intéressant qui mentionne des",
    "mots connus et des mots inconnus commeceluici ou celui-là polybromobiphényle",
]

with open(textfile, "w") as f:
    f.write("\n".join(words))

tf.reset_default_graph()
padded_shapes = tf.TensorShape([None])
padding_values = "<pad>"

dataset = tf.data.TextLineDataset(textfile)
dataset = dataset.map(extract_words, 2)
dataset = dataset.shuffle(10000, reshuffle_each_iteration=True)
dataset = dataset.repeat()
dataset = dataset.padded_batch(3, padded_shapes, padding_values)
iterator = tf.data.Iterator.from_structure(
    dataset.output_types, dataset.output_shapes
)
dataset_init_op = iterator.make_initializer(dataset, name="dataset_init_op")
x = iterator.get_next()
z = tf.py_func(lookup, [x], tf.float32, stateful=True, name=None)
sess = tf.InteractiveSession()
sess.run(dataset_init_op)
y, w = sess.run([x, z])
y = decode(y)

print(
    "\nWords out of vocabulary: ",
    np.sum(1 for word in y.reshape(-1) if word not in model.wv.vocab),
)
print("Lookup worked: ", all(model.wv[y[0][0][0]] == w[0][0][0]))

Prints:

Words out of vocabulary:  6
Lookup worked:  True

I did not try to optimize things, especially the lookup loop, comments are welcome

This is a solution for bypassing tf.nn.embedding_lookup. It means that you do get a vector for every word if at least some character n-gram is known. But the embeddings aren't trainable (something the OP asked for in comment), not even the ones for seen words.
Hi, I am trying to use Ted's solution in my code. It works by itself with some slight revision but when I use it to replace the tf.nn.embedding_lookup part of my code, it gives me error.

Collectives™ on Stack Overflow

Use Tensorflow and pre-trained FastText to get embeddings of unseen words

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related