10

I am using a pre-trained fasttext model https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).

I use Gensim to load the fasttext model. It can output a vector for any words, no matter it is seen or unseen (out-of-vocabulary).

from gensim.models.wrappers import FastText
en_model = FastText.load_fasttext_format('../wiki.en/wiki.en')
print(en_model['car'])
print(en_model['carcaryou'])

In tensorflow, I know that I can use below code to get the trainable embeddings of seen words:

# Embedding layer
embeddings = tf.get_variable('embedding_matrix', [vocab_size, state_size], Trainable=True)
rnn_inputs = tf.nn.embedding_lookup(embeddings, x)

The indices of known words are easy to get. However, for those unseen words, FastText "predicts" their latent vectors based on sub-word patterns. Unseen words do not have any indices.

In this case, how should I use tensorflow to handle both known words and unseen words using fasttext?

4
  • Can you use tf.cond to detect whether the word is known (see tf.lookup for tools for that) coupled with tf.py_func to call FastText if the word is not known? Commented Oct 30, 2017 at 19:54
  • @AlexandrePassos Yes. I think it is doable. But what if I want the embedding of both known and unseen words are trainable? For these unseen words, I need to store their embeddings somewhere. Am I right? Commented Oct 30, 2017 at 20:57
  • I am struggling to find an answer to this question as well. Did you figure it out @Munichong? Commented Jul 6, 2018 at 17:48
  • @user1669710 No... Sorry Commented Jul 9, 2018 at 19:27

1 Answer 1

3

I found a workaround using tf.py_func:

def lookup(arr):
    global model
    global decode

    decoded_arr = decode(arr)
    new_arr = np.zeros((*arr.shape, 300))
    for s, sent in enumerate(decoded_arr):
        for w, word in enumerate(sent):
            try:
                new_arr[s, w] = model.wv[word]
            except Exception as e:
                print(e)
                new_arr[s, w] = np.zeros(300)
    return new_arr.astype(np.float32)

z = tf.py_func(lookup, [x], tf.float32, stateful=True, name=None)

This piece of code works, (using French, sorry but does not matter)

import tensorflow as tf
import numpy as np
from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format("../../Tracfin/dev/han/data/embeddings/cc.fr.300.bin")
decode = np.vectorize(lambda x: x.decode("utf-8"))

def lookup(arr):
    global model
    global decode

    decoded_arr = decode(arr)
    new_arr = np.zeros((*arr.shape, 300))
    for s, sent in enumerate(decoded_arr):
        for w, word in enumerate(sent):
            try:
                new_arr[s, w] = model.wv[word]
            except Exception as e:
                print(e)
                new_arr[s, w] = np.zeros(300)
    return new_arr.astype(np.float32)

def extract_words(token):
    # Split characters
    out = tf.string_split([token], delimiter=" ")
    # Convert to Dense tensor, filling with default value
    out = tf.reshape(tf.sparse_tensor_to_dense(out, default_value="<pad>"), [-1])
    return out


textfile = "text.txt"
words = [
    "ceci est un texte hexabromocyclododécanes intéressant qui mentionne des",
    "mots connus et des mots inconnus commeceluici ou celui-là polybromobiphényle",
]

with open(textfile, "w") as f:
    f.write("\n".join(words))

tf.reset_default_graph()
padded_shapes = tf.TensorShape([None])
padding_values = "<pad>"

dataset = tf.data.TextLineDataset(textfile)
dataset = dataset.map(extract_words, 2)
dataset = dataset.shuffle(10000, reshuffle_each_iteration=True)
dataset = dataset.repeat()
dataset = dataset.padded_batch(3, padded_shapes, padding_values)
iterator = tf.data.Iterator.from_structure(
    dataset.output_types, dataset.output_shapes
)
dataset_init_op = iterator.make_initializer(dataset, name="dataset_init_op")
x = iterator.get_next()
z = tf.py_func(lookup, [x], tf.float32, stateful=True, name=None)
sess = tf.InteractiveSession()
sess.run(dataset_init_op)
y, w = sess.run([x, z])
y = decode(y)

print(
    "\nWords out of vocabulary: ",
    np.sum(1 for word in y.reshape(-1) if word not in model.wv.vocab),
)
print("Lookup worked: ", all(model.wv[y[0][0][0]] == w[0][0][0]))

Prints:

Words out of vocabulary:  6
Lookup worked:  True

I did not try to optimize things, especially the lookup loop, comments are welcome

Sign up to request clarification or add additional context in comments.

2 Comments

This is a solution for bypassing tf.nn.embedding_lookup. It means that you do get a vector for every word if at least some character n-gram is known. But the embeddings aren't trainable (something the OP asked for in comment), not even the ones for seen words.
Hi, I am trying to use Ted's solution in my code. It works by itself with some slight revision but when I use it to replace the tf.nn.embedding_lookup part of my code, it gives me error.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.