3

From the page I got the below code:

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
  1. I looked at encoded_docs and noticed that words done and work both have one_hot encoding of 2, why? Is it because unicity of word to index mapping non-guaranteed. as per this page?
  2. I got embeddings by command embeddings = model.layers[0].get_weights()[0]. in such case why do we get embedding object of size 50? Even though two words have same one_hot number, do they have different embedding?
  3. how could i understand which embedding is for which word i.e. done vs work
  4. I also found below code at the page that could help with finding embedding of each word. But i dont know how to create word_to_index

    word_to_index is a mapping (i.e. dict) from words to their index, e.g. love: 69 words_embeddings = {w:embeddings[idx] for w, idx in word_to_index.items()}

  5. Please ensure that my understanding of para # is correct.

The first layer has 400 parameters because total word count is 50 and embedding have 8 dimensions so 50*8=400.

The last layer has 33 parameters because each sentence has 4 words max. So 4*8 due to dimensions of embedding and 1 for bias. 33 total

_________________________________________________________________
Layer (type)                 Output Shape              Param#   
=================================================================
embedding_3 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten_3 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
=================================================================
  1. Finally, if 1 above is correct, is there a better way to get embedding layer model.add(Embedding(vocab_size, 8, input_length=max_length)) without doing one hot coding encoded_docs = [one_hot(d, vocab_size) for d in docs]

+++++++++++++++++++++++++++++++ update - providing the updated code

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])


from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

#this creates the dictionary
#IMPORTANT: MUST HAVE ALL DATA - including Test data
#IMPORTANT2: This method should be called only once!!!
tokenizer.fit_on_texts(docs)

#this transforms the texts in to sequences of indices
encoded_docs2 = tokenizer.texts_to_sequences(docs)

encoded_docs2

max_length = 4
padded_docs2 = pad_sequences(encoded_docs2, maxlen=max_length, padding='post')
max_index = array(padded_docs2).reshape((-1,)).max()



# define the model
model = Sequential()
model.add(Embedding(max_index+1, 8, input_length=max_length))# you cannot use just max_index 
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs2, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs2, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

embeddings = model.layers[0].get_weights()[0]

embeding_for_word_7 = embeddings[14]
index = tokenizer.texts_to_sequences([['well']])[0][0]
tokenizer.document_count
tokenizer.word_index

1 Answer 1

11
+100

1 - Yes, word unicity is not guaranteed, see the docs:

  • From one_hot: This is a wrapper to the hashing_trick function...
  • From hashing_trick: "Two or more words may be assigned to the same index, due to possible collisions by the hashing function. The probability of a collision is in relation to the dimension of the hashing space and the number of distinct objects."

It would be better to use a Tokenizer for this. (See question 4)

It's very important to remember that you should involve all words at once when creating indices. You cannot use a function to create a dictionary with 2 words, then again with 2 words, then again.... This will create very wrong dictionaries.


2 - Embeddings have the size 50 x 8, because that was defined in the embedding layer:

Embedding(vocab_size, 8, input_length=max_length)
  • vocab_size = 50 - this means there are 50 words in the dictionary
  • embedding_size= 8 - this is the true size of the embedding: each word is represented by a vector of 8 numbers.

3 - You don't know. They use the same embedding.

The system will use the same embedding (the one for index = 2). This is not healthy for your model at all. You should use another method for creating indices in question 1.


4 - You can create a word dictionary manually, or use the Tokenizer class.

Manually:

Make sure you remove punctuation, make all words lower case.

Just create a dictionary for each word you have:

dictionary = dict()
current_key = 1

for doc in docs:
    for word in doc.split(' '):
        #make sure you remove punctuation (this might be boring)
        word = word.lower()

        if not (word in dictionary):
            dictionary[word] = current_key
            current_key += 1

Tokenizer:

from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

#this creates the dictionary
#IMPORTANT: MUST HAVE ALL DATA - including Test data
#IMPORTANT2: This method should be called only once!!!
tokenizer.fit_on_texts(docs)

#this transforms the texts in to sequences of indices
encoded_docs2 = tokenizer.texts_to_sequences(docs)

See the output of encoded_docs2:

[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]]

See the maximum index:

padded_docs2 = pad_sequences(encoded_docs2, maxlen=max_length, padding='post')
max_index = array(padded_docs2).reshape((-1,)).max()

So, your vocab_size should be 15 (otherwise you'd have lots of useless - and harmless - embedding rows). Notice that 0 was not used as an index. It will appear in padding!!!

Do not "fit" the tokenizer again! Only use texts_to_sequences() or other methods here that are not related to "fitting".

Hint: it might be useful to include end_of_sentence words in your text sometimes.

Hint2: it is a good idea to save your Tokenizer to be used later (since it has a specific dictoinary for your data, created with fit_on_texts).

#save:
text_to_save = tokenizer.to_json()

#load:
from keras.preprocessing.text import tokenizer_from_json
tokenizer = tokenizer_from_json(loaded_text)

5 - Params for embedding are correct.

Dense:

Params for Dense are always based on the preceding layer (the Flatten in this case).

The formula is: previous_output * units + units

This results in 32 (from the Flatten) * 1 (Dense units) + 1 (Dense bias=units) = 33

Flatten:

It gets all the previous dimensions multiplied = 8 * 4.
The Embedding outputs lenght = 4 and embedding_size = 8.


6 - The Embedding layer is not dependent of your data and how you preprocess it.

The Embedding layer has simply the size 50 x 8 because you told so. (See question 2)

There are, of course, better ways of preprocessing the data - See question 4.

This will lead you to select better the vocab_size (which is dictionary size).

Seeing the embedding of a word:

Get the embeddings matrix:

embeddings = model.layers[0].get_weights()[0]

Choose any word index:

embeding_for_word_7 = embeddings[7]

That's all.

If you're using a tokenizer, get the word index with:

index = tokenizer.texts_to_sequences([['word']])[0][0]
Sign up to request clarification or add additional context in comments.

2 Comments

I am going through your answer. Thanks for a detailed reply. Would it be possible to provide complete code that includes answers 3 and 4?
i used your suggestions. I noticed that i have to use model.add(Embedding(max_index+1, 8, input_length=max_length)) instead of model.add(Embedding(max_index, 8, input_length=max_length)). max_index is 14 and we added 0 for padding...is that adjustment correct?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.