I'm currently training a seq2seq encoder-decoder network powered by LSTM in TensorFlow 2.x. The main problem right now is the loss approaches to NaN and the prediction returned are all NaN as well. I understand the possibility of exploding/vanishing gradients and have deployed several ways to try and combat it (ex: adding min-max layers, adding L2 regularizations, using clipnorm/clipvalues, and changing the learning rate, etc.). Almost all researchable methods for combating this issue have been tried but the issue persists.
The architecture is as followed:
embedding_size = 16
INPUT_LENGTH = X.shape[1]
# MAX_OUTPUT_LENGTH = y.shape[1]
MAX_OUTPUT_LENGTH = 10
# for min-max normalization - create a min-max layer
global_min = np.min(X_train)
global_max = np.max(X_train)
min_max_layer = keras.layers.Lambda(lambda x: (x - global_min) / (global_max - global_min))
# define encoder model
encoder = keras.models.Sequential()
encoder.add(keras.layers.Embedding(input_dim=len(aa_tokenizer.word_index) + 1,
output_dim=embedding_size,
input_shape=[None]))
# input_length=MAX_OUTPUT_LENGTH))
encoder.add(min_max_layer)
encoder.add(keras.layers.LSTM(16, activity_regularizer=keras.regularizers.L2(0.1)))
# define decoder model
decoder = keras.models.Sequential()
# decoder.add(min_max_layer)
decoder.add(keras.layers.LSTM(16, return_sequences=True, activity_regularizer=keras.regularizers.L2(0.1)))
# decoder.add(min_max_layer)
decoder.add(keras.layers.Dense(len(codon_tokenizer.word_index) + 1, activation='softmax'))
# define inference model
model = keras.models.Sequential([encoder, keras.layers.RepeatVector(MAX_OUTPUT_LENGTH), decoder])
optimizer = keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
metrics=["accuracy"])
history = model.fit(X_train[:, :10, :], y_train[:, :10], epochs=2, validation_split=0.15)
With the output: loss: nan - accuracy: 0.0527 - val_loss: nan - val_accuracy: 0.0000e+00
(Note: this network currently trains on sequences with length of 10 for the sake of speed and testing, but the original goal is to train on sequences with a length of 2,400).
print()steps to trace its path and see where does thenanappearsparse_categorical_crossentropyfor the loss function. Let me look into how I can use theprint()function to trace the path of the network. In addition, the loss appears to be a real number in the first few batches but suddenly just jumped tonanafterwards. @noober