Building a Predictive Keyboard Model with PyTorch

Every time you type on your smartphone, you see three words pop up as suggestions. That’s a predictive keyboard in action. These suggestions aren’t random. They are based on deep learning models that have learned language patterns from tons of text data. So, if you want to learn about building the model behind a predictive keyboard, this article is for you. In this article, I’ll take you through the task of building a predictive keyboard model with PyTorch.

Building a Predictive Keyboard Model Using PyTorch

The task of building a predictive keyboard model includes these steps:

Tokenizing and preparing natural language data
Building a vocabulary and converting words to indices
Training a next-word prediction model using LSTMs
Generating top-k predictions like a predictive keyboard

We will use these steps, and in the end, we will see a model generating three suggestions for the next word, just like a predictive keyboard in your smartphone.

The richer the data, the better your model will generalize. So, the dataset we will use is based on the stories of Sherlock Holmes. You can find this dataset here.

Step 1: Preparing the Dataset

We will start with tokenizing the text data and converting everything to lowercase:

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

# load data
with open('sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as f:
    text = f.read().lower()

tokens = word_tokenize(text)
print("Total Tokens:", len(tokens))

Total Tokens: 125772

Here, we converted the text to lowercase (to maintain consistency) and used word_tokenize to break the entire corpus into word-level tokens. This prepares our data for model training by converting raw text into a structured format that the model can understand.

Step 2: Creating a Vocabulary

Next, we need a way to convert words into numbers. So we will create:

a dictionary to map each word to an index
and another dictionary to reverse it back

So, let’s build the vocabulary and create word-to-index mappings:

from collections import Counter

word_counts = Counter(tokens)
vocab = sorted(word_counts, key=word_counts.get, reverse=True)

word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}
vocab_size = len(vocab)

Here, we counted how often each word appears using Counter, then sorted the vocabulary from most to least frequent. This sorted list helps us assign lower indices to more common words (useful for embeddings). Then, we created word2idx and idx2word dictionaries to convert words to unique IDs and back. Finally, we stored the total vocabulary size, which will define the input and output dimensions for our model.

Step 3: Building Input-Output Sequences

To predict the next word, the model needs context. We can use a sliding window approach. So, let’s create input-target sequences for next word prediction:

sequence_length = 4  # e.g., "I am going to [predict this]"

data = []
for i in range(len(tokens) - sequence_length):
    input_seq = tokens[i:i + sequence_length - 1]
    target = tokens[i + sequence_length - 1]
    data.append((input_seq, target))

# convert words to indices
def encode(seq): return [word2idx[word] for word in seq]

encoded_data = [(torch.tensor(encode(inp)), torch.tensor(word2idx[target]))
                for inp, target in data]

Here, we used a sliding window approach to generate training samples: for every group of 3 consecutive words (input), we predict the next word (target). It prepares the data for sequence modelling.

Then, we defined an encode function to convert each word in the sequence into its corresponding index using our vocabulary. Finally, we build encoded_data, a list of (input_tensor, target_tensor) pairs, where each input is a tensor of word indices and the target is the index of the next word to be predicted.

Step 4: Designing the Model Architecture

For sequence data, LSTMs are still the go-to. They can remember patterns across time steps, which makes them perfect for language modelling. So, let’s define the LSTM-based Predictive Keyboard model:

import torch.nn as nn

class PredictiveKeyboard(nn.Module):
    def __init__(self, vocab_size, embed_dim=64, hidden_dim=128):
        super(PredictiveKeyboard, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        output, _ = self.lstm(x)
        output = self.fc(output[:, -1, :])  # last LSTM output
        return output

This class defines our neural network model. First, the Embedding layer converts word indices into dense vectors. These embeddings are then passed through an LSTM layer, which captures the sequential context of the input.

Finally, we take the output of the last time step and feed it through a Linear layer to get a vector of size vocab_size, representing the predicted probabilities for each word in the vocabulary. This architecture allows the model to learn patterns and dependencies in word sequences for next-word prediction.

Step 5: Training the Model

We’ll use CrossEntropyLoss (standard for classification tasks) and train over a small batch of sequences. So, let’s train the model on input-target word sequences:

import torch
import torch.optim as optim
import random

model = PredictiveKeyboard(vocab_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.005)

epochs = 20
for epoch in range(epochs):
    total_loss = 0
    random.shuffle(encoded_data)
    for input_seq, target in encoded_data[:10000]:  # Limit data for speed
        input_seq = input_seq.unsqueeze(0)
        output = model(input_seq)
        loss = criterion(output, target.unsqueeze(0))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

Epoch 1, Loss: 66887.7021
Epoch 2, Loss: 67179.4026
Epoch 3, Loss: 68716.7401
Epoch 4, Loss: 72399.9313
Epoch 5, Loss: 72274.3920
Epoch 6, Loss: 73415.8201
Epoch 7, Loss: 73517.9691
Epoch 8, Loss: 75054.6567
Epoch 9, Loss: 75603.8497
Epoch 10, Loss: 75316.2069
Epoch 11, Loss: 77058.3788
Epoch 12, Loss: 77787.7627
Epoch 13, Loss: 77840.5772
Epoch 14, Loss: 79864.1127
Epoch 15, Loss: 77990.9982
Epoch 16, Loss: 80775.6010
Epoch 17, Loss: 79951.5634
Epoch 18, Loss: 80578.3322
Epoch 19, Loss: 81427.6879
Epoch 20, Loss: 81869.4849

Here, we instantiated the model, defined a loss function (CrossEntropyLoss), and used the Adam optimizer for efficient gradient updates. During each training epoch, we shuffled the dataset for better generalization. For each training sample, we added a batch dimension to the input, computed the output, and calculated the loss between predicted and actual next-word indices.

Then we performed backpropagation, updated the weights, and accumulated the total loss for tracking. This loop trains the model to predict the next word based on the previous sequence.

Predicting the Next Words

Now, we will use this model just like a smartphone keyboard. Instead of predicting just one word, we will mimic how smartphone keyboards suggest three possible next words. So, let’s generate the top 3 next-word predictions like a predictive keyboard:

import torch.nn.functional as F

def suggest_next_words(model, text_prompt, top_k=3):
    model.eval()
    tokens = word_tokenize(text_prompt.lower())
    if len(tokens) < sequence_length - 1:
        raise ValueError(f"Input should be at least {sequence_length - 1} words long.")

    input_seq = tokens[-(sequence_length - 1):]
    input_tensor = torch.tensor(encode(input_seq)).unsqueeze(0)

    with torch.no_grad():
        output = model(input_tensor)
        probs = F.softmax(output, dim=1).squeeze()
        top_indices = torch.topk(probs, top_k).indices.tolist()

    return [idx2word[idx] for idx in top_indices]

print("Suggestions:", suggest_next_words(model, "So, are we really at"))

Suggestions: ['the', 'his', 'a']

This function takes a user input like “So, are we really at”, tokenizes and encodes the last few words, and passes them through the trained model to get output scores.

These scores are then converted into probabilities using softmax, and the top k predictions (like the three most probable next words) are selected using torch.topk. The function then maps these indices back to actual words using idx2word, mimicking the behaviour of a real predictive keyboard.

Summary

And that’s it. You have just built the core of a predictive keyboard using PyTorch. From tokenizing raw text to training an LSTM and generating top-3 next-word suggestions, you’ve seen how deep learning models can understand and predict language patterns. I hope you liked this article on building a predictive keyboard model using PyTorch. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.