Image to Sequence: Image Captioning
In this tutorial, we will utilize EIR for image-to-sequence tasks.
Image to Sequence (img-to-seq) models are a type of models that convert an
input image into a sequence of words. This could be useful for tasks like
image captioning, where the model generates a description of the contents of an image.
In this tutorial, we will be generating captions for images using the COCO 2017 dataset.
A - Data
You can download the data for this tutorial here.
After downloading the data, the folder structure should look like this (we will look at the configs in a bit):
eir_tutorials/c_sequence_output/03_image_captioning
├── conf
│ ├── fusion.yaml
│ ├── globals.yaml
│ ├── inputs_resnet18.yaml
│ └── output.yaml
└── data
├── captions.csv
└── images
B - Training
Training follows a similar approach as we saw in the previous tutorial, Sequence Generation: Generating Movie Reviews.
For reference, here are the configurations:
basic_experiment:
batch_size: 64
memory_dataset: false
n_epochs: 3
output_folder: eir_tutorials/tutorial_runs/c_sequence_output/03_image_captioning
valid_size: 500
dataloader_workers: 8
evaluation_checkpoint:
checkpoint_interval: 500
n_saved_models: 1
sample_interval: 500
optimization:
lr: 0.001
optimizer: adabelief
model_type: "pass-through"
input_info:
input_source: eir_tutorials/c_sequence_output/03_image_captioning/data/images
input_name: image_captioning
input_type: image
input_type_info:
size:
- 64
auto_augment: true
model_config:
model_type: "resnet18"
pretrained_model: True
output_info:
output_source: eir_tutorials/c_sequence_output/03_image_captioning/data/captions.csv
output_name: captions
output_type: sequence
output_type_info:
max_length: 32
split_on: " "
sampling_strategy_if_longer: "uniform"
min_freq: 20
model_config:
embedding_dim: 128
model_init_config:
num_layers: 6
sampling_config:
generated_sequence_length: 64
n_eval_inputs: 10
Like previously, we will start by training a model only on the text to establish as baseline:
eirtrain \
--global_configs eir_tutorials/c_sequence_output/03_image_captioning/conf/globals.yaml \
--fusion_configs eir_tutorials/c_sequence_output/03_image_captioning/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/03_image_captioning/conf/output.yaml \
--globals.basic_experiment.output_folder=eir_tutorials/tutorial_runs/c_sequence_output/03_image_captioning_text_only
When running the command above, I got the following training curve:
Now, we will train a model that uses both the image and the text:
eirtrain \
--global_configs eir_tutorials/c_sequence_output/03_image_captioning/conf/globals.yaml \
--input_configs eir_tutorials/c_sequence_output/03_image_captioning/conf/inputs_resnet18.yaml \
--fusion_configs eir_tutorials/c_sequence_output/03_image_captioning/conf/fusion.yaml \
--output_configs eir_tutorials/c_sequence_output/03_image_captioning/conf/output.yaml
When running the command above, I got the following training curve:
The fact that the validation loss is lower indicates that the model is likely able to use the image to improve the quality of the captions.
After training, we can look at some of the generated captions:
While the captions seem to be somewhat related to the images, they are far from perfect. As the validation loss is still decreasing, we could train the model for longer, try a larger model, use larger images, or use a larger dataset.
D - Serving
In this final section, we demonstrate serving our trained image captioning model as a web service and interacting with it using HTTP requests.
Starting the Web Service
To serve the model, use the following command:
eirserve --model-path [MODEL_PATH]
Replace [MODEL_PATH] with the actual path to your trained model. This command initiates a web service that listens for incoming requests.
Here is an example of the command:
eirserve \
--model-path eir_tutorials/tutorial_runs/c_sequence_output/03_image_captioning/saved_models/03_image_captioning_checkpoint_10500_perf-average=-1.5105.pt
Sending Requests
With the server running, we can now send image-based requests for caption generation. For this model, we send images and receive their captions.
Here’s an example Python function demonstrating this process:
import base64
from io import BytesIO
import requests
from PIL import Image
def encode_image_to_base64(file_path: str) -> str:
with Image.open(file_path) as image:
buffered = BytesIO()
image.save(buffered, format="JPEG")
return base64.b64encode(buffered.getvalue()).decode("utf-8")
def send_request(url: str, payload: list[dict]) -> list[dict]:
response = requests.post(url, json=payload)
response.raise_for_status()
return response.json()
image_base = "eir_tutorials/c_sequence_output/03_image_captioning/data/images"
payload = [
{
"image_captioning": encode_image_to_base64(f"{image_base}/000000000009.jpg"),
"captions": "",
},
{
"image_captioning": encode_image_to_base64(f"{image_base}/000000000034.jpg"),
"captions": "",
},
{
"image_captioning": encode_image_to_base64(f"{image_base}/000000581929.jpg"),
"captions": "A horse",
},
]
response = send_request(url="http://localhost:8000/predict", payload=payload)
print(response)
When running this, we get the following output:
{
"result": [
{
"captions": "A white plate with vegetables on a wooden table."
},
{
"captions": "A zebra grazing on grass near a tree in the field."
},
{
"captions": "A horse standing in a field next to a cow."
}
]
}
Analyzing Responses
Before analyzing the responses, let’s view the images that were used for generating captions:
000000000009.jpg
000000000034.jpg
000000581929.jpg
After sending requests to the served model, the responses can be analyzed. These responses provide insights into the model’s capability to generate captions for the input images.
[
{
"request": [
{
"image_captioning": "eir_tutorials/c_sequence_output/03_image_captioning/data/images/000000000009.jpg",
"captions": ""
},
{
"image_captioning": "eir_tutorials/c_sequence_output/03_image_captioning/data/images/000000000034.jpg",
"captions": ""
},
{
"image_captioning": "eir_tutorials/c_sequence_output/03_image_captioning/data/images/000000581929.jpg",
"captions": "A horse"
}
],
"response": {
"result": [
{
"captions": "A plate of food on a white plate"
},
{
"captions": "A zebra standing on top of a lush green field."
},
{
"captions": "A horse grazing in a green field next to a"
}
]
}
}
]
Thank you for reading!