An intelligent image captioning application that automatically generates descriptive captions for images using deep learning models. The project offers two model options: a custom-trained InceptionV3 + LSTM model and a lightweight pretrained ViT-GPT2 model.
- Overview
- Features
- Model Architecture
- Performance
- Installation
- Usage
- Project Structure
- Model Training
- Sample Results
- Technologies Used
- Future Improvements
- Contributing
- License
This project implements an image captioning system that combines Computer Vision and Natural Language Processing to automatically generate human-readable descriptions of images. The application provides a user-friendly web interface built with Streamlit, allowing users to upload images and receive captions instantly.
- Dual Model Support: Choose between custom-trained or pretrained models
- Custom Model: InceptionV3 (CNN) + LSTM architecture trained on custom dataset
- Pretrained Model: ViT-GPT2 model for quick and accurate captions
- Interactive Web Interface: Easy-to-use Streamlit application
- Real-time Processing: Generate captions in seconds
- Image Upload: Support for JPG, JPEG, and PNG formats
- Cross-platform: Works on Windows, macOS, and Linux
- Feature Extractor: InceptionV3 pretrained on ImageNet
- Input: 299x299x3 RGB images
- Output: 2048-dimensional feature vector
- Caption Generator: LSTM-based sequence model
- Processes image features and generates word sequences
- Uses custom tokenizer trained on caption dataset
- Max caption length: 40 words
- Vision Encoder: Vision Transformer (ViT)
- Language Decoder: GPT-2 architecture
- Source:
nlpconnect/vit-gpt2-image-captioningfrom Hugging Face - Optimized for speed and accuracy
- Average BLEU Score: 0.1233 (across 10 samples)
- Training Dataset: Flickr8k (8,000 images)
- Model Training: Performed on Kaggle with GPU acceleration
- System Limitations: Due to computational constraints, training was limited to 8k images
- Scalability: With more powerful hardware (higher GPU memory, extended training time), the model can scale to:
- Flickr30k (30,000 images) - Expected BLEU improvement to 0.25-0.30
- MS COCO (100,000+ images) - Expected BLEU improvement to 0.35-0.45+
- Inference Time:
- Custom Model: ~2-3 seconds per image
- Pretrained Model: ~1-2 seconds per image
- Python 3.8 or higher
- pip package manager
- 4GB+ RAM recommended
- (Optional) GPU for faster inference
git clone https://github.com/yourusername/image-caption-generator.git
cd image-caption-generator# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtEnsure you have the following files in your project directory:
image_caption_model.h5- Your trained model weightstokenizer.pkl- Tokenizer for custom model
For training your own model:
import kagglehub
# Download Flickr8k dataset
path = kagglehub.dataset_download("adityajn105/flickr8k")
print("Path to dataset files:", path)Note: The pretrained ViT-GPT2 model will be downloaded automatically on first use.
- Start the Streamlit app:
streamlit run app.py-
Open your browser:
- The app will automatically open at
http://localhost:8501 - If not, manually navigate to the URL shown in terminal
- The app will automatically open at
-
Generate Captions:
- Select your preferred model from the dropdown
- Upload an image (JPG, JPEG, or PNG)
- Wait for the caption to be generated
- View your results!
# For programmatic usage
from PIL import Image
from app import generate_caption, load_model_and_tokenizer, extract_feature, load_feature_extractor
# Load models
model, tokenizer = load_model_and_tokenizer()
fe_model = load_feature_extractor()
# Load and process image
image = Image.open("sample.jpg")
feature = extract_feature(image, fe_model)
# Generate caption
caption = generate_caption(model, tokenizer, feature)
print(f"Caption: {caption}")image-caption-generator/
โ
โโโ app.py # Main Streamlit application
โโโ requirements.txt # Python dependencies
โโโ image_caption_model.h5 # Trained model weights
โโโ tokenizer.pkl # Tokenizer for custom model
โโโ training_notebook.ipynb # Model training notebook (Kaggle)
โโโ README.md # Project documentation
โ
## ๐ Model Training
The custom model was trained using the following approach:
### Dataset
- **Dataset**: Flickr8k - 8,000 images with 5 captions each
- **Source**: Available on Kaggle
- **Access Method**:
```python
import kagglehub
# Download latest version
path = kagglehub.dataset_download("adityajn105/flickr8k")
print("Path to dataset files:", path)
- Dataset Link: Flickr8k on Kaggle
- Images preprocessed to 299x299 resolution
- Captions tokenized with special start/end tokens
- Note: Due to system limitations, training was performed on the full 8k images. With more powerful hardware (higher GPU memory, longer training time), the model can be trained on larger datasets like Flickr30k (30,000 images) or MS COCO (>100,000 images) for significantly improved accuracy and BLEU scores.
- Framework: TensorFlow/Keras
- Feature Extractor: InceptionV3 (frozen weights)
- Encoder-Decoder Architecture: LSTM-based
- Loss Function: Categorical Crossentropy
- Optimizer: Adam
- Training Platform: Kaggle GPU
Refer to training_notebook.ipynb for complete training code and detailed steps.
Generated Caption: "a snowmobiler flies through the air end"
Generated Caption: "three dogs are playing in a grassy field end"
Generated Caption: "a man in a red jacket is climbing a snowy mountain end"
Note: Place sample images in the sample_images/ folder for testing
- Frontend: Streamlit
- Deep Learning:
- TensorFlow 2.17.0
- PyTorch
- Transformers (Hugging Face)
- Computer Vision:
- InceptionV3
- Vision Transformer (ViT)
- NLP:
- LSTM
- GPT-2
- Image Processing: Pillow (PIL)
- Numerical Computing: NumPy
- Train on larger datasets (Flickr30k, MS COCO) with better hardware
- Improve BLEU score through extended training epochs
- Add attention mechanism visualization
- Support for batch image processing
- Deploy on cloud platforms (AWS, Azure, Heroku)
- Add multilingual caption support
- Implement beam search for better captions
- Add confidence scores for generated captions
- Create REST API endpoint
- Add caption editing and feedback mechanism
- Fine-tune hyperparameters for optimal performance
- Implement transfer learning with newer architectures (EfficientNet, ResNet)
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Flickr8k dataset from Kaggle: adityajn105/flickr8k
- InceptionV3 model from TensorFlow/Keras
- ViT-GPT2 model from Hugging Face (nlpconnect)
- Kaggle for providing free GPU training infrastructure
- Streamlit for the amazing web framework
Note: This project demonstrates image captioning capabilities within system constraints. With access to more powerful hardware (e.g., Tesla V100, A100 GPUs) and larger datasets, the model performance can be significantly enhanced.
Divy Dobariya
Email: divydobariya11@gmail.com LinkedIn: linkedin.com/in/divy-dobariya-92881423b GitHub: @Divy005
Project Link: https://github.com/Divy005/image_caption_generator
โญ If you found this project helpful, please give it a star!
Made with โค๏ธ and Python


