Creating an Image QA System in Python: A Step-by-Step Tutorial

In today’s world of artificial intelligence and machine learning, the ability to analyze images and extract meaningful information from them is more crucial than ever. Imagine an application that not only recognizes objects in an image but also answers specific questions about them. This tutorial will guide you through creating an Image Question Answering (QA) system in Python using the Gemini API. Whether you’re looking to integrate AI into your projects or simply want to explore the fascinating domain of image processing, this guide is for you!

Introduction

Image QA systems have a wide range of applications, from assisting visually impaired individuals by describing images to enhancing customer experiences in e-commerce by answering questions about products. By leveraging advanced AI models, we can create a system capable of understanding images in a conversational context, identifying objects, counting them, and even performing spatial reasoning tasks.

Create a Test Image with Objects

This snippet demonstrates how to create a test image with various shapes using the Pillow library, which is essential for generating visual data for image processing tasks.

📚 Recommended Python Learning Resources

Level up your Python skills with these hand-picked resources:

Vibe Coding Blueprint | No-Code Low-Code Guide

Click for details
View Details →

Complete Gemini API Guide – 42 Python Scripts, 70+ Page PDF & Cheat Sheet – Digital Download

Click for details
View Details →

AI Thinking Workbook

Click for details
View Details →

ACT Test (American College Testing) Prep Flashcards Bundle: Vocabulary, Math, Grammar, and Science

Click for details
View Details →

Leonardo.Ai API Mastery: Python Automation Guide (PDF + Code + HTML

Click for details
View Details →

def create_test_image_with_objects():
    """Create a test image with various objects if it doesn't exist."""
    image_path = 'test_objects.png'
    
    # Check if image already exists
    if Path(image_path).exists():
        print(f"[OK] Using existing image: {image_path}")
        return image_path
    
    try:
        from PIL import Image, ImageDraw, ImageFont
        
        print(f"[NOTE] Creating new test image...")
        
        img = Image.new('RGB', (600, 400), color='white')
        draw = ImageDraw.Draw(img)
        
        # Draw objects
        draw.rectangle([50, 50, 150, 150], fill='red', outline='black', width=2)
        draw.rectangle([200, 50, 300, 150], fill='blue', outline='black', width=2)
        draw.ellipse([350, 50, 500, 200], fill='green', outline='black', width=2)
        
        img.save(image_path)
        print(f"[OK] Created new test image: {image_path}")
        return image_path
        
    except ImportError:
        print("[WARNING]  PIL not available. Please install: pip install Pillow")
        return None

Prerequisites and Setup

Before we dive into the implementation, ensure you have the following prerequisites:

Object Counting in an Image

This snippet shows how to count objects in an image by sending specific questions to an AI model, illustrating the interaction between image data and natural language processing.

def object_counting(client, image_path):
    """Count objects in an image."""
    print("\n" + "=" * 60)
    print("  EXAMPLE 1: Object Counting")
    print("=" * 60)
    
    if not image_path or not Path(image_path).exists():
        print("\n[WARNING]  No image available")
        return
    
    with open(image_path, 'rb') as f:
        image_bytes = f.read()
    
    image_part = types.Part.from_bytes(data=image_bytes, mime_type='image/png')
    
    questions = [
        "How many rectangles are in this image?",
        "How many circles/ellipses can you see?",
        "What is the total number of shapes?",
        "Which color appears most frequently?"
    ]
    
    for question in questions:
        print(f"\n {question}")
        response = client.models.generate_content(
            model='gemini-2.5-flash',
            contents=[image_part, question]
        )
        print(f"[CHAT] {response.text}")

Python 3.x: Make sure Python is installed on your machine. You can download it from python.org.
Pillow Library: This library is essential for image manipulation. You can install it using pip:

pip install Pillow

Gemini API: To access the Gemini API, you will need a Google Cloud account. Set up a project, enable the Gemini API, and obtain your API key.

Once you have everything set up, you are ready to start building your Image QA system!

Core Concepts Explanation

Understanding the core concepts behind the implementation is crucial for grasping how the Image QA system works. Here are the key components:

Spatial Reasoning with Images

This snippet demonstrates how to ask questions about spatial relationships in an image, showcasing the model’s ability to understand and analyze the layout of visual elements.

def spatial_reasoning(client, image_path):
    """Ask about spatial relationships in images."""
    print("\n" + "=" * 60)
    print("  EXAMPLE 2: Spatial Reasoning")
    print("=" * 60)
    
    if not image_path or not Path(image_path).exists():
        print("\n[WARNING]  No image available")
        return
    
    with open(image_path, 'rb') as f:
        image_bytes = f.read()
    
    image_part = types.Part.from_bytes(data=image_bytes, mime_type='image/png')
    
    questions = [
        "What shapes are in the top row?",
        "What shapes are in the bottom row?",
        "Which shape is on the far right?",
        "Is there any text in the image? Where is it located?"
    ]
    
    for question in questions:
        print(f"\n {question}")
        response = client.models.generate_content(
            model='gemini-2.5-flash',
            contents=[image_part, question]
        )
        print(f"[CHAT] {response.text}")

Image Creation

The first step involves creating a test image with various objects. This is essential for simulating real-world scenarios where you would want to analyze images. We use the Pillow library to draw shapes on a blank canvas, which will serve as our input image for testing.

Object Counting

Object counting involves asking the AI model to identify and count the distinct objects present in our image. This showcases the model’s ability to recognize and differentiate between objects, making it a powerful feature for various applications.

Spatial Reasoning

Spatial reasoning allows the model to answer questions related to the layout and arrangement of objects within the image. For example, you might ask, “Which object is to the right of the blue rectangle?” This feature highlights the AI’s understanding of spatial relationships.

Text Extraction (OCR)

Optical Character Recognition (OCR) enables the extraction of text from images. This is particularly useful for scenarios where images contain text elements, such as signs, labels, or documents. Integrating OCR into our Image QA system allows us to answer questions related to textual content.

Step-by-Step Implementation Walkthrough

Now that we’ve covered the core concepts, let’s walk through the implementation of the Image QA system, as demonstrated in the code snippets.

Text Extraction from Images (OCR)

This snippet illustrates how to perform Optical Character Recognition (OCR) on an image to extract text, highlighting the integration of image analysis and text processing capabilities.

def ocr_text_extraction(client, image_path):
    """Extract text from images."""
    print("\n" + "=" * 60)
    print("  EXAMPLE 3: Text Extraction (OCR)")
    print("=" * 60)
    
    if not image_path or not Path(image_path).exists():
        print("\n[WARNING]  No image available")
        return
    
    with open(image_path, 'rb') as f:
        image_bytes = f.read()
    
    image_part = types.Part.from_bytes(data=image_bytes, mime_type='image/png')
    
    prompts = [
        "Is there any text in this image?",
        "Extract all text you can see.",
        "What does the text say and where is it positioned?"
    ]
    
    for prompt in prompts:
        print(f"\n[NOTE] {prompt}")
        response = client.models.generate_content(
            model='gemini-2.5-flash',
            contents=[image_part, prompt]
        )
        print(f" {response.text}")

Creating the Test Image

The first function in our implementation is responsible for creating a test image with various shapes. This is done using the Pillow library. The function checks if an image file already exists; if it does, it uses that existing image. Otherwise, it creates a new one. This approach saves time and resources during development, especially when testing various functionalities.

Counting Objects in the Image

The next function focuses on object counting. It utilizes the Gemini API to send requests for counting objects. This interaction between the image data and the AI model showcases how we can leverage NLP capabilities to ask specific questions about the image.

Spatial Reasoning

After counting objects, we move on to spatial reasoning. This function asks questions about the spatial relationships between objects in the image, providing insights into the layout. This is particularly useful in applications where understanding the arrangement of items is critical, such as in warehouse management or robotics.

Text Extraction

The final function integrates OCR capabilities into our system. By sending the image to the API, we can extract any text present and use it to answer relevant questions. This feature significantly expands the scope of our Image QA system, making it more versatile.

Advanced Features or Optimizations

While the basic functionalities of the Image QA system are powerful, there are several advanced features and optimizations you could implement:

Multi-Turn Image Q&A

This snippet demonstrates how to conduct a multi-turn conversation about an image, maintaining context across interactions, which is crucial for developing advanced conversational AI applications.

def multi_turn_image_qa(client, image_path):
    """Multi-turn conversation about an image."""
    print("\n" + "=" * 60)
    print("  EXAMPLE 4: Multi-Turn Image Q&A")
    print("=" * 60)
    
    if not image_path or not Path(image_path).exists():
        print("\n[WARNING]  No image available")
        return
    
    with open(image_path, 'rb') as f:
        image_bytes = f.read()
    
    image_part = types.Part.from_bytes(data=image_bytes, mime_type='image/png')
    
    history = []
    
    questions = [
        "Describe this image briefly.",
        "Which shape is the largest?",
        "If I wanted to buy something, what discount is offered?"
    ]
    
    print("\n[CHAT] Conversation about the image:\n")
    
    for i, question in enumerate(questions, 1):
        print(f"Turn {i}:")
        print(f" User: {question}")
        
        # First turn includes image
        if i == 1:
            history.append({"role": "user", "parts": [image_part, {"text": question}]})
        else:
            history.append({"role": "user", "parts": [{"text": question}]})
        
        response = client.models.generate_content(
            model='gemini-2.5-flash',
            contents=history
        )
        
        model_msg = response.text
        print(f" Model: {model_msg}\n")
        
        history.append({"role": "model", "parts": [{"text": model_msg}]})
    
    print("[OK] Multi-turn conversation maintains image context!")

Improved Image Quality: Enhance image resolution and quality before processing to improve the accuracy of object detection and OCR.
Multiple Image Formats: Extend support for various image formats (like JPG, PNG, etc.) to make the system more robust.
User Interface: Create a frontend interface where users can upload images and receive answers interactively.
Batch Processing: Implement functionality to process multiple images at once, increasing efficiency for users with large datasets.

Practical Applications

The applications of an Image QA system are vast. Here are a few practical scenarios where you might deploy such a system:

Accessibility Tools: Assist visually impaired users by describing images and providing relevant information in an accessible format.
Customer Support: Enable customers to ask questions about products directly from images, improving engagement and user experience.
Inventory Management: Use object counting and spatial reasoning to manage inventory and optimize layouts in warehouses.
Education: Develop educational tools that help students learn about objects and their relationships through interactive image analysis.

Common Pitfalls and Solutions

While implementing your Image QA system, you may encounter several common pitfalls:

API Limitations: Understand the limitations of the Gemini API, such as rate limits and data quotas. Always monitor your usage to avoid disruptions.
Image Quality: Ensure that the images used are of sufficient quality for accurate processing. Low-resolution images can lead to poor results in object detection and OCR.
Error Handling: Implement robust error handling in your code to manage API errors gracefully. This will enhance user experience and reliability.

Conclusion and Next Steps

In this tutorial, we have explored the process of creating an Image QA system in Python using the Gemini API. From generating test images to implementing advanced functionalities like object counting and OCR, we’ve covered a comprehensive approach to building an intelligent system capable of understanding images.

As you continue your journey in AI and image processing, consider expanding this project further. Explore different APIs, integrate machine learning models, or even delve into real-time image analysis. The possibilities are endless!

Happy coding!

About This Tutorial: This code tutorial is designed to help you learn Python programming through practical examples. Always test code in a development environment first and adapt it to your specific needs.

Want to accelerate your Python learning? Check out our premium Python resources including Flashcards, Cheat Sheets, Interivew preparation guides, Certification guides, and a range of tutorials on various technical areas.

Introduction

Create a Test Image with Objects

📚 Recommended Python Learning Resources

Vibe Coding Blueprint | No-Code Low-Code Guide

Complete Gemini API Guide – 42 Python Scripts, 70+ Page PDF & Cheat Sheet – Digital Download

AI Thinking Workbook

ACT Test (American College Testing) Prep Flashcards Bundle: Vocabulary, Math, Grammar, and Science

Leonardo.Ai API Mastery: Python Automation Guide (PDF + Code + HTML

Prerequisites and Setup

Object Counting in an Image

Core Concepts Explanation

Spatial Reasoning with Images

Image Creation

Object Counting

Spatial Reasoning

Text Extraction (OCR)

Step-by-Step Implementation Walkthrough

Text Extraction from Images (OCR)

Creating the Test Image

Counting Objects in the Image

Spatial Reasoning

Text Extraction

Advanced Features or Optimizations

Multi-Turn Image Q&A

Practical Applications

Common Pitfalls and Solutions

Conclusion and Next Steps

Related Posts