Gemini Multimodal API Guide

Gemini Multimodal API Guide: Working with Images, Audio, and Video

Gemini's native multimodal capabilities set it apart from many competing AI models. Unlike models that bolt on vision or audio as afterthoughts, Gemini was designed from the ground up to process text, images, audio, and video within a single unified model. This guide shows you how to leverage each modality effectively in your applications.

Understanding Multimodal Input

Gemini accepts multiple content types in a single request. You can combine text prompts with images, audio files, video clips, and PDF documents. The model processes all inputs together and generates a coherent text response that considers all provided content.

Working with Images

Gemini can analyze images for content description, object detection, text extraction (OCR), diagram interpretation, and visual question answering.

from google import genai
from google.genai import types
import base64

client = genai.Client(api_key="YOUR_API_KEY")

# Method 1: From a local file
with open("screenshot.png", "rb") as f:
    image_data = f.read()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_bytes(data=image_data, mime_type="image/png"),
        "Describe what you see in this image. Extract any visible text."
    ]
)

print(response.text)

# Method 2: From a URL
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_uri(
            file_uri="https://example.com/chart.png",
            mime_type="image/png"
        ),
        "Analyze this chart and summarize the key trends."
    ]
)

Supported image formats include JPEG, PNG, GIF, WebP, and HEIC. For best results, ensure images are clear and at a reasonable resolution — extremely high-resolution images consume more tokens without necessarily improving analysis quality.

When sending multiple images in one request, Gemini can compare them, identify differences, and reason about relationships between them. This is useful for tasks like comparing UI mockups, analyzing before/after images, or processing multi-page documents.

Working with Audio

Gemini can transcribe speech, analyze audio content, answer questions about spoken content, and understand multiple languages in audio input.

from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_API_KEY")

# Upload audio file using the File API for files over 20MB
audio_file = client.files.upload(file="meeting_recording.mp3")

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_uri(
            file_uri=audio_file.uri,
            mime_type="audio/mp3"
        ),
        "Transcribe this audio and provide a summary of the key discussion points."
    ]
)

print(response.text)

Supported audio formats include MP3, WAV, AIFF, AAC, OGG, and FLAC. Gemini can process audio files up to approximately 9.5 hours in length (when using the File API for uploads).

Working with Video

Video understanding is one of Gemini's standout capabilities. The model can analyze visual content frame by frame, understand temporal sequences, and answer questions about actions and events in the video.

from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_API_KEY")

# Upload video using the File API
video_file = client.files.upload(file="demo_video.mp4")

# Wait for processing to complete
import time
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = client.files.get(name=video_file.name)

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        types.Part.from_uri(
            file_uri=video_file.uri,
            mime_type="video/mp4"
        ),
        "Describe the key actions in this video. What is the person doing at the 30 second mark?"
    ]
)

print(response.text)

Supported video formats include MP4, AVI, MOV, MKV, WEBM, FLV, and MPEG. Videos up to approximately 1 hour can be processed.

Video processing requires uploading via the File API and may take some time to process before it is ready for use. Always check the file state before sending your generation request. Large video files consume significantly more tokens than text or image inputs.

Working with PDF Documents

Gemini can analyze PDF documents including both text and visual content such as charts, tables, and diagrams:

from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_API_KEY")

# Upload PDF
pdf_file = client.files.upload(file="research_paper.pdf")

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        types.Part.from_uri(
            file_uri=pdf_file.uri,
            mime_type="application/pdf"
        ),
        "Summarize the methodology and key findings of this research paper."
    ]
)

Combining Multiple Modalities

The real power of Gemini's multimodal capabilities comes from combining different content types in a single request:

# Analyze a UI design with reference to requirements
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        types.Part.from_bytes(data=mockup_image, mime_type="image/png"),
        types.Part.from_bytes(data=spec_pdf, mime_type="application/pdf"),
        "Compare this UI mockup against the specification document. "
        "List any discrepancies or missing features."
    ]
)

Token Costs for Multimodal Content

Images: Each image typically costs 258 tokens (fixed overhead) plus additional tokens based on resolution.
Audio: Approximately 32 tokens per second of audio.
Video: Approximately 263 tokens per second of video (includes both visual and audio tracks).
PDFs: Each page costs approximately 258 tokens.

Best Practices

Use the File API for large files. Inline base64 encoding works for small files but is impractical for anything over a few MB.
Optimize image resolution. Downscale images when full resolution is not needed to reduce token costs.
Be specific in your prompts. Tell the model exactly what to look for in visual content.
Consider using Flash for simple visual tasks. Reserve Pro for complex multi-step visual reasoning.

For production applications that heavily use multimodal capabilities, a relay service like claude4u.com can help manage the higher token costs and rate limits associated with multimodal requests by distributing load across multiple API keys and providing detailed cost tracking per modality.

Get Started with 轻舟 AI

Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more

Gemini Multimodal API Guide