GPT-4o Complete Guide

GPT-4o Multimodal Model: Complete Guide

GPT-4o is OpenAI's flagship multimodal model that can process text, images, audio, and video inputs while generating text and audio outputs. The "o" stands for "omni," reflecting its ability to handle multiple modalities natively. This guide covers everything you need to know to use GPT-4o effectively in your applications.

What Makes GPT-4o Special

Native multimodal — Processes images and text in a single model, not separate pipelines
Faster than GPT-4 Turbo — 2x faster response times on average
Cheaper than GPT-4 Turbo — 50% lower cost per token
128K context window — Process up to ~96,000 words in a single request
Improved multilingual — Better performance on non-English languages
Structured outputs — Native JSON mode and function calling support

Text Generation with GPT-4o

from openai import OpenAI

client = OpenAI(base_url="https://claude4u.com/v1")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an expert software engineer."},
        {"role": "user", "content": "Explain the difference between REST and GraphQL."}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

Image Understanding (Vision)

GPT-4o can analyze images sent as URLs or base64-encoded data:

from openai import OpenAI
import base64

client = OpenAI(base_url="https://claude4u.com/v1")

# Method 1: Image URL
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image? Describe it in detail."},
            {
                "type": "image_url",
                "image_url": {"url": "https://example.com/photo.jpg"}
            }
        ]
    }]
)

# Method 2: Base64 encoded image
with open("screenshot.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this UI screenshot for accessibility issues."},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{img_b64}"}
            }
        ]
    }]
)

print(response.choices[0].message.content)

Multiple Images in One Request

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these two product designs and list the differences."},
            {"type": "image_url", "image_url": {"url": "https://example.com/design-a.png"}},
            {"type": "image_url", "image_url": {"url": "https://example.com/design-b.png"}}
        ]
    }]
)

Image Detail Levels

Control how much detail the model uses when processing images, which affects both quality and cost:

content = [
    {"type": "text", "text": "Read all the text in this document."},
    {
        "type": "image_url",
        "image_url": {
            "url": "https://example.com/document.png",
            "detail": "high"  # Options: "auto", "low", "high"
        }
    }
]

low — 85 tokens per image. Fast and cheap, good for simple questions.
high — Up to 1,105 tokens per image tile (varies by resolution). Best for detailed analysis.
auto — The model decides based on the image size (default).

Node.js Vision Example

import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI({ baseURL: 'https://claude4u.com/v1' });

const imageBuffer = fs.readFileSync('chart.png');
const base64Image = imageBuffer.toString('base64');

const response = await client.chat.completions.create({
    model: 'gpt-4o',
    messages: [{
        role: 'user',
        content: [
            { type: 'text', text: 'Extract all data points from this chart and format as JSON.' },
            {
                type: 'image_url',
                image_url: { url: `data:image/png;base64,${base64Image}` }
            }
        ]
    }],
    response_format: { type: 'json_object' }
});

const data = JSON.parse(response.choices[0].message.content);
console.log(data);

GPT-4o vs GPT-4o-mini

Choose the right variant for your use case:

GPT-4o — Best for complex reasoning, detailed analysis, creative writing, and code generation. $2.50 input / $10.00 output per 1M tokens.
GPT-4o-mini — Best for simple tasks, classification, extraction, and high-volume processing. $0.15 input / $0.60 output per 1M tokens. 16x cheaper than GPT-4o.

Tip: Use GPT-4o-mini for preprocessing and filtering, then route only complex cases to GPT-4o. This hybrid approach can reduce costs by 80% or more while maintaining quality where it matters.

Structured Outputs (JSON Mode)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "List 5 programming languages with their year of creation."
    }],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)

Warning: When using JSON mode, you must include the word "JSON" in your prompt (system or user message). Otherwise the API will return an error.

Best Use Cases for GPT-4o

Document analysis — Extract data from scanned documents, receipts, invoices
Code review — Analyze screenshots of code for bugs and improvements
Product analysis — Compare product images, read labels, identify defects
Accessibility auditing — Evaluate UI screenshots for accessibility compliance
Chart and graph interpretation — Extract data from visualizations
Content moderation — Analyze images for policy compliance

Tip: Access GPT-4o and all other OpenAI models through claude4u.com with a single API key. The relay service handles load balancing, failover, and provides access to Claude and Gemini models through the same OpenAI-compatible interface.

Get Started with 轻舟 AI

Stable, fast AI API relay — supports Claude, OpenAI, Gemini and more

GPT-4o Complete Guide

GPT-4o Multimodal Model: Complete Guide

What Makes GPT-4o Special

Text Generation with GPT-4o

Image Understanding (Vision)

Multiple Images in One Request

Image Detail Levels

Node.js Vision Example

GPT-4o vs GPT-4o-mini

Structured Outputs (JSON Mode)

Best Use Cases for GPT-4o

Get Started with 轻舟 AI

More Guides