Capabilities

Vision

Pass images alongside text inside the same content array. The model reads images and text together — no separate vision endpoint, no two-step flow.

Which models support vision

All five virtual ids accept image content blocks today. The capability flag on each model is exposed via GET /v1/models as supports_vision — read it at runtime if you want to feature-gate client UI.

Basic call with an image

Build the user message as a content array. Order matters — put the image first, then the question. Mixed content is supported in any order, but the model prefers context-then-question.

# Encode an image and embed it as a base64 content block.
B64=$(base64 -w0 receipt.jpg 2>/dev/null || base64 receipt.jpg)
curl https://caicaini.com/v1/messages \
  -H "Authorization: Bearer cai_api_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"caicaini/sonnet\",
    \"max_tokens\": 512,
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image\", \"source\": {\"type\": \"base64\", \"media_type\": \"image/jpeg\", \"data\": \"${B64}\"}},
        {\"type\": \"text\", \"text\": \"Extract every line item with price into JSON.\"}
      ]
    }]
  }"

Image sources

We accept two source types. Inline base64 is the safest for production because the request is self-contained — no second network hop, no image-host availability problem mid-call.

base64 — pass the raw bytes encoded with standard base64 in the data field. Set media_type to one of image/jpeg, image/png, image/webp, image/gif.
url — we fetch the image server-side. The URL must be https://, return 200 within 5 seconds, and serve a supported MIME type. Private URLs (signed S3 links, presigned R2 links) work as long as the signature is valid at request time.

url source

{
  "type": "image",
  "source": {
    "type": "url",
    "url": "https://example.com/photos/receipt.jpg"
  }
}

Size and count limits

Up to 20 images per turn. Beyond that the request is rejected with type invalid_request_error.
Each image up to 5 MB raw, 8000 × 8000 px. We downscale very large images to a model-friendly resolution server-side before billing input tokens.
Total request body up to 16 MB. With base64 overhead this comfortably fits roughly two dozen photos.

Practical tips

Crop irrelevant whitespace before sending. Smaller, sharper images cost less and read better.
For documents, use PDFs only after rendering each page to PNG client-side. We do not currently accept application/pdf directly.
Steer with the system prompt — "Reply only with valid JSON. Do not summarize." works well for OCR-like extraction tasks.
For latency-sensitive vision workloads, route to caicaini/haiku first and escalate to caicaini/sonnet only when the result is uncertain. For high-volume document Q&A on long inputs, pin caicaini/kimi. See Models.

PreviousStreaming

NextTools

Vision

Which models support vision#

Basic call with an image#

Image sources#

Size and count limits#

Practical tips#

Which models support vision

Basic call with an image

Image sources

Size and count limits

Practical tips