Capabilities
Vision
Pass images alongside text inside the same content array. The model reads images and text together — no separate vision endpoint, no two-step flow.
Which models support vision
All five virtual ids accept image content blocks today. The capability flag on each model is exposed via GET /v1/models as supports_vision — read it at runtime if you want to feature-gate client UI.
Basic call with an image
Build the user message as a content array. Order matters — put the image first, then the question. Mixed content is supported in any order, but the model prefers context-then-question.
# Encode an image and embed it as a base64 content block.
B64=$(base64 -w0 receipt.jpg 2>/dev/null || base64 receipt.jpg)
curl https://caicaini.com/v1/messages \
-H "Authorization: Bearer cai_api_YOUR_KEY" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"caicaini/sonnet\",
\"max_tokens\": 512,
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"image\", \"source\": {\"type\": \"base64\", \"media_type\": \"image/jpeg\", \"data\": \"${B64}\"}},
{\"type\": \"text\", \"text\": \"Extract every line item with price into JSON.\"}
]
}]
}"Image sources
We accept two source types. Inline base64 is the safest for production because the request is self-contained — no second network hop, no image-host availability problem mid-call.
base64— pass the raw bytes encoded with standard base64 in thedatafield. Setmedia_typeto one ofimage/jpeg,image/png,image/webp,image/gif.url— we fetch the image server-side. The URL must behttps://, return 200 within 5 seconds, and serve a supported MIME type. Private URLs (signed S3 links, presigned R2 links) work as long as the signature is valid at request time.
url source
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/photos/receipt.jpg"
}
}Size and count limits
- Up to 20 images per turn. Beyond that the request is rejected with type
invalid_request_error. - Each image up to 5 MB raw, 8000 × 8000 px. We downscale very large images to a model-friendly resolution server-side before billing input tokens.
- Total request body up to 16 MB. With base64 overhead this comfortably fits roughly two dozen photos.
Practical tips
- Crop irrelevant whitespace before sending. Smaller, sharper images cost less and read better.
- For documents, use PDFs only after rendering each page to PNG client-side. We do not currently accept
application/pdfdirectly. - Steer with the system prompt — "Reply only with valid JSON. Do not summarize." works well for OCR-like extraction tasks.
- For latency-sensitive vision workloads, route to
caicaini/haikufirst and escalate tocaicaini/sonnetonly when the result is uncertain. For high-volume document Q&A on long inputs, pincaicaini/kimi. See Models.