Caicaini
Get started

Capabilities

Vision

Pass images alongside text inside the same content array. The model reads images and text together — no separate vision endpoint, no two-step flow.

Which models support vision

All five virtual ids accept image content blocks today. The capability flag on each model is exposed via GET /v1/models as supports_vision — read it at runtime if you want to feature-gate client UI.

Basic call with an image

Build the user message as a content array. Order matters — put the image first, then the question. Mixed content is supported in any order, but the model prefers context-then-question.

# Encode an image and embed it as a base64 content block.
B64=$(base64 -w0 receipt.jpg 2>/dev/null || base64 receipt.jpg)
curl https://caicaini.com/v1/messages \
  -H "Authorization: Bearer cai_api_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"caicaini/sonnet\",
    \"max_tokens\": 512,
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image\", \"source\": {\"type\": \"base64\", \"media_type\": \"image/jpeg\", \"data\": \"${B64}\"}},
        {\"type\": \"text\", \"text\": \"Extract every line item with price into JSON.\"}
      ]
    }]
  }"

Image sources

We accept two source types. Inline base64 is the safest for production because the request is self-contained — no second network hop, no image-host availability problem mid-call.

  • base64 — pass the raw bytes encoded with standard base64 in the data field. Set media_type to one of image/jpeg, image/png, image/webp, image/gif.
  • url — we fetch the image server-side. The URL must be https://, return 200 within 5 seconds, and serve a supported MIME type. Private URLs (signed S3 links, presigned R2 links) work as long as the signature is valid at request time.
url source
{
  "type": "image",
  "source": {
    "type": "url",
    "url": "https://example.com/photos/receipt.jpg"
  }
}

Size and count limits

  • Up to 20 images per turn. Beyond that the request is rejected with type invalid_request_error.
  • Each image up to 5 MB raw, 8000 × 8000 px. We downscale very large images to a model-friendly resolution server-side before billing input tokens.
  • Total request body up to 16 MB. With base64 overhead this comfortably fits roughly two dozen photos.

Practical tips

  • Crop irrelevant whitespace before sending. Smaller, sharper images cost less and read better.
  • For documents, use PDFs only after rendering each page to PNG client-side. We do not currently accept application/pdf directly.
  • Steer with the system prompt — "Reply only with valid JSON. Do not summarize." works well for OCR-like extraction tasks.
  • For latency-sensitive vision workloads, route to caicaini/haiku first and escalate to caicaini/sonnet only when the result is uncertain. For high-volume document Q&A on long inputs, pin caicaini/kimi. See Models.