Capabilities
Thinking
Extended thinking gives the model a private scratchpad before it responds. The model writes out its reasoning, you decide whether to surface it, and the answer that follows is dramatically better on hard problems.
When to use thinking
- Multi-step math, logic puzzles, planning problems.
- Code that requires holding many constraints in head: architectural changes, debugging, optimization.
- Long agent loops where every wrong tool call cascades into more wrong calls.
- Anything where you currently chain a "think first, then answer" prompt manually.
Enabling thinking
Add the thinking field to your request. budget_tokens is the upper bound on private reasoning tokens; pick a value smaller than max_tokens so the model still has room to write the visible answer.
curl https://caicaini.com/v1/messages \
-H "Authorization: Bearer cai_api_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "caicaini/opus",
"max_tokens": 4096,
"thinking": { "type": "enabled", "budget_tokens": 2048 },
"messages": [
{"role":"user","content":"A jug has 12L. Two pours: 5L jug and 7L jug. Show me how to measure exactly 6L."}
]
}'Response shape
The response is a normal Messages reply with one or more thinking blocks at the start of the content array, followed by the public text block(s).
{
"id": "msg_01H...",
"type": "message",
"role": "assistant",
"content": [
{
"type": "thinking",
"thinking": "Let me work this out step by step. I have 5L and 7L jugs, want exactly 6L. Fill the 7L jug, pour into 5L jug until full — leaves 2L. Empty 5L. Pour the 2L into the 5L. Fill the 7L again, pour into 5L until full (3L moves) — 4L left. Hmm not quite. Let me reconsider..."
},
{
"type": "text",
"text": "Here is one path: 1) Fill the 7L. 2) Pour into the 5L until full — 2L remains in the 7L. 3) Empty the 5L. 4) Move the 2L into the 5L. 5) Fill the 7L again. 6) Pour from the 7L into the 5L until full — that uses 3L, leaving 4L in the 7L. 7) Empty the 5L. 8) Move the 4L to the 5L. 9) Fill the 7L again. 10) Pour into the 5L until full — uses 1L, leaves exactly 6L in the 7L."
}
],
"model": "caicaini/opus",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 47,
"output_tokens": 1830,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0,
"credits_consumed": 612
}
}What it costs
- Thinking tokens are billed as output tokens at the same per-token rate as the rest of the model's output. They are counted in
output_tokenson the response usage. usage.credits_consumedis authoritative and includes the cost of thinking.- Set
budget_tokensconservatively. 1024–2048 is plenty for most tasks; bump higher only when you observe the model getting cut off mid-reasoning.
Should you show thinking to your users?
Usually no. The reasoning is a private scratchpad — sometimes it includes speculative paths, dead ends, or self-correction that a non-technical user will find confusing. Surface only the final text blocks by default. If you want a "show reasoning" affordance, label it clearly and render the thinking content as a collapsible section.
Thinking + streaming
When you stream a thinking-enabled request, the SSE feed emits a content_block_start with type thinking, followed by content_block_delta frames whose delta.type is thinking_delta, then a content_block_stop before the first text block begins. See Streaming for the full event vocabulary.
Thinking with tools
Tool use and thinking compose. The model thinks, decides which tool to call, emits a tool_use block, you respond with the result, and the next turn may think again. This is the highest-quality configuration for agent loops on hard tasks. See Tools for the full loop.