Reference

Rate limits

Each key has a tier that controls requests per minute, input tokens per minute, and concurrent streams. Tiers are based on lifetime top-up; promotion is automatic.

Tiers

Tier	When you are here	RPM	TPM	Concurrent streams
free	New keys before any top-up.	5	20,000	1
pay_as_you_go	After your first confirmed top-up.	60	100,000	5
high_volume	Lifetime top-up of $500 or more.	300	500,000	20
enterprise	Custom limits, granted manually.	custom	custom	custom

Promotion runs every 5 minutes and looks at confirmed lifetime top-up. There is no manual button — when your spend crosses a threshold, your tier moves up on the next sweep. Need higher than high_volume for a launch? Drop us a note via the support links in the footer.

Headers on every response

Every /v1/* response carries the live request-counter values for your key. Use them to throttle client-side instead of waiting for a 429.

X-RateLimit-Tier — the tier that produced these limits.
X-RateLimit-Limit-Requests — your RPM cap.
X-RateLimit-Remaining-Requests — requests left in the current minute.
X-RateLimit-Reset-Requests — seconds until the current minute window resets.
Retry-After — only set on 429. Seconds to sleep before retrying.

example

curl -i https://caicaini.com/v1/messages \
  -H "Authorization: Bearer cai_api_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"caicaini/auto","max_tokens":50,"messages":[{"role":"user","content":"Hi"}]}'

# X-RateLimit-Tier:               pay_as_you_go
# X-RateLimit-Limit-Requests:     60
# X-RateLimit-Remaining-Requests: 59
# X-RateLimit-Reset-Requests:     42         # seconds until the window resets

When you exceed

Over the limit returns 429 with type rate_limit_error and a Retry-After header in seconds. The request is rejected before any provider call, so it does not consume credits.

response · 429

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Requests per minute exceeded for this key. Retry after 12s."
  }
}

Recommended backoff

Wrap calls in a small retry helper. Read Retry-After when present, fall back to exponential backoff with jitter when not. Cap retries at 4–6 attempts and a total wait around 30 seconds.

# bash one-liner that retries on 429 with exponential backoff
delay=1
for i in 1 2 3 4 5; do
  status=$(curl -s -o /tmp/body -w "%{http_code}" \
    -X POST https://caicaini.com/v1/messages \
    -H "Authorization: Bearer cai_api_YOUR_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model":"caicaini/auto","max_tokens":50,"messages":[{"role":"user","content":"Hi"}]}')
  if [ "$status" != "429" ]; then cat /tmp/body; exit 0; fi
  sleep "$delay"
  delay=$((delay * 2))
done
echo "rate-limited after retries" >&2
exit 1

Insufficient credits is a separate signal

Out-of-credits returns 402 with type insufficient_quota. Do not retry — it will not pass until your balance goes up. Catch the 402 explicitly, alert your operator, and pause the job. See Errors for the full table of error types.

PreviousThinking

NextErrors

Rate limits

Tiers#

Headers on every response#

When you exceed#

Recommended backoff#

Insufficient credits is a separate signal#

Tiers

Headers on every response

When you exceed

Recommended backoff

Insufficient credits is a separate signal