When autonomous agents make thousands of API calls per minute, spend governance can't be an afterthought bolted onto a dashboard. It has to be in the request path — fast enough that agents never notice, thorough enough that nothing slips through.
Every agent API call follows the same path. Click any node to see its technical spec.
X-Reins-Agent and X-Reins-Team headers for identity resolution. Zero code changes to your prompt logic. Supports Python, Node.js, and REST.
tenant_id, team_id, agent_id from headers. Key lookup uses a 60s LRU cache. Invalid keys return 401 in <1ms.
policy_rules table in priority order. Supports conditions: budget threshold, model allowlist, time window, token limit, vendor restriction. Rules are cached in-memory with a 5s TTL. Evaluation short-circuits on first block action.
service_connections (AES-256-GCM). Injects Authorization header. Forwards request body unmodified. Supports streaming via SSE passthrough. Timeout: 120s default, configurable per-policy.
usage field in API responses. Pricing table covers 40+ models across OpenAI, Anthropic, Google, Mistral, Cohere.
transactions row: agent, vendor, model, tokens, cost, latency, policy evaluation result, timestamp. Write is async (non-blocking). Retention: 90 days hot, 2 years cold storage. Query performance: <50ms for 30-day windows.
Four systems, each with a single job. Loose coupling means a failure in one doesn't cascade to the others.
Request interception, routing, and transparent proxying
The gateway sits between your agent and the vendor API. It intercepts outbound requests, runs them through the policy pipeline, and forwards to the target vendor if all checks pass.
Header injection: The SDK adds X-Reins-Agent, X-Reins-Team, and X-Reins-Key headers. The gateway strips these before forwarding to the vendor — your vendor never sees Reins metadata.
Retry logic: On vendor 5xx, the gateway retries once with exponential backoff (100ms, then 500ms). On second failure, it returns the vendor error to the agent with a X-Reins-Retry-Count header so the agent can decide whether to retry.
// Simplified gateway request handler async function handleRequest(req, res) { const identity = resolveIdentity(req.headers); if (!identity) return res.status(401).json({ error: 'invalid_key' }); // Rate limit check (token bucket) const allowed = await checkRateLimit(identity); if (!allowed) return res.status(429).json({ error: 'rate_limited', retry_after: allowed.retryAfter }); // Policy evaluation const decision = await evaluatePolicies( identity, req.body ); if (decision.action === 'block') { await recordBlock(identity, decision); return res.status(403).json({ error: 'policy_blocked', rule: decision.rule_name, reason: decision.reason }); } // Forward to vendor const vendorRes = await proxyToVendor( identity, req ); // Async: record transaction + update budget recordTransaction(identity, vendorRes) .catch(err => logError(err)); return pipeResponse(vendorRes, res); }
Rule matching, priority ordering, and action execution
Policies are evaluated on every request. Each policy contains ordered rules. Rules have conditions (what to match) and actions (what to do when matched).
Evaluation order: Policies are sorted by priority (lower number = higher priority). Within a policy, rules execute top-to-bottom. The engine short-circuits on the first block action — remaining rules are skipped.
Condition types: Budget threshold (rolling window), model allowlist/denylist, time-of-day window (cron syntax), per-request token limit, vendor restriction, agent-specific overrides.
Actions: allow, block, alert, downgrade (switch to a cheaper model).
// Policy evaluation with short-circuit async function evaluatePolicies(identity, req) { const rules = await getRulesForTenant( identity.tenant_id ); for (const rule of rules) { const ctx = { agent: identity.agent_id, team: identity.team_id, model: req.model, vendor: detectVendor(req), spend: await getCurrentSpend( identity, rule.window ), time: new Date() }; if (matchesConditions(rule.conditions, ctx)) { if (rule.action === 'block') { return { action: 'block', rule_name: rule.name, reason: formatReason(rule, ctx) }; } if (rule.action === 'alert') { emitAlert(identity, rule, ctx); } if (rule.action === 'downgrade') { req.model = rule.downgrade_to; } } } return { action: 'allow' }; }
Real-time cost calculation and multi-vendor aggregation
Every API response carries token usage data. The tracker extracts it, looks up the per-token price for the specific model, and computes cost. This happens on every request — not as a batch job.
Pricing table: Covers 40+ models across OpenAI, Anthropic, Google, Mistral, and Cohere. Updated within 24 hours of vendor price changes. Input and output tokens are priced separately.
Aggregation: Costs roll up three ways — per-agent, per-team, per-organization. Each level has its own rolling budget window. The policy engine reads these aggregations in real time for budget threshold rules.
Streaming: For SSE responses, token counts are estimated incrementally as chunks arrive using a lightweight tokenizer. Final count is reconciled when the stream completes.
// Cost calculation from API response function calculateCost(vendorResponse) { const usage = vendorResponse.usage; const model = vendorResponse.model; const pricing = getPricing(model); const inputCost = usage.prompt_tokens * pricing.input_per_token; const outputCost = usage.completion_tokens * pricing.output_per_token; return { total: inputCost + outputCost, breakdown: { input_tokens: usage.prompt_tokens, output_tokens: usage.completion_tokens, input_cost: inputCost, output_cost: outputCost, model: model, vendor: pricing.vendor, timestamp: new Date() } }; } // Rolling spend aggregation async function updateSpend(identity, cost) { await db.query(` INSERT INTO transactions (agent_id, team_id, org_id, amount, vendor, model, created_at) VALUES ($1, $2, $3, $4, $5, $6, NOW()) `, [ identity.agent_id, identity.team_id, identity.org_id, cost.total, cost.breakdown.vendor, cost.breakdown.model ]); }
Immutable transaction records, write path, and query performance
Every request produces an immutable audit record. "Immutable" means no UPDATE or DELETE — records are append-only. This is non-negotiable for compliance and debugging.
What gets captured: Request metadata (agent, team, vendor, model), token counts (input/output), computed cost, latency breakdown (auth, policy eval, vendor round-trip), policy evaluation result (which rules fired, what action was taken), and timestamp.
Write path: Audit writes are asynchronous — they happen after the response is returned to the agent. This means audit writes add zero blocking latency. Writes use a batched insert (up to 100 records per batch, flushed every 500ms or on batch full).
What is NOT captured: Prompt content, completion content, or any request/response body data. Reins never reads, stores, or logs your actual AI conversations.
// Batched async audit writer class AuditWriter { constructor() { this.buffer = []; this.flushInterval = setInterval( () => this.flush(), 500 ); } record(entry) { this.buffer.push({ ...entry, id: generateULID(), created_at: new Date() }); if (this.buffer.length >= 100) { this.flush(); } } async flush() { if (this.buffer.length === 0) return; const batch = this.buffer.splice( 0, 100 ); try { await db.batchInsert( 'transactions', batch ); } catch (err) { // Re-queue on failure this.buffer.unshift(...batch); metrics.increment( 'audit.write_failure' ); } } }
From agent request to vendor response. Total overhead: 2-8ms.
api.openai.com to gateway.reins.dev. Identity headers are injected. Request body is untouched.Click any node for implementation details.
Drop-in wrappers for your existing AI client. Zero changes to your prompt logic.
# Python — OpenAI wrapper # pip install reins-sdk openai from reins import ReinsClient from openai import OpenAI # Initialize Reins — wraps your existing OpenAI client reins = ReinsClient( api_key="rns_live_abc123...", agent_id="data-extraction-agent", team="ml-team" ) # Your existing OpenAI code — unchanged client = reins.wrap(OpenAI(api_key="sk-...")) try: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Analyze this dataset"}] ) print(response.choices[0].message.content) except reins.PolicyBlockedError as e: # Budget exceeded or model restricted print(f"Blocked: {e.rule_name} — {e.reason}") # e.rule_name = "daily-budget-limit" # e.reason = "Team ml-team exceeded $50/day budget" except reins.RateLimitedError as e: print(f"Rate limited. Retry after {e.retry_after}s")
// Node.js — Express middleware pattern // npm install @reins/sdk openai import { Reins } from '@reins/sdk'; import OpenAI from 'openai'; const reins = new Reins({ apiKey: process.env.REINS_API_KEY, agentId: 'customer-support-bot', team: 'support', // Optional: fail-open if Reins is unreachable failOpen: true, }); const openai = reins.wrap(new OpenAI({ apiKey: process.env.OPENAI_API_KEY })); // Use exactly like the standard OpenAI client const completion = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: prompt }], stream: true, // Streaming works transparently }); // Access Reins metadata after each call const meta = reins.getLastTransaction(); console.log(meta.cost); // 0.0023 console.log(meta.model); // "gpt-4o-mini" console.log(meta.latency); // 342 (ms, vendor only)
# LangChain integration # pip install reins-sdk langchain-openai from reins.integrations.langchain import ReinsCallbackHandler from langchain_openai import ChatOpenAI # Reins as a LangChain callback handler reins_handler = ReinsCallbackHandler( api_key="rns_live_abc123...", agent_id="research-agent", team="research", # Pre-check budget before LLM call enforce_policy=True ) llm = ChatOpenAI( model="gpt-4o", callbacks=[reins_handler] ) # Works with chains, agents, and tools from langchain.agents import AgentExecutor agent = AgentExecutor( agent=my_agent, tools=my_tools, callbacks=[reins_handler], # Tracks every LLM call in the chain max_iterations=10 ) result = agent.invoke({"input": "Research quantum computing trends"}) # After execution: full cost breakdown by chain step summary = reins_handler.get_run_summary() print(f"Total cost: ${summary.total_cost:.4f}") print(f"LLM calls: {summary.call_count}") print(f"Tokens used: {summary.total_tokens}")
# Direct REST — no SDK required # Just change the base URL and add headers curl -X POST https://gateway.reins.dev/v1/chat/completions \ -H "Authorization: Bearer rns_live_abc123..." \ -H "X-Reins-Agent: my-agent" \ -H "X-Reins-Team: my-team" \ -H "X-Vendor-Key: sk-..." \ -H "X-Vendor: openai" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o", "messages": [ {"role": "user", "content": "Hello"} ] }' # Response includes Reins headers: # X-Reins-Cost: 0.0012 # X-Reins-Tokens-In: 8 # X-Reins-Tokens-Out: 42 # X-Reins-Transaction-Id: txn_01H... # X-Reins-Budget-Remaining: 47.32 # X-Reins-Latency-Overhead: 3ms # Error responses for policy blocks: # HTTP 403 # { # "error": "policy_blocked", # "rule": "daily-budget-limit", # "reason": "Team exceeded $50/day budget", # "budget_used": 50.12, # "budget_limit": 50.00 # }
How Reins handles your most sensitive data — vendor API keys and agent credentials.
Vendor API keys are encrypted at rest using AES-256-GCM with per-tenant encryption keys derived via HKDF. Keys are decrypted in memory only for the duration of a proxied request. Plaintext keys never touch disk, logs, or any persistent store.
Reins API keys use HMAC-SHA256 for validation. Keys are stored as hashed values — we cannot retrieve your key after creation. JWT tokens for dashboard access use RS256 with 15-minute expiry and rotating signing keys.
Request and response bodies are proxied in streaming mode. Reins never buffers, stores, or logs prompt content or completions. Only metadata (model, tokens, cost, latency) is extracted and recorded.
Reins API keys can be rotated with a 24-hour overlap window — both old and new keys remain valid during rotation. Vendor API keys can be updated in the dashboard; re-encryption happens synchronously.
Audit log records are append-only. No UPDATE or DELETE operations are permitted on the transactions table — not through the API, not through the dashboard, not through support. Records are cryptographically chained for tamper detection.
All traffic is TLS 1.3. The gateway enforces HSTS. Vendor API calls are made over TLS with certificate pinning for major vendors (OpenAI, Anthropic, Google). Internal service communication uses mutual TLS.
Click any column header to sort. Current offering is SaaS; self-hosted and on-prem are on the roadmap.
| Feature | SaaS | Self-Hosted | On-Prem |
|---|---|---|---|
| Availability | Available now | Q3 2026 | Enterprise |
| Setup time | 5 minutes | ~1 hour | Custom |
| Data residency | US-East (AWS) | Your cloud | Your datacenter |
| Auto-updates | Yes | Opt-in | Manual |
| SSO / SAML | Yes | Yes | Yes |
| SLA | 99.9% | Self-managed | Custom |
| Audit log export | API + CSV | Direct DB | Direct DB |
| VPC peering | Enterprise | Native | Native |
| Custom domains | Yes | Yes | Yes |
Measured on production traffic. Latency overhead is the time Reins adds — vendor latency is excluded.
| Operation | P50 | P95 | P99 | Relative |
|---|---|---|---|---|
| Auth validation | 0.3ms | 0.8ms | 1.2ms | |
| Rate limit check | 0.1ms | 0.2ms | 0.5ms | |
| Policy evaluation | 1.2ms | 2.8ms | 4.1ms | |
| Budget lookup | 0.5ms | 1.5ms | 3.0ms | |
| Vendor proxy setup | 0.2ms | 0.5ms | 0.8ms | |
| Response parsing | 0.3ms | 0.7ms | 1.5ms | |
| Audit write (async) | 0ms | 0ms | 0ms | |
| Total overhead | 2.6ms | 6.5ms | 11.1ms |
What breaks, how Reins responds, and what you experience.
Reins adds 2-8ms of latency per request. Auth validation takes under 1ms, policy evaluation 1-3ms, and the audit write is asynchronous — it adds zero blocking latency. For context, a typical GPT-4o call takes 500-3000ms. Reins overhead is <1% of total round-trip time.
By default, Reins operates in fail-open mode. If the gateway is unreachable, the SDK falls back to calling the vendor API directly. Your agents keep working — they just aren't monitored until the gateway recovers. You can configure fail-closed mode for environments where unmonitored calls are unacceptable.
Reins maintains a pricing table for every supported vendor and model (40+ models). Token counts are extracted from the usage field in API responses. Costs are calculated per-request and aggregated per-agent, per-team, and per-organization in real time. For streaming responses, a lightweight tokenizer counts tokens incrementally.
Self-hosted deployment is on our roadmap for Q3 2026. Currently Reins runs as managed SaaS. For enterprises with strict data residency or compliance requirements, contact us about dedicated tenancy with data isolation guarantees.
No. Reins never reads, stores, or logs prompt content or AI completions. The gateway inspects request metadata (model, token count, vendor) for policy evaluation and cost tracking. Request and response bodies are proxied as opaque streams — we cannot access them even if we wanted to.
The gateway handles 10,000+ requests per second per tenant with horizontal auto-scaling. Policy evaluation uses an in-memory rule cache with a 5-second TTL. Rate limiting uses a token bucket algorithm with in-memory state. The bottleneck in practice is always the vendor API, not Reins.
Reins proxies SSE streams transparently — the first byte reaches your agent with zero additional buffering. Token counting happens incrementally as chunks arrive, updating the running cost without waiting for the stream to complete. Stream interruptions are recorded as partial transactions in the audit log.
Every policy change creates a versioned snapshot. You can view the diff between versions in the dashboard and roll back to any previous version with one click. Active rollback takes effect within the 5-second cache window. Rollback history is immutable and auditable.