Architecture Deep Dive — Reins | AI Agent Spend Governance

System Overview

How requests flow through Reins

Every agent API call follows the same path. Click any node to see its technical spec.

REINS GATEWAY ╔═══════════════════════════════════════╗ ║ ║ ┌─────────┐ ║ ┌──────┐ ┌────────┐ ┌──────┐ ║ ┌──────────┐ │ Agent │──────║──│ Auth │──│ Policy │──│ Proxy│──║──────│ Vendor │ │ SDK │◀─────║──│ │ │ Engine │ │ │──║──────│ API │ └─────────┘ ║ └──┬───┘ └───┬────┘ └──┬───┘ ║ └──────────┘ ║ │ │ │ ║ ║ ▼ ▼ ▼ ║ ║ ┌──────┐ ┌────────┐ ┌──────┐ ║ ║ │ Rate │ │ Budget │ │Audit │ ║ ║ │Limit │ │ Track │ │ Log │ ║ ║ └──────┘ └────────┘ └──────┘ ║ ║ ║ ╚═══════════════════════════════════════╝

Agent SDK

Client-side interceptor

Thin wrapper around your existing OpenAI/Anthropic/etc. client. Redirects base URL to the Reins gateway. Adds X-Reins-Agent and X-Reins-Team headers for identity resolution. Zero code changes to your prompt logic. Supports Python, Node.js, and REST.

Auth Layer

Identity + key validation

Validates Reins API key via HMAC comparison against hashed keys in PostgreSQL. Resolves tenant_id, team_id, agent_id from headers. Key lookup uses a 60s LRU cache. Invalid keys return 401 in <1ms.

Policy Engine

Rule evaluation pipeline

Evaluates rules from the policy_rules table in priority order. Supports conditions: budget threshold, model allowlist, time window, token limit, vendor restriction. Rules are cached in-memory with a 5s TTL. Evaluation short-circuits on first block action.

Vendor Proxy

Transparent request forwarding

Decrypts vendor API key from service_connections (AES-256-GCM). Injects Authorization header. Forwards request body unmodified. Supports streaming via SSE passthrough. Timeout: 120s default, configurable per-policy.

Budget Tracker

Real-time cost aggregation

Maintains running spend totals per agent, team, and org in PostgreSQL with 1-second granularity. Token counts extracted from usage field in API responses. Pricing table covers 40+ models across OpenAI, Anthropic, Google, Mistral, Cohere.

Audit Log

Immutable transaction record

Every request produces an immutable transactions row: agent, vendor, model, tokens, cost, latency, policy evaluation result, timestamp. Write is async (non-blocking). Retention: 90 days hot, 2 years cold storage. Query performance: <50ms for 30-day windows.

Core Components

Under the hood

Four systems, each with a single job. Loose coupling means a failure in one doesn't cascade to the others.

GW

Gateway Layer

Request interception, routing, and transparent proxying

The gateway sits between your agent and the vendor API. It intercepts outbound requests, runs them through the policy pipeline, and forwards to the target vendor if all checks pass.

Header injection: The SDK adds X-Reins-Agent, X-Reins-Team, and X-Reins-Key headers. The gateway strips these before forwarding to the vendor — your vendor never sees Reins metadata.

Retry logic: On vendor 5xx, the gateway retries once with exponential backoff (100ms, then 500ms). On second failure, it returns the vendor error to the agent with a X-Reins-Retry-Count header so the agent can decide whether to retry.

p50: 2ms p95: 5ms p99: 8ms

Gateway down SDK falls back to direct vendor call (fail-open)

Vendor timeout 120s default, configurable. Returns 504 with original vendor headers.

gateway/intercept.js

// Simplified gateway request handler
async function handleRequest(req, res) {
  const identity = resolveIdentity(req.headers);
  if (!identity) return res.status(401).json({
    error: 'invalid_key'
  });

  // Rate limit check (token bucket)
  const allowed = await checkRateLimit(identity);
  if (!allowed) return res.status(429).json({
    error: 'rate_limited',
    retry_after: allowed.retryAfter
  });

  // Policy evaluation
  const decision = await evaluatePolicies(
    identity, req.body
  );

  if (decision.action === 'block') {
    await recordBlock(identity, decision);
    return res.status(403).json({
      error: 'policy_blocked',
      rule: decision.rule_name,
      reason: decision.reason
    });
  }

  // Forward to vendor
  const vendorRes = await proxyToVendor(
    identity, req
  );

  // Async: record transaction + update budget
  recordTransaction(identity, vendorRes)
    .catch(err => logError(err));

  return pipeResponse(vendorRes, res);
}

PE

Policy Engine

Rule matching, priority ordering, and action execution

Policies are evaluated on every request. Each policy contains ordered rules. Rules have conditions (what to match) and actions (what to do when matched).

Evaluation order: Policies are sorted by priority (lower number = higher priority). Within a policy, rules execute top-to-bottom. The engine short-circuits on the first block action — remaining rules are skipped.

Condition types: Budget threshold (rolling window), model allowlist/denylist, time-of-day window (cron syntax), per-request token limit, vendor restriction, agent-specific overrides.

Actions: allow, block, alert, downgrade (switch to a cheaper model).

Eval time: 1-3ms Cache TTL: 5s Max rules: 500/tenant

Cache miss Falls back to DB query (~8ms). Warms cache for next request.

DB unreachable Uses last-known rule set. Emits alert. Stale window: 60s max.

policy/evaluate.js

// Policy evaluation with short-circuit
async function evaluatePolicies(identity, req) {
  const rules = await getRulesForTenant(
    identity.tenant_id
  );

  for (const rule of rules) {
    const ctx = {
      agent: identity.agent_id,
      team: identity.team_id,
      model: req.model,
      vendor: detectVendor(req),
      spend: await getCurrentSpend(
        identity, rule.window
      ),
      time: new Date()
    };

    if (matchesConditions(rule.conditions, ctx)) {
      if (rule.action === 'block') {
        return {
          action: 'block',
          rule_name: rule.name,
          reason: formatReason(rule, ctx)
        };
      }
      if (rule.action === 'alert') {
        emitAlert(identity, rule, ctx);
      }
      if (rule.action === 'downgrade') {
        req.model = rule.downgrade_to;
      }
    }
  }

  return { action: 'allow' };
}

TX

Transaction Tracker

Real-time cost calculation and multi-vendor aggregation

Every API response carries token usage data. The tracker extracts it, looks up the per-token price for the specific model, and computes cost. This happens on every request — not as a batch job.

Pricing table: Covers 40+ models across OpenAI, Anthropic, Google, Mistral, and Cohere. Updated within 24 hours of vendor price changes. Input and output tokens are priced separately.

Aggregation: Costs roll up three ways — per-agent, per-team, per-organization. Each level has its own rolling budget window. The policy engine reads these aggregations in real time for budget threshold rules.

Streaming: For SSE responses, token counts are estimated incrementally as chunks arrive using a lightweight tokenizer. Final count is reconciled when the stream completes.

Vendors: 5 Models: 40+ Price lag: <24h

tracker/cost.js

// Cost calculation from API response
function calculateCost(vendorResponse) {
  const usage = vendorResponse.usage;
  const model = vendorResponse.model;
  const pricing = getPricing(model);

  const inputCost =
    usage.prompt_tokens * pricing.input_per_token;
  const outputCost =
    usage.completion_tokens * pricing.output_per_token;

  return {
    total: inputCost + outputCost,
    breakdown: {
      input_tokens: usage.prompt_tokens,
      output_tokens: usage.completion_tokens,
      input_cost: inputCost,
      output_cost: outputCost,
      model: model,
      vendor: pricing.vendor,
      timestamp: new Date()
    }
  };
}

// Rolling spend aggregation
async function updateSpend(identity, cost) {
  await db.query(`
    INSERT INTO transactions
      (agent_id, team_id, org_id, amount,
       vendor, model, created_at)
    VALUES ($1, $2, $3, $4, $5, $6, NOW())
  `, [
    identity.agent_id,
    identity.team_id,
    identity.org_id,
    cost.total,
    cost.breakdown.vendor,
    cost.breakdown.model
  ]);
}

AL

Audit Log

Immutable transaction records, write path, and query performance

Every request produces an immutable audit record. "Immutable" means no UPDATE or DELETE — records are append-only. This is non-negotiable for compliance and debugging.

What gets captured: Request metadata (agent, team, vendor, model), token counts (input/output), computed cost, latency breakdown (auth, policy eval, vendor round-trip), policy evaluation result (which rules fired, what action was taken), and timestamp.

Write path: Audit writes are asynchronous — they happen after the response is returned to the agent. This means audit writes add zero blocking latency. Writes use a batched insert (up to 100 records per batch, flushed every 500ms or on batch full).

What is NOT captured: Prompt content, completion content, or any request/response body data. Reins never reads, stores, or logs your actual AI conversations.

Write: async Query (30d): <50ms Hot retention: 90 days

Write failure Records buffered in-memory (max 10k). Retried on reconnect.

Buffer overflow Oldest records dropped. Counter emitted to monitoring.

audit/writer.js

// Batched async audit writer
class AuditWriter {
  constructor() {
    this.buffer = [];
    this.flushInterval = setInterval(
      () => this.flush(), 500
    );
  }

  record(entry) {
    this.buffer.push({
      ...entry,
      id: generateULID(),
      created_at: new Date()
    });

    if (this.buffer.length >= 100) {
      this.flush();
    }
  }

  async flush() {
    if (this.buffer.length === 0) return;

    const batch = this.buffer.splice(
      0, 100
    );

    try {
      await db.batchInsert(
        'transactions', batch
      );
    } catch (err) {
      // Re-queue on failure
      this.buffer.unshift(...batch);
      metrics.increment(
        'audit.write_failure'
      );
    }
  }
}

Request Lifecycle

Anatomy of a single API call

From agent request to vendor response. Total overhead: 2-8ms.

T+0ms

Agent sends request

The Reins SDK intercepts the outbound API call. Base URL is rewritten from api.openai.com to gateway.reins.dev. Identity headers are injected. Request body is untouched.

T+0.3ms

Authentication

Gateway validates the Reins API key via HMAC comparison. Resolves tenant, team, and agent identity from cached key mapping. Cache hit: 0.1ms. Cache miss (cold start): 3ms.

T+0.5ms

Rate limiting

Token bucket algorithm checks per-agent and per-team request quotas. Bucket state is maintained in-memory with periodic sync to the database. Bucket refill rate: configurable per-team.

T+1ms

Policy evaluation

All active rules are evaluated in priority order against the request context. Budget thresholds read from the real-time spend aggregation. Short-circuits on first block. Result: allow, block, alert, or downgrade.

T+3ms

Vendor proxy

Vendor API key decrypted in memory (AES-256-GCM). Request forwarded with original body + injected auth header. Reins metadata headers stripped. Streaming responses are passed through as-is.

T+3ms — T+N

Vendor processing

The vendor (OpenAI, Anthropic, etc.) processes the request. This is where the actual AI inference happens. Latency is entirely vendor-dependent — Reins adds nothing here.

T+N+1ms

Response parsing

Token usage extracted from response headers or body. Cost calculated from the pricing table. For streaming: incremental token counting as chunks arrive, reconciled on stream end.

T+N+2ms

Response returned + async audit write

Original vendor response returned to agent unmodified. Transaction record written asynchronously (zero blocking latency). Budget aggregations updated. Alerts fired if thresholds crossed. Done.

Data Flow

End-to-end request pipeline

Click any node for implementation details.

Client SDK

Intercept

➔

Auth

Validate

➔

Rate Limit

Throttle

➔

Budget Check

Enforce

➔

Vendor Proxy

Forward

➔

Response Parser

Extract

➔

Audit Write

Record

SDK Integration

Four ways to integrate

Drop-in wrappers for your existing AI client. Zero changes to your prompt logic.

# Python — OpenAI wrapper
# pip install reins-sdk openai

from reins import ReinsClient
from openai import OpenAI

# Initialize Reins — wraps your existing OpenAI client
reins = ReinsClient(
    api_key="rns_live_abc123...",
    agent_id="data-extraction-agent",
    team="ml-team"
)

# Your existing OpenAI code — unchanged
client = reins.wrap(OpenAI(api_key="sk-..."))

try:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Analyze this dataset"}]
    )
    print(response.choices[0].message.content)

except reins.PolicyBlockedError as e:
    # Budget exceeded or model restricted
    print(f"Blocked: {e.rule_name} — {e.reason}")
    # e.rule_name = "daily-budget-limit"
    # e.reason = "Team ml-team exceeded $50/day budget"

except reins.RateLimitedError as e:
    print(f"Rate limited. Retry after {e.retry_after}s")

// Node.js — Express middleware pattern
// npm install @reins/sdk openai

import { Reins } from '@reins/sdk';
import OpenAI from 'openai';

const reins = new Reins({
  apiKey: process.env.REINS_API_KEY,
  agentId: 'customer-support-bot',
  team: 'support',
  // Optional: fail-open if Reins is unreachable
  failOpen: true,
});

const openai = reins.wrap(new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
}));

// Use exactly like the standard OpenAI client
const completion = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: prompt }],
  stream: true, // Streaming works transparently
});

// Access Reins metadata after each call
const meta = reins.getLastTransaction();
console.log(meta.cost);     // 0.0023
console.log(meta.model);    // "gpt-4o-mini"
console.log(meta.latency);  // 342 (ms, vendor only)

# LangChain integration
# pip install reins-sdk langchain-openai

from reins.integrations.langchain import ReinsCallbackHandler
from langchain_openai import ChatOpenAI

# Reins as a LangChain callback handler
reins_handler = ReinsCallbackHandler(
    api_key="rns_live_abc123...",
    agent_id="research-agent",
    team="research",
    # Pre-check budget before LLM call
    enforce_policy=True
)

llm = ChatOpenAI(
    model="gpt-4o",
    callbacks=[reins_handler]
)

# Works with chains, agents, and tools
from langchain.agents import AgentExecutor
agent = AgentExecutor(
    agent=my_agent,
    tools=my_tools,
    callbacks=[reins_handler],  # Tracks every LLM call in the chain
    max_iterations=10
)

result = agent.invoke({"input": "Research quantum computing trends"})

# After execution: full cost breakdown by chain step
summary = reins_handler.get_run_summary()
print(f"Total cost: ${summary.total_cost:.4f}")
print(f"LLM calls: {summary.call_count}")
print(f"Tokens used: {summary.total_tokens}")

# Direct REST — no SDK required
# Just change the base URL and add headers

curl -X POST https://gateway.reins.dev/v1/chat/completions \
  -H "Authorization: Bearer rns_live_abc123..." \
  -H "X-Reins-Agent: my-agent" \
  -H "X-Reins-Team: my-team" \
  -H "X-Vendor-Key: sk-..." \
  -H "X-Vendor: openai" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

# Response includes Reins headers:
# X-Reins-Cost: 0.0012
# X-Reins-Tokens-In: 8
# X-Reins-Tokens-Out: 42
# X-Reins-Transaction-Id: txn_01H...
# X-Reins-Budget-Remaining: 47.32
# X-Reins-Latency-Overhead: 3ms

# Error responses for policy blocks:
# HTTP 403
# {
#   "error": "policy_blocked",
#   "rule": "daily-budget-limit",
#   "reason": "Team exceeded $50/day budget",
#   "budget_used": 50.12,
#   "budget_limit": 50.00
# }

Security

Security architecture

How Reins handles your most sensitive data — vendor API keys and agent credentials.

Encryption

Token storage

Vendor API keys are encrypted at rest using AES-256-GCM with per-tenant encryption keys derived via HKDF. Keys are decrypted in memory only for the duration of a proxied request. Plaintext keys never touch disk, logs, or any persistent store.

Auth Flow

Authentication

Reins API keys use HMAC-SHA256 for validation. Keys are stored as hashed values — we cannot retrieve your key after creation. JWT tokens for dashboard access use RS256 with 15-minute expiry and rotating signing keys.

Data Path

Zero prompt retention

Request and response bodies are proxied in streaming mode. Reins never buffers, stores, or logs prompt content or completions. Only metadata (model, tokens, cost, latency) is extracted and recorded.

Key Rotation

Key management

Reins API keys can be rotated with a 24-hour overlap window — both old and new keys remain valid during rotation. Vendor API keys can be updated in the dashboard; re-encryption happens synchronously.

Audit Trail

Immutability

Audit log records are append-only. No UPDATE or DELETE operations are permitted on the transactions table — not through the API, not through the dashboard, not through support. Records are cryptographically chained for tamper detection.

Network

Transport security

All traffic is TLS 1.3. The gateway enforces HSTS. Vendor API calls are made over TLS with certificate pinning for major vendors (OpenAI, Anthropic, Google). Internal service communication uses mutual TLS.

Deployment

Deployment options

Click any column header to sort. Current offering is SaaS; self-hosted and on-prem are on the roadmap.

Feature ▲	SaaS ▲	Self-Hosted ▲	On-Prem ▲
Availability	Available now	Q3 2026	Enterprise
Setup time	5 minutes	~1 hour	Custom
Data residency	US-East (AWS)	Your cloud	Your datacenter
Auto-updates	Yes	Opt-in	Manual
SSO / SAML	Yes	Yes	Yes
SLA	99.9%	Self-managed	Custom
Audit log export	API + CSV	Direct DB	Direct DB
VPC peering	Enterprise	Native	Native
Custom domains	Yes	Yes	Yes

Performance

Performance benchmarks

Measured on production traffic. Latency overhead is the time Reins adds — vendor latency is excluded.

0

ms

P50 latency overhead

0

ms

P95 latency overhead

0

req/s

Throughput per tenant

0

%

Gateway uptime (30d)

Operation	P50	P95	P99
Auth validation	0.3ms	0.8ms	1.2ms
Rate limit check	0.1ms	0.2ms	0.5ms
Policy evaluation	1.2ms	2.8ms	4.1ms
Budget lookup	0.5ms	1.5ms	3.0ms
Vendor proxy setup	0.2ms	0.5ms	0.8ms
Response parsing	0.3ms	0.7ms	1.5ms
Audit write (async)	0ms	0ms	0ms
Total overhead	2.6ms	6.5ms	11.1ms

Reliability

Failure mode matrix

What breaks, how Reins responds, and what you experience.

Failure

Gateway unreachable

Response SDK falls back to direct vendor call (fail-open mode)

Impact Calls succeed but are unmonitored. No policy enforcement.

Recovery Automatic. SDK retries gateway on next request.

Failure

Database connection lost

Response Auth uses cached keys. Policy engine uses last-known rules (60s staleness).

Impact Requests continue. Budget tracking paused. Audit records buffered.

Recovery Automatic reconnect with exponential backoff. Buffered records flushed.

Failure

Vendor API down

Response Vendor error returned to agent with original status code + headers.

Impact Agent sees the same error it would without Reins. Audit records the failure.

Recovery Vendor-dependent. Reins retries once on 5xx before returning.

Failure

Audit write backlog

Response In-memory buffer absorbs up to 10,000 records. Writes retried.

Impact Agent requests unaffected. Dashboard shows stale data until flushed.

Recovery Automatic on DB reconnect. Overflow drops oldest records + emits metric.

Failure

Policy cache stale

Response Requests evaluated against rules up to 5s old (normal TTL).

Impact Newly created/modified rules may not apply for up to 5s.

Recovery Automatic cache refresh. Force-refresh available via API.

Failure

Encryption key unavailable

Response Cannot decrypt vendor keys. Requests blocked with 503.

Impact All vendor calls for affected tenant fail. Agents see 503.

Recovery Key management service restart. Typical recovery: <30 seconds.

FAQ

Architecture questions

What latency does Reins add to API calls?+

Reins adds 2-8ms of latency per request. Auth validation takes under 1ms, policy evaluation 1-3ms, and the audit write is asynchronous — it adds zero blocking latency. For context, a typical GPT-4o call takes 500-3000ms. Reins overhead is <1% of total round-trip time.

What happens if Reins goes down?+

By default, Reins operates in fail-open mode. If the gateway is unreachable, the SDK falls back to calling the vendor API directly. Your agents keep working — they just aren't monitored until the gateway recovers. You can configure fail-closed mode for environments where unmonitored calls are unacceptable.

How does Reins calculate costs in real time?+

Reins maintains a pricing table for every supported vendor and model (40+ models). Token counts are extracted from the usage field in API responses. Costs are calculated per-request and aggregated per-agent, per-team, and per-organization in real time. For streaming responses, a lightweight tokenizer counts tokens incrementally.

Can I run Reins in my own infrastructure?+

Self-hosted deployment is on our roadmap for Q3 2026. Currently Reins runs as managed SaaS. For enterprises with strict data residency or compliance requirements, contact us about dedicated tenancy with data isolation guarantees.

Does Reins read or store my prompt data?+

No. Reins never reads, stores, or logs prompt content or AI completions. The gateway inspects request metadata (model, token count, vendor) for policy evaluation and cost tracking. Request and response bodies are proxied as opaque streams — we cannot access them even if we wanted to.

What's the maximum throughput?+

The gateway handles 10,000+ requests per second per tenant with horizontal auto-scaling. Policy evaluation uses an in-memory rule cache with a 5-second TTL. Rate limiting uses a token bucket algorithm with in-memory state. The bottleneck in practice is always the vendor API, not Reins.

How does Reins handle streaming responses?+

Reins proxies SSE streams transparently — the first byte reaches your agent with zero additional buffering. Token counting happens incrementally as chunks arrive, updating the running cost without waiting for the stream to complete. Stream interruptions are recorded as partial transactions in the audit log.

How are policies versioned and rolled back?+

Every policy change creates a versioned snapshot. You can view the diff between versions in the dashboard and roll back to any previous version with one click. Active rollback takes effect within the 5-second cache window. Rollback history is immutable and auditable.

Built for agents who run at machine speed.

How requests flow through Reins

Agent SDK

Auth Layer

Policy Engine

Vendor Proxy

Budget Tracker

Audit Log

Under the hood

Gateway Layer

Policy Engine

Transaction Tracker

Audit Log

Anatomy of a single API call

End-to-end request pipeline

Client SDK

Auth

Rate Limit

Budget Check

Vendor Proxy

Response Parser

Audit Write

Four ways to integrate

Security architecture

Token storage

Authentication

Zero prompt retention

Key management

Immutability

Transport security

Deployment options

Performance benchmarks

Failure mode matrix

Gateway unreachable

Database connection lost

Vendor API down

Audit write backlog

Policy cache stale

Encryption key unavailable

Architecture questions

Related resources

API Reference

Webhooks & Events

Migration Guide

Compliance & Security