Prompt Caching - Adaptive

Adaptive implements multi-layer prompt caching combining AI-powered semantic understanding with provider-level infrastructure caching. This automatically optimizes for both similar queries and exact repeats, reducing costs by 60-95%.

Automatic: Both semantic and provider caching work without configuration for most applications.

How It Works

Adaptive checks caches in order of speed:

Provider Cache: Exact matches at provider infrastructure (microseconds)
Semantic Cache: AI-powered similar meaning matches (<100ms)
Fresh Request: Standard API call when no cache hit

Key Benefits

Speed

Microseconds - 100ms across cache levels

Savings

60-95% cost reduction

Intelligence

Exact + semantic matching

Integration

Provider native caching

Real-World Examples

Customer Support: “How do I reset my password?” matches “I forgot my password”, “Password reset steps”, “Help me recover my account” Technical Docs: “How to implement JWT authentication” matches “JWT auth implementation guide”, “Setting up JSON Web Token auth” FAQ Systems: “What are your business hours?” matches “When are you open?”, “Office hours information”

Cache Priority

Provider Cache: Exact matches at provider infrastructure
Semantic Cache: AI-powered similar meaning matches
Fresh Request: Standard API call when no cache hit

Provider Prompt Caching

Adaptive automatically leverages provider-level caching when available. Key providers:

Provider	Cache Reads	Configuration	Notes
OpenAI	0.25x-0.50x cost	Automatic	1024+ tokens minimum
Anthropic	0.1x cost	`cache_control` required	4 breakpoints max, 5min TTL
Google Gemini	0.25x cost	Automatic or explicit	1028-2048 tokens minimum
DeepSeek	0.1x cost	Automatic	-
Grok	0.25x cost	Automatic	-
Groq	Reduced rate	Automatic	Kimi K2 models

const message = await client.messages.create({
  model: 'claude-sonnet-4-5',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Based on this document:' },
        {
          type: 'text',
          text: 'LARGE TEXT CONTENT HERE',
          cache_control: { type: 'ephemeral' }
        },
        { type: 'text', text: 'Summarize the key points.' }
      ]
    }
  ]
});

Provider Cache Examples

// OpenAI automatically caches prompts and reuses context
const completion = await client.chat.completions.create({
  model: 'gpt-5-mini',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain quantum computing' }
  ]
});

// Provider may cache this exact prompt for future identical requests

Provider vs Semantic Cache

Type	Matching	Speed	Savings
Provider	Exact prompts only	Microseconds	90-95%
Semantic	Similar meaning	<100ms	60-90%
Combined	Best available	Automatic	Maximum

Monitoring Provider Cache Hits

Provider cache hits are indicated differently than semantic cache hits:

// Check for provider-level caching indicators
const completion = await client.chat.completions.create({
  model: '',
  messages: [{ role: 'user', content: 'Hello!' }]
});

// Provider cache hits may show:
// - Extremely fast response times (&lt;50ms)
// - Lower token counts (reused context)
// - Provider-specific cache indicators
console.log(`Response time: ${Date.now() - startTime}ms`);
console.log(`Cache tier: ${completion.usage.cache_tier || 'none'}`);
console.log(`Provider: ${completion.provider}`);

// For OpenAI: Look for system_fingerprint changes indicating cache use
console.log(`System fingerprint: ${completion.system_fingerprint}`);

Provider Cache Best Practices

Provider caching is automatic but can be optimized with consistent prompt patterns.

Use Consistent System Prompts

Keep system messages identical across similar requests for provider cache hits

Maintain Conversation Context

Provider caches work best with ongoing conversations using the same context

Monitor Response Times

Very fast responses (<50ms) often indicate provider cache hits

Combine with Semantic Cache

Let Adaptive automatically choose the best caching strategy

SDK Setup

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.ADAPTIVE_API_KEY,
  baseURL: 'https://api.llmadaptive.uk/v1'
});

Configuration

Default: Semantic cache enabled automatically - no configuration needed. Custom Threshold:

// Stricter matching
model_router: { semantic_cache: { threshold: 0.9 } }

// Looser matching  
model_router: { semantic_cache: { threshold: 0.75 } }

// Disable for real-time data
model_router: { semantic_cache: { enabled: false } }

Custom Threshold Settings

// Higher threshold = stricter matching
const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "Technical implementation question" }],
  model_router: {
    semantic_cache: {
      threshold: 0.9  // Higher = more precise matching
    }
  }
});

Configuration Parameters

enabled: Enable/disable semantic caching (default: true)
threshold: Similarity threshold 0.0-1.0 (default: 0.85)

Threshold Guide

0.7-0.8: Loose matching for FAQ/customer support (high hit rate)
0.8-0.9: Balanced (default) for most applications
0.9+: Strict matching for technical/legal content (high precision)

Cache Performance Tracking

Every response includes cache information:

{
  "usage": {
    "cache_tier": "semantic_similar",
    "total_tokens": 40
  },
  "provider": "cached"
}

Cache Tier Values

semantic_exact

Perfect semantic match: 90%+ savings, <50ms response

semantic_similar

Similar semantic match: 60-90% savings, <100ms response

provider_cache

Provider infrastructure cache: Microsecond response, high savings

none

No cache hit: Standard latency and costs

Performance Characteristics

Performance: Microseconds-100ms latency, 60-80% hit rate, 60-95% cost savings

Use Cases

Perfect for: Customer support, FAQ systems, documentation search, educational content Avoid for: Real-time data, personalized content, time-sensitive information

Best Practices

Use default settings (0.85 threshold) for most applications
Monitor cache_tier in responses to track effectiveness
Lower thresholds for FAQ systems, higher for technical content
Group similar content to maximize cache hits

Error Handling

Cache failures automatically fallback to fresh API calls. No interruption to your requests.

Monitoring

Track cache performance in your Adaptive dashboard: hit rates, cost savings, and response times.

Technical Details

Multi-layer: Provider infrastructure + AI semantic caching
Embeddings: Sentence transformers for semantic matching
Similarity: Cosine similarity with configurable thresholds
Storage: Dual systems with automatic fallback

Enhanced Cache Performance Tracking

Monitoring Cache Effectiveness

Check cache performance in API responses:

const completion = await client.chat.completions.create({
  model: '',
  messages: [{ role: 'user', content: 'How do I reset my password?' }]
});

console.log(`Cache tier: ${completion.usage.cache_tier || 'none'}`);
console.log(`Provider: ${completion.provider}`);
console.log(`Response time: ${Date.now() - startTime}ms`);

Cache Tier Values

"semantic_exact": Perfect semantic match (90%+ savings, <50ms)
"semantic_similar": Similar semantic match (60-90% savings, <100ms)
"provider_cache": Provider infrastructure cache (microseconds, high savings)
"none": No cache hit (standard costs)

Tracking Cache Hit Rates

Calculate cache effectiveness:

let totalRequests = 0;
let cacheHits = 0;

const completion = await client.chat.completions.create({
  model: '',
  messages: [{ role: 'user', content: 'How do I reset my password?' }]
});

totalRequests++;
if (completion.usage.cache_tier && completion.usage.cache_tier !== 'none') {
  cacheHits++;
}

const hitRate = ((cacheHits / totalRequests) * 100).toFixed(1);
console.log(`Cache hit rate: ${hitRate}%`);

Streaming with Cache Data

Cache information is available in streaming responses:

const stream = await client.chat.completions.create({
  model: '',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true,
  stream_options: { include_usage: true }
});

for await (const chunk of stream) {
  if (chunk.usage) {
    console.log(`Cache tier: ${chunk.usage.cache_tier || 'none'}`);
  }
}

Production Cache Monitoring

Simple cache tracking for production:

class CacheMonitor {
  constructor() {
    this.requests = 0;
    this.hits = 0;
  }

  track(completion) {
    this.requests++;
    if (completion.usage.cache_tier !== 'none') {
      this.hits++;
    }
  }

  getHitRate() {
    return ((this.hits / this.requests) * 100).toFixed(1) + '%';
  }
}

Cost Savings Calculator

function calculateSavings(completion) {
  const tokens = completion.usage.total_tokens;
  const tier = completion.usage.cache_tier;
  
  if (tier === 'semantic_exact') return { savings: '90%', cost: tokens * 0.15 * 0.1 };
  if (tier === 'semantic_similar') return { savings: '75%', cost: tokens * 0.15 * 0.25 };
  return { savings: '0%', cost: tokens * 0.15 };
}

Next Steps

Performance Overview - All optimization features
Intelligent Routing - Routing + caching

Getting Started

Key Features

Framework Integrations

Developer Tools

Examples

API Reference

Support

​How It Works

​Key Benefits

Speed

Savings

Intelligence

Integration

​Real-World Examples

​Cache Priority

​Provider Prompt Caching

​Provider Cache Examples

​Provider vs Semantic Cache

​Monitoring Provider Cache Hits

​Provider Cache Best Practices

​SDK Setup

​Configuration

​Custom Threshold Settings

​Configuration Parameters

​Threshold Guide

​Cache Performance Tracking

​Cache Tier Values

semantic_exact

semantic_similar

provider_cache

none

​Performance Characteristics

​Use Cases

​Best Practices

​Error Handling

​Monitoring

​Technical Details

​Enhanced Cache Performance Tracking

​Monitoring Cache Effectiveness

​Cache Tier Values

​Tracking Cache Hit Rates

​Streaming with Cache Data

​Production Cache Monitoring

​Cost Savings Calculator

​Next Steps

How It Works

Key Benefits

Real-World Examples

Cache Priority

Provider Prompt Caching

Provider Cache Examples

Provider vs Semantic Cache

Monitoring Provider Cache Hits

Provider Cache Best Practices

SDK Setup

Configuration

Custom Threshold Settings

Configuration Parameters

Threshold Guide

Cache Performance Tracking

Cache Tier Values

Performance Characteristics

Use Cases

Best Practices

Error Handling

Monitoring

Technical Details

Enhanced Cache Performance Tracking

Monitoring Cache Effectiveness

Cache Tier Values

Tracking Cache Hit Rates

Streaming with Cache Data

Production Cache Monitoring

Cost Savings Calculator

Next Steps