Skip to main content
Adaptive implements multi-layer prompt caching combining AI-powered semantic understanding with provider-level infrastructure caching. This automatically optimizes for both similar queries and exact repeats, reducing costs by 60-95%.
Automatic: Both semantic and provider caching work without configuration for most applications.

How It Works

Adaptive checks caches in order of speed:
  1. Provider Cache: Exact matches at provider infrastructure (microseconds)
  2. Semantic Cache: AI-powered similar meaning matches (<100ms)
  3. Fresh Request: Standard API call when no cache hit

Key Benefits

Speed

Microseconds - 100ms across cache levels

Savings

60-95% cost reduction

Intelligence

Exact + semantic matching

Integration

Provider native caching

Real-World Examples

Customer Support: “How do I reset my password?” matches “I forgot my password”, “Password reset steps”, “Help me recover my account” Technical Docs: “How to implement JWT authentication” matches “JWT auth implementation guide”, “Setting up JSON Web Token auth” FAQ Systems: “What are your business hours?” matches “When are you open?”, “Office hours information”

Cache Priority

  1. Provider Cache: Exact matches at provider infrastructure
  2. Semantic Cache: AI-powered similar meaning matches
  3. Fresh Request: Standard API call when no cache hit

Provider Prompt Caching

Adaptive automatically leverages provider-level caching when available. Key providers:
ProviderCache ReadsConfigurationNotes
OpenAI0.25x-0.50x costAutomatic1024+ tokens minimum
Anthropic0.1x costcache_control required4 breakpoints max, 5min TTL
Google Gemini0.25x costAutomatic or explicit1028-2048 tokens minimum
DeepSeek0.1x costAutomatic-
Grok0.25x costAutomatic-
GroqReduced rateAutomaticKimi K2 models
const message = await client.messages.create({
  model: 'claude-sonnet-4-5',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Based on this document:' },
        {
          type: 'text',
          text: 'LARGE TEXT CONTENT HERE',
          cache_control: { type: 'ephemeral' }
        },
        { type: 'text', text: 'Summarize the key points.' }
      ]
    }
  ]
});

Provider Cache Examples

// OpenAI automatically caches prompts and reuses context
const completion = await client.chat.completions.create({
  model: 'gpt-5-mini',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain quantum computing' }
  ]
});

// Provider may cache this exact prompt for future identical requests

Provider vs Semantic Cache

TypeMatchingSpeedSavings
ProviderExact prompts onlyMicroseconds90-95%
SemanticSimilar meaning<100ms60-90%
CombinedBest availableAutomaticMaximum

Monitoring Provider Cache Hits

Provider cache hits are indicated differently than semantic cache hits:
// Check for provider-level caching indicators
const completion = await client.chat.completions.create({
  model: '',
  messages: [{ role: 'user', content: 'Hello!' }]
});

// Provider cache hits may show:
// - Extremely fast response times (&lt;50ms)
// - Lower token counts (reused context)
// - Provider-specific cache indicators
console.log(`Response time: ${Date.now() - startTime}ms`);
console.log(`Cache tier: ${completion.usage.cache_tier || 'none'}`);
console.log(`Provider: ${completion.provider}`);

// For OpenAI: Look for system_fingerprint changes indicating cache use
console.log(`System fingerprint: ${completion.system_fingerprint}`);

Provider Cache Best Practices

Provider caching is automatic but can be optimized with consistent prompt patterns.
1

Use Consistent System Prompts

Keep system messages identical across similar requests for provider cache hits
2

Maintain Conversation Context

Provider caches work best with ongoing conversations using the same context
3

Monitor Response Times

Very fast responses (<50ms) often indicate provider cache hits
4

Combine with Semantic Cache

Let Adaptive automatically choose the best caching strategy

SDK Setup

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.ADAPTIVE_API_KEY,
  baseURL: 'https://api.llmadaptive.uk/v1'
});

Configuration

Default: Semantic cache enabled automatically - no configuration needed. Custom Threshold:
// Stricter matching
model_router: { semantic_cache: { threshold: 0.9 } }

// Looser matching  
model_router: { semantic_cache: { threshold: 0.75 } }

// Disable for real-time data
model_router: { semantic_cache: { enabled: false } }

Custom Threshold Settings

// Higher threshold = stricter matching
const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "Technical implementation question" }],
  model_router: {
    semantic_cache: {
      threshold: 0.9  // Higher = more precise matching
    }
  }
});

Configuration Parameters

  • enabled: Enable/disable semantic caching (default: true)
  • threshold: Similarity threshold 0.0-1.0 (default: 0.85)

Threshold Guide

  • 0.7-0.8: Loose matching for FAQ/customer support (high hit rate)
  • 0.8-0.9: Balanced (default) for most applications
  • 0.9+: Strict matching for technical/legal content (high precision)

Cache Performance Tracking

Every response includes cache information:
{
  "usage": {
    "cache_tier": "semantic_similar",
    "total_tokens": 40
  },
  "provider": "cached"
}

Cache Tier Values

semantic_exact

Perfect semantic match: 90%+ savings, <50ms response

semantic_similar

Similar semantic match: 60-90% savings, <100ms response

provider_cache

Provider infrastructure cache: Microsecond response, high savings

none

No cache hit: Standard latency and costs

Performance Characteristics

Performance: Microseconds-100ms latency, 60-80% hit rate, 60-95% cost savings

Use Cases

Perfect for: Customer support, FAQ systems, documentation search, educational content Avoid for: Real-time data, personalized content, time-sensitive information

Best Practices

  • Use default settings (0.85 threshold) for most applications
  • Monitor cache_tier in responses to track effectiveness
  • Lower thresholds for FAQ systems, higher for technical content
  • Group similar content to maximize cache hits

Error Handling

Cache failures automatically fallback to fresh API calls. No interruption to your requests.

Monitoring

Track cache performance in your Adaptive dashboard: hit rates, cost savings, and response times.

Technical Details

  • Multi-layer: Provider infrastructure + AI semantic caching
  • Embeddings: Sentence transformers for semantic matching
  • Similarity: Cosine similarity with configurable thresholds
  • Storage: Dual systems with automatic fallback

Enhanced Cache Performance Tracking

Monitoring Cache Effectiveness

Check cache performance in API responses:
const completion = await client.chat.completions.create({
  model: '',
  messages: [{ role: 'user', content: 'How do I reset my password?' }]
});

console.log(`Cache tier: ${completion.usage.cache_tier || 'none'}`);
console.log(`Provider: ${completion.provider}`);
console.log(`Response time: ${Date.now() - startTime}ms`);

Cache Tier Values

  • "semantic_exact": Perfect semantic match (90%+ savings, <50ms)
  • "semantic_similar": Similar semantic match (60-90% savings, <100ms)
  • "provider_cache": Provider infrastructure cache (microseconds, high savings)
  • "none": No cache hit (standard costs)

Tracking Cache Hit Rates

Calculate cache effectiveness:
let totalRequests = 0;
let cacheHits = 0;

const completion = await client.chat.completions.create({
  model: '',
  messages: [{ role: 'user', content: 'How do I reset my password?' }]
});

totalRequests++;
if (completion.usage.cache_tier && completion.usage.cache_tier !== 'none') {
  cacheHits++;
}

const hitRate = ((cacheHits / totalRequests) * 100).toFixed(1);
console.log(`Cache hit rate: ${hitRate}%`);

Streaming with Cache Data

Cache information is available in streaming responses:
const stream = await client.chat.completions.create({
  model: '',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true,
  stream_options: { include_usage: true }
});

for await (const chunk of stream) {
  if (chunk.usage) {
    console.log(`Cache tier: ${chunk.usage.cache_tier || 'none'}`);
  }
}

Production Cache Monitoring

Simple cache tracking for production:
class CacheMonitor {
  constructor() {
    this.requests = 0;
    this.hits = 0;
  }

  track(completion) {
    this.requests++;
    if (completion.usage.cache_tier !== 'none') {
      this.hits++;
    }
  }

  getHitRate() {
    return ((this.hits / this.requests) * 100).toFixed(1) + '%';
  }
}

Cost Savings Calculator

function calculateSavings(completion) {
  const tokens = completion.usage.total_tokens;
  const tier = completion.usage.cache_tier;
  
  if (tier === 'semantic_exact') return { savings: '90%', cost: tokens * 0.15 * 0.1 };
  if (tier === 'semantic_similar') return { savings: '75%', cost: tokens * 0.15 * 0.25 };
  return { savings: '0%', cost: tokens * 0.15 };
}

Next Steps