Semantic cache uses AI to understand the meaning of requests, automatically caching responses for similar queries even when they’re worded differently. This intelligent caching is enabled by default and can dramatically reduce costs and response times.
Enabled by Default: Semantic caching works automatically - no configuration needed for most applications.

How Semantic Caching Works

1

Request Analysis

Incoming prompts are converted to embeddings that capture their semantic meaning
2

Similarity Search

The system searches for cached responses with similar meaning using AI-powered matching
3

Intelligent Matching

Finds relevant cached responses even when questions are phrased differently
4

Smart Delivery

Returns cached responses in under 100ms with full compatibility

Key Benefits

Speed

<100ms
Response times

Cost Savings

60-80%
Reduction for similar queries

Intelligence

Understands meaning
Not just exact matches

Reliability

Success-only caching
Never caches errors

Real-World Examples

Customer Support Queries

Original Query: “How do I reset my password?”Semantic Matches (all return the same cached response):
  • “I forgot my password”
  • “Password reset steps”
  • “Help me recover my account”
  • “Can’t log in - need new password”
  • “Locked out of my account”

Cache Priority System

Adaptive checks caches in optimal order for best performance:
1

L1: Prompt Response Cache

Microseconds: Exact matches if explicitly enabled per request
2

L2: Semantic Cache

Sub-100ms: Similar meaning matches (enabled by default)
3

L3: Fresh Request

Standard latency: New API call to provider

Configuration Options

Most applications work perfectly with default settings:
// Default - semantic cache enabled automatically
const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "How do I reset my password?" }]
  // Semantic cache works automatically
});

console.log(`Cache tier: ${completion.usage.cache_tier}`);

Custom Threshold Settings

// Higher threshold = stricter matching
const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "Technical implementation question" }],
  model_router: {
    semantic_cache: {
      threshold: 0.9  // Higher = more precise matching
    }
  }
});

Configuration Parameters

model_router.semantic_cache
object
Configuration for semantic caching behavior

Similarity Threshold Guide

Choose the right threshold for your use case:

Loose Matching

0.7 - 0.8
FAQ systems, customer support, general inquiries

Balanced Matching

0.8 - 0.9 (Default)
Most applications, mixed content types

Strict Matching

0.9+
Technical docs, legal content, precise requirements

Threshold Examples

Good for: Customer support, FAQ systems
"How do I cancel my subscription?" 
// Matches: "cancel membership", "stop billing", "end service"
Pros: High cache hit rate, good for varied user language
Cons: May occasionally match unrelated queries

Cache Performance Tracking

Response Metadata

Every response includes cache performance information:
{
  "id": "chatcmpl-abc123",
  "choices": [{"message": {"content": "To reset your password..."}}],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 25,
    "total_tokens": 40,
    "cache_tier": "semantic_similar"  // Cache performance indicator
  },
  "provider": "cached",
  "model": "cached-response"
}

Cache Tier Values

semantic_exact

Perfect match: Identical semantic meaning found

semantic_similar

Similar match: Semantically related content found

undefined

No cache: Fresh response from API provider

prompt_response

Explicit cache: From prompt-response cache if enabled

Performance Characteristics

Hit Latency

50-100ms
Including embedding computation

Miss Overhead

10-20ms
Added for similarity analysis

Hit Rate

40-60%
Typical applications

Storage

AI-powered
Embedding-based indexing

Ideal Use Cases

Perfect Fits

Customer Support

High hit rate: Users ask similar questions in many different ways

Documentation Search

Consistent responses: Same explanations for similar concepts

FAQ Systems

Multiple phrasings: Various ways to ask the same questions

Educational Content

Concept explanations: Similar topics with consistent answers

When to Disable

Consider disabling semantic cache for time-sensitive or highly personalized content.

Real-time Data

Examples: Stock prices, live sports scores, current weather

Personalized Content

Examples: User-specific data, account information, personal history

Time-sensitive Info

Examples: Breaking news, current events, recent updates

High-precision Content

Examples: Legal advice, medical information, financial guidance

Best Practices

For Maximum Effectiveness

1

Use Default Settings

The 0.85 threshold works well for most applications
2

Monitor Hit Rates

Track cache_tier values to measure effectiveness
3

Adjust by Use Case

Lower thresholds for FAQ systems, higher for technical content
4

Test Different Thresholds

Experiment to find optimal settings for your specific use case

Optimization Tips

Content Strategy

Group similar topics: Organize content to maximize cache hits

Threshold Tuning

Start with defaults: Adjust based on hit rate and accuracy needs

Performance Monitoring

Track metrics: Monitor cache efficiency and response times

User Patterns

Analyze queries: Understand common user question patterns

Error Handling and Reliability

Graceful Degradation: Semantic cache failures never interrupt your requests - they automatically fallback to fresh API calls.

Failure Scenarios

Embedding Service Issues

Automatic fallback to fresh requests without semantic matching

Cache Storage Problems

Transparent handling - requests proceed normally

Similarity Computation Errors

Skip matching and proceed with standard routing

Network Issues

Retry logic with fallback to direct API calls

Monitoring and Analytics

Track semantic cache performance in your Adaptive dashboard:
  • Cache hit rates and trends over time
  • Similarity threshold effectiveness for your content
  • Cost savings achieved through semantic matching
  • Response time improvements from cached responses
  • Cache efficiency by content type and user patterns

Technical Implementation

Under the Hood

Next Steps