Semantic Cache - Adaptive

Semantic cache uses AI to understand the meaning of requests, automatically caching responses for similar queries even when they’re worded differently. This intelligent caching is enabled by default and can dramatically reduce costs and response times.

Enabled by Default: Semantic caching works automatically - no configuration needed for most applications.

How Semantic Caching Works

Request Analysis

Incoming prompts are converted to embeddings that capture their semantic meaning

Similarity Search

The system searches for cached responses with similar meaning using AI-powered matching

Intelligent Matching

Finds relevant cached responses even when questions are phrased differently

Smart Delivery

Returns cached responses in under 100ms with full compatibility

Key Benefits

Speed

<100ms
Response times

Cost Savings

60-90% Reduction for similar queries

Intelligence

Understands meaning
Not just exact matches

Reliability

Success-only caching
Never caches errors

Real-World Examples

Customer Support Queries

Password Help
Technical Questions
General Inquiries

Original Query: “How do I reset my password?”Semantic Matches (all return the same cached response):

“I forgot my password”
“Password reset steps”
“Help me recover my account”
“Can’t log in - need new password”
“Locked out of my account”

Cache Priority System

Adaptive checks caches in optimal order for best performance:

L1: Prompt Response Cache

Microseconds: Exact matches if explicitly enabled per request

L2: Semantic Cache

Sub-100ms: Similar meaning matches (enabled by default)

L3: Fresh Request

Standard latency: New API call to provider

Configuration Options

Default Behavior (Recommended)

Most applications work perfectly with default settings:

// Default - semantic cache enabled automatically
const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "How do I reset my password?" }]
  // Semantic cache works automatically
});

console.log(`Cache tier: ${completion.usage.cache_tier}`);

Custom Threshold Settings

// Higher threshold = stricter matching
const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "Technical implementation question" }],
  model_router: {
    semantic_cache: {
      threshold: 0.9  // Higher = more precise matching
    }
  }
});

Configuration Parameters

model_router.semantic_cache

object

Configuration for semantic caching behavior

Show Properties

enabled

boolean

Enable/disable semantic caching (default: true)

threshold

number

Similarity threshold from 0.0-1.0 (default: 0.85)

Similarity Threshold Guide

Choose the right threshold for your use case:

Loose Matching

0.7 - 0.8
FAQ systems, customer support, general inquiries

Balanced Matching

0.8 - 0.9 (Default)
Most applications, mixed content types

Strict Matching

0.9+
Technical docs, legal content, precise requirements

Threshold Examples

0.75 - Loose
0.85 - Balanced
0.95 - Strict

Good for: Customer support, FAQ systems

"How do I cancel my subscription?" 
// Matches: "cancel membership", "stop billing", "end service"

Pros: High cache hit rate, good for varied user language
Cons: May occasionally match unrelated queries

Cache Performance Tracking

Response Metadata

Every response includes cache performance information:

{
  "id": "chatcmpl-abc123",
  "choices": [{"message": {"content": "To reset your password..."}}],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 25,
    "total_tokens": 40,
    "cache_tier": "semantic_similar"  // Cache performance indicator
  },
  "provider": "cached",
  "model": "cached-response"
}

Cache Tier Values

semantic_exact

Perfect match: Identical semantic meaning found

semantic_similar

Similar match: Semantically related content found

undefined

No cache: Fresh response from API provider

Performance Characteristics

Hit Latency

50-100ms
Including embedding computation

Miss Overhead

10-20ms
Added for similarity analysis

Hit Rate

40-60%
Typical applications

Storage

AI-powered
Embedding-based indexing

Ideal Use Cases

Perfect Fits

Customer Support

High hit rate: Users ask similar questions in many different ways

Documentation Search

Consistent responses: Same explanations for similar concepts

FAQ Systems

Multiple phrasings: Various ways to ask the same questions

Educational Content

Concept explanations: Similar topics with consistent answers

When to Disable

Consider disabling semantic cache for time-sensitive or highly personalized content.

Real-time Data

Examples: Stock prices, live sports scores, current weather

Personalized Content

Examples: User-specific data, account information, personal history

Time-sensitive Info

Examples: Breaking news, current events, recent updates

High-precision Content

Examples: Legal advice, medical information, financial guidance

Best Practices

For Maximum Effectiveness

Use Default Settings

The 0.85 threshold works well for most applications

Monitor Hit Rates

Track cache_tier values to measure effectiveness

Adjust by Use Case

Lower thresholds for FAQ systems, higher for technical content

Test Different Thresholds

Experiment to find optimal settings for your specific use case

Optimization Tips

Content Strategy

Group similar topics: Organize content to maximize cache hits

Threshold Tuning

Start with defaults: Adjust based on hit rate and accuracy needs

Performance Monitoring

Track metrics: Monitor cache efficiency and response times

User Patterns

Analyze queries: Understand common user question patterns

Error Handling and Reliability

Graceful Degradation: Semantic cache failures never interrupt your requests - they automatically fallback to fresh API calls.

Failure Scenarios

Embedding Service Issues

Automatic fallback to fresh requests without semantic matching

Cache Storage Problems

Transparent handling - requests proceed normally

Similarity Computation Errors

Skip matching and proceed with standard routing

Network Issues

Retry logic with fallback to direct API calls

Monitoring and Analytics

Track semantic cache performance in your Adaptive dashboard:

Cache hit rates and trends over time
Similarity threshold effectiveness for your content
Cost savings achieved through semantic matching
Response time improvements from cached responses
Cache efficiency by content type and user patterns

Technical Implementation

Under the Hood

Technical Details

Embedding Generation

Uses state-of-the-art sentence transformers
Generates high-dimensional semantic representations
Optimized for speed and accuracy

Similarity Computation

Cosine similarity between embedding vectors
Configurable threshold-based matching
Fast vector operations for real-time performance

Storage & Retrieval

Efficient vector indexing and search
Deferred caching after successful responses
Automatic cache warming and optimization

Error Resilience

Multiple fallback layers for reliability
Success-only caching prevents error propagation
Graceful degradation under any failure scenario

Next Steps

Performance Overview

Understand all performance optimization features

Getting Started

Framework Integrations

Developer Tools

Key Features

API Reference

Support

​How Semantic Caching Works

​Key Benefits

Speed

Cost Savings

Intelligence

Reliability

​Real-World Examples

​Customer Support Queries

​Cache Priority System

​Configuration Options

​Default Behavior (Recommended)

​Custom Threshold Settings

​Configuration Parameters

​Similarity Threshold Guide

Loose Matching

Balanced Matching

Strict Matching

​Threshold Examples

​Cache Performance Tracking

​Response Metadata

​Cache Tier Values

semantic_exact

semantic_similar

undefined

​Performance Characteristics

Hit Latency

Miss Overhead

Hit Rate

Storage

​Ideal Use Cases

​Perfect Fits

Customer Support

Documentation Search

FAQ Systems

Educational Content

​When to Disable

Real-time Data

Personalized Content

Time-sensitive Info

High-precision Content

​Best Practices

​For Maximum Effectiveness

​Optimization Tips

Content Strategy

Threshold Tuning

Performance Monitoring

User Patterns

​Error Handling and Reliability

​Failure Scenarios

Embedding Service Issues

Cache Storage Problems

Similarity Computation Errors

Network Issues

​Monitoring and Analytics

​Technical Implementation

​Under the Hood

​Next Steps

Performance Overview

How Semantic Caching Works

Key Benefits

Real-World Examples

Customer Support Queries

Cache Priority System

Configuration Options

Default Behavior (Recommended)

Custom Threshold Settings

Configuration Parameters

Similarity Threshold Guide

Threshold Examples

Cache Performance Tracking

Response Metadata

Cache Tier Values

Performance Characteristics

Ideal Use Cases

Perfect Fits

When to Disable

Best Practices

For Maximum Effectiveness

Optimization Tips

Error Handling and Reliability

Failure Scenarios

Monitoring and Analytics

Technical Implementation

Under the Hood

Next Steps