Enabled by Default: Semantic caching works automatically - no configuration needed for most applications.
How Semantic Caching Works
1
Request Analysis
Incoming prompts are converted to embeddings that capture their semantic meaning
2
Similarity Search
The system searches for cached responses with similar meaning using AI-powered matching
3
Intelligent Matching
Finds relevant cached responses even when questions are phrased differently
4
Smart Delivery
Returns cached responses in under 100ms with full compatibility
Key Benefits
Speed
<100ms
Response times
Response times
Cost Savings
60-80%
Reduction for similar queries
Reduction for similar queries
Intelligence
Understands meaning
Not just exact matches
Not just exact matches
Reliability
Success-only caching
Never caches errors
Never caches errors
Real-World Examples
Customer Support Queries
Original Query: “How do I reset my password?”Semantic Matches (all return the same cached response):
- “I forgot my password”
- “Password reset steps”
- “Help me recover my account”
- “Can’t log in - need new password”
- “Locked out of my account”
Cache Priority System
Adaptive checks caches in optimal order for best performance:1
L1: Prompt Response Cache
Microseconds: Exact matches if explicitly enabled per request
2
L2: Semantic Cache
Sub-100ms: Similar meaning matches (enabled by default)
3
L3: Fresh Request
Standard latency: New API call to provider
Configuration Options
Default Behavior (Recommended)
Most applications work perfectly with default settings:Custom Threshold Settings
Configuration Parameters
Configuration for semantic caching behavior
Similarity Threshold Guide
Choose the right threshold for your use case:Loose Matching
0.7 - 0.8
FAQ systems, customer support, general inquiries
FAQ systems, customer support, general inquiries
Balanced Matching
0.8 - 0.9 (Default)
Most applications, mixed content types
Most applications, mixed content types
Strict Matching
0.9+
Technical docs, legal content, precise requirements
Technical docs, legal content, precise requirements
Threshold Examples
Good for: Customer support, FAQ systemsPros: High cache hit rate, good for varied user language
Cons: May occasionally match unrelated queries
Cons: May occasionally match unrelated queries
Cache Performance Tracking
Response Metadata
Every response includes cache performance information:Cache Tier Values
semantic_exact
Perfect match: Identical semantic meaning found
semantic_similar
Similar match: Semantically related content found
undefined
No cache: Fresh response from API provider
prompt_response
Explicit cache: From prompt-response cache if enabled
Performance Characteristics
Hit Latency
50-100ms
Including embedding computation
Including embedding computation
Miss Overhead
10-20ms
Added for similarity analysis
Added for similarity analysis
Hit Rate
40-60%
Typical applications
Typical applications
Storage
AI-powered
Embedding-based indexing
Embedding-based indexing
Ideal Use Cases
Perfect Fits
Customer Support
High hit rate: Users ask similar questions in many different ways
Documentation Search
Consistent responses: Same explanations for similar concepts
FAQ Systems
Multiple phrasings: Various ways to ask the same questions
Educational Content
Concept explanations: Similar topics with consistent answers
When to Disable
Consider disabling semantic cache for time-sensitive or highly personalized content.
Real-time Data
Examples: Stock prices, live sports scores, current weather
Personalized Content
Examples: User-specific data, account information, personal history
Time-sensitive Info
Examples: Breaking news, current events, recent updates
High-precision Content
Examples: Legal advice, medical information, financial guidance
Best Practices
For Maximum Effectiveness
1
Use Default Settings
The 0.85 threshold works well for most applications
2
Monitor Hit Rates
Track
cache_tier
values to measure effectiveness3
Adjust by Use Case
Lower thresholds for FAQ systems, higher for technical content
4
Test Different Thresholds
Experiment to find optimal settings for your specific use case
Optimization Tips
Content Strategy
Group similar topics: Organize content to maximize cache hits
Threshold Tuning
Start with defaults: Adjust based on hit rate and accuracy needs
Performance Monitoring
Track metrics: Monitor cache efficiency and response times
User Patterns
Analyze queries: Understand common user question patterns
Error Handling and Reliability
Graceful Degradation: Semantic cache failures never interrupt your requests - they automatically fallback to fresh API calls.
Failure Scenarios
Embedding Service Issues
Automatic fallback to fresh requests without semantic matching
Cache Storage Problems
Transparent handling - requests proceed normally
Similarity Computation Errors
Skip matching and proceed with standard routing
Network Issues
Retry logic with fallback to direct API calls
Monitoring and Analytics
Track semantic cache performance in your Adaptive dashboard:- Cache hit rates and trends over time
- Similarity threshold effectiveness for your content
- Cost savings achieved through semantic matching
- Response time improvements from cached responses
- Cache efficiency by content type and user patterns
Technical Implementation
Under the Hood
Technical Details
Technical Details
Embedding Generation
- Uses state-of-the-art sentence transformers
- Generates high-dimensional semantic representations
- Optimized for speed and accuracy
- Cosine similarity between embedding vectors
- Configurable threshold-based matching
- Fast vector operations for real-time performance
- Efficient vector indexing and search
- Deferred caching after successful responses
- Automatic cache warming and optimization
- Multiple fallback layers for reliability
- Success-only caching prevents error propagation
- Graceful degradation under any failure scenario