Important: Prompt response caching is disabled by default. Enable it per request when you want maximum speed for repeated or similar queries.
How It Works
1
Semantic Analysis
Incoming prompts are converted to AI embeddings to understand meaning and context
2
Intelligent Matching
The system looks for both exact matches and semantically similar cached responses
3
Instant Delivery
Cached responses are returned in under 100ms, converted to your preferred format
4
Streaming Simulation
Cached responses work with streaming by chunking content with realistic timing
Key Benefits
Speed
<100ms
Response times
Response times
Cost
$0.00
Zero API costs
Zero API costs
Accuracy
100%
Identical responses
Identical responses
Reliability
Redis-backed
Persistent storage
Persistent storage
Quick Start
Enable prompt caching by adding theprompt_cache
parameter:
Cache Behavior
What Gets Cached
Semantic Content
Prompt meaning: AI understands different phrasings of the same question
Exact Parameters
Request settings: temperature, max_tokens, and other parameters must match exactly
Successful Responses
Only success: Error responses are never cached to prevent propagation
Model Compatibility
Cross-model: Works across compatible models when semantically similar
Cache Matching Examples
These prompts would all hit the same cache entry:All return the same cached response with
cache_tier: "semantic_similar"
Cache Priority System
Adaptive checks caches in priority order for optimal performance:1
L1: Prompt Response Cache
Fastest: Exact matches return in microseconds
2
L2: Semantic Cache
Fast: Similar meaning matches in 1-2ms
3
L3: Fresh Request
Standard: New API call to provider
Configuration Options
Configuration for prompt response caching
TTL (Time To Live) Guidelines
Static Content
3600-86400s (1-24 hours)
Documentation, facts, definitions
Documentation, facts, definitions
Semi-Dynamic
600-3600s (10 minutes - 1 hour)
Analysis, explanations, tutorials
Analysis, explanations, tutorials
Dynamic Content
60-600s (1-10 minutes)
Time-sensitive information
Time-sensitive information
Streaming Support
Cached responses work seamlessly with streaming requests:Cache Performance Tracking
Response Metadata
Every cached response includes performance information:Cache Tier Values
semantic_exact
Perfect match: Identical prompt and parameters found
semantic_similar
Similar match: Semantically equivalent prompt found
undefined
No cache: Fresh response from API provider
prompt_response
Legacy: Old cache format (being phased out)
Performance Characteristics
Hit Latency
50-100ms
Including semantic analysis
Including semantic analysis
Miss Overhead
10-20ms
Added for embedding generation
Added for embedding generation
Hit Rate
25-40%
Typical applications
Typical applications
Storage
Redis
Persistent, scalable
Persistent, scalable
Best Practices
For Maximum Effectiveness
1
Use Deterministic Settings
Set
temperature: 0
for consistent, cacheable results2
Choose Appropriate TTL
Match cache duration to your content’s freshness requirements
3
Monitor Performance
Track
cache_tier
values to measure cache effectiveness4
Consider Memory Usage
Monitor Redis usage, especially with high TTL values
Ideal Use Cases
FAQ Systems
Perfect fit: Users ask similar questions in different ways
Documentation Queries
High hit rate: Repeated searches for the same information
Educational Content
Consistent responses: Same explanations for similar concepts
API Responses
Fast lookups: Repeated queries for similar data
Error Handling
Graceful Degradation: Cache failures never block your requests - they automatically fallback to fresh API calls.
Failure Scenarios
Redis Issues
Automatic fallback to fresh API requests with no user impact
Cache Corruption
Self-healing - corrupted entries are automatically removed
Embedding Errors
Skip semantic matching and proceed with standard routing
Memory Limits
LRU eviction - oldest entries removed to make space
Monitoring and Analytics
Track cache performance in your Adaptive dashboard:- Cache hit rates and trends
- Semantic matching accuracy
- Cost savings from cached responses
- Response time improvements
- Memory usage and capacity planning