The prompt response cache delivers sub-millisecond response times by storing complete responses for requests and using AI-powered semantic matching to find similar prompts. This cache layer provides the fastest possible responses when enabled.
Important: Prompt response caching is disabled by default. Enable it per request when you want maximum speed for repeated or similar queries.

How It Works

1

Semantic Analysis

Incoming prompts are converted to AI embeddings to understand meaning and context
2

Intelligent Matching

The system looks for both exact matches and semantically similar cached responses
3

Instant Delivery

Cached responses are returned in under 100ms, converted to your preferred format
4

Streaming Simulation

Cached responses work with streaming by chunking content with realistic timing

Key Benefits

Speed

<100ms
Response times

Cost

$0.00
Zero API costs

Accuracy

100%
Identical responses

Reliability

Redis-backed
Persistent storage

Quick Start

Enable prompt caching by adding the prompt_cache parameter:
const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "What is 2+2?" }],
  temperature: 0.7,
  prompt_cache: {
    enabled: true,
    ttl: 3600 // 1 hour (optional)
  }
});

// Check if response came from cache
console.log(`Cache tier: ${completion.usage.cache_tier}`);

Cache Behavior

What Gets Cached

Semantic Content

Prompt meaning: AI understands different phrasings of the same question

Exact Parameters

Request settings: temperature, max_tokens, and other parameters must match exactly

Successful Responses

Only success: Error responses are never cached to prevent propagation

Model Compatibility

Cross-model: Works across compatible models when semantically similar

Cache Matching Examples

These prompts would all hit the same cache entry:
// Original request
"What is 2+2?"

// Semantic matches (cache hits)
"Tell me 2 plus 2"
"What's two plus two?"  
"Calculate 2 + 2"
"What does 2+2 equal?"
All return the same cached response with cache_tier: "semantic_similar"

Cache Priority System

Adaptive checks caches in priority order for optimal performance:
1

L1: Prompt Response Cache

Fastest: Exact matches return in microseconds
2

L2: Semantic Cache

Fast: Similar meaning matches in 1-2ms
3

L3: Fresh Request

Standard: New API call to provider

Configuration Options

prompt_cache
object
required
Configuration for prompt response caching

TTL (Time To Live) Guidelines

Static Content

3600-86400s (1-24 hours)
Documentation, facts, definitions

Semi-Dynamic

600-3600s (10 minutes - 1 hour)
Analysis, explanations, tutorials

Dynamic Content

60-600s (1-10 minutes)
Time-sensitive information

Streaming Support

Cached responses work seamlessly with streaming requests:
const stream = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "Write a poem about AI" }],
  stream: true,
  prompt_cache: {
    enabled: true,
    ttl: 3600
  }
});

// Cached responses stream naturally with 10ms delays between chunks
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Cache Performance Tracking

Response Metadata

Every cached response includes performance information:
{
  "id": "chatcmpl-abc123",
  "choices": [{"message": {"content": "2+2 equals 4"}}],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 20,
    "total_tokens": 30,
    "cache_tier": "semantic_exact"  // Cache performance indicator
  },
  "provider": "cached",
  "model": "cached-response"
}

Cache Tier Values

semantic_exact

Perfect match: Identical prompt and parameters found

semantic_similar

Similar match: Semantically equivalent prompt found

undefined

No cache: Fresh response from API provider

prompt_response

Legacy: Old cache format (being phased out)

Performance Characteristics

Hit Latency

50-100ms
Including semantic analysis

Miss Overhead

10-20ms
Added for embedding generation

Hit Rate

25-40%
Typical applications

Storage

Redis
Persistent, scalable

Best Practices

For Maximum Effectiveness

1

Use Deterministic Settings

Set temperature: 0 for consistent, cacheable results
2

Choose Appropriate TTL

Match cache duration to your content’s freshness requirements
3

Monitor Performance

Track cache_tier values to measure cache effectiveness
4

Consider Memory Usage

Monitor Redis usage, especially with high TTL values

Ideal Use Cases

FAQ Systems

Perfect fit: Users ask similar questions in different ways

Documentation Queries

High hit rate: Repeated searches for the same information

Educational Content

Consistent responses: Same explanations for similar concepts

API Responses

Fast lookups: Repeated queries for similar data

Error Handling

Graceful Degradation: Cache failures never block your requests - they automatically fallback to fresh API calls.

Failure Scenarios

Redis Issues

Automatic fallback to fresh API requests with no user impact

Cache Corruption

Self-healing - corrupted entries are automatically removed

Embedding Errors

Skip semantic matching and proceed with standard routing

Memory Limits

LRU eviction - oldest entries removed to make space

Monitoring and Analytics

Track cache performance in your Adaptive dashboard:
  • Cache hit rates and trends
  • Semantic matching accuracy
  • Cost savings from cached responses
  • Response time improvements
  • Memory usage and capacity planning

Next Steps