Prompt Response Cache

The prompt response cache delivers sub-millisecond response times by storing complete responses for requests and using AI-powered semantic matching to find similar prompts. This cache layer provides the fastest possible responses when enabled.

Important: Prompt response caching is disabled by default. Enable it per request when you want maximum speed for repeated or similar queries.

How It Works

Semantic Analysis

Incoming prompts are converted to AI embeddings to understand meaning and context

Intelligent Matching

The system looks for both exact matches and semantically similar cached responses

Instant Delivery

Cached responses are returned in under 100ms, converted to your preferred format

Streaming Simulation

Cached responses work with streaming by chunking content with realistic timing

Key Benefits

Speed

<100ms
Response times

Cost

$0.00
Zero API costs

Accuracy

100%
Identical responses

Reliability

Redis-backed
Persistent storage

Quick Start

Enable prompt caching by adding the prompt_cache parameter:

const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "What is 2+2?" }],
  temperature: 0.7,
  prompt_cache: {
    enabled: true,
    ttl: 3600 // 1 hour (optional)
  }
});

// Check if response came from cache
console.log(`Cache tier: ${completion.usage.cache_tier}`);

Cache Behavior

What Gets Cached

Semantic Content

Prompt meaning: AI understands different phrasings of the same question

Exact Parameters

Request settings: temperature, max_tokens, and other parameters must match exactly

Successful Responses

Only success: Error responses are never cached to prevent propagation

Model Compatibility

Cross-model: Works across compatible models when semantically similar

Cache Matching Examples

These prompts would all hit the same cache entry:

// Original request
"What is 2+2?"

// Semantic matches (cache hits)
"Tell me 2 plus 2"
"What's two plus two?"  
"Calculate 2 + 2"
"What does 2+2 equal?"

All return the same cached response with cache_tier: "semantic_similar"

Cache Priority System

Adaptive checks caches in priority order for optimal performance:

L1: Prompt Response Cache

Fastest: Exact matches return in microseconds

L2: Semantic Cache

Fast: Similar meaning matches in 1-2ms

L3: Fresh Request

Standard: New API call to provider

Configuration Options

prompt_cache

object

required

Configuration for prompt response caching

Show Properties

enabled

boolean

required

Must be true to enable caching for this request

ttl

integer

Cache duration in seconds (default: 3600 = 1 hour)

TTL (Time To Live) Guidelines

Static Content

3600-86400s (1-24 hours)
Documentation, facts, definitions

Semi-Dynamic

600-3600s (10 minutes - 1 hour)
Analysis, explanations, tutorials

Dynamic Content

60-600s (1-10 minutes)
Time-sensitive information

Streaming Support

Cached responses work seamlessly with streaming requests:

const stream = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "Write a poem about AI" }],
  stream: true,
  prompt_cache: {
    enabled: true,
    ttl: 3600
  }
});

// Cached responses stream naturally with 10ms delays between chunks
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Cache Performance Tracking

Response Metadata

Every cached response includes performance information:

{
  "id": "chatcmpl-abc123",
  "choices": [{"message": {"content": "2+2 equals 4"}}],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 20,
    "total_tokens": 30,
    "cache_tier": "semantic_exact"  // Cache performance indicator
  },
  "provider": "cached",
  "model": "cached-response"
}

Cache Tier Values

semantic_exact

Perfect match: Identical prompt and parameters found

semantic_similar

Similar match: Semantically equivalent prompt found

undefined

No cache: Fresh response from API provider

prompt_response

Legacy: Old cache format (being phased out)

Performance Characteristics

Hit Latency

50-100ms
Including semantic analysis

Miss Overhead

10-20ms
Added for embedding generation

Hit Rate

25-40%
Typical applications

Storage

Redis
Persistent, scalable

Best Practices

For Maximum Effectiveness

Use Deterministic Settings

Set temperature: 0 for consistent, cacheable results

Choose Appropriate TTL

Match cache duration to your content’s freshness requirements

Monitor Performance

Track cache_tier values to measure cache effectiveness

Consider Memory Usage

Monitor Redis usage, especially with high TTL values

Ideal Use Cases

FAQ Systems

Perfect fit: Users ask similar questions in different ways

Documentation Queries

High hit rate: Repeated searches for the same information

Educational Content

Consistent responses: Same explanations for similar concepts

API Responses

Fast lookups: Repeated queries for similar data

Error Handling

Graceful Degradation: Cache failures never block your requests - they automatically fallback to fresh API calls.

Failure Scenarios

Redis Issues

Automatic fallback to fresh API requests with no user impact

Cache Corruption

Self-healing - corrupted entries are automatically removed

Embedding Errors

Skip semantic matching and proceed with standard routing

Memory Limits

LRU eviction - oldest entries removed to make space

Monitoring and Analytics

Track cache performance in your Adaptive dashboard:

Cache hit rates and trends
Semantic matching accuracy
Cost savings from cached responses
Response time improvements
Memory usage and capacity planning

Getting Started

Framework Integrations

Developer Tools

Key Features

API Reference

Examples & Use Cases

Support

​How It Works

​Key Benefits

Speed

Cost

Accuracy

Reliability

​Quick Start

​Cache Behavior

​What Gets Cached

Semantic Content

Exact Parameters

Successful Responses

Model Compatibility

​Cache Matching Examples

​Cache Priority System

​Configuration Options

​TTL (Time To Live) Guidelines

Static Content

Semi-Dynamic

Dynamic Content

​Streaming Support

​Cache Performance Tracking

​Response Metadata

​Cache Tier Values

semantic_exact

semantic_similar

undefined

prompt_response

​Performance Characteristics

Hit Latency

Miss Overhead

Hit Rate

Storage

​Best Practices

​For Maximum Effectiveness

​Ideal Use Cases

FAQ Systems

Documentation Queries

Educational Content

API Responses

​Error Handling

​Failure Scenarios

Redis Issues

Cache Corruption

Embedding Errors

Memory Limits

​Monitoring and Analytics

​Next Steps

Semantic Cache

Performance Overview

How It Works

Key Benefits

Quick Start

Cache Behavior

What Gets Cached

Cache Matching Examples

Cache Priority System

Configuration Options

TTL (Time To Live) Guidelines

Streaming Support

Cache Performance Tracking

Response Metadata

Cache Tier Values

Performance Characteristics

Best Practices

For Maximum Effectiveness

Ideal Use Cases

Error Handling

Failure Scenarios

Monitoring and Analytics

Next Steps