The fastest cache level - stores complete responses for identical prompts to deliver instant results. Disabled by default - enable per request for maximum speed. Supports both streaming and non-streaming requests.

How It Works

  • Exact matching: Caches responses for identical prompt + parameter combinations
  • SHA256-based keys: Deterministic cache key generation for consistency
  • Redis backend: Uses existing Redis infrastructure for reliability
  • Streaming conversion: Cached responses are converted to Server-Sent Events for streaming compatibility
  • Success-only caching: Only successful responses are cached, preventing error propagation

Benefits

  • Sub-millisecond response times for repeated requests
  • Zero API costs for cached responses
  • Perfect accuracy - exact same response every time
  • Streaming support - cached responses work with both regular and streaming requests
  • Redis reliability - Persistent storage with configurable TTL

Cache Priority

The prompt cache is checked before the semantic cache for optimal performance:
  1. Prompt Cache (fastest) - Exact parameter matches
  2. Semantic Cache - Similar meaning matches
  3. Provider APIs - Fresh requests

Cache Keys

Responses are cached based on SHA256 hash of:
  • Complete message history
  • Model selection
  • All parameters (temperature, max_tokens, top_p, etc.)
  • Tools and tool choice configuration

Examples

Basic Usage

// Enable prompt caching with custom TTL
const completion = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "What is 2+2?" }],
  temperature: 0.7,
  prompt_cache: {
    enabled: true,
    ttl: 7200  // 2 hours in seconds
  }
});

Streaming Support

// Works with streaming requests
const stream = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "Write a poem" }],
  stream: true,
  prompt_cache: {
    enabled: true,
    ttl: 3600
  }
});

// Cached responses are streamed word-by-word for consistent UX

Cache Miss Scenarios

// Different temperature = cache miss
const response1 = await openai.chat.completions.create({
  model: "",
  messages: [{ role: "user", content: "What is 2+2?" }],
  temperature: 0.8,  // Different parameter
  prompt_cache: { enabled: true }
});

// Different message = cache miss  
const response2 = await openai.chat.completions.create({
  model: "", 
  messages: [{ role: "user", content: "What is 3+3?" }],
  temperature: 0.7,
  prompt_cache: { enabled: true }
});

Configuration

The prompt cache accepts a prompt_cache object in requests:
{
  "prompt_cache": {
    "enabled": true,        // Required: enable caching for this request
    "ttl": 3600            // Optional: TTL in seconds (default: 3600)
  }
}

Configuration Options

  • enabled: Required boolean to enable caching for this request
  • ttl: Optional cache duration in seconds (default: 3600, i.e., 1 hour)

Technical Implementation

  • Storage: Redis with configurable TTL per entry
  • Key generation: SHA256 hash of request parameters for consistency
  • Error handling: Cache failures don’t block requests - fallback to fresh API calls
  • Memory efficiency: Only successful responses are stored
  • Streaming simulation: Cached responses are chunked and streamed with realistic timing

Performance Characteristics

  • Hit latency: ~1-5ms including Redis roundtrip
  • Storage: Redis with persistence and replication
  • Hit rate: 15-25% for typical applications, higher for repeated workflows
  • Capacity: Limited by Redis memory configuration
  • Streaming delay: 10ms between chunks for natural feel

Best Practices

  • Use with deterministic requests: Set temperature: 0 for consistent results
  • Configure appropriate TTL: Shorter for time-sensitive content, longer for static responses
  • Monitor cache effectiveness: Track hit rates in application metrics
  • Batch similar requests: Group identical requests to maximize cache hits
  • Consider memory usage: Monitor Redis memory consumption with high TTL values

Error Handling

  • Cache initialization failures don’t prevent application startup
  • Cache storage/retrieval errors fallback gracefully to API calls
  • Only successful API responses are cached to prevent error propagation
  • Redis connection issues are logged but don’t interrupt request flow