Semantic cache is enabled by default. It understands similar requests even when worded differently, using AI embeddings to match semantically similar queries. Only successful responses are cached to ensure reliability.

Benefits

  • Sub-100ms responses for cached queries
  • 60-80% cost reduction for similar requests
  • Intelligent matching - paraphrases and synonyms
  • Success-only caching - prevents error propagation
  • Deferred storage - caches only after successful completion

How It Works

  • Embedding-based matching: Uses sentence transformers to understand request meaning
  • Similarity thresholds: Configurable matching sensitivity
  • Success validation: Only successful API responses are stored
  • Smart routing: Checked after prompt cache but before fresh API calls

Cache Priority Order

  1. Prompt Cache - Exact parameter matches (if enabled)
  2. Semantic Cache - Similar meaning matches (enabled by default)
  3. Provider APIs - Fresh requests

Examples

Cache hits for similar requests: Original: “How do I reset my password?”
Semantic matches:
  • “I forgot my password”
  • “Password reset steps”
  • “Help me recover my account”
Technical queries:
  • “How to implement JWT authentication”
  • “JWT auth implementation guide”
  • “Setting up JSON Web Token auth”

Configuration

// Default behavior (enabled with balanced threshold)
{
  "model": "",
  "messages": [{"role": "user", "content": "How do I reset my password?"}]
  // semantic_cache enabled by default
}

// Customize threshold for stricter matching
{
  "model": "", 
  "messages": [{"role": "user", "content": "Technical implementation question"}],
  "semantic_cache": {
    "semantic_threshold": 0.9  // Higher = stricter matching
  }
}

// Disable for real-time or highly dynamic content
{
  "model": "",
  "messages": [{"role": "user", "content": "What's the current stock price?"}],
  "semantic_cache": {
    "enabled": false
  }
}

Configuration Options

  • enabled: Boolean to enable/disable semantic caching (default: true)
  • semantic_threshold: Float between 0.0-1.0 for similarity matching (default: 0.85)

Similarity Thresholds

  • 0.7-0.8: Loose matching - good for FAQs and general questions
  • 0.8-0.9: Balanced - default setting for most use cases
  • 0.9+: Strict matching - ideal for technical documentation and precise queries

Technical Implementation

  • Storage: Uses protocol manager’s AI service for embedding generation and matching
  • Success validation: Responses are only cached after successful API completion
  • Deferred caching: Cache storage happens post-response to ensure reliability
  • Error resilience: Cache failures don’t interrupt request flow
  • Performance: Embedding comparison adds ~10-20ms overhead on cache misses

Performance Characteristics

  • Hit latency: ~50-100ms including embedding computation
  • Miss overhead: ~10-20ms for similarity computation
  • Hit rate: 40-60% for typical applications with similar queries
  • Storage efficiency: Embedding-based indexing with configurable retention

Best Practices

  • Use default settings for most applications - balanced threshold works well
  • Lower thresholds (0.7-0.8) for customer support and FAQ systems
  • Higher thresholds (0.9+) for technical documentation and precise queries
  • Disable for real-time data - stock prices, weather, current events
  • Monitor hit rates - track cache effectiveness in application metrics

When to Disable

Consider disabling semantic cache for:
  • Real-time data requests (stock prices, live sports scores)
  • Highly personalized content (user-specific data)
  • Time-sensitive queries (current events, breaking news)
  • Exact precision required (legal, medical, financial advice)

Error Handling

  • Graceful degradation: Cache failures fallback to fresh API calls
  • Success-only storage: Failed API responses are never cached
  • Deferred caching: Storage happens after successful response delivery
  • Reliability: Cache issues don’t affect application availability