Overview
The Gemini Stream Generate Content endpoint provides real-time streaming responses using Google’s Gemini API format. This endpoint is fully compatible with the official@google/genai
SDK and streams responses as Server-Sent Events (SSE).
Streaming provides:
- Real-time token-by-token generation
- Lower perceived latency for users
- Progressive UI updates as content generates
- Full compatibility with Google’s Gen AI SDK
Authentication
Your Adaptive API key. Also supports
Authorization: Bearer
, X-API-Key
, or api-key
headers.Path Parameters
The model to use for streaming generation. Supports Gemini model names and Adaptive’s intelligent routing.Examples:
gemini-2.5-pro
- Latest Gemini Pro modelgemini-2.5-flash
- Fast Gemini Flash model (recommended for streaming)gemini-1.5-pro
- Gemini 1.5 Pro- Custom model aliases configured in Adaptive
Request Body
The request body format is identical to the Generate Content endpoint.An array of content parts representing the conversation history or prompt.
Generation configuration parameters (same as non-streaming endpoint).
Adaptive Extension: Provider-specific configuration overrides.
Adaptive Extension: Control intelligent routing behavior.
Adaptive Extension: Semantic caching configuration.
Adaptive Extension: Prompt caching configuration.
Adaptive Extension: Fallback configuration for provider failures.
Response Format
The endpoint returns a Server-Sent Events (SSE) stream withContent-Type: text/event-stream
. Each chunk follows Gemini’s streaming format.
Stream Chunk Structure
Array of generated response candidates (partial content).
Token usage information (included in final chunk).
The actual model version used for generation.
Adaptive Extension: The provider that handled the request.
Code Examples
Advanced Examples
React Component with Streaming
Multi-Turn Streaming Conversation
With Adaptive Extensions
Stream Event Format
Each Server-Sent Event (SSE) follows this format:The final chunk includes
finishReason
and complete usageMetadata
with token counts and cache information.Error Handling
Stream Interruption
Stream Interruption
If the stream is interrupted, an error chunk will be sent:Solution: Adaptive’s fallback system automatically retries with alternative providers.
Authentication Errors
Authentication Errors
Authentication errors return immediately before streaming begins:Solution: Provide a valid API key in the
x-goog-api-key
header.Rate Limiting
Rate Limiting
Performance Optimization
Use Flash Models
gemini-2.5-flash
provides faster streaming than Pro models—ideal for real-time applicationsStreaming vs Non-Streaming
Streaming reduces perceived latency by ~60% but uses similar total tokens
Semantic Caching
Cache similar prompts to reduce latency and costs by up to 90%
Load Balancing
Adaptive distributes streaming requests across providers for higher throughput
Best Practices
1
Handle Partial Chunks
Accumulate text chunks progressively—don’t assume complete sentences in each chunk
2
Monitor Final Chunk
The last chunk contains
finishReason
and complete usageMetadata
—use this for billing and completion detection3
Implement Error Recovery
Handle stream interruptions gracefully with reconnection logic or fallback to non-streaming
4
Use Flash for Speed
For real-time chat applications, prefer
gemini-2.5-flash
over Pro models5
Enable Caching
Use semantic and prompt caching for frequently asked questions or similar prompts
Differences from Non-Streaming
Feature | Streaming | Non-Streaming |
---|---|---|
Response Format | SSE chunks | Single JSON response |
Latency | Progressive (lower perceived) | Wait for full completion |
UI Updates | Real-time token-by-token | All at once |
Token Usage | Same | Same |
Error Handling | Mid-stream errors possible | Single error response |
Use Case | Chat, real-time apps | Batch processing, APIs |
Related Endpoints
- Generate Content - Non-streaming version of this endpoint
- Chat Completions - OpenAI-compatible streaming chat
- Select Model - Get optimal model recommendations