Skip to main content
POST
/
api
/
v1beta
/
models
/
{model}
/
streamGenerateContent
{
  "candidates": [
    {
      "content": {},
      "finishReason": "<string>"
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 123,
    "candidatesTokenCount": 123,
    "totalTokenCount": 123,
    "cache_tier": "<string>"
  },
  "modelVersion": "<string>",
  "provider": "<string>"
}

Overview

The Gemini Stream Generate Content endpoint provides real-time streaming responses using Google’s Gemini API format. This endpoint is fully compatible with the official @google/genai SDK and streams responses as Server-Sent Events (SSE).
Streaming provides:
  • Real-time token-by-token generation
  • Lower perceived latency for users
  • Progressive UI updates as content generates
  • Full compatibility with Google’s Gen AI SDK

Authentication

x-goog-api-key
string
required
Your Adaptive API key. Also supports Authorization: Bearer, X-API-Key, or api-key headers.

Path Parameters

model
string
required
The model to use for streaming generation. Supports Gemini model names and Adaptive’s intelligent routing.Examples:
  • gemini-2.5-pro - Latest Gemini Pro model
  • gemini-2.5-flash - Fast Gemini Flash model (recommended for streaming)
  • gemini-1.5-pro - Gemini 1.5 Pro
  • Custom model aliases configured in Adaptive

Request Body

The request body format is identical to the Generate Content endpoint.
contents
array
required
An array of content parts representing the conversation history or prompt.
"contents": [
  {
    "role": "user",
    "parts": [
      {
        "text": "Write a creative story about space exploration"
      }
    ]
  }
]
config
object
Generation configuration parameters (same as non-streaming endpoint).
provider_configs
object
Adaptive Extension: Provider-specific configuration overrides.
model_router
object
Adaptive Extension: Control intelligent routing behavior.
semantic_cache
object
Adaptive Extension: Semantic caching configuration.
prompt_cache
object
Adaptive Extension: Prompt caching configuration.
fallback
object
Adaptive Extension: Fallback configuration for provider failures.

Response Format

The endpoint returns a Server-Sent Events (SSE) stream with Content-Type: text/event-stream. Each chunk follows Gemini’s streaming format.

Stream Chunk Structure

candidates
array
Array of generated response candidates (partial content).
usageMetadata
object
Token usage information (included in final chunk).
modelVersion
string
The actual model version used for generation.
provider
string
Adaptive Extension: The provider that handled the request.

Code Examples

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({
  apiKey: process.env.GEMINI_API_KEY,
  httpOptions: {
    baseUrl: 'https://www.llmadaptive.uk/api/v1beta'
  }
});

const stream = await ai.models.generateContentStream({
  model: 'gemini-2.5-flash',
  contents: [
    {
      role: 'user',
      parts: [
        { text: 'Write a creative story about space exploration' }
      ]
    }
  ],
  config: {
    temperature: 0.9,
    maxOutputTokens: 2048
  }
});

// Process stream chunks
for await (const chunk of stream) {
  const text = chunk.candidates[0].content.parts[0].text;
  process.stdout.write(text);

  // Check for final chunk with usage metadata
  if (chunk.usageMetadata) {
    console.log('\n\nTokens used:', chunk.usageMetadata.totalTokenCount);
    console.log('Provider:', chunk.provider);
    console.log('Cache tier:', chunk.usageMetadata.cache_tier);
  }
}

Advanced Examples

React Component with Streaming

import { GoogleGenAI } from '@google/genai';
import { useState } from 'react';

function StreamingChat() {
  const [response, setResponse] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  const ai = new GoogleGenAI({
    apiKey: process.env.NEXT_PUBLIC_GEMINI_API_KEY,
    httpOptions: {
      baseUrl: 'https://www.llmadaptive.uk/api/v1beta'
    }
  });

  async function handleSubmit(prompt: string) {
    setIsStreaming(true);
    setResponse('');

    try {
      const stream = await ai.models.generateContentStream({
        model: 'gemini-2.5-flash',
        contents: [
          {
            role: 'user',
            parts: [{ text: prompt }]
          }
        ]
      });

      for await (const chunk of stream) {
        const text = chunk.candidates[0].content.parts[0].text;
        setResponse(prev => prev + text);
      }
    } catch (error) {
      console.error('Streaming error:', error);
    } finally {
      setIsStreaming(false);
    }
  }

  return (
    <div>
      <textarea
        disabled={isStreaming}
        onChange={(e) => handleSubmit(e.target.value)}
      />
      <div className="response">
        {response}
        {isStreaming && <span className="cursor"></span>}
      </div>
    </div>
  );
}

Multi-Turn Streaming Conversation

const conversationHistory = [
  {
    role: 'user',
    parts: [{ text: 'Tell me about the solar system' }]
  },
  {
    role: 'model',
    parts: [{ text: 'The solar system consists of...' }]
  },
  {
    role: 'user',
    parts: [{ text: 'What about Mars specifically?' }]
  }
];

const stream = await ai.models.generateContentStream({
  model: 'gemini-2.5-pro',
  contents: conversationHistory
});

for await (const chunk of stream) {
  // Process chunk
}

With Adaptive Extensions

const stream = await ai.models.generateContentStream({
  model: 'gemini-2.5-flash',
  contents: [
    {
      role: 'user',
      parts: [{ text: 'Generate a Python web scraper' }]
    }
  ],
  config: {
    temperature: 0.3,
    maxOutputTokens: 3000
  },
  // Adaptive-specific features
  semantic_cache: {
    enabled: true,
    similarity_threshold: 0.95
  },
  fallback: {
    enabled: true,
    max_retries: 3
  },
  model_router: {
    cost_optimization: true,
    fallback_models: ['claude-3-5-haiku-20241022', 'gpt-4o-mini']
  }
});

let fullResponse = '';
for await (const chunk of stream) {
  const text = chunk.candidates[0].content.parts[0].text;
  fullResponse += text;

  // Update UI progressively
  updateUI(fullResponse);

  // Log metadata from final chunk
  if (chunk.usageMetadata) {
    console.log('Cache hit:', chunk.usageMetadata.cache_tier !== 'none');
    console.log('Provider used:', chunk.provider);
  }
}

Stream Event Format

Each Server-Sent Event (SSE) follows this format:
data: {"candidates":[{"content":{"parts":[{"text":"Once upon"}],"role":"model"}}],"modelVersion":"gemini-2.5-flash","provider":"google"}

data: {"candidates":[{"content":{"parts":[{"text":" a time"}],"role":"model"}}],"modelVersion":"gemini-2.5-flash","provider":"google"}

data: {"candidates":[{"content":{"parts":[{"text":" in a"}],"role":"model"},"finishReason":"STOP"}],"usageMetadata":{"promptTokenCount":12,"candidatesTokenCount":156,"totalTokenCount":168,"cache_tier":"none"},"modelVersion":"gemini-2.5-flash","provider":"google"}

The final chunk includes finishReason and complete usageMetadata with token counts and cache information.

Error Handling

If the stream is interrupted, an error chunk will be sent:
data: {
  "error": {
    "code": 500,
    "message": "Stream failed",
    "status": "INTERNAL"
  }
}
Solution: Adaptive’s fallback system automatically retries with alternative providers.
Authentication errors return immediately before streaming begins:
{
  "error": {
    "code": 401,
    "message": "API key required",
    "status": "UNAUTHENTICATED"
  }
}
Solution: Provide a valid API key in the x-goog-api-key header.
{
  "error": {
    "code": 429,
    "message": "Rate limit exceeded",
    "status": "RESOURCE_EXHAUSTED"
  }
}
Solution: Adaptive’s multi-provider load balancing helps distribute requests and avoid rate limits.

Performance Optimization

Use Flash Models

gemini-2.5-flash provides faster streaming than Pro models—ideal for real-time applications

Streaming vs Non-Streaming

Streaming reduces perceived latency by ~60% but uses similar total tokens

Semantic Caching

Cache similar prompts to reduce latency and costs by up to 90%

Load Balancing

Adaptive distributes streaming requests across providers for higher throughput

Best Practices

1

Handle Partial Chunks

Accumulate text chunks progressively—don’t assume complete sentences in each chunk
2

Monitor Final Chunk

The last chunk contains finishReason and complete usageMetadata—use this for billing and completion detection
3

Implement Error Recovery

Handle stream interruptions gracefully with reconnection logic or fallback to non-streaming
4

Use Flash for Speed

For real-time chat applications, prefer gemini-2.5-flash over Pro models
5

Enable Caching

Use semantic and prompt caching for frequently asked questions or similar prompts

Differences from Non-Streaming

FeatureStreamingNon-Streaming
Response FormatSSE chunksSingle JSON response
LatencyProgressive (lower perceived)Wait for full completion
UI UpdatesReal-time token-by-tokenAll at once
Token UsageSameSame
Error HandlingMid-stream errors possibleSingle error response
Use CaseChat, real-time appsBatch processing, APIs

SDK Integration

For full SDK integration guide with streaming examples and best practices, see: