Gemini Stream Generate Content

Overview

The Gemini Stream Generate Content endpoint provides real-time streaming responses using Google’s Gemini API format. This endpoint is fully compatible with the official @google/genai SDK and streams responses as Server-Sent Events (SSE).

Streaming provides:

Real-time token-by-token generation
Lower perceived latency for users
Progressive UI updates as content generates
Full compatibility with Google’s Gen AI SDK

Authentication

x-goog-api-key

string

required

Your Adaptive API key. Also supports Authorization: Bearer, X-API-Key, or api-key headers.

Path Parameters

model

string

required

The model to use for streaming generation. Supports Gemini model names and Adaptive’s intelligent routing.Examples:

gemini-2.5-pro - Latest Gemini Pro model
gemini-2.5-flash - Fast Gemini Flash model (recommended for streaming)
gemini-1.5-pro - Gemini 1.5 Pro
Custom model aliases configured in Adaptive

Request Body

The request body format is identical to the Generate Content endpoint.

contents

array

required

An array of content parts representing the conversation history or prompt.

"contents": [
  {
    "role": "user",
    "parts": [
      {
        "text": "Write a creative story about space exploration"
      }
    ]
  }
]

config

object

Generation configuration parameters (same as non-streaming endpoint).

Show Configuration Properties

config.temperature

number

Controls randomness in generation (0.0 to 2.0). Default: 1.0

config.topP

number

Nucleus sampling parameter (0.0 to 1.0). Default: 0.95

config.topK

number

Top-K sampling parameter. Default: 40

config.maxOutputTokens

number

Maximum tokens to generate. Default: 8192

config.stopSequences

array

Sequences that stop generation when encountered.

provider_configs

object

Adaptive Extension: Provider-specific configuration overrides.

model_router

object

Adaptive Extension: Control intelligent routing behavior.

semantic_cache

object

Adaptive Extension: Semantic caching configuration.

fallback

object

Adaptive Extension: Fallback configuration for provider failures.

Response Format

The endpoint returns a Server-Sent Events (SSE) stream with Content-Type: text/event-stream. Each chunk follows Gemini’s streaming format.

Stream Chunk Structure

candidates

array

Array of generated response candidates (partial content).

Show Candidate Structure

content

object

Partial generated content.

"content": {
  "parts": [
    {
      "text": "Once upon a time, in a galaxy..."
    }
  ],
  "role": "model"
}

finishReason

string

Present in final chunk: STOP, MAX_TOKENS, SAFETY, RECITATION, OTHER

usageMetadata

object

Token usage information (included in final chunk).

Show Usage Metadata

promptTokenCount

number

Number of tokens in the prompt.

candidatesTokenCount

number

Number of tokens in the generated response.

totalTokenCount

number

Total tokens used (prompt + completion).

modelVersion

string

The actual model version used for generation.

Code Examples

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({
  apiKey: process.env.GEMINI_API_KEY,
  httpOptions: {
    baseUrl: 'https://api.llmadaptive.uk/v1beta'
  }
});

const stream = await ai.models.generateContentStream({
  model: 'gemini-2.5-flash',
  contents: [
    {
      role: 'user',
      parts: [
        { text: 'Write a creative story about space exploration' }
      ]
    }
  ],
  config: {
    temperature: 0.9,
    maxOutputTokens: 2048
  }
});

// Process stream chunks
for await (const chunk of stream) {
  const text = chunk.candidates[0].content.parts[0].text;
  process.stdout.write(text);

   // Check for final chunk with usage metadata
   if (chunk.usageMetadata) {
     console.log('\n\nTokens used:', chunk.usageMetadata.totalTokenCount);
   }
}

Advanced Examples

React Component with Streaming

import { GoogleGenAI } from '@google/genai';
import { useState } from 'react';

function StreamingChat() {
  const [response, setResponse] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  const ai = new GoogleGenAI({
    apiKey: process.env.NEXT_PUBLIC_GEMINI_API_KEY,
    httpOptions: {
      baseUrl: 'https://api.llmadaptive.uk/v1beta'
    }
  });

  async function handleSubmit(prompt: string) {
    setIsStreaming(true);
    setResponse('');

    try {
      const stream = await ai.models.generateContentStream({
        model: 'gemini-2.5-flash',
        contents: [
          {
            role: 'user',
            parts: [{ text: prompt }]
          }
        ]
      });

      for await (const chunk of stream) {
        const text = chunk.candidates[0].content.parts[0].text;
        setResponse(prev => prev + text);
      }
    } catch (error) {
      console.error('Streaming error:', error);
    } finally {
      setIsStreaming(false);
    }
  }

  return (
    <div>
      <textarea
        disabled={isStreaming}
        onChange={(e) => handleSubmit(e.target.value)}
      />
      <div className="response">
        {response}
        {isStreaming && <span className="cursor">▊</span>}
      </div>
    </div>
  );
}

Multi-Turn Streaming Conversation

const conversationHistory = [
  {
    role: 'user',
    parts: [{ text: 'Tell me about the solar system' }]
  },
  {
    role: 'model',
    parts: [{ text: 'The solar system consists of...' }]
  },
  {
    role: 'user',
    parts: [{ text: 'What about Mars specifically?' }]
  }
];

const stream = await ai.models.generateContentStream({
  model: 'gemini-2.5-pro',
  contents: conversationHistory
});

for await (const chunk of stream) {
  // Process chunk
}

With Adaptive Extensions

const stream = await ai.models.generateContentStream({
  model: 'gemini-2.5-flash',
  contents: [
    {
      role: 'user',
      parts: [{ text: 'Generate a Python web scraper' }]
    }
  ],
  config: {
    temperature: 0.3,
    maxOutputTokens: 3000
  },
  // Adaptive-specific features
  semantic_cache: {
    enabled: true,
    similarity_threshold: 0.95
  },
  fallback: {
    enabled: true,
    max_retries: 3
  },
  model_router: {
    cost_optimization: true,
    fallback_models: ['claude-3-5-haiku-20241022', 'gpt-4o-mini']
  }
});

let fullResponse = '';
for await (const chunk of stream) {
  const text = chunk.candidates[0].content.parts[0].text;
  fullResponse += text;

  // Update UI progressively
  updateUI(fullResponse);

  // Log metadata from final chunk
  if (chunk.usageMetadata) {
    // Usage metadata available
  }
}

Stream Event Format

Each Server-Sent Event (SSE) follows this format:

data: {"candidates":[{"content":{"parts":[{"text":"Once upon"}],"role":"model"}}],"modelVersion":"gemini-2.5-flash"}

data: {"candidates":[{"content":{"parts":[{"text":" a time"}],"role":"model"}}],"modelVersion":"gemini-2.5-flash"}

data: {"candidates":[{"content":{"parts":[{"text":" in a"}],"role":"model"},"finishReason":"STOP"}],"usageMetadata":{"promptTokenCount":12,"candidatesTokenCount":156,"totalTokenCount":168},"modelVersion":"gemini-2.5-flash"}

The final chunk includes finishReason and complete usageMetadata with token counts and cache information.

Error Handling

Stream Interruption

If the stream is interrupted, an error chunk will be sent:

data: {
  "error": {
    "code": 500,
    "message": "Stream failed",
    "status": "INTERNAL"
  }
}

Solution: Adaptive’s fallback system automatically retries with alternative providers.

Authentication Errors

Authentication errors return immediately before streaming begins:

{
  "error": {
    "code": 401,
    "message": "API key required",
    "status": "UNAUTHENTICATED"
  }
}

Solution: Provide a valid API key in the x-goog-api-key header.

Rate Limiting

{
  "error": {
    "code": 429,
    "message": "Rate limit exceeded",
    "status": "RESOURCE_EXHAUSTED"
  }
}

Solution: Adaptive’s multi-provider load balancing helps distribute requests and avoid rate limits.

Performance Optimization

Use Flash Models

gemini-2.5-flash provides faster streaming than Pro models—ideal for real-time applications

Streaming vs Non-Streaming

Streaming reduces perceived latency by ~60% but uses similar total tokens

Semantic Caching

Cache similar prompts to reduce latency and costs by up to 90%

Load Balancing

Adaptive distributes streaming requests across providers for higher throughput

Best Practices

Handle Partial Chunks

Accumulate text chunks progressively—don’t assume complete sentences in each chunk

Monitor Final Chunk

The last chunk contains finishReason and complete usageMetadata—use this for billing and completion detection

Implement Error Recovery

Handle stream interruptions gracefully with reconnection logic or fallback to non-streaming

Use Flash for Speed

For real-time chat applications, prefer gemini-2.5-flash over Pro models

Enable Caching

Use semantic caching for frequently asked questions or similar prompts

Differences from Non-Streaming

Feature	Streaming	Non-Streaming
Response Format	SSE chunks	Single JSON response
Latency	Progressive (lower perceived)	Wait for full completion
UI Updates	Real-time token-by-token	All at once
Token Usage	Same	Same
Error Handling	Mid-stream errors possible	Single error response
Use Case	Chat, real-time apps	Batch processing, APIs

Generate Content - Non-streaming version of this endpoint
Chat Completions - OpenAI-compatible streaming chat
Select Model - Get optimal model recommendations

SDK Integration

For full SDK integration guide with streaming examples and best practices, see:

Getting Started

Key Features

Framework Integrations

Developer Tools

Examples

API Reference

Support

Gemini Stream Generate Content

Overview

Authentication

Path Parameters

Request Body

Response Format

Stream Chunk Structure

Code Examples

Advanced Examples

React Component with Streaming

Multi-Turn Streaming Conversation

With Adaptive Extensions

Stream Event Format

Error Handling

Performance Optimization

Use Flash Models

Streaming vs Non-Streaming

Semantic Caching

Load Balancing

Best Practices

Differences from Non-Streaming

SDK Integration

Getting Started

Key Features

Framework Integrations

Developer Tools

Examples

API Reference

Support

​Overview

​Authentication

​Path Parameters

​Request Body

​Response Format

​Stream Chunk Structure

​Code Examples

​Advanced Examples

​React Component with Streaming

​Multi-Turn Streaming Conversation

​With Adaptive Extensions

​Stream Event Format

​Error Handling

​Performance Optimization

Use Flash Models

Streaming vs Non-Streaming

Semantic Caching

Load Balancing

​Best Practices

​Differences from Non-Streaming

​Related Endpoints

​SDK Integration

Overview

Authentication

Path Parameters

Request Body

Response Format

Stream Chunk Structure

Code Examples

Advanced Examples

React Component with Streaming

Multi-Turn Streaming Conversation

With Adaptive Extensions

Stream Event Format

Error Handling

Performance Optimization

Best Practices

Differences from Non-Streaming

Related Endpoints

SDK Integration