Environment details
- Programming language: Python 3.10.19
- OS: Linux (WSL2, Ubuntu)
- Package version:
google-genai 1.68.0
- API: Gemini Developer API (API key)
What happened
I'm building a TTS audio evaluation pipeline that sends WAV audio to Gemini and asks it to compare/rate them. I noticed that gemini-3.1-flash-lite-preview always returns thoughts_token_count: 0 when audio is included in the request, even with thinking_level="medium" or "high".
Text-only requests on the same model work fine — thinking tokens are generated as expected.
Same audio requests on gemini-3-flash-preview also work fine — 100% thinking activation.
So the issue seems specific to the combination of Flash Lite + audio input.
Steps to reproduce
- Send a request to
gemini-3.1-flash-lite-preview with inline audio bytes and ThinkingConfig(thinking_level="medium")
- Check
response.usage_metadata.thoughts_token_count
- It returns 0. The same request without audio, or the same request on
gemini-3-flash-preview, returns non-zero thinking tokens.
Minimal script:
import os
from google import genai
from google.genai.types import GenerateContentConfig, Part, ThinkingConfig
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
audio_bytes = open("sample.wav", "rb").read() # any short WAV
# This produces 0 thinking tokens
response = client.models.generate_content(
model="gemini-3.1-flash-lite-preview",
contents=[
Part.from_text(text="Rate this audio quality from 1-10."),
Part.from_bytes(data=audio_bytes, mime_type="audio/wav"),
],
config=GenerateContentConfig(
temperature=1.0,
thinking_config=ThinkingConfig(thinking_level="medium"),
),
)
print(response.usage_metadata.thoughts_token_count) # 0
# Swap to gemini-3-flash-preview → non-zero thinking tokens
No errors are raised. The parameter is silently accepted but has no effect.
What I tested
I ran a more thorough test to isolate the issue: 4 input types × 2 models × 25 calls = 200 calls, all with thinking_level="medium".
Thinking activation (calls with thoughts_token_count > 0):
| Input |
gemini-3-flash-preview |
gemini-3.1-flash-lite-preview |
| Text only |
25/25 (100%) |
25/25 (100%) |
| Text + 1 WAV |
25/25 (100%) |
2/25 (8%) |
| Text + 2 WAVs |
25/25 (100%) |
0/25 (0%) |
| Text + 3 WAVs |
25/25 (100%) |
0/25 (0%) |
Flash Lite thinks normally on text-only, but drops to near-zero once any audio is in the request.
I also ran 600 calls on Flash Lite alone across all 4 thinking levels (with 3 audio inputs):
thinking_level |
Fired / 150 |
minimal |
0 (0%) |
low |
108 (72%) |
medium |
0 (0%) |
high |
22 (14.7%) |
The ordering doesn't make sense — low triggers thinking far more than medium or high.
I also tried adding explicit instructions in the prompt like "Think step by step", "Listen to each audio carefully and thoroughly compare them before answering", and requesting a reasoning field in the output. None of these made a difference — thoughts_token_count stayed at 0 with audio input on Flash Lite.
Why it matters
In my use case (comparing TTS audio samples), responses without thinking show extreme position bias — the model just picks whichever audio was presented second/middle without actually comparing them. With thinking enabled (on gemini-3-flash-preview), the responses are far more meaningful. So this isn't just a cosmetic token count issue; it directly affects output quality for audio tasks.
My understanding
- This looks like a product-side issue (model behavior), not a client library bug
- The model card and thinking docs both list Flash Lite as supporting all four thinking levels with multimodal input
Environment details
google-genai1.68.0What happened
I'm building a TTS audio evaluation pipeline that sends WAV audio to Gemini and asks it to compare/rate them. I noticed that
gemini-3.1-flash-lite-previewalways returnsthoughts_token_count: 0when audio is included in the request, even withthinking_level="medium"or"high".Text-only requests on the same model work fine — thinking tokens are generated as expected.
Same audio requests on
gemini-3-flash-previewalso work fine — 100% thinking activation.So the issue seems specific to the combination of Flash Lite + audio input.
Steps to reproduce
gemini-3.1-flash-lite-previewwith inline audio bytes andThinkingConfig(thinking_level="medium")response.usage_metadata.thoughts_token_countgemini-3-flash-preview, returns non-zero thinking tokens.Minimal script:
No errors are raised. The parameter is silently accepted but has no effect.
What I tested
I ran a more thorough test to isolate the issue: 4 input types × 2 models × 25 calls = 200 calls, all with
thinking_level="medium".Thinking activation (calls with
thoughts_token_count > 0):gemini-3-flash-previewgemini-3.1-flash-lite-previewFlash Lite thinks normally on text-only, but drops to near-zero once any audio is in the request.
I also ran 600 calls on Flash Lite alone across all 4 thinking levels (with 3 audio inputs):
thinking_levelminimallowmediumhighThe ordering doesn't make sense —
lowtriggers thinking far more thanmediumorhigh.I also tried adding explicit instructions in the prompt like "Think step by step", "Listen to each audio carefully and thoroughly compare them before answering", and requesting a
reasoningfield in the output. None of these made a difference —thoughts_token_countstayed at 0 with audio input on Flash Lite.Why it matters
In my use case (comparing TTS audio samples), responses without thinking show extreme position bias — the model just picks whichever audio was presented second/middle without actually comparing them. With thinking enabled (on
gemini-3-flash-preview), the responses are far more meaningful. So this isn't just a cosmetic token count issue; it directly affects output quality for audio tasks.My understanding