-
Notifications
You must be signed in to change notification settings - Fork 338
calc Time to First Token(ttft) and token speed #779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Time to First Token (TTFT) and token speed calculation capabilities to the LLM performance tracking infrastructure. The implementation aims to provide better observability into LLM response times and throughput.
- Enhanced the
timed_with_statusdecorator to calculate and log content-based performance metrics (speed in chars/s, content length) - Added TTFT tracking in OpenAI streaming responses to measure latency from request to first token
- Introduced model name tracking to correlate performance metrics with specific models
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| src/memos/utils.py | Modified the timed_with_status decorator to add performance metrics calculation including total time in seconds, content length, and speed (chars/s) for successful function calls with truthy results |
| src/memos/llms/openai.py | Added TTFT (Time To First Token) tracking in the generate_stream method with timing logic and model name logging for the first token received |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - log_extra_args: | ||
| - can be a dict: fixed contextual fields that are always attached to logs; | ||
| - or a callable: like `fn(*args, **kwargs) -> dict`, used to dynamically generate contextual fields at runtime. | ||
| - target_models: list of target models to filter performance tracking |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation mentions a new parameter "target_models" but this parameter is not actually implemented in the function signature or used in the function body. Either remove this documentation line or implement the parameter.
| - target_models: list of target models to filter performance tracking |
| if success_flag and result: | ||
| # Calculate content metrics | ||
| content_length = len(result) if isinstance(result, str) else 0 |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic assumes that when success_flag is True and result is truthy, the result will be a string for length calculation. However, this function is a generic decorator that could be applied to any function. If the result is not a string (e.g., a list, dict, or custom object), this will cause a bug. Consider adding a type check before attempting to measure content_length, or document that this decorator should only be used with string-returning functions.
| # Get model name from self.config | ||
| self = args[0] if args else None | ||
| model_name = ( | ||
| getattr(self, "config", {}).get("model_name_or_path", "unknown") | ||
| if self | ||
| else "unknown" | ||
| ) |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing self.config using getattr on a potentially None object is fragile. If args[0] is not an object with a config attribute (or if args is empty), this will fail silently and return "unknown". This assumes all decorated functions are methods with a specific structure. Consider adding more robust type checking or documenting this requirement clearly.
| # Get model name from self.config | |
| self = args[0] if args else None | |
| model_name = ( | |
| getattr(self, "config", {}).get("model_name_or_path", "unknown") | |
| if self | |
| else "unknown" | |
| ) | |
| # Get model name from self.config, if available and well-formed | |
| self_obj = args[0] if args else None | |
| model_name = "unknown" | |
| if self_obj is not None: | |
| config = getattr(self_obj, "config", None) | |
| if isinstance(config, dict): | |
| model_name = config.get("model_name_or_path", "unknown") |
| elif ( | ||
| success_flag | ||
| ): # Standard logging for non-target models or when not tracking metrics | ||
| msg = ( | ||
| f"[TIMER_WITH_STATUS] {log_prefix or fn.__name__} " | ||
| f"took {elapsed_ms:.0f} ms{status_info}, args: {ctx_str}" | ||
| ) | ||
| logger.info(msg) |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging condition is confusing. The branch at line 103-105 logs when success_flag is True but result is falsy (None, empty string, 0, False, etc.). This means successful executions with falsy results use the standard logging format, while successful executions with truthy results use the performance metrics format. This logic seems inconsistent - consider what should happen when a function successfully returns an empty string or None.
| # 尝试从响应中获取实际模型信息 | ||
| actual_model = getattr(chunk, "model", None) or self.config.model_name_or_path | ||
| requested_model = self.config.model_name_or_path | ||
|
|
||
| # Print TTFT info - 显示请求模型和实际模型(如果不一致) |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is in Chinese. For consistency with the rest of the codebase which uses English comments, this should be translated to English.
| # 尝试从响应中获取实际模型信息 | |
| actual_model = getattr(chunk, "model", None) or self.config.model_name_or_path | |
| requested_model = self.config.model_name_or_path | |
| # Print TTFT info - 显示请求模型和实际模型(如果不一致) | |
| # Try to get the actual model information from the response | |
| actual_model = getattr(chunk, "model", None) or self.config.model_name_or_path | |
| requested_model = self.config.model_name_or_path | |
| # Print TTFT info - show requested model and actual model (if they differ) |
| # 尝试从响应中获取实际模型信息 | ||
| actual_model = getattr(chunk, "model", None) or self.config.model_name_or_path | ||
| requested_model = self.config.model_name_or_path | ||
|
|
||
| # Print TTFT info - 显示请求模型和实际模型(如果不一致) |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is in Chinese. For consistency with the rest of the codebase which uses English comments, this should be translated to English.
| # 尝试从响应中获取实际模型信息 | |
| actual_model = getattr(chunk, "model", None) or self.config.model_name_or_path | |
| requested_model = self.config.model_name_or_path | |
| # Print TTFT info - 显示请求模型和实际模型(如果不一致) | |
| # Try to obtain the actual model information from the response | |
| actual_model = getattr(chunk, "model", None) or self.config.model_name_or_path | |
| requested_model = self.config.model_name_or_path | |
| # Print TTFT info - show requested model and actual model (if they differ) |
| first_token_time = None | ||
| start_time = time.perf_counter() | ||
|
|
||
| for chunk in response: | ||
| delta = chunk.choices[0].delta | ||
|
|
||
| # Calculate TTFT on first token | ||
| if first_token_time is None: | ||
| first_token_time = time.perf_counter() | ||
| ttft_ms = (first_token_time - start_time) * 1000.0 | ||
|
|
||
| # 尝试从响应中获取实际模型信息 | ||
| actual_model = getattr(chunk, "model", None) or self.config.model_name_or_path | ||
| requested_model = self.config.model_name_or_path | ||
|
|
||
| # Print TTFT info - 显示请求模型和实际模型(如果不一致) | ||
| if actual_model != requested_model: | ||
| logger.info( | ||
| f"TTFT: {ttft_ms:.2f}ms | Requested: {requested_model} | Actual: {actual_model}" | ||
| ) | ||
| else: | ||
| logger.info(f"TTFT: {ttft_ms:.2f}ms | {requested_model}") |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TTFT (Time To First Token) calculation and logging happens inside the generator loop but only once when first_token_time is None. However, this timing logic doesn't account for cases where the stream might be empty or where no chunks are yielded. Additionally, since this code is inside generate_stream which is decorated with @timed_with_status, there's now redundant timing logic - the decorator already tracks total time. Consider whether TTFT should be tracked separately from the decorator's timing or if this overlaps with the decorator's functionality.
| first_token_time = None | ||
| start_time = time.perf_counter() | ||
|
|
||
| for chunk in response: | ||
| delta = chunk.choices[0].delta | ||
|
|
||
| # Calculate TTFT on first token | ||
| if first_token_time is None: | ||
| first_token_time = time.perf_counter() | ||
| ttft_ms = (first_token_time - start_time) * 1000.0 | ||
|
|
||
| # 尝试从响应中获取实际模型信息 | ||
| actual_model = getattr(chunk, "model", None) or self.config.model_name_or_path | ||
| requested_model = self.config.model_name_or_path | ||
|
|
||
| # Print TTFT info - 显示请求模型和实际模型(如果不一致) | ||
| if actual_model != requested_model: | ||
| logger.info( | ||
| f"TTFT: {ttft_ms:.2f}ms | Requested: {requested_model} | Actual: {actual_model}" | ||
| ) | ||
| else: | ||
| logger.info(f"TTFT: {ttft_ms:.2f}ms | {requested_model}") |
Copilot
AI
Dec 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new TTFT timing logic and model name retrieval in generate_stream lacks test coverage. The existing test at line 46-103 in test_openai.py doesn't verify the TTFT calculation or the logging of model information. Consider adding tests to verify that TTFT is calculated correctly and that model name logging works as expected.
Description
Summary: (summary)
Fix: #(issue)
Docs Issue/PR: (docs-issue-or-pr-link)
Reviewer: @(reviewer)
Checklist: