Feat reads regional benchmark#16795
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements time-based microbenchmarks for regional bucket read operations using JSON and gRPC APIs, and updates the AsyncGrpcClient to support custom API endpoints and quota project IDs. Review feedback highlights several improvement opportunities: moving authentication token retrieval out of the module level to optimize multiprocessing startup and prevent test collection failures, specifying explicit file encoding when opening configuration files, removing an unused import, and adding HTTP status checks to ensure benchmark accuracy by failing fast on unauthorized or missing resource errors.
| token = subprocess.run( | ||
| ["gcloud", "auth", "print-access-token"], | ||
| capture_output=True, | ||
| text=True, | ||
| check=True, | ||
| ).stdout.strip() |
There was a problem hiding this comment.
Executing subprocess.run at the module level to fetch an authentication token is problematic for several reasons:
- Test Collection: This command runs whenever the module is imported, which can fail the entire test suite during collection if
gcloudis not installed or the user is not authenticated. - Multiprocessing Overhead: Since the benchmark uses the
spawnstart method (line 214), this module is re-imported in every worker process. For a workload with many processes (e.g., 96), this results in 96 concurrent, redundant calls togcloud, causing significant startup delay and potential rate limiting.
Consider fetching the token once in the parent process and passing it to the workers via the initializer or as an argument to the worker function.
| """Generates a dictionary of benchmark parameters for time based read operations.""" | ||
| params: Dict[str, List[TimeBasedReadParameters]] = {} | ||
| config_path = os.path.join(os.path.dirname(__file__), "config.yaml") | ||
| with open(config_path, "r") as f: |
There was a problem hiding this comment.
| import pytest | ||
| import aiohttp | ||
| import subprocess | ||
| import math |
| async with session.get(url, headers=headers) as response: | ||
| data = await response.read() | ||
| bytes_in_iteration += len(data) |
There was a problem hiding this comment.
The code reads the response body without checking the HTTP status code. If the request fails (e.g., 401 Unauthorized, 404 Not Found), response.read() will return the error message body, which will then be counted as downloaded data, leading to incorrect benchmark results. Adding response.raise_for_status() ensures fail-fast behavior and prevents potential issues with unexpected states.
| async with session.get(url, headers=headers) as response: | |
| data = await response.read() | |
| bytes_in_iteration += len(data) | |
| async with session.get(url, headers=headers) as response: | |
| response.raise_for_status() | |
| data = await response.read() | |
| bytes_in_iteration += len(data) |
References
- When a function encounters an unexpected state or unsupported input, it should raise an error instead of silently continuing to ensure fail-fast behavior.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕