[feature](be) Support zstd stream load compression#64711
Conversation
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Stream load and HTTP stream load parse compressed CSV data through a legacy format enum list when estimating async group commit content length, while newer compress_type-based inputs such as ZSTD remain FORMAT_CSV_PLAIN or FORMAT_JSON. This makes ZSTD stream loads unsupported by the shared load-format parser and misses compressed-load handling in group commit paths. This change teaches LoadUtil to parse ZSTD, centralizes compressed load detection by compress_type plus legacy format types, applies it to stream load and HTTP stream group commit planning, and adds zstd regression coverage for CSV, JSON, group commit stream load, and group commit HTTP stream paths.
### Release note
Support ZSTD compressed stream load and HTTP stream load inputs.
### Check List (For Author)
- Test:
- Unit Test: ./run-be-ut.sh --run --filter=LoadUtilTest.*
- Regression test: ./run-regression-test.sh --run -d load_p0/stream_load -s test_stream_load_compress_type -c 'jdbc:mysql://127.0.0.1:33520/?useLocalSessionState=true&allowLoadLocalInfile=true&zeroDateTimeBehavior=round' -ha 127.0.0.1:33500
- Regression test: ./run-regression-test.sh --run -d load_p0/stream_load -s test_group_commit_stream_load -c 'jdbc:mysql://127.0.0.1:33520/?useLocalSessionState=true&allowLoadLocalInfile=true&zeroDateTimeBehavior=round' -ha 127.0.0.1:33500
- Regression test: ./run-regression-test.sh --run -d load_p0/http_stream -s test_group_commit_http_stream -c 'jdbc:mysql://127.0.0.1:33520/?useLocalSessionState=true&allowLoadLocalInfile=true&zeroDateTimeBehavior=round' -ha 127.0.0.1:33500
- Static check: ./build-support/clang-format.sh && ./build-support/check-format.sh
- Static check attempted: CLANG_TIDY_BINARY=/data/data3/huangruixin/include/src-master/ldb_toolchain/bin/clang-tidy ./build-support/run-clang-tidy.sh (blocked by existing/environment diagnostics: existing HTTP action and LoadUtilTest lint findings plus stddef.h not found; new zstd helper warning fixed and load_util.cpp/load_util.h reported clean)
- Behavior changed: Yes. Stream load and HTTP stream load now accept ZSTD as compress_type and treat it as compressed input for async group commit content length estimation.
- Does this need documentation: No
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29240 ms |
TPC-DS: Total hot run time: 172685 ms |
ClickBench: Total hot run time: 25.19 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Reviewed the changed BE stream/http load paths, LoadUtil parsing/helper behavior, FE stream-load and http_stream file-scan parameter propagation, BE CSV/JSON decompressor selection, BE utility test coverage, and regression suites/expected outputs/assets. I did not find a substantiated correctness, compatibility, lifecycle, concurrency, config/session propagation, or test-result issue that needs an inline review comment.
Critical checkpoints:
- Goal and tests: the PR adds ZSTD parsing for stream load plus compressed-load WAL sizing based on
compress_type; BE UT and regression coverage exercise CSV/JSON ZSTD stream load and group-commit stream/http stream paths. - Scope/focus: changes are limited to load compression parsing/accounting and targeted tests.
- Concurrency/lifecycle/config: no new shared mutable state, lifecycle ownership, or config item is introduced.
- Compatibility/parallel paths: legacy compressed CSV format enums remain handled, and the
compress_typepath is applied to both stream load and http_stream group-commit sizing. - Test results: new result rows/counts match the added successful ZSTD loads; binary fixtures decompress successfully. I did not run the full regression or BE unit test suite locally in this review runner.
- User focus: no additional focus points were provided.
Subagent conclusions: optimizer-rewrite reported no optimizer/rewrite or join/aggregate semantic candidates. tests-session-config reported no regression, output, session/config propagation, compatibility, or basic CI/style candidates. No candidates were accepted, dismissed, or merged as duplicates. Convergence round 1 ended with both subagents reporting NO_NEW_VALUABLE_FINDINGS for the same ledger/comment set.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
I found one small test issue:
Please change it to |
|
@sollhui I checked this against the current PR head Before this PR, the expected output for the first query already had 22 rows. This PR adds one I also ran the related BE UT locally: ./run-be-ut.sh --run --filter=LoadUtilTest.* -j60Result: 3 tests passed. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
Stream load does not support `compress_type=zstd` in the shared load format parser. Async group commit also checks only legacy compressed CSV format enum values when estimating compressed input size, so `compress_type` based compressed input is not handled consistently by stream load and HTTP stream load. This PR adds ZSTD parsing in `LoadUtil::parse_format`, adds a shared `LoadUtil::is_compressed_load` helper for `compress_type` and legacy compressed CSV format types, and uses it in stream load and HTTP stream group commit paths. This PR also adds BE UT and regression coverage for ZSTD CSV/JSON stream load and group commit stream/HTTP stream load.
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary:
Stream load does not support
compress_type=zstdin the shared load format parser. Async group commit also checks only legacy compressed CSV format enum values when estimating compressed input size, socompress_typebased compressed input is not handled consistently by stream load and HTTP stream load.This PR adds ZSTD parsing in
LoadUtil::parse_format, adds a sharedLoadUtil::is_compressed_loadhelper forcompress_typeand legacy compressed CSV format types, and uses it in stream load and HTTP stream group commit paths. This PR also adds BE UT and regression coverage for ZSTD CSV/JSON stream load and group commit stream/HTTP stream load.Release note
Support ZSTD compressed stream load and HTTP stream load.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)