Skip to content

[feature](be) Support zstd stream load compression#64711

Merged
liaoxin01 merged 1 commit into
apache:masterfrom
0AyanamiRei:stream-load-zstd-compression
Jun 23, 2026
Merged

[feature](be) Support zstd stream load compression#64711
liaoxin01 merged 1 commit into
apache:masterfrom
0AyanamiRei:stream-load-zstd-compression

Conversation

@0AyanamiRei

@0AyanamiRei 0AyanamiRei commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary:

Stream load does not support compress_type=zstd in the shared load format parser. Async group commit also checks only legacy compressed CSV format enum values when estimating compressed input size, so compress_type based compressed input is not handled consistently by stream load and HTTP stream load.

This PR adds ZSTD parsing in LoadUtil::parse_format, adds a shared LoadUtil::is_compressed_load helper for compress_type and legacy compressed CSV format types, and uses it in stream load and HTTP stream group commit paths. This PR also adds BE UT and regression coverage for ZSTD CSV/JSON stream load and group commit stream/HTTP stream load.

Release note

Support ZSTD compressed stream load and HTTP stream load.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. Stream load and HTTP stream load support ZSTD compressed input.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Stream load and HTTP stream load parse compressed CSV data through a legacy format enum list when estimating async group commit content length, while newer compress_type-based inputs such as ZSTD remain FORMAT_CSV_PLAIN or FORMAT_JSON. This makes ZSTD stream loads unsupported by the shared load-format parser and misses compressed-load handling in group commit paths. This change teaches LoadUtil to parse ZSTD, centralizes compressed load detection by compress_type plus legacy format types, applies it to stream load and HTTP stream group commit planning, and adds zstd regression coverage for CSV, JSON, group commit stream load, and group commit HTTP stream paths.

### Release note

Support ZSTD compressed stream load and HTTP stream load inputs.

### Check List (For Author)

- Test:
    - Unit Test: ./run-be-ut.sh --run --filter=LoadUtilTest.*
    - Regression test: ./run-regression-test.sh --run -d load_p0/stream_load -s test_stream_load_compress_type -c 'jdbc:mysql://127.0.0.1:33520/?useLocalSessionState=true&allowLoadLocalInfile=true&zeroDateTimeBehavior=round' -ha 127.0.0.1:33500
    - Regression test: ./run-regression-test.sh --run -d load_p0/stream_load -s test_group_commit_stream_load -c 'jdbc:mysql://127.0.0.1:33520/?useLocalSessionState=true&allowLoadLocalInfile=true&zeroDateTimeBehavior=round' -ha 127.0.0.1:33500
    - Regression test: ./run-regression-test.sh --run -d load_p0/http_stream -s test_group_commit_http_stream -c 'jdbc:mysql://127.0.0.1:33520/?useLocalSessionState=true&allowLoadLocalInfile=true&zeroDateTimeBehavior=round' -ha 127.0.0.1:33500
    - Static check: ./build-support/clang-format.sh && ./build-support/check-format.sh
    - Static check attempted: CLANG_TIDY_BINARY=/data/data3/huangruixin/include/src-master/ldb_toolchain/bin/clang-tidy ./build-support/run-clang-tidy.sh (blocked by existing/environment diagnostics: existing HTTP action and LoadUtilTest lint findings plus stddef.h not found; new zstd helper warning fixed and load_util.cpp/load_util.h reported clean)
- Behavior changed: Yes. Stream load and HTTP stream load now accept ZSTD as compress_type and treat it as compressed input for async group commit content length estimation.
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@0AyanamiRei

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29240 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fdd2a59a9196fe6c4aae6062508b8f9b7995afc2, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17834	4039	4109	4039
q2	2041	318	195	195
q3	10262	1431	847	847
q4	4684	467	339	339
q5	7506	840	568	568
q6	180	171	137	137
q7	805	859	619	619
q8	9364	1546	1682	1546
q9	5937	4517	4554	4517
q10	6765	1771	1515	1515
q11	434	280	245	245
q12	633	418	304	304
q13	18105	3377	2752	2752
q14	275	271	250	250
q15	q16	791	791	709	709
q17	970	998	979	979
q18	7008	5841	5577	5577
q19	1362	1294	996	996
q20	488	401	268	268
q21	5931	2659	2528	2528
q22	431	365	310	310
Total cold run time: 101806 ms
Total hot run time: 29240 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4406	4376	4409	4376
q2	347	362	233	233
q3	4550	4971	4372	4372
q4	2079	2190	1399	1399
q5	4451	4319	4308	4308
q6	234	177	129	129
q7	1748	1997	1850	1850
q8	2632	2275	2213	2213
q9	8195	8495	8075	8075
q10	4851	4770	4341	4341
q11	592	455	371	371
q12	739	775	538	538
q13	3280	3685	3012	3012
q14	281	306	270	270
q15	q16	730	759	667	667
q17	1407	1370	1340	1340
q18	7951	7510	7199	7199
q19	1179	1121	1109	1109
q20	2227	2219	1938	1938
q21	5327	4683	4439	4439
q22	523	496	437	437
Total cold run time: 57729 ms
Total hot run time: 52616 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172685 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fdd2a59a9196fe6c4aae6062508b8f9b7995afc2, data reload: false

query5	4317	633	491	491
query6	437	190	172	172
query7	4840	511	284	284
query8	356	209	195	195
query9	8757	4106	4082	4082
query10	466	308	248	248
query11	5867	2342	2122	2122
query12	175	102	98	98
query13	1253	611	436	436
query14	6382	5415	5064	5064
query14_1	4395	4398	4397	4397
query15	203	197	177	177
query16	990	437	402	402
query17	919	675	557	557
query18	2430	459	349	349
query19	196	177	138	138
query20	114	106	102	102
query21	209	135	122	122
query22	13670	13514	13442	13442
query23	17386	16517	16137	16137
query23_1	16326	16265	16311	16265
query24	7585	1783	1329	1329
query24_1	1330	1320	1298	1298
query25	530	428	370	370
query26	1295	306	165	165
query27	2718	584	336	336
query28	4456	2087	2069	2069
query29	1035	616	478	478
query30	311	233	199	199
query31	1103	1079	942	942
query32	105	60	58	58
query33	520	317	244	244
query34	1184	1122	643	643
query35	762	783	675	675
query36	1349	1368	1222	1222
query37	151	105	93	93
query38	1881	1701	1665	1665
query39	936	925	895	895
query39_1	919	854	881	854
query40	224	128	105	105
query41	70	68	66	66
query42	92	89	87	87
query43	339	329	281	281
query44	1462	792	796	792
query45	197	187	179	179
query46	1099	1235	784	784
query47	2348	2381	2212	2212
query48	425	439	300	300
query49	645	462	359	359
query50	1015	366	263	263
query51	4342	4331	4269	4269
query52	83	82	71	71
query53	256	270	201	201
query54	275	247	214	214
query55	73	70	69	69
query56	257	231	229	229
query57	1443	1382	1324	1324
query58	244	223	218	218
query59	1607	1638	1443	1443
query60	293	257	245	245
query61	176	164	199	164
query62	684	654	600	600
query63	240	190	193	190
query64	2488	741	604	604
query65	4849	4788	4814	4788
query66	1769	453	337	337
query67	29873	29646	29607	29607
query68	3199	1578	927	927
query69	419	308	271	271
query70	1056	978	939	939
query71	310	234	215	215
query72	2892	2613	2362	2362
query73	827	764	458	458
query74	5120	4924	4813	4813
query75	2654	2595	2238	2238
query76	2339	1178	791	791
query77	360	378	285	285
query78	12485	12467	11854	11854
query79	1390	1166	741	741
query80	571	469	365	365
query81	448	280	240	240
query82	597	167	122	122
query83	355	275	246	246
query84	309	150	115	115
query85	840	500	404	404
query86	371	306	293	293
query87	1839	1827	1783	1783
query88	3696	2833	2789	2789
query89	422	388	325	325
query90	1868	184	188	184
query91	170	165	131	131
query92	65	61	61	61
query93	1511	1441	969	969
query94	550	342	297	297
query95	681	371	342	342
query96	1086	754	328	328
query97	2702	2673	2545	2545
query98	228	208	212	208
query99	1181	1158	994	994
Total cold run time: 257724 ms
Total hot run time: 172685 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.19 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit fdd2a59a9196fe6c4aae6062508b8f9b7995afc2, data reload: false

query1	0.00	0.00	0.01
query2	0.10	0.05	0.05
query3	0.25	0.14	0.13
query4	1.61	0.14	0.13
query5	0.24	0.22	0.22
query6	1.21	1.12	1.11
query7	0.04	0.01	0.00
query8	0.06	0.04	0.03
query9	0.37	0.31	0.32
query10	0.56	0.55	0.56
query11	0.19	0.14	0.14
query12	0.17	0.15	0.14
query13	0.47	0.48	0.48
query14	1.03	1.00	0.99
query15	0.62	0.60	0.59
query16	0.31	0.34	0.31
query17	1.10	1.11	1.11
query18	0.22	0.22	0.21
query19	1.98	1.85	1.92
query20	0.02	0.01	0.01
query21	15.42	0.22	0.15
query22	4.74	0.05	0.06
query23	16.13	0.32	0.11
query24	2.90	0.42	0.33
query25	0.12	0.06	0.05
query26	0.73	0.21	0.15
query27	0.04	0.04	0.03
query28	3.57	0.88	0.54
query29	12.52	4.33	3.49
query30	0.28	0.15	0.15
query31	2.77	0.59	0.31
query32	3.23	0.60	0.48
query33	3.27	3.26	3.26
query34	15.57	4.25	3.50
query35	3.52	3.45	3.50
query36	0.57	0.45	0.42
query37	0.10	0.07	0.07
query38	0.05	0.03	0.04
query39	0.04	0.02	0.02
query40	0.18	0.16	0.15
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.03
Total cold run time: 96.46 s
Total hot run time: 25.19 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 53.85% (21/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.42% (21364/39256)
Line Coverage 38.06% (204302/536828)
Region Coverage 34.08% (160427/470744)
Branch Coverage 35.07% (70224/200222)

@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/review

@sollhui sollhui self-requested a review June 23, 2026 02:44

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the changed BE stream/http load paths, LoadUtil parsing/helper behavior, FE stream-load and http_stream file-scan parameter propagation, BE CSV/JSON decompressor selection, BE utility test coverage, and regression suites/expected outputs/assets. I did not find a substantiated correctness, compatibility, lifecycle, concurrency, config/session propagation, or test-result issue that needs an inline review comment.

Critical checkpoints:

  • Goal and tests: the PR adds ZSTD parsing for stream load plus compressed-load WAL sizing based on compress_type; BE UT and regression coverage exercise CSV/JSON ZSTD stream load and group-commit stream/http stream paths.
  • Scope/focus: changes are limited to load compression parsing/accounting and targeted tests.
  • Concurrency/lifecycle/config: no new shared mutable state, lifecycle ownership, or config item is introduced.
  • Compatibility/parallel paths: legacy compressed CSV format enums remain handled, and the compress_type path is applied to both stream load and http_stream group-commit sizing.
  • Test results: new result rows/counts match the added successful ZSTD loads; binary fixtures decompress successfully. I did not run the full regression or BE unit test suite locally in this review runner.
  • User focus: no additional focus points were provided.

Subagent conclusions: optimizer-rewrite reported no optimizer/rewrite or join/aggregate semantic candidates. tests-session-config reported no regression, output, session/config propagation, compatibility, or basic CI/style candidates. No candidates were accepted, dismissed, or merged as duplicates. Convergence round 1 ended with both subagents reporting NO_NEW_VALUABLE_FINDINGS for the same ledger/comment set.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 92.31% (36/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.17% (28430/38330)
Line Coverage 58.08% (309980/533676)
Region Coverage 54.85% (259373/472879)
Branch Coverage 56.18% (112620/200473)

@sollhui

sollhui commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

I found one small test issue:

regression-test/suites/load_p0/http_stream/test_group_commit_http_stream.groovy: the new zstd case only adds one compressed CSV load, so it contributes 4 rows. The expected output now contains 23 rows, but the suite waits for getRowCount(26). Since getRowCount only waits and does not assert, this makes the regression test spend the full retry window before continuing.

Please change it to getRowCount(23).

@0AyanamiRei

Copy link
Copy Markdown
Contributor Author

@sollhui I checked this against the current PR head fdd2a59a919 and I think getRowCount(26) is correct here.

Before this PR, the expected output for the first query already had 22 rows. This PR adds one zstd compressed CSV load with 4 rows, so the expected row count becomes 22 + 4 = 26. The current test_group_commit_http_stream.out also has 26 rows in the first result set before the next -- !sql -- block.

I also ran the related BE UT locally:

./run-be-ut.sh --run --filter=LoadUtilTest.* -j60

Result: 3 tests passed.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 92.31% (36/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.17% (28429/38330)
Line Coverage 58.08% (309944/533676)
Region Coverage 54.82% (259215/472879)
Branch Coverage 56.17% (112598/200473)

@sollhui sollhui left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@liaoxin01 liaoxin01 merged commit d293e8e into apache:master Jun 23, 2026
34 of 35 checks passed
github-actions Bot pushed a commit that referenced this pull request Jun 23, 2026
Stream load does not support `compress_type=zstd` in the shared load
format parser. Async group commit also checks only legacy compressed CSV
format enum values when estimating compressed input size, so
`compress_type` based compressed input is not handled consistently by
stream load and HTTP stream load.

This PR adds ZSTD parsing in `LoadUtil::parse_format`, adds a shared
`LoadUtil::is_compressed_load` helper for `compress_type` and legacy
compressed CSV format types, and uses it in stream load and HTTP stream
group commit paths. This PR also adds BE UT and regression coverage for
ZSTD CSV/JSON stream load and group commit stream/HTTP stream load.
@0AyanamiRei 0AyanamiRei deleted the stream-load-zstd-compression branch June 24, 2026 01:08
yiguolei pushed a commit that referenced this pull request Jun 24, 2026
…#64750)

Cherry-picked from #64711

Co-authored-by: Refrain <refrain520cc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.3-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants