Skip to content

[refine](column) add opt-in column and Arrow sanity checks#64694

Open
Mryange wants to merge 2 commits into
apache:masterfrom
Mryange:column-sanity-arrow-validate
Open

[refine](column) add opt-in column and Arrow sanity checks#64694
Mryange wants to merge 2 commits into
apache:masterfrom
Mryange:column-sanity-arrow-validate

Conversation

@Mryange

@Mryange Mryange commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: N/A

Problem Summary:

This is a staged change to make existing column and Arrow data validity checks available outside debug-only paths while keeping them disabled by default.

Root cause: ColumnString::sanity_check() was guarded by NDEBUG, so release builds could not enable the existing offset and chars consistency checks. Arrow conversion paths also accepted invalid Arrow arrays without a shared validation gate, and nested Arrow serde reads would repeat validation if validation were placed directly in every nested call.

This change adds two opt-in BE configs:

  • enable_column_sanity_check: enables column sanity checks in release builds.
  • enable_arrow_validate_full: enables Arrow ValidateFull() checks on Arrow read/write conversion paths.

The column path keeps the existing sanity_check() API name and semantics. Debug builds still run ColumnString sanity checks as before, while release builds return early unless enable_column_sanity_check is enabled. Expression evaluation calls sanity_check() after producing result columns when the config is enabled.

The Arrow read path now uses a non-virtual DataTypeSerDe::read_column_from_arrow() wrapper to run ValidateFull() once at the top-level entry, then delegates to read_column_from_arrow_impl(). Nested array/map/nullable/struct serde reads call the impl method directly to avoid repeated validation. The Arrow write path validates finished arrays and the final record batch when enable_arrow_validate_full is enabled, without adding another serde write implementation layer.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange

Mryange commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29262 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 392005163daccd793203feff4d017135388163d6, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17616	4046	4028	4028
q2	1994	328	188	188
q3	10339	1464	831	831
q4	4681	474	345	345
q5	7522	861	570	570
q6	177	170	141	141
q7	775	840	622	622
q8	9377	1658	1720	1658
q9	5872	4559	4539	4539
q10	6808	1816	1540	1540
q11	445	278	245	245
q12	636	431	297	297
q13	18102	3407	2814	2814
q14	271	253	247	247
q15	q16	796	779	710	710
q17	1052	876	1049	876
q18	7356	5818	5621	5621
q19	1321	1316	1057	1057
q20	509	409	274	274
q21	5960	2711	2355	2355
q22	440	363	304	304
Total cold run time: 102049 ms
Total hot run time: 29262 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4356	4277	4281	4277
q2	344	373	242	242
q3	4641	4974	4481	4481
q4	2059	2184	1382	1382
q5	4453	4356	4330	4330
q6	230	179	130	130
q7	1744	2076	1890	1890
q8	2560	2250	2224	2224
q9	8450	8457	7952	7952
q10	4847	4784	4357	4357
q11	584	415	394	394
q12	756	769	549	549
q13	3372	3737	2932	2932
q14	296	298	262	262
q15	q16	703	767	651	651
q17	1377	1346	1468	1346
q18	7928	7449	7452	7449
q19	1155	1106	1109	1106
q20	2235	2232	1961	1961
q21	5294	4609	4520	4520
q22	514	452	408	408
Total cold run time: 57898 ms
Total hot run time: 52843 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172403 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 392005163daccd793203feff4d017135388163d6, data reload: false

query5	4312	654	490	490
query6	460	191	171	171
query7	4814	547	309	309
query8	355	206	191	191
query9	8783	4024	4032	4024
query10	439	310	258	258
query11	5941	2358	2148	2148
query12	155	111	102	102
query13	1259	581	443	443
query14	6405	5374	5049	5049
query14_1	4396	4383	4349	4349
query15	209	195	175	175
query16	993	451	455	451
query17	991	719	587	587
query18	2450	482	345	345
query19	201	185	147	147
query20	113	107	106	106
query21	227	139	113	113
query22	13807	13623	13402	13402
query23	17353	16658	16173	16173
query23_1	16324	16350	16403	16350
query24	7607	1768	1323	1323
query24_1	1297	1330	1321	1321
query25	574	460	398	398
query26	1288	310	173	173
query27	2696	579	340	340
query28	4551	2064	2043	2043
query29	1097	638	507	507
query30	317	234	201	201
query31	1133	1068	954	954
query32	106	62	59	59
query33	527	332	281	281
query34	1207	1111	645	645
query35	734	763	678	678
query36	1400	1410	1215	1215
query37	147	106	91	91
query38	1883	1698	1680	1680
query39	926	907	894	894
query39_1	868	874	891	874
query40	222	120	97	97
query41	63	61	61	61
query42	90	87	87	87
query43	319	320	278	278
query44	1449	755	778	755
query45	195	188	179	179
query46	1076	1202	753	753
query47	2391	2351	2259	2259
query48	362	424	289	289
query49	605	479	346	346
query50	1023	351	261	261
query51	4382	4279	4238	4238
query52	81	82	67	67
query53	265	263	193	193
query54	263	215	193	193
query55	74	68	65	65
query56	227	211	208	208
query57	1453	1408	1321	1321
query58	255	207	203	203
query59	1596	1626	1454	1454
query60	281	238	228	228
query61	150	150	147	147
query62	701	655	591	591
query63	232	193	184	184
query64	2540	801	614	614
query65	4888	4814	4782	4782
query66	1825	457	338	338
query67	29842	29162	29669	29162
query68	3191	1586	950	950
query69	394	324	268	268
query70	1044	948	942	942
query71	302	234	217	217
query72	2888	2625	2391	2391
query73	846	747	436	436
query74	5121	5007	4778	4778
query75	2627	2603	2237	2237
query76	2309	1187	784	784
query77	352	382	285	285
query78	12614	12595	11950	11950
query79	1379	1097	748	748
query80	1289	471	384	384
query81	517	286	239	239
query82	625	154	117	117
query83	354	276	255	255
query84	303	152	109	109
query85	929	534	435	435
query86	425	299	294	294
query87	1824	1832	1771	1771
query88	3715	2798	2784	2784
query89	424	389	323	323
query90	1933	185	173	173
query91	170	163	134	134
query92	72	56	57	56
query93	1556	1441	879	879
query94	718	360	317	317
query95	674	452	342	342
query96	1086	787	355	355
query97	2744	2749	2584	2584
query98	218	211	200	200
query99	1174	1164	1023	1023
Total cold run time: 259477 ms
Total hot run time: 172403 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.39 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 392005163daccd793203feff4d017135388163d6, data reload: false

query1	0.00	0.01	0.00
query2	0.10	0.06	0.05
query3	0.26	0.13	0.14
query4	1.60	0.14	0.14
query5	0.24	0.22	0.22
query6	1.22	1.06	1.06
query7	0.04	0.01	0.01
query8	0.06	0.03	0.04
query9	0.37	0.30	0.32
query10	0.56	0.54	0.57
query11	0.19	0.14	0.14
query12	0.18	0.15	0.14
query13	0.47	0.47	0.47
query14	1.01	1.03	0.99
query15	0.63	0.59	0.61
query16	0.30	0.31	0.32
query17	1.16	1.17	1.11
query18	0.22	0.21	0.21
query19	2.07	2.00	2.01
query20	0.01	0.02	0.01
query21	15.44	0.24	0.14
query22	4.66	0.05	0.05
query23	16.15	0.33	0.12
query24	2.92	0.43	0.33
query25	0.12	0.06	0.04
query26	0.73	0.22	0.16
query27	0.03	0.04	0.03
query28	3.54	0.90	0.51
query29	12.47	4.36	3.54
query30	0.28	0.14	0.15
query31	2.77	0.59	0.32
query32	3.23	0.59	0.50
query33	3.20	3.36	3.20
query34	15.52	4.28	3.54
query35	3.54	3.54	3.52
query36	0.55	0.43	0.43
query37	0.09	0.06	0.07
query38	0.06	0.03	0.04
query39	0.04	0.04	0.03
query40	0.18	0.18	0.15
query41	0.09	0.04	0.04
query42	0.04	0.03	0.03
query43	0.04	0.04	0.03
Total cold run time: 96.38 s
Total hot run time: 25.39 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 72.34% (34/47) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.41% (21360/39254)
Line Coverage 38.03% (204144/536806)
Region Coverage 34.03% (160214/470750)
Branch Coverage 35.05% (70178/200222)

@Mryange

Mryange commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one correctness issue in the opt-in Arrow full-validation path. The final RecordBatch validation can reject the converter's existing large-string fallback because the generated array type no longer matches the unchanged schema.

_arrays[_cur_field_idx]->type()->name()));
}
}
*out = arrow::RecordBatch::Make(_schema, actual_rows, std::move(_arrays));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This full-batch validation conflicts with the existing large-string fallback above. For string fields the schema is still built as arrow::utf8(), but when column->byte_size() >= MAX_ARROW_UTF8 the converter switches only the builder/array to arrow::large_utf8(). RecordBatch::ValidateFull() then validates that LargeString array against the original UTF8 field, so enabling enable_arrow_validate_full can fail on the very fallback this converter uses for large string columns. Please update the output schema when the array is promoted, or avoid validating a batch whose schema no longer matches the arrays.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 78.72% (37/47) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.08% (28395/38328)
Line Coverage 58.00% (309526/533654)
Region Coverage 54.71% (258723/472885)
Branch Coverage 56.12% (112498/200473)

@Mryange

Mryange commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29040 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7883ca94a092740928e0ffe3527a2bbb8f4fc3bd, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17706	4137	3993	3993
q2	2036	314	184	184
q3	10377	1462	807	807
q4	4680	461	336	336
q5	7511	877	569	569
q6	176	174	132	132
q7	796	838	635	635
q8	9342	1555	1636	1555
q9	5720	4441	4573	4441
q10	6818	1804	1540	1540
q11	422	275	245	245
q12	624	410	284	284
q13	18090	3405	2771	2771
q14	264	257	252	252
q15	q16	792	789	706	706
q17	1020	941	1063	941
q18	7358	5629	5533	5533
q19	1298	1325	1141	1141
q20	515	413	267	267
q21	5856	2622	2414	2414
q22	423	353	294	294
Total cold run time: 101824 ms
Total hot run time: 29040 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4314	4256	4237	4237
q2	334	353	233	233
q3	4560	4962	4388	4388
q4	2034	2146	1358	1358
q5	4421	4522	4296	4296
q6	230	177	131	131
q7	1705	1657	2032	1657
q8	2563	2215	2165	2165
q9	8191	8191	8019	8019
q10	4816	4776	4342	4342
q11	592	402	370	370
q12	726	758	527	527
q13	3200	3655	3003	3003
q14	298	305	278	278
q15	q16	715	714	633	633
q17	1355	1319	1316	1316
q18	7888	7298	7266	7266
q19	1157	1127	1149	1127
q20	2204	2217	1962	1962
q21	5226	4526	4404	4404
q22	507	450	407	407
Total cold run time: 57036 ms
Total hot run time: 52119 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172211 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7883ca94a092740928e0ffe3527a2bbb8f4fc3bd, data reload: false

query5	4302	633	494	494
query6	443	194	171	171
query7	4805	603	316	316
query8	359	213	202	202
query9	8770	4025	4029	4025
query10	451	314	266	266
query11	5935	2320	2172	2172
query12	154	101	108	101
query13	1323	607	426	426
query14	6421	5431	5065	5065
query14_1	4375	4387	4397	4387
query15	201	198	181	181
query16	1011	443	404	404
query17	926	716	577	577
query18	2442	492	353	353
query19	204	187	145	145
query20	113	110	109	109
query21	232	139	117	117
query22	13581	13547	13415	13415
query23	17410	16537	16126	16126
query23_1	16276	16299	16257	16257
query24	7659	1783	1351	1351
query24_1	1321	1308	1338	1308
query25	568	461	421	421
query26	1293	313	205	205
query27	2669	541	347	347
query28	4439	2059	2061	2059
query29	1094	603	475	475
query30	316	240	202	202
query31	1107	1076	961	961
query32	107	61	57	57
query33	513	332	262	262
query34	1202	1151	655	655
query35	750	777	676	676
query36	1344	1403	1226	1226
query37	148	105	88	88
query38	1897	1723	1645	1645
query39	934	919	892	892
query39_1	888	897	873	873
query40	215	122	105	105
query41	66	62	60	60
query42	87	86	85	85
query43	320	317	276	276
query44	1411	769	774	769
query45	198	186	176	176
query46	1093	1206	766	766
query47	2329	2372	2217	2217
query48	405	426	312	312
query49	626	456	342	342
query50	1010	356	259	259
query51	4331	4235	4291	4235
query52	82	81	69	69
query53	260	260	191	191
query54	263	213	194	194
query55	70	68	71	68
query56	227	253	205	205
query57	1423	1396	1289	1289
query58	246	210	230	210
query59	1582	1640	1440	1440
query60	278	247	221	221
query61	153	152	150	150
query62	696	647	586	586
query63	228	191	185	185
query64	2511	755	592	592
query65	4865	4837	4614	4614
query66	1801	461	335	335
query67	29819	29861	28995	28995
query68	3215	1589	973	973
query69	414	302	278	278
query70	1092	963	950	950
query71	312	232	216	216
query72	2874	2600	2367	2367
query73	870	780	446	446
query74	5117	4991	4813	4813
query75	2619	2597	2239	2239
query76	2311	1185	788	788
query77	368	366	281	281
query78	12654	12588	11867	11867
query79	1291	1141	744	744
query80	547	486	386	386
query81	442	282	241	241
query82	307	156	120	120
query83	359	274	244	244
query84	311	142	115	115
query85	841	503	415	415
query86	357	298	288	288
query87	1827	1825	1768	1768
query88	3708	2809	2788	2788
query89	411	383	329	329
query90	1973	185	179	179
query91	167	167	132	132
query92	63	61	61	61
query93	1469	1444	906	906
query94	564	367	302	302
query95	657	370	362	362
query96	1108	850	358	358
query97	2708	2689	2566	2566
query98	219	209	200	200
query99	1187	1161	1023	1023
Total cold run time: 257445 ms
Total hot run time: 172211 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.34 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 7883ca94a092740928e0ffe3527a2bbb8f4fc3bd, data reload: false

query1	0.01	0.01	0.00
query2	0.09	0.06	0.05
query3	0.25	0.14	0.13
query4	1.61	0.14	0.14
query5	0.24	0.24	0.24
query6	1.27	1.02	1.06
query7	0.04	0.01	0.00
query8	0.06	0.04	0.03
query9	0.40	0.34	0.32
query10	0.57	0.55	0.55
query11	0.19	0.15	0.14
query12	0.18	0.14	0.15
query13	0.47	0.47	0.48
query14	1.01	1.02	1.01
query15	0.63	0.62	0.61
query16	0.32	0.32	0.32
query17	1.10	1.07	1.09
query18	0.23	0.20	0.20
query19	2.12	1.97	2.04
query20	0.02	0.01	0.02
query21	15.44	0.25	0.14
query22	4.67	0.05	0.04
query23	16.12	0.30	0.12
query24	2.98	0.44	0.34
query25	0.10	0.06	0.04
query26	0.72	0.21	0.16
query27	0.04	0.04	0.03
query28	3.50	0.86	0.51
query29	12.53	4.39	3.51
query30	0.27	0.17	0.15
query31	2.78	0.59	0.31
query32	3.22	0.61	0.51
query33	3.25	3.22	3.28
query34	15.72	4.20	3.53
query35	3.50	3.51	3.51
query36	0.56	0.44	0.44
query37	0.08	0.07	0.07
query38	0.05	0.03	0.04
query39	0.04	0.03	0.03
query40	0.18	0.17	0.15
query41	0.08	0.04	0.03
query42	0.04	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 96.72 s
Total hot run time: 25.34 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 75.00% (33/44) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.43% (21364/39254)
Line Coverage 38.07% (204349/536805)
Region Coverage 34.07% (160401/470747)
Branch Coverage 35.08% (70237/200221)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 75.00% (33/44) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.37% (27737/38328)
Line Coverage 55.74% (297452/533653)
Region Coverage 52.51% (248301/472882)
Branch Coverage 53.58% (107414/200472)

Comment thread be/src/common/config.cpp

DEFINE_mBool(enable_column_type_check, "true");
DEFINE_mBool(enable_column_sanity_check, "false");
DEFINE_mBool(enable_arrow_validate_full, "false");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个默认是true吧,我们还是避免挂,而且这个影响不了多少性能

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants