-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-56957][SDP] AutoCDC Flow Execution; Introduce and Integrate SCD1 Scd1MergeStreamingWrite
#56122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-56957][SDP] AutoCDC Flow Execution; Introduce and Integrate SCD1 Scd1MergeStreamingWrite
#56122
Changes from all commits
9a76be1
eeff543
e2d51ba
f17c7d0
9dacba8
b3ea8f6
fa6104b
5e56a96
e4b562e
fe77a7c
c850e09
f99660b
eed7d91
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.pipelines.autocdc | ||
|
|
||
| /** | ||
| * Names that AutoCDC reserves for its own use, both for internal columns it inserts during | ||
| * reconciliation (e.g. `${prefix}metadata`, `${prefix}winning_row`) and for internal tables it | ||
| * manages alongside user-defined targets (e.g. the per-target auxiliary state table). | ||
| * | ||
| * A single recognizable prefix gives a single auditable answer to "what does AutoCDC own", and | ||
| * lets user-defined columns and tables be unambiguously distinguished from AutoCDC-managed ones. | ||
| */ | ||
| private[pipelines] object AutoCdcReservedNames { | ||
|
|
||
| /** Common reserved-name prefix shared by AutoCDC internal columns and internal tables. */ | ||
| val prefix: String = "__spark_autocdc_" | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -367,19 +367,29 @@ case class Scd1BatchProcessor( | |
| val incomingWinsDelete = microbatchDeleteVersionField.isNotNull && | ||
| microbatchDeleteVersionField > destinationUpsertVersionField | ||
|
|
||
| // When the incoming upsert wins against an existing record, the entire row (all columns) | ||
| // will be overwritten, including the CDC metadata column. We only exclude keys because | ||
| // most merge implementations require that join columns are not being mutated, even if | ||
| // the mutation is a no-op. | ||
| val resolver = microbatchDf.sparkSession.sessionState.conf.resolver | ||
| val keyNames = changeArgs.keys.map(_.name) | ||
|
|
||
| def constructTargetColumnAssignmentsFromMicrobatch(columnName: String): (String, Column) = { | ||
| // Map a column in the target table to its direct equivalent in the microbatch. Note that | ||
| // because of target-table schema evolution during SDP dataset materialization, the | ||
| // microbatch's columns are always a subset of (or equal to) the target's columns. | ||
| val quotedCol = QuotingUtils.quoteIdentifier(columnName) | ||
| s"$destinationTableStr.$quotedCol" -> F.col(s"microbatch.$quotedCol") | ||
| } | ||
|
|
||
| // Most merge implementations require that join columns are not mutated, even when the | ||
| // mutation would be a no-op. The remaining microbatch columns (including the CDC metadata | ||
| // column) are overwritten outright when the incoming upsert wins. | ||
| val columnsToUpdateWhenIncomingWinsUpsert: Map[String, Column] = | ||
| microbatchDf.columns | ||
| .filterNot(c => keyNames.exists(resolver(_, c))) | ||
| .map { c => | ||
| val quotedCol = QuotingUtils.quoteIdentifier(c) | ||
| s"$destinationTableStr.$quotedCol" -> F.col(s"microbatch.$quotedCol") | ||
| } | ||
| .map(constructTargetColumnAssignmentsFromMicrobatch) | ||
| .toMap | ||
|
|
||
| val columnsToInsertOnNewKey: Map[String, Column] = | ||
| microbatchDf.columns | ||
| .map(constructTargetColumnAssignmentsFromMicrobatch) | ||
| .toMap | ||
|
Comment on lines
+387
to
393
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These changes were needed to support schema evolution, which I could only test now that we've integrated flow execution with the rest of SDP. Over multiple pipeline executions, the source microbatch's schema could ultimately become a subset of the target table's schema - we should take care to construct the column mappings appropriately. |
||
|
|
||
| microbatchDf | ||
|
|
@@ -391,7 +401,12 @@ case class Scd1BatchProcessor( | |
| // New key: only insert upserts; deletes for absent keys are no-ops for the target table | ||
| // merge, and instead would have been inserted as tombstones into the auxiliary table. | ||
| .whenNotMatched(microbatchDeleteVersionField.isNull) | ||
| .insertAll() | ||
| // When inserting a brand new row for a new key, construct column mappings from microbatch. | ||
| // The microbatch's columns may be a strict subset of the target's columns -- e.g. the user | ||
| // narrowed `column_list` between runs, or the source DF dropped a column. The target's | ||
| // columns can never be a strict subset of the microbatch's, however, because SDP's schema | ||
| // evolution always unions old and new schemas onto the target. | ||
| .insert(columnsToInsertOnNewKey) | ||
| .merge() | ||
| } | ||
|
|
||
|
|
@@ -417,17 +432,15 @@ case class Scd1BatchProcessor( | |
|
|
||
| object Scd1BatchProcessor { | ||
| /** | ||
| * Reserved column-name prefix for internal SDP AutoCDC processing. Source change-data-feed | ||
| * dataframes must not contain any columns starting with this prefix; the invariant is | ||
| * Internal columns inserted by AutoCDC reconciliation. Source change-data-feed dataframes must | ||
| * not contain any columns starting with [[AutoCdcReservedNames.prefix]]; the invariant is | ||
| * enforced at [[org.apache.spark.sql.pipelines.graph.AutoCdcMergeFlow]] construction. | ||
| */ | ||
| private[pipelines] val reservedColumnNamePrefix: String = "__spark_autocdc_" | ||
|
|
||
| private[autocdc] val winningRowColName: String = s"${reservedColumnNamePrefix}winning_row" | ||
| private[pipelines] val cdcMetadataColName: String = s"${reservedColumnNamePrefix}metadata" | ||
| private[autocdc] val winningRowColName: String = s"${AutoCdcReservedNames.prefix}winning_row" | ||
| private[pipelines] val cdcMetadataColName: String = s"${AutoCdcReservedNames.prefix}metadata" | ||
|
|
||
| private[autocdc] val cdcDeleteSequenceFieldName: String = "deleteSequence" | ||
| private[autocdc] val cdcUpsertSequenceFieldName: String = "upsertSequence" | ||
| private[pipelines] val cdcDeleteSequenceFieldName: String = "deleteSequence" | ||
| private[pipelines] val cdcUpsertSequenceFieldName: String = "upsertSequence" | ||
|
|
||
| /** Project the delete sequence out of the CDC metadata column. */ | ||
| private[autocdc] def deleteSequenceOf(cdcMetadataCol: Column): Column = | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -303,6 +303,20 @@ object DatasetManager extends Logging { | |
| context.spark.sql(s"TRUNCATE TABLE ${table.identifier.quotedString}") | ||
| } | ||
|
|
||
| if (isFullRefresh) { | ||
| // On full refresh, drop the AutoCDC auxiliary state associated with this table (if any) so | ||
| // that stale delete-tracking data and table properties are not carried forward into the new | ||
| // table generation. We unconditionally issue the DROP for every fully-refreshed target. | ||
|
|
||
| // Intentionally DROP and not TRUNCATE: the auxiliary table is an internal state store | ||
| // that is not part of the dataflow graph, so it does not participate in regular schema | ||
| // evolution like user tables do. On a full refresh we want a clean recreation against | ||
| // the new target schema rather than carrying forward the previous generation's layout. | ||
|
|
||
| val auxiliaryTableId = AutoCdcAuxiliaryTable.identifier(table.identifier) | ||
| context.spark.sql(s"DROP TABLE IF EXISTS ${auxiliaryTableId.quotedString}") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Layer-mixing / asymmetry. Worth considering a per-flow
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah this is a pretty fair concern, it's something I've been thinking about too. The AutoCDC flow is the first type of flow to introduce the concept of an internal state table. Streaming queries/flows have internal state, but that's managed in the form of checkpoint directories, not catalog table entities. And while its an intentional decision to make the AutoCDC auxiliary tables real catalog tables (ex. to inherit catalog based permission model and other functionality), it also means the pipeline needs to manage those internal tables in a similar fashion to actual destination tables. I'm not too worried about compounding behavior with SCD2, but I agree there's probably a better data model here that we should eventually refactor to. Either an But in either case, these are pure refactorings and don't affect user observed behavior at all. Let's revisit this in the future as SCD2 lands, so that we don't end up prematurely choosing a refactor path that doesn't actually fit well. |
||
| } | ||
|
|
||
| // Alter the table if we need to | ||
| existingTableOpt.foreach { existingTable => | ||
| val existingSchema = v2ColumnsToStructType(existingTable.columns()) | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.