-
Notifications
You must be signed in to change notification settings - Fork 182
feat(import): add script tool for multiple hbase snapshot imports) #4606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tianlei2
wants to merge
2
commits into
googleapis:main
Choose a base branch
from
tianlei2:add-snapshot-import-script
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Empty file.
65 changes: 65 additions & 0 deletions
65
bigtable-dataflow-parent/bigtable-beam-import/SNAPSHOT_IMPORT_USAGE.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # HBase Snapshot Import Helper Script Usage | ||
|
|
||
| This document describes the environment variables used by the `run-snapshot-import.sh` script to automate HBase snapshot imports into Cloud Bigtable using Dataflow. | ||
|
|
||
| ## Environment Variables | ||
|
|
||
| The script relies on the following environment variables. You should set them before executing the script. | ||
|
|
||
| | Variable | Description | Example / Suggested Value | | ||
| | :--- | :--- | :--- | | ||
| | `PROJECT_ID` | The Google Cloud Project ID where the Bigtable instance and Dataflow jobs reside. | `your-project-id` | | ||
| | `INSTANCE_ID` | The Bigtable Instance ID to import data into. | `your-instance-id` | | ||
| | `BUCKET` | The GCS bucket name used for Dataflow staging, temp files, and default snapshot source path. | `your-gcs-bucket` | | ||
| | `REGION` | The GCP region to run the Dataflow jobs in. | `us-central1` | | ||
| | `TABLE_NAME` | The target Bigtable table name. | `your-table-name` | | ||
| | `SNAPSHOT_NAME` | The name of the HBase snapshot to import. | `your-snapshot-name` | | ||
| | `SNAPSHOT_SOURCE_DIR` | The GCS path where the HBase snapshot export is located. | `gs://your-gcs-bucket/snapshots` | | ||
| | `SERVICE_ACCOUNT` | The service account email to run the Dataflow jobs. | `your-service-account@developer.gserviceaccount.com` | | ||
| | `NUM_SHARDS` | The number of shards to split the import into for parallel processing. | `20` | | ||
| | `MAX_INFLIGHT_RPCS` | Maximum number of inflight RPCs for Bigtable client. | `100` | | ||
| | `BULK_MUTATION_CLOSE_TIMEOUT_MINUTES` | Timeout in minutes for closing bulk mutations. | `30` | | ||
| | `NETWORK` | VPC Network name for Dataflow workers. | `your-network` | | ||
| | `SUBNETWORK` | VPC Subnetwork name for Dataflow workers. | `regions/us-central1/subnetworks/your-subnetwork` | | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Run a specific shard range | ||
| ```bash | ||
| ./run-snapshot-import.sh <start_shard> <end_shard> | ||
| ``` | ||
| Example: `./run-snapshot-import.sh 0 5` | ||
|
|
||
| ### Run all shards (Auto-parallel mode) | ||
| ```bash | ||
| ./run-snapshot-import.sh --all | ||
| ``` | ||
| This mode will first run the restore step, and then launch background processes for all shards in parallel groups of 4 by default. | ||
|
|
||
| ## Advanced Usage | ||
|
|
||
| ### Manual Parallel Execution | ||
|
|
||
| To run shards in parallel groups (e.g., assuming 20 shards total), you can run multiple instances of this script. | ||
|
|
||
| > [!IMPORTANT] | ||
| > Because concurrent shards cannot delete or overwrite the restored snapshot directory simultaneously, **no shard** performs the restore step during a sharded run. You MUST run the restore step explicitly first! | ||
|
|
||
| Example for manual parallel execution: | ||
| ```bash | ||
| # 1. Run the blocking restore step first! | ||
| ./run-snapshot-import.sh --restore-only | ||
|
|
||
| # 2. Once the restore is complete, launch shards in parallel: | ||
| ./run-snapshot-import.sh 0 3 & | ||
| ./run-snapshot-import.sh 4 7 & | ||
| ./run-snapshot-import.sh 8 11 & | ||
| ./run-snapshot-import.sh 12 15 & | ||
| ./run-snapshot-import.sh 16 19 & | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### JDK Compatibility | ||
|
|
||
| If you are running on a newer JDK (like Java 21 or 26) and hit ByteBuddy errors, you can add `-Dnet.bytebuddy.experimental=true` to the `java` command lines in the script. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
171 changes: 171 additions & 0 deletions
171
bigtable-dataflow-parent/bigtable-beam-import/run-snapshot-import.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,171 @@ | ||
| #!/bin/bash | ||
|
|
||
| # ============================================================================== | ||
| # HBase Snapshot Import Helper Script | ||
| # ============================================================================== | ||
| # This script runs a range of Dataflow snapshot import jobs sequentially or in parallel. | ||
| # Must be executed from the 'bigtable-dataflow-parent/bigtable-beam-import' directory. | ||
| # | ||
| # For detailed usage and advanced options, see: SNAPSHOT_IMPORT_USAGE.md | ||
| # ============================================================================== | ||
|
|
||
| # ------------------------------------------------------------------------------ | ||
| # Environment Variables | ||
| # ------------------------------------------------------------------------------ | ||
| # Most users will need to set these variables before running the script. | ||
| # See SNAPSHOT_IMPORT_USAGE.md for details and expected values. | ||
|
|
||
| # --- Required / Common Configurations --- | ||
| # export PROJECT_ID="your-project-id" | ||
| # export INSTANCE_ID="your-instance-id" | ||
| # export BUCKET="your-gcs-bucket" | ||
| # export REGION="us-central1" | ||
| # | ||
| # export TABLE_NAME="your-table-name" | ||
| # export SNAPSHOT_NAME="your-snapshot-name" | ||
| # export SNAPSHOT_SOURCE_DIR="gs://your-gcs-bucket/snapshots" | ||
| # export SERVICE_ACCOUNT="your-service-account" | ||
|
|
||
| # --- Sharding & Tuning --- | ||
| # export NUM_SHARDS="20" | ||
| # export MAX_INFLIGHT_RPCS="100" | ||
| # export BULK_MUTATION_CLOSE_TIMEOUT_MINUTES="30" | ||
|
|
||
| # --- Network Configurations --- | ||
| # export NETWORK="your-network" | ||
| # export SUBNETWORK="your-subnetwork" | ||
|
|
||
| # ------------------------------------------------------------------------------ | ||
| # Usage | ||
| # ------------------------------------------------------------------------------ | ||
| # Usage: ./run-snapshot-import.sh <start_shard> <end_shard> | ||
| # Or: ./run-snapshot-import.sh --all | ||
| # (Runs all shards in parallel groups of 4 by default) | ||
| # | ||
| # Examples: | ||
| # ./run-snapshot-import.sh 0 3 | ||
| # ./run-snapshot-import.sh --all | ||
|
|
||
| if [ "$#" -ne 2 ] && [ "$1" != "--all" ] && [ "$1" != "--restore-only" ]; then | ||
| echo "Usage: $0 <start_shard> <end_shard>" | ||
| echo " Or: $0 --all" | ||
| echo " Or: $0 --restore-only" | ||
| exit 1 | ||
| fi | ||
|
|
||
| START_SHARD=$1 | ||
| END_SHARD=$2 | ||
|
|
||
| # Configurations | ||
| JAR_PATH="target/bigtable-beam-import-2.18.2-SNAPSHOT-shaded.jar" | ||
| RESTORE_DIR="${SNAPSHOT_SOURCE_DIR}/restore-${SNAPSHOT_NAME}" | ||
|
|
||
| # --- RESTORE ONLY MODE --- | ||
| if [ "$1" == "--restore-only" ]; then | ||
| echo "🚀 Performing snapshot restore (blocking)..." | ||
| java -jar ${JAR_PATH} importsnapshot \ | ||
| --runner=DataflowRunner \ | ||
| --project=${PROJECT_ID} \ | ||
| --bigtableInstanceId=${INSTANCE_ID} \ | ||
| --bigtableTableId=${TABLE_NAME} \ | ||
| --hbaseSnapshotSourceDir=${SNAPSHOT_SOURCE_DIR} \ | ||
| --snapshots=${SNAPSHOT_NAME}:${TABLE_NAME} \ | ||
| --stagingLocation=gs://${BUCKET}/dataflow/staging \ | ||
| --tempLocation=gs://${BUCKET}/dataflow/temp \ | ||
| --region=${REGION} \ | ||
| --performOnlyRestoreStep=true \ | ||
| --restorePath=${RESTORE_DIR} \ | ||
| --jobName="restore-job" \ | ||
| --network=${NETWORK} \ | ||
| --subnetwork=${SUBNETWORK} | ||
| echo "✅ Restore completed." | ||
| echo "⚠️ IMPORTANT: Please manually cleanup the restore path once validation succeeds:" | ||
| echo " gsutil rm -r ${RESTORE_DIR}" | ||
| exit 0 | ||
| fi | ||
|
|
||
| # --- AUTO-PARALLEL MODE --- | ||
| if [ "$1" == "--all" ]; then | ||
| echo "🚀 Starting fully automated snapshot import..." | ||
|
|
||
| # Step 1: Perform ONLY the restore step | ||
| echo "Step 1/2: Performing snapshot restore (blocking)..." | ||
| java -jar ${JAR_PATH} importsnapshot \ | ||
| --runner=DataflowRunner \ | ||
| --project=${PROJECT_ID} \ | ||
| --bigtableInstanceId=${INSTANCE_ID} \ | ||
| --bigtableTableId=${TABLE_NAME} \ | ||
| --hbaseSnapshotSourceDir=${SNAPSHOT_SOURCE_DIR} \ | ||
| --snapshots=${SNAPSHOT_NAME}:${TABLE_NAME} \ | ||
| --stagingLocation=gs://${BUCKET}/dataflow/staging \ | ||
| --tempLocation=gs://${BUCKET}/dataflow/temp \ | ||
| --region=${REGION} \ | ||
| --performOnlyRestoreStep=true \ | ||
| --restorePath=${RESTORE_DIR} \ | ||
| --jobName="restore-job" \ | ||
| --network=${NETWORK} \ | ||
| --subnetwork=${SUBNETWORK} | ||
|
|
||
| echo "Restore completed. Proceeding to data import." | ||
|
|
||
| # Step 2: Launch parallel groups of 4 | ||
| echo "Step 2/2: Launching parallel groups of 4 shards..." | ||
| SHARDS_PER_GROUP=4 | ||
|
|
||
| for (( start=0; start<$NUM_SHARDS; start+=$SHARDS_PER_GROUP )); do | ||
| end=$((start + SHARDS_PER_GROUP - 1)) | ||
| [ $end -ge $NUM_SHARDS ] && end=$((NUM_SHARDS - 1)) | ||
|
|
||
| echo "Launching group: shards $start to $end in background" | ||
| # Call ourselves with the range! | ||
| $0 $start $end & | ||
| done | ||
|
|
||
| echo "All groups launched. Waiting for all background jobs to finish..." | ||
| wait | ||
| echo "🎉 All import jobs completed!" | ||
| echo "⚠️ IMPORTANT: Please manually cleanup the restore path once validation succeeds:" | ||
| echo " gsutil rm -r ${RESTORE_DIR}" | ||
| exit 0 | ||
| fi | ||
| # ---------------------------------------- | ||
|
|
||
| # Standard Range Mode | ||
| for i in $(seq $START_SHARD $END_SHARD); do | ||
| echo "Submitting Dataflow job for shardIndex: $i" | ||
|
|
||
| # As per the sharding contract, ALL parallel sharded jobs MUST skip the restore step | ||
| # to prevent concurrent shards from deleting the restore path. | ||
| # The --all mode runs performOnlyRestoreStep=true automatically in Step 1. | ||
| SKIP_RESTORE="true" | ||
|
|
||
| JOB="job-${i}" | ||
| java -jar ${JAR_PATH} importsnapshot \ | ||
| --runner=DataflowRunner \ | ||
| --project=${PROJECT_ID} \ | ||
| --bigtableInstanceId=${INSTANCE_ID} \ | ||
| --bigtableTableId=${TABLE_NAME} \ | ||
| --hbaseSnapshotSourceDir=${SNAPSHOT_SOURCE_DIR} \ | ||
| --snapshots=${SNAPSHOT_NAME}:${TABLE_NAME} \ | ||
| --stagingLocation=gs://${BUCKET}/dataflow/staging \ | ||
| --tempLocation=gs://${BUCKET}/dataflow/temp \ | ||
| --workerMachineType=n1-highmem-4 \ | ||
| --diskSizeGb=500 \ | ||
| --maxNumWorkers=10 \ | ||
| --region=${REGION} \ | ||
| --serviceAccount=${SERVICE_ACCOUNT} \ | ||
| --usePublicIps=false \ | ||
| --enableSnappy=true \ | ||
| --skipRestoreStep=${SKIP_RESTORE} \ | ||
| --deleteRestoredSnapshots=false \ | ||
| --restorePath=${RESTORE_DIR} \ | ||
| --numShards=${NUM_SHARDS} \ | ||
| --shardIndex=$i \ | ||
| --jobName="${JOB}" \ | ||
| --network=${NETWORK} \ | ||
| --subnetwork=${SUBNETWORK} \ | ||
| --maxInflightRpcs=${MAX_INFLIGHT_RPCS} \ | ||
| --bulkMutationCloseTimeoutMinutes=${BULK_MUTATION_CLOSE_TIMEOUT_MINUTES} | ||
|
|
||
| # Sequential within this script instance | ||
| done | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, but how are we passing the restorePath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should have a custom restore path and the script (idempoent by adding timestamp etc) and use it for restore the path and pass it as the restorepath in every job.
Also, with this model, who cleans up the restore path? is there a way to trigger a cleanup at the end of the script? or we have a tool that can be used? We can also say its a manual step. but then this script should output something to the tune of "the snapshot was imported, please cleanup $RESTORE_PATH once validation succeeds."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some output for manual cleanups and passing restore path to every job.