Data quality testing for SQL-, Spark-, and Pandas-accessible data.
We’re preparing a major release of Soda Core (v4.0), scheduled for January 28.
This release introduces Data Contracts as the default way to define data quality rules for tables.
⚠️ This is a breaking change: Soda Core is moving from the checks language to a Data Contracts–based syntax.The new approach offers a cleaner, more structured, and more maintainable way to define and manage data quality rules, based on community feedback and real-world usage.
If you are currently using Soda Core, you will need to migrate your existing checks to Data Contracts when upgrading to v4.
📖 More details, including migration guidance and examples of the new Data Contract format, will be shared closer to the release date.
📄 Check out the Soda Core 4.0 documentation to learn how to install it and see what’s coming.
If you are using Soda Cloud, a Customer Engineer will reach out to you to help schedule and support your migration to v4.
If you want to stay on Soda Core v3 and avoid automatically upgrading to v4, make sure to pin your dependency to a v3 version. Check out all previous versions of Soda Core releases.
Specify a v3 version to prevent automatic upgrades:
pip install soda-core==3.5.6We strongly recommend pinning versions in production environments to avoid unexpected breaking changes.
✔ An open-source, CLI tool and Python library for data quality testing
✔ Compatible with the Soda Checks Language (SodaCL)
✔ Enables data quality testing both in and out of your data pipelines and development workflows
✔ Integrated to allow a Soda scan in a data pipeline, or programmatic scans on a time-based schedule
Soda Core is a free, open-source, command-line tool and Python library that enables you to use the Soda Checks Language to turn user-defined input into aggregated SQL queries.
When it runs a scan on a dataset, Soda Core executes the checks to find invalid, missing, or unexpected data. When your Soda Checks fail, they surface the data that you defined as bad-quality.
Consider migrating to Soda Library, an extension of Soda Core that offers more features and functionality, and enables you to connect to a Soda Cloud account to collaborate with your team on data quality.
- Use Group by and Group Evolution configurations to intelligently group check results
- Leverage Reconciliation checks to compare data between data sources for data migration projects.
- Use Schema Evolution checks to automatically validate schemas.
- Set up Anomaly Detection checks to automatically learn patterns and discover anomalies in your data.
Install Soda Library and get started with a 45-day free trial.
Soda Core currently supports connections to several data sources. See Compatibility for a complete list.
Requirements
- Python 3.8 or greater
- Pip 21.0 or greater
Install and run
-
To get started, use the install command, replacing
soda-core-postgreswith the package that matches your data source. See Install Soda Core for a complete list.pip install soda-core-postgres
-
Prepare a
configuration.ymlfile to connect to your data source. Then, write data quality checks in achecks.ymlfile. See Configure Soda Core. -
Run a scan to review checks that passed, failed, or warned during a scan. See Run a Soda Core scan.
soda scan -d your_datasource -c configuration.yml checks.yml
# Checks for basic validations
checks for dim_customer:
- row_count between 10 and 1000
- missing_count(birth_date) = 0
- invalid_percent(phone) < 1 %:
valid format: phone number
- invalid_count(number_cars_owned) = 0:
valid min: 1
valid max: 6
- duplicate_count(phone) = 0
# Checks for schema changes
checks for dim_product:
- schema:
name: Find forbidden, missing, or wrong type
warn:
when required column missing: [dealer_price, list_price]
when forbidden column present: [credit_card]
when wrong column type:
standard_cost: money
fail:
when forbidden column present: [pii*]
when wrong column index:
model_name: 22
# Check for freshness
- freshness(start_date) < 1d
# Check for referential integrity
checks for dim_department_group:
- values in (department_group_name) must exist in dim_employee (department_name)