How to compare two tables with the UserConfigurableProfiler
In this guide, you will utilize a UserConfigurableProfilerGenerates Metrics and candidate Expectations from data. to create an Expectation SuiteA collection of verifiable assertions about data. that can be used to gauge whether two tables are identical. This workflow can be used, for example, to validate migrated data.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Have a basic understanding of Expectation Configurations in Great Expectations.
- Have read the overview of ProfilersGenerates Metrics and candidate Expectations from data. and the section on UserConfigurableProfilers in particular.
Steps
1. Decide your use-case
This workflow can be applied to batches created from full tables, or to batches created from queries against tables. These two approaches will have slightly different workflows detailed below.
- Full Table
- Query
2. Set-Up
In this workflow, we will be making use of the UserConfigurableProfiler
to profile against a BatchRequestProvided to a Datasource in order to create a Batch. representing our source data, and validate the resulting suite against a BatchRequest
representing our second set of data.
To begin, we'll need to set up our imports and instantiate our Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.:
from ruamel import yaml
import great_expectations as gx
from great_expectations.core.batch import BatchRequest
from great_expectations.profile.user_configurable_profiler import (
UserConfigurableProfiler,
)
context = gx.get_context()
Depending on your use-case, workflow, and directory structures, you may need to update you context root directory as follows:
context = gx.get_context(
context_root_dir='/my/context/root/directory/great_expectations'
)
3. Create Batch Requests
In order to profile our first table and validate our second table, we need to set up our Batch Requests pointing to each set of data.
In this guide, we will use a MySQL DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. as our source data -- the data we trust to be correct.
mysql_batch_request = BatchRequest(
datasource_name="my_mysql_datasource",
data_connector_name="default_inferred_data_connector_name",
data_asset_name="test_ci.mysql_taxi_data",
)
From this data, we will create an Expectation SuiteA collection of verifiable assertions about data. and use that suite to validate our second table, pulled from a PostgreSQL Datasource.
pg_batch_request = BatchRequest(
datasource_name="my_postgresql_datasource",
data_connector_name="default_inferred_data_connector_name",
data_asset_name="public.postgres_taxi_data",
)
4. Profile Source Data
We can now use the mysql_batch_request
defined above to build a ValidatorUsed to run an Expectation Suite against data.:
validator = context.get_validator(batch_request=mysql_batch_request)
Instantiate our UserConfigurableProfiler
:
profiler = UserConfigurableProfiler(
profile_dataset=validator,
excluded_expectations=[
"expect_column_quantile_values_to_be_between",
"expect_column_mean_to_be_between",
],
)
And use that profiler to build and save an Expectation Suite:
expectation_suite_name = "compare_two_tables"
suite = profiler.build_suite()
context.save_expectation_suite(
expectation_suite=suite, expectation_suite_name=expectation_suite_name
)
excluded_expectations
?
expect_column_quantile_values_to_be_between
, as it isn't fully supported by some SQL dialects.This is one example of the ways in which we can customize the Suite built by our Profiler.
For more on these configurations, see our guide on the optional parameters available with the UserConfigurableProfiler
.
5. Checkpoint Set-Up
Before we can validate our second table, we need to define a CheckpointThe primary means for validating data in a production deployment of Great Expectations..
We will pass both the pg_batch_request
and Expectation Suite defined above to this checkpoint.
my_checkpoint_name = "comparison_checkpoint"
yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
expectation_suite_name: {expectation_suite_name}
"""
context.add_checkpoint(**yaml.load(yaml_config))
6. Validation
Finally, we can use our Checkpoint to validate that our two tables are identical:
results = context.run_checkpoint(
checkpoint_name=my_checkpoint_name, batch_request=pg_batch_request
)
If we now inspect the results of this Checkpoint (results["success"]
), we can see that our Validation was successful!
By default, the Checkpoint above also updates your Data Docs, allowing you to further inspect the results of this workflow.
2. Set-Up
In this workflow, we will be making use of the UserConfigurableProfiler
to profile against a RuntimeBatchRequestProvided to a Datasource in order to create a Batch. representing a query against our source data, and validate the resulting suite against a RuntimeBatchRequest
representing a query against our second set of data.
To begin, we'll need to set up our imports and instantiate our Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components.:
from ruamel import yaml
import great_expectations as gx
from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.profile.user_configurable_profiler import (
UserConfigurableProfiler,
)
context = gx.get_context()
Depending on your use-case, workflow, and directory structures, you may need to update you context root directory as follows:
context = gx.get_context(
context_root_dir='/my/context/root/directory/great_expectations'
)
3. Create Batch Requests
In order to profile our first table and validate our second table, we need to set up our Batch Requests pointing to each set of data. These will be RuntimeBatchRequests
, specifying a query against our data to be executed at runtime.
In this guide, we will use a MySQL DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. as our source data -- the data we trust to be correct.
mysql_runtime_batch_request = RuntimeBatchRequest(
datasource_name="my_mysql_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="mysql_asset",
runtime_parameters={"query": "SELECT * from taxi_data LIMIT 10"},
batch_identifiers={"batch_id": "default_identifier"},
)
From this data, we will create an Expectation SuiteA collection of verifiable assertions about data. and use that suite to validate our second table, pulled from a PostgreSQL Datasource.
pg_runtime_batch_request = RuntimeBatchRequest(
datasource_name="my_postgresql_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="postgres_asset",
runtime_parameters={"query": "SELECT * from taxi_data LIMIT 10"},
batch_identifiers={"batch_id": "default_identifier"},
)
4. Profile Source Data
We can now use the mysql_runtime_batch_request
defined above to build a ValidatorUsed to run an Expectation Suite against data.:
validator = context.get_validator(
batch_request=mysql_runtime_batch_request,
)
Instantiate our UserConfigurableProfiler
:
profiler = UserConfigurableProfiler(
profile_dataset=validator,
excluded_expectations=[
"expect_column_quantile_values_to_be_between",
"expect_column_mean_to_be_between",
],
)
And use that profiler to build and save an Expectation Suite:
expectation_suite_name = "compare_two_tables"
suite = profiler.build_suite()
context.save_expectation_suite(
expectation_suite=suite, expectation_suite_name=expectation_suite_name
)
excluded_expectations
?
expect_column_quantile_values_to_be_between
, as it isn't fully supported by some SQL dialects.This is one example of the ways in which we can customize the Suite built by our Profiler.
For more on these configurations, see our guide on the optional parameters available with the UserConfigurableProfiler
.
5. Checkpoint Set-Up
Before we can validate our second table, we need to define a CheckpointThe primary means for validating data in a production deployment of Great Expectations..
We will pass both the pg_runtime_batch_request
and Expectation Suite defined above to this checkpoint.
my_checkpoint_name = "comparison_checkpoint"
yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
expectation_suite_name: {expectation_suite_name}
"""
context.add_checkpoint(**yaml.load(yaml_config))
6. Validation
Finally, we can use our Checkpoint to validate that our two batches of data - queried from two different tables - are identical:
results = context.run_checkpoint(
checkpoint_name=my_checkpoint_name, batch_request=pg_runtime_batch_request
)
If we now inspect the results of this Checkpoint (results["success"]
), we can see that our Validation was successful!
By default, the Checkpoint above also updates your Data Docs, allowing you to further inspect the results of this workflow.
Congratulations!
🎉 You've just compared two tables across Datasources! 🎉