How to Validate data with an in-memory Checkpoint
This guide will demonstrate how to Validate data using a Checkpoint that is configured and run entirely in-memory. This workflow is appropriate for environments or workflows where a user does not want to or cannot use a Checkpoint Store, e.g. in a hosted environment.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Have a Data Context
- Have an Expectation Suite
- Have a Datasource
- Have a basic understanding of Checkpoints
Reading our guide on Deploying Great Expectations in a hosted environment without file system or CLI is recommended for guidance on the setup, connecting to data, and creating expectations steps that take place prior to this process.
Steps
1. Import the necessary modules
The recommended method for creating a Checkpoint is to use the CLI to open a Jupyter Notebook which contains code scaffolding to assist you with the process. Since that option is not available (this guide is assuming that your need for an in-memory Checkpoint is due to being unable to use the CLI or access a filesystem) you will have to provide that scaffolding yourself.
In the script that you are defining and executing your Checkpoint in, enter the following code:
import great_expectations as gx
from great_expectations.checkpoint import Checkpoint
Importing great_expectations
will give you access to your Data Context, while we will configure an instance of the Checkpoint
class as our in-memory Checkpoint.
If you are planning to use a YAML string to configure your in-memory Checkpoint you will also need to import yaml
from ruamel
:
from ruamel import yaml
You will also need to initialize yaml.YAML(...)
:
yaml = yaml.YAML(typ="safe")
2. Initialize your Data Context
In the previous section you imported great_expectations
in order to get access to your Data Context. The line of code that does this is:
context = gx.get_context()
Checkpoints require a Data Context in order to access necessary Stores from which to retrieve Expectation Suites and store Validation Results and Metrics, so you will pass context
in as a parameter when you initialize your Checkpoint
class later.
3. Define your Checkpoint configuration
In addition to a Data Context, you will need a configuration with which to initialize your Checkpoint. This configuration can be in the form of a YAML string or a Python dictionary, The following examples show configurations that are equivalent to the one used by the Getting Started Tutorial.
Normally, a Checkpoint configuration will include the keys class_name
and module_name
. These are used by Great Expectations to identify the class of Checkpoint that should be initialized with a given configuration. Since we are initializing an instance of the Checkpoint
class directly we don't need the configuration to indicate the class of Checkpoint to be initialized. Therefore, these two keys will be left out of our configuration.
- Python Dictionary
- YAML String
my_checkpoint_name = "in_memory_checkpoint"
python_config = {
"name": my_checkpoint_name,
"config_version": 1,
"run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
"action_list": [
{
"name": "store_validation_result",
"action": {"class_name": "StoreValidationResultAction"},
},
{
"name": "store_evaluation_params",
"action": {"class_name": "StoreEvaluationParametersAction"},
},
{
"name": "update_data_docs",
"action": {"class_name": "UpdateDataDocsAction", "site_names": []},
},
],
"validations": [
{
"batch_request": {
"datasource_name": "taxi_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "yellow_tripdata_sample_2019-01",
"data_connector_query": {"index": -1},
},
"expectation_suite_name": "my_expectation_suite",
}
],
}
my_checkpoint_name = "in_memory_checkpoint"
yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
run_name_template: '%Y%m%d-%H%M%S-my-run-name-template'
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
site_names: []
validations:
- batch_request:
datasource_name: taxi_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: yellow_tripdata_sample_2019-01
expectation_suite_name: my_expectation_suite
"""
When you are tailoring the configuration for your own purposes, you will want to replace the Batch Request and Expectation Suite under the validations
key with your own values. You can further edit the configuration to add additional Batch Request and Expectation Suite entries under the validations
key. Alternatively, you can even replace this configuration entirely and build one from scratch. If you choose to build a configuration from scratch, or to further modify the examples provided above, you may wish to reference our documentation on Checkpoint configurations as you do.
4. Initialize your Checkpoint
Once you have your Data Context and Checkpoint configuration you will be able to initialize a Checkpoint
instance in memory. There is a minor variation in how you do so, depending on whether you are using a Python dictionary or a YAML string for your configuration.
- Python Dictionary
- YAML String
If you are using a Python dictionary as your configuration, you will need to unpack it as parameters for the Checkpoint
object's initialization. This can be done with the code:
my_checkpoint = Checkpoint(data_context=context, **python_config)
If you are using a YAML string as your configuration, you will need to convert it into a dictionary and unpack it as parameters for the Checkpoint
object's initialization. This can be done with the code:
my_checkpoint = Checkpoint(data_context=context, **yaml.load(yaml_config))
5. Run your Checkpoint
Congratulations! You now have an initialized Checkpoint
object in memory. You can now use it's run(...)
method to Validate your data as specified in the configuration.
This will be done with the line:
results = my_checkpoint.run()
Congratulations! Your script is now ready to be run. Each time you run it, it will initialize and run a Checkpoint in memory, rather than retrieving a Checkpoint configuration from a Checkpoint Store.
6. Check your Data Docs
Once you have run your script you can verify that it has worked by checking your Data Docs for new results.
Notes
To view the full example scripts used in this documentation, see: