How to use auto-initializing Expectations

This guide will walk you through the process of using a auto-initializing ExpectationsA verifiable assertion about data. to automate parameter estimation when you are creating Expectations interactively by using a BatchA selection of records from a Data Asset. or Batches that have been loaded into a ValidatorUsed to run an Expectation Suite against data..

PREREQUISITES: THIS HOW-TO GUIDE ASSUMES YOU HAVE:

Steps

Setup

This guide assumes that you are creating and editing expectations in a Jupyter Notebook. This process is covered in the guide: How to create and edit expectations with instant feedback from a sample batch of data.

Additionally, this guide assumes that you are using a multi-batch Batch RequestProvided to a Datasource in order to create a Batch. to provide your sample data. (Auto-initializing Expectations will work when run on a single Batch, but they really shine when run on multiple Batches that would have otherwise needed to be individually processed if a manual aproach were taken.)

1. Determine if your Expectation is auto-initializing

Not all Expectations are auto-initializng. In order to be a auto-initializing Expectation, an Expectation must have parameters that can be estimated. As an example: ExpectColumnToExist only takes in a Domain (which is the column name) and checks whether the column name is in the list of names in the table's metadata. This would be an example of an Expectation that would not work under the auto-initializing framework.

An example of Expectations that would work under the auto-initializing framework would be the ones that have numeric ranges, like ExpectColumnMeanToBeBetween, ExpectColumnMaxToBeBetween, and ExpectColumnSumToBeBetween.

To check whether the Expectation you are interested in works under the auto-initializing framework, run the is_expectation_auto_initializing() method of the Expectation class.

For example:

Python script/Jupyter notebook
from great_expectations.expectations.expectation import Expectation

Expectation.is_expectation_self_initializing(name="expect_column_to_exist")

will return False and print the message:

Console output

The Expectation expect_column_to_exist is not able to be auto-initialized.

However, the command:

Python script/Jupyter notebook
Expectation.is_expectation_self_initializing(name="expect_column_mean_to_be_between")

will return True and print the message:

Console output

The Expectation expect_column_mean_to_be_between is able to be auto-initialized. Please run by using the auto=True parameter.

For the purposes of this guide, we will be using expect_column_mean_to_be_between as our example Expectation.

2. Run the expectation with `auto=True`

Say you are interested in constructing an Expectation that captures the average distance of taxi trips across all of 2018. You have a DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. that provides 12 Batches (one for each month of the year) and you know that expect_colum_mean_to_be_between is the Expectation you want to implement.

The manual way

The Expectation expect_column_mean_to_be_between() has the following parameters:

column (str): The column name.
min_value (float or None): The minimum value for the column mean.
max_value (float or None): The maximum value for the column mean.
strict_min (boolean): If True, the column mean must be strictly larger than min_value, default=False
strict_max (boolean): If True, the column mean must be strictly smaller than max_value, default=False

Without the auto-initialization framework you would have to get the values for min_value and max_value for your series of 12 Batches by calculating the mean value for each Batch and using calculated mean values to determine the min_value and max_value parameters to pass your Expectation. This, although not difficult, would be a monotonous and time consuming task.

Using `auto=True`

Auto-initializing Expectations automate this sort of calculation across batches. To perform the same calculation described above (the mean ranges across the 12 Batches in the 2018 taxi data) the only thing you need to do is run the Expectation with auto=True

Python script/Jupyter notebook
expectation_result = validator.expect_column_mean_to_be_between(
    column="trip_distance", auto=True
)

Now the Expectation will calculate the min_value (2.83) and max_value (3.06) using all of the Batches that are loaded into the Validator. In our case, that means all 12 Batches associated with the 2018 taxi data.

3. Save your Expectation with the calculated values

Now that the Expectation's upper and lower bounds have come from the Batches, you can save your Expectation SuiteA collection of verifiable assertions about data. and move on.

validator.save_expectation_suite(discard_failed_expectations=False)

Additional information

note

To view the full scripts that were used in this page, see them on GitHub:

Steps​

Setup​

1. Determine if your Expectation is auto-initializing​

2. Run the expectation with auto=True​

The manual way​

Using auto=True​

3. Save your Expectation with the calculated values​

Additional information​