How to Use Great Expectations with Prefect

This guide will help you run a Great Expectations with Prefect

Prerequisites: This how-to guide assumes you have:

Prefect is a workflow management system that enables data engineers to build robust data applications. The Prefect open source library allows users to create workflows using Python and makes it easy to take your data pipelines and add semantics like retries, logging, dynamic mapping, caching, and failure notifications. Prefect Cloud is the easy, powerful, scalable way to automate and monitor dataflows built in Prefect 1.0 — without having to worry about orchestration infrastructure.

Great Expectations validations can be used to validate data passed between tasks in your Prefect flow. By validating your data before operating on it, you can quickly find issues with your data with less debugging. Prefect makes it easy to combine Great Expectations with other services in your data stack and orchestrate them all in a predictable manner.

The `RunGreatExpectationsValidation` task

With Prefect, you define your workflows with tasks and flows. A Task represents a discrete action in a Prefect workflow. A Flow is a container for Tasks. It represents an entire workflow or application by describing the dependencies between tasks. Prefect offers a suite of over 180 pre-built tasks in the Prefect Task Library. The RunGreatExpectationsValidation task is one of these pre-built tasks. With the RunGreatExpectationsValidation task you can run validations for an existing Great Expectations project.

To use the RunGreatExpectationsValidation, you need to install Prefect with the ge extra:

pip install "prefect[ge]"

Here is an example of a flow that runs a Great Expectations validation:

from prefect import Flow, Parameter
from prefect.tasks.great_expectations import RunGreatExpectationsValidation

validation_task = RunGreatExpectationsValidation()

with Flow("ge_test") as flow:
   checkpoint_name = Parameter("checkpoint_name")
   prev_run_row_count = 100
   validation_task(
      checkpoint_name=checkpoint_name,
      evaluation_parameters=dict(prev_run_row_count=prev_run_row_count),
   )

flow.run(parameters={"checkpoint_name": "my_checkpoint"})

Using the RunGreatExpectationsValidation task is as easy as importing the task, instantiating the task, and calling it in your flow. In the flow above, we parameterize our flow with the checkpoint name. This way, we're able to reuse our flow to run different Great Expectations validations based on the input.

Configuring the root context directory

By default, the RunGreatExpectationsValidation task will look in the current directory for a Great Expectations project in a folder named great_expectations. If your great_expectations.yml is located in another directory, you can configure the RunGreatExpectationsValidation tasks with the context_root_dir argument:

from prefect import Flow, Parameter
from prefect.tasks.great_expectations import RunGreatExpectationsValidation

validation_task = RunGreatExpectationsValidation()

with Flow("ge_test") as flow:
   checkpoint_name = Parameter("checkpoint_name")
   prev_run_row_count = 100
   validation_task(
      checkpoint_name=checkpoint_name,
      evaluation_parameters=dict(prev_run_row_count=prev_run_row_count),
      context_root_dir="../great_expectations"
   )

flow.run(parameters={"checkpoint_name": "my_checkpoint"})

Using dynamic runtime configuration

The RunGreatExpectationsValidation task also enables runtime configuration of your validation run. You can pass in an in memory DataContext via the context argument or pass an in memory Checkpoint via the ge_checkpoint argument.

Here is an example with an in memory DataContext:

import os
from pathlib import Path

import great_expectations as gx

from great_expectations.data_context.types.base import (
    DataContextConfig,
)
from prefect import Flow, Parameter, task
from prefect.tasks.great_expectations import RunGreatExpectationsValidation

@task
def create_in_memory_data_context(project_path: Path, data_path: Path):
    data_context = gx.get_context(
        project_config=DataContextConfig(
            **{
                "config_version": 3.0,
                "datasources": {
                    "data__dir": {
                        "module_name": "great_expectations.datasource",
                        "data_connectors": {
                            "data__dir_example_data_connector": {
                                "default_regex": {
                                    "group_names": ["data_asset_name"],
                                    "pattern": "(.*)",
                                },
                                "base_directory": str(data_path),
                                "module_name": "great_expectations.datasource.data_connector",
                                "class_name": "InferredAssetFilesystemDataConnector",
                            },
                            "default_runtime_data_connector_name": {
                                "batch_identifiers": ["default_identifier_name"],
                                "module_name": "great_expectations.datasource.data_connector",
                                "class_name": "RuntimeDataConnector",
                            },
                        },
                        "execution_engine": {
                            "module_name": "great_expectations.execution_engine",
                            "class_name": "PandasExecutionEngine",
                        },
                        "class_name": "Datasource",
                    }
                },
                "config_variables_file_path": str(
                    project_path / "uncommitted" / "config_variables.yml"
                ),
                "stores": {
                    "expectations_store": {
                        "class_name": "ExpectationsStore",
                        "store_backend": {
                            "class_name": "TupleFilesystemStoreBackend",
                            "base_directory": str(
                                project_path / "expectations"
                            ),
                        },
                    },
                    "validations_store": {
                        "class_name": "ValidationsStore",
                        "store_backend": {
                            "class_name": "TupleFilesystemStoreBackend",
                            "base_directory": str(
                                project_path / "uncommitted" / "validations"
                            ),
                        },
                    },
                    "evaluation_parameter_store": {
                        "class_name": "EvaluationParameterStore"
                    },
                    "checkpoint_store": {
                        "class_name": "CheckpointStore",
                        "store_backend": {
                            "class_name": "TupleFilesystemStoreBackend",
                            "suppress_store_backend_id": True,
                            "base_directory": str(
                                project_path / "checkpoints"
                            ),
                        },
                    },
                },
                "expectations_store_name": "expectations_store",
                "validations_store_name": "validations_store",
                "evaluation_parameter_store_name": "evaluation_parameter_store",
                "checkpoint_store_name": "checkpoint_store",
                "data_docs_sites": {
                    "local_site": {
                        "class_name": "SiteBuilder",
                        "show_how_to_buttons": True,
                        "store_backend": {
                            "class_name": "TupleFilesystemStoreBackend",
                            "base_directory": str(
                                project_path / "uncommitted" / "data_docs" / "local_site"
                            ),
                        },
                        "site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
                    }
                },
                "anonymous_usage_statistics": {
                    "data_context_id": "abcdabcd-1111-2222-3333-abcdabcdabcd",
                    "enabled": False,
                },
                "notebooks": None,
                "concurrency": {"enabled": False},
            }
        )
    )

    return data_context

validation_task = RunGreatExpectationsValidation()

with Flow("ge_test") as flow:
   checkpoint_name = Parameter("checkpoint_name")
   prev_run_row_count = 100
   data_context = create_in_memory_data_context(project_path=Path.cwd(), data_path=Path.cwd().parent)
   validation_task(
      checkpoint_name=checkpoint_name,
      evaluation_parameters=dict(prev_run_row_count=prev_run_row_count),
      context=data_context
   )

flow.run(parameters={"checkpoint_name": "my_checkpoint"})

Validating in memory data

Because Prefect allows first class passing of data between tasks, you can even use the RunGreatExpectationsValidation task on in memory dataframes! This means you won't need to write to and read data from remote storage between steps of your pipeline.

Here is an example of how to run a validation on an in memory dataframe by passing in a RuntimeBatchRequest via the checkpoint_kwargs argument:

from great_expectations.core.batch import RuntimeBatchRequest
import pandas as pd
from prefect import Flow, Parameter, task
from prefect.tasks.great_expectations import RunGreatExpectationsValidation

validation_task = RunGreatExpectationsValidation()

@task
def create_runtime_batch_request(df: pd.DataFrame):
   return RuntimeBatchRequest(
        datasource_name="data__dir",
        data_connector_name="default_runtime_data_connector_name",
        data_asset_name="yellow_tripdata_sample_2019-02_df",
        runtime_parameters={"batch_data": df},
        batch_identifiers={
            "default_identifier_name": "ingestion step 1",
        },
    )

with Flow("ge_test") as flow:
   checkpoint_name = Parameter("checkpoint_name")
   prev_run_row_count = 100

   df = dataframe_creation_task()

   in_memory_runtime_batch_request = create_runtime_batch_request(df)

   validation_task(
      checkpoint_name=checkpoint_name,
      evaluation_parameters=dict(prev_run_row_count=prev_run_row_count),
      checkpoint_kwargs={
         "validations": [
            {
               "batch_request": in_memory_runtime_batch_request,
               "expectation_suite_name": "taxi.demo_pass",
            }
         ]
      },
   )

flow.run(parameters={"checkpoint_name": "my_checkpoint"})

Where to go for more information

The flexibility that Prefect and the RunGreatExpectationsValidation task offer makes it easy to incorporate data validation into your dataflows with Great Expectations.

For more info about the RunGreatExpectationsValidation task, refer to the Prefect documentation.

Prerequisites: This how-to guide assumes you have:

The RunGreatExpectationsValidation task​

Configuring the root context directory​

Using dynamic runtime configuration​

Validating in memory data​

Where to go for more information​

The `RunGreatExpectationsValidation` task

Configuring the root context directory

Using dynamic runtime configuration

Validating in memory data

Where to go for more information