Batch Request
Overview
Definition
A Batch Request is provided to a DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. in order to create a BatchA selection of records from a Data Asset..
Features and promises
A Batch Request contains all the necessary details to query the appropriate underlying data. The relationship between a Batch Request and the data returned as a Batch is guaranteed. If a Batch Request identifies multiple Batches that fit the criteria of the user provided batch_identifiers
, the Batch Request will return all of the matching Batches.
Relationship to other objects
A Batch Request is always used when Great Expectations builds a Batch. The Batch Request includes a "query" for a Datasource's Data ConnectorProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets. to describe the data to include in the Batch. Any time you interact with something that requires a Batch of Data (such as a ProfilerGenerates Metrics and candidate Expectations from data., CheckpointThe primary means for validating data in a production deployment of Great Expectations., or ValidatorUsed to run an Expectation Suite against data.) you will use a Batch Request and Datasource to create the Batch that is used.
Use cases
Connect to Data |
Since a Batch Request is necessary in order to get a Batch from a Datasource, all of our guides on how to connect to specific source data systems include a section on using a Batch Request to test that your Datasource is properly configured. These sections also serve as examples on how to define a Batch Request for a Datasource that is configured for a given source data system.
You can find these guides in our documentation on how to connect to data.
Create Expectations |
If you are using a Profiler or the interactive method of creating Expectations, you will need to provide a Batch of data for the Profiler to analyze or your manually defined Expectations to test against. For both of these processes, you will therefore need a Batch Request to get the Batch.
For more information, see:
- Our how-to guide on the interactive process for creating Expectations
- Our how-to guide on using a Profiler to generate Expectations
Validate Data |
When ValidatingThe act of applying an Expectation Suite to a Batch. data with a Checkpoint, you will need to provide one or more Batch Requests and one or more Expectation SuitesA collection of verifiable assertions about data.. You can do this at runtime, or by defining Batch Request and Expectation Suite pairs in advance, in the Checkpoint's configuration.
For more information on setting up Batch Request/Expectation Suite pairs in a Checkpoint's configuration, see:
- Our guide on how to add data or suites to a Checkpoint
- Our guide on how to configure a new Checkpoint using
test_yaml_config(...)
When passing RuntimeBatchRequest
s to a Checkpoint, you will not be pairing Expectation Suites with Batch Requests. Instead, when you provide RuntimeBatchRequest
s to a Checkpoint, it will run all of its configured Expectation Suites against each of the RuntimeBatchRequest
s that are passed in.
For examples of how to pass RuntimeBatchRequest
s to a Checkpoint, see the examples used to test your Datasource configurations in our documentation on how to connect to data. RuntimeBatchRequest
s are typically used when you need to pass in a DataFrame at runtime.
For a good example if you don't have a specific source data system in mind right now, check out Example 2 of our guide on how to pass an in memory dataframe to a Checkpoint.
Features
Guaranteed relationships
The relationship between a Batch and the Batch Request that generated it is guaranteed. A Batch Request includes all of the information necessary to identify a specific Batch or Batches.
Batches are always built using a Batch Request. When the Batch is built, additional metadata is included, one of which is a Batch Definition. The Batch Definition directly corresponds to the Batch Request that was used to create the Batch.
API basics
How to access
You will rarely need to access an existing Batch Request. Instead, you will often find yourself defining a Batch Request in a configuration file, or passing in parameters to create a Batch Request which you will then pass to a Datasource. Once you receive a Batch back, it is unlikely you will need to reference to the Batch Request that generated it. Indeed, if the Batch Request was part of a configuration, Great Expectations will simply initialize a new copy rather than load an existing one when the Batch Request is needed.
How to create
Batch Requests are instances of either a RuntimeBatchRequest
or a BatchRequest
A BatchRequest
can be defined by passing a dictionary with the necessary parameters when a BatchRequest
is initialized, like so:
from great_expectations.core.batch import BatchRequest
batch_request_parameters = {
'datasource_name': 'getting_started_datasource',
'data_connector_name': 'default_inferred_data_connector_name',
'data_asset_name': 'yellow_tripdata_sample_2019-01.csv',
'limit': 1000
}
batch_request=BatchRequest(**batch_request_parameters)
Regardless of the source data system that the Datasource being referenced by a Batch Request is associated with, the parameters for initializing a Batch Request will remain the same. Great Expectations will handle translating that information into a query appropriate for the source data system behind the scenes.
A RuntimeBatchRequest
will need a Datasource that has been configured with a RuntimeDataConnector
. You will then use a RuntimeBatchRequest
to specify the Batch that you will be working with.
For more information and examples regarding setting up a Datasource for use with RuntimeBatchRequest
s, see:
More Details
Batches and Batch Requests: Design Motivation
You do not generally need to access the metadata that Great Expectations uses to define a Batch. Typically, a user need specify only the Batch Request. The Batch Request will describe what data Great Expectations should fetch, including the name of the Data Asset and other identifiers (see more detail below).
A Batch Definition includes all the information required to precisely identify a set of data from the external data source that should be translated into a Batch. One or more BatchDefinitions are always returned from the Datasource, as a result of processing the Batch Request. A Batch Definition includes several key components:
- Batch Identifiers: contains information that uniquely identifies a specific batch from the Data Asset, such as the delivery date or query time.
- Engine Passthrough: contains information that will be passed directly to the Execution Engine as part of the Batch Spec.
- Sample Definition: contains information about sampling or limiting done on the Data Asset to create a Batch.
We recommend that you make every Data Asset Name unique in your Data Context configuration. Even though a Batch Definition includes the Data Connector Name and Datasource Name, choosing a unique Data Asset name makes it easier to navigate quickly through Data Docs and ensures your logical data assets are not confused with any particular view of them provided by an Execution Engine.
A Batch Spec is an Execution Engine-specific description of the Batch. The Data Connector is responsible for working with the Execution Engine to translate the Batch Definition into a spec that enables Great Expectations to access the data using that Execution Engine.
Finally, the BatchMarkers are additional pieces of metadata that can be useful to understand reproducibility, such as the time the batch was constructed, or hash of an in-memory DataFrame.
Batches and Batch Requests: A full journey
Let's follow the outline in this diagram to follow the journey from BatchRequest to Batch list:
- A Datasource's
get_batch_list_from_batch_request
method is passed a BatchRequest.- A BatchRequest can include
data_connector_query
params with values relative to the latest Batch (e.g. the "latest" slice). Conceptually, this enables "fetch the latest Batch" behavior. It is the key thing that differentiates a BatchRequest, which does NOT necessarily uniquely identify the Batch(es) to be fetched, from a BatchDefinition. - The BatchRequest can also include a section called
batch_spec_passthrough
to make it easy to directly communicate parameters to a specific Execution Engine. - When resolved, the BatchRequest may point to many BatchDefinitions and Batches.
- BatchRequests can be defined as dictionaries, or by instantiating a BatchRequest object.
- A BatchRequest can include
runtime_batch_request = RuntimeBatchRequest(
datasource_name="my_pandas_datasource",
data_connector_name="my_runtime_data_connector",
data_asset_name="insert_your_data_asset_name_here",
runtime_parameters={"path": path_to_file},
batch_identifiers={
"some_key_maybe_pipeline_stage": "ingestion step 1",
"some_other_key_maybe_airflow_run_id": "run 18",
},
batch_spec_passthrough={
"reader_method": "read_csv",
"reader_options": {"sep": ",", "header": 0},
},
)
- The Datasource finds the Data Connector indicated by the BatchRequest, and uses it to obtain a BatchDefinition list.
DataSource.get_batch_list_from_batch_request(batch_request=batch_request)
- A BatchDefinition resolves any ambiguity in BatchRequest to uniquely identify a single Batch to be fetched. BatchDefinitions are Datasource -- and Execution Engine -- agnostic. That means that its parameters may depend on the configuration of the Datasource, but they do not otherwise depend on the specific Data Connector type (e.g. filesystem, SQL, etc.) or Execution Engine being used to instantiate Batches.
BatchDefinition
datasource: str
data_connector: str
data_asset_name: str
batch_identifiers:
** contents depend on the configuration of the DataConnector **
** provides a persistent, unique identifier for the Batch within the context of the Data Asset **
The Datasource then requests that the Data Connector transform the BatchDefinition list into BatchData, BatchSpec, and BatchMarkers.
When the Data Connector receives this request, it first builds the BatchSpec, then calls its Execution Engine to create BatchData and BatchMarkers.
- A
BatchSpec
is a set of specific instructions for the Execution Engine to fetch specific data; it is the ExecutionEngine-specific version of the BatchDefinition. For example, aBatchSpec
could include the path to files, information about headers, or other configuration required to ensure the data is loaded properly for validation. - Batch Markers are metadata that can be used to calculate performance characteristics, ensure reproducibility of Validation Results, and provide indicators of the state of the underlying data system.
- After the Data Connector returns the BatchSpec, BatchData, and BatchMarkers, the Datasource builds and returns a list of Batches.
RuntimeDataConnector and RuntimeBatchRequest
A Runtime Data Connector is a special kind of Data Connector that supports easy integration with Pipeline Runners where
the data is already available as a reference that needs only a lightweight wrapper to track validations. Runtime Data
Connectors are used alongside a special kind of Batch Request class called a RuntimeBatchRequest
. Instead of serving
as a description of what data Great Expectations should fetch, a Runtime Batch Request serves as a wrapper for data that
is passed in at runtime (as an in-memory dataframe, file/S3 path, or SQL query), with user-provided identifiers for
uniquely identifying the data.
In a Batch Definition produced by a Runtime Data Connector, the batch_identifiers
come directly from the Runtime Batch
Request and serve as a persistent, unique identifier for the data included in the Batch. By relying on
user-provided batch_identifiers
, we allow the definition of the specific batch's identifiers to happen at runtime, for
example using a run_id from an Airflow DAG run. The specific runtime batch_identifiers to be expected are controlled in
the Runtime Data Connector configuration. Using that configuration creates a control plane for governance-minded
engineers who want to enforce some level of consistency between validations.