Skip to main content

How to configure an InferredAssetDataConnector

This guide demonstrates how to configure an InferredAssetDataConnector, and provides several examples you can use for configuration.

Prerequisites: This how-to guide assumes you have:

Great Expectations provides two types of DataConnector classes for connecting to Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. stored as file-system-like data (this includes files on disk, but also S3 object stores, etc) as well as relational database data:

  • A ConfiguredAssetDataConnector allows you to specify that you have multiple Data Assets in a Datasource, but also requires an explicit listing of each Data Asset you want to connect to. This allows more fine-tuning, but also requires more setup.
  • An InferredAssetDataConnector infers data_asset_name by using a regex that takes advantage of patterns that exist in the filename or folder structure.

InferredAssetDataConnector has fewer options, so it's simpler to set up. It’s a good choice if you want to connect to a single Data Asset, or several Data Assets that all share the same naming convention.

If you're not sure which one to use, please check out How to choose which DataConnector to use.

Steps

1. Instantiate your project's DataContext

Import these necessary packages and modules:

from ruamel import yaml

import great_expectations as gx
from great_expectations.core.batch import BatchRequest

2. Set up a Datasource

All the examples below assume you’re testing configurations using something like:

datasource_yaml = """
name: taxi_datasource
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
data_connectors:
<DATA CONNECTOR NAME GOES HERE>:
<DATACONNECTOR CONFIGURATION GOES HERE>
"""
context.test_yaml_config(yaml_config=datasource_config)

If you’re not familiar with the test_yaml_config method, please check out: How to configure Data Context components using test_yaml_config

3. Add an InferredAssetDataConnector to a Datasource configuration

InferredAssetDataConnectors like InferredAssetFilesystemDataConnector and InferredAssetS3DataConnector require a default_regex parameter, with a configured regex pattern and capture group_names.

Imagine you have the following files in my_directory/:

<MY DIRECTORY>/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata_2019-03.csv

We can imagine two approaches to loading the data into GX.

The simplest approach would be to consider each file to be its own Data Asset. In that case, the configuration would look like the following:

datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
default_regex:
group_names:
- data_asset_name
pattern: (.*)\.csv
"""

Notice that the default_regex is configured to have one capture group ((.*)) which captures the entire filename. That capture group is assigned to data_asset_name under group_names. For InferredAssetDataConnectors data_asset_name is a required group_name, and it's associated capture group is the way each data_asset_name is inferred. Running test_yaml_config() would result in 3 Data Assets : yellow_tripdata_2019-01, yellow_tripdata_2019-02 and yellow_tripdata_2019-03.

However, a closer look at the filenames reveals a pattern that is common to the 3 files. Each have yellow_tripdata_ in the name, and have date information afterwards. These are the types of patterns that InferredAssetDataConnectors allow you to take advantage of.

We could treat yellow_tripdata_*.csv files as BatchesA selection of records from a Data Asset. within the yellow_tripdata Data Asset with a more specific regex pattern and adding group_names for year and month.

Note: We have chosen to be more specific in the capture groups for the year and month by specifying the integer value (using \d) and the number of digits, but a simpler capture group like (.*) would also work. For more information about capture groups, refer to the Python documentation on regular expressions.

datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
default_regex:
group_names:
- data_asset_name
- year
- month
pattern: (.*)_(\d{4})-(\d{2})\.csv
"""

Running test_yaml_config() would result in 1 Data Asset yellow_tripdata with 3 associated data_references: yellow_tripdata_2019-01.csv, yellow_tripdata_2019-02.csv and yellow_tripdata_2019-03.csv, seen also in Example 1 below.

A corresponding configuration for InferredAssetS3DataConnector would look similar but would require bucket and prefix values instead of base_directory.

datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetS3DataConnector
bucket: <MY S3 BUCKET>/
prefix: <MY S3 BUCKET PREFIX>/
default_regex:
group_names:
- prefix
- data_asset_name
- year
- month
pattern: (.*)/(.*)_sample_(\d{4})-(\d{2})\.csv
"""

The following examples will show scenarios that InferredAssetDataConnectors can help you analyze, using InferredAssetFilesystemDataConnector.

Example 1: Basic configuration for a single Data Asset

Continuing the example above, imagine you have the following files in the directory <MY DIRECTORY>:

<MY DIRECTORY>/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata_2019-03.csv

Then this configuration:

# YAML
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
default_regex:
group_names:
- data_asset_name
- year
- month

will make available yelow_tripdata as a single Data Asset with the following data_references:

Available data_asset_names (1 of 1):
yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']

Unmatched data_references (0 of 0):[]

Once configured, you can get ValidatorsUsed to run an Expectation Suite against data. from the Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. as follows:

)

batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="default_inferred_data_connector_name",
data_asset_name="yellow_tripdata",
)

validator = context.get_validator(
batch_request=batch_request,

Since this BatchRequest does not specify which data_reference to load, the ActiveBatch for the validator will be the last data_reference that was loaded. In this case, yellow_tripdata_2019-03.csv is what is being used by validator. We can verfiy this with:

print(validator.active_batch_definition)

which prints:

{
"datasource_name": "taxi_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "yellow_tripdata",
"batch_identifiers": {
"year": "2019",
"month": "03"
}
}

Notice that the batch_identifiers for this batch_definition specify "year": "2019", "month": "03". The parameter batch_identifiers can be used in our BatchRequest to return the data_reference CSV of our choosing using the group_names defined in our DataConnector:

assert isinstance(validator, gx.validator.validator.Validator)

batch_request = BatchRequest(
datasource_name="taxi_datasource",
data_connector_name="default_inferred_data_connector_name",
data_asset_name="yellow_tripdata",
data_connector_query={"batch_filter_parameters": {"year": "2019", "month": "02"}},
)

validator = context.get_validator(
batch_request=batch_request,
print(validator.active_batch_definition)

which prints:

{
"datasource_name": "taxi_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "yellow_tripdata",
"batch_identifiers": {
"year": "2019",
"month": "02"
}
}

This ability to access specific Batches using batch_identifiers is very useful when validating Data Assets that span multiple files. For more information on batches and batch_identifiers, please refer to our Batches documentation.

Example 2: Basic configuration with more than one Data Asset

Here’s a similar example, but this time two Data Assets are mixed together in one folder.

Note: For an equivalent configuration using ConfiguredAssetFilesSystemDataconnector, please see Example 2 in How to configure a ConfiguredAssetDataConnector.

<MY DIRECTORY>/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/green_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/green_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata_2019-03.csv

The same configuration as Example 1:

# YAML
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
default_regex:
group_names:
- data_asset_name
- year
- month

will now make yellow_tripdata and green_tripdata both available as Data Assets, with the following data_references:

Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['green_tripdata_2019-01.csv', 'green_tripdata_2019-02.csv', 'green_tripdata_2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']

Unmatched data_references (0 of 0): []

Example 3: Nested directory structure with the data_asset_name on the inside

Here’s a similar example, with a nested directory structure:

<MY DIRECTORY>/2018/10/yellow_tripdata.csv
<MY DIRECTORY>/2018/10/green_tripdata.csv
<MY DIRECTORY>/2018/11/yellow_tripdata.csv
<MY DIRECTORY>/2018/11/green_tripdata.csv
<MY DIRECTORY>/2018/12/yellow_tripdata.csv
<MY DIRECTORY>/2018/12/green_tripdata.csv
<MY DIRECTORY>/2019/01/yellow_tripdata.csv
<MY DIRECTORY>/2019/01/green_tripdata.csv
<MY DIRECTORY>/2019/02/yellow_tripdata.csv
<MY DIRECTORY>/2019/02/green_tripdata.csv
<MY DIRECTORY>/2019/03/yellow_tripdata.csv
<MY DIRECTORY>/2019/03/green_tripdata.csv

Then this configuration:

# YAML
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
glob_directive: "*/*/*.csv"
default_regex:
group_names:
- year
- month
- data_asset_name

will now make yellow_tripdata and green_tripdata both available as Data Assets, with the following data_references:

Available data_asset_names (2 of 2):
green_tripdata (3 of 6): ['2018/10/green_tripdata.csv', '2018/11/green_tripdata.csv', '2018/12/green_tripdata.csv']
yellow_tripdata (3 of 6): ['2018/10/yellow_tripdata.csv', '2018/11/yellow_tripdata.csv', '2018/12/yellow_tripdata.csv']

Unmatched data_references (0 of 0):[]

The glob_directive is provided to give the DataConnector information about the directory structure to expect for each Data Asset. The default glob_directive for the InferredAssetFileSystemDataConnector is "*" and therefore must be overridden when your data_references exist in subdirectories.

Example 4: Nested directory structure with the data_asset_name on the outside

In the following example, files are placed in a folder structure with the data_asset_name defined by the folder name (yellow_tripdata or green_tripdata)

<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata/2019-01.csv
<MY DIRECTORY>/green_tripdata/2019-02.csv
<MY DIRECTORY>/green_tripdata/2019-03.csv

Then this configuration:

# YAML
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
glob_directive: "*/*.csv"
default_regex:
group_names:
- data_asset_name
- file_name_root
- year
- month

will now make yellow_tripdata and green_tripdata into Data Assets, with each containing 3 data_references

Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['green_tripdata/2019-01.csv', 'green_tripdata/2019-02.csv', 'green_tripdata/2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata/yellow_tripdata_2019-01.csv', 'yellow_tripdata/yellow_tripdata_2019-02.csv', 'yellow_tripdata/yellow_tripdata_2019-03.csv']

Unmatched data_references (0 of 0):[]

Example 5: Redundant information in the naming convention

In the following example, files are placed in a folder structure with the data_asset_name defined by the folder name (yellow_tripdata or green_tripdata), but then the term yellow_tripdata is repeated in some filenames.

<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata/2019-01.csv
<MY DIRECTORY>/green_tripdata/2019-02.csv
<MY DIRECTORY>/green_tripdata/2019-03.csv

Then this configuration:

# YAML
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
glob_directive: "*/*.csv"
default_regex:
group_names:
- data_asset_name
- year
- month

will not display the redundant information:

Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['green_tripdata/*2019-01.csv', 'green_tripdata/*2019-02.csv', 'green_tripdata/*2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata/*2019-01.csv', 'yellow_tripdata/*2019-02.csv', 'yellow_tripdata/*2019-03.csv']

Unmatched data_references (0 of 0):[]

Additional Notes

To view the full script used in this page, see it on GitHub: