How to choose which DataConnector to use
This guide demonstrates how to choose which Data ConnectorsProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets. to configure within your DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems..
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Understand the basics of Datasources in the V3 (Batch Request) API
- Learned how to configure a Data Context using test_yaml_config
Great Expectations provides three types of DataConnector
classes. Two classes are for connecting to Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. stored as file-system-like data (this includes files on disk, but also S3 object stores, etc) as well as relational database data:
- An InferredAssetDataConnector infers
data_asset_name
by using a regex that takes advantage of patterns that exist in the filename or folder structure. - A ConfiguredAssetDataConnector allows users to have the most fine-tuning, and requires an explicit listing of each Data Asset you want to connect to.
InferredAssetDataConnectors | ConfiguredAssetDataConnectors |
---|---|
InferredAssetFilesystemDataConnector | ConfiguredAssetFilesystemDataConnector |
InferredAssetFilePathDataConnector | ConfiguredAssetFilePathDataConnector |
InferredAssetAzureDataConnector | ConfiguredAssetAzureDataConnector |
InferredAssetGCSDataConnector | ConfiguredAssetGCSDataConnector |
InferredAssetS3DataConnector | ConfiguredAssetS3DataConnector |
InferredAssetSqlDataConnector | ConfiguredAssetSqlDataConnector |
InferredAssetDBFSDataConnector | ConfiguredAssetDBFSDataConnector |
InferredAssetDataConnectors and ConfiguredAssetDataConnectors are used to define Data Assets and their associated data_references. A Data Asset is an abstraction that can consist of one or more data_references to CSVs or relational database tables. For instance, you might have a yellow_tripdata
Data Asset containing information about taxi rides, which consists of twelve data_references to twelve CSVs, each consisting of one month of data.
The third type of DataConnector
class is for providing a Batch'sA selection of records from a Data Asset. data directly at runtime:
- A
RuntimeDataConnector
enables you to use aRuntimeBatchRequest
to wrap either an in-memory dataframe, filepath, or SQL query, and must include batch identifiers that uniquely identify the data (e.g. arun_id
from an AirFlow DAG run).
If you know for example, that your Pipeline Runner will already have your batch data in memory at runtime, you can choose to configure a RuntimeDataConnector
with unique batch identifiers. Reference the documents on How to configure a RuntimeDataConnector and How to create a Batch of data from an in-memory Spark or Pandas dataframe to get started with RuntimeDataConnectors
.
If you aren't sure which type of the remaining DataConnector
s to use, the following examples will use DataConnector
classes designed to connect to files on disk, namely InferredAssetFilesystemDataConnector
and ConfiguredAssetFilesystemDataConnector
to demonstrate the difference between these types of DataConnectors
.
When to use an InferredAssetDataConnector
If you have the following <MY DIRECTORY>/
directory in your filesystem, and you want to treat the yellow_tripdata_*.csv
files as batches within the yellow_tripdata
Data Asset, and also do the same for files in the green_tripdata
directory:
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata/2019-01.csv
<MY DIRECTORY>/green_tripdata/2019-02.csv
<MY DIRECTORY>/green_tripdata/2019-03.csv
This configuration:
- YAML
- Python
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: /
glob_directive: "*/*.csv"
default_regex:
group_names:
- data_asset_name
- year
- month
pattern: (.*)/.*(\d{4})-(\d{2})\.csv
"""
datasource_config = {
"name": "taxi_datasource",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"module_name": "great_expectations.execution_engine",
"class_name": "PandasExecutionEngine",
},
"data_connectors": {
"default_inferred_data_connector_name": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "/",
"glob_directive": "*/*.csv",
"default_regex": {
"group_names": [
"data_asset_name",
"year",
"month",
],
"pattern": r"(.*)/.*(\d{4})-(\d{2})\.csv",
},
},
},
}
will make available the following Data Assets and data_references:
Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['green_tripdata/*2019-01.csv', 'green_tripdata/*2019-02.csv', 'green_tripdata/*2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata/*2019-01.csv', 'yellow_tripdata/*2019-02.csv', 'yellow_tripdata/*2019-03.csv']
Unmatched data_references (0 of 0):[]
Note that the InferredAssetFileSystemDataConnector
infers data_asset_names
from the regex you provide. This is the key difference between InferredAssetDataConnector and ConfiguredAssetDataConnector, and also requires that one of the group_names
in the default_regex
configuration be data_asset_name
.
The glob_directive
is provided to give the DataConnector
information about the directory structure to expect for each Data Asset. The default glob_directive
for the InferredAssetFileSystemDataConnector
is "*"
and therefore must be overridden when your data_references exist in subdirectories.
When to use a ConfiguredAssetDataConnector
On the other hand, ConfiguredAssetFilesSystemDataConnector
requires an explicit listing of each Data Asset you want to connect to. This tends to be helpful when the naming conventions for your Data Assets are less standardized, but the user has a strong understanding of the semantics governing the segmentation of data (files, database tables).
If you have the same <MY DIRECTORY>/
directory in your filesystem,
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata/2019-01.csv
<MY DIRECTORY>/green_tripdata/2019-02.csv
<MY DIRECTORY>/green_tripdata/2019-03.csv
Then this configuration:
- YAML
- Python
datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: /
assets:
yellow_tripdata:
base_directory: yellow_tripdata/
pattern: yellow_tripdata_(\d{4})-(\d{2})\.csv
group_names:
- year
- month
green_tripdata:
base_directory: green_tripdata/
pattern: (\d{4})-(\d{2})\.csv
group_names:
- year
- month
"""
datasource_config = {
"name": "taxi_datasource",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"module_name": "great_expectations.execution_engine",
"class_name": "PandasExecutionEngine",
},
"data_connectors": {
"default_configured_data_connector_name": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "/",
"assets": {
"yellow_tripdata": {
"base_directory": "yellow_tripdata/",
"pattern": r"yellow_tripdata_(\d{4})-(\d{2})\.csv",
"group_names": ["year", "month"],
},
"green_tripdata": {
"base_directory": "green_tripdata/",
"pattern": r"(\d{4})-(\d{2})\.csv",
"group_names": ["year", "month"],
},
},
},
},
}
will make available the following Data Assets and data_references:
Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['2019-01.csv', '2019-02.csv', '2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']
Unmatched data_references (0 of 0):[]
Additional Notes
- Additional examples and configurations for
ConfiguredAssetFilesystemDataConnector
s can be found here: How to configure a ConfiguredAssetDataConnector - Additional examples and configurations for
InferredAssetFilesystemDataConnector
s can be found here: How to configure an InferredAssetDataConnector - Additional examples and configurations for
RuntimeDataConnector
s can be found here: How to configure a RuntimeDataConnector
To view the full script used in this page, see it on GitHub: