How to configure a Pandas Datasource
This guide will walk you through the process of configuring a Pandas Datasource from scratch, verifying that your configuration is valid, and adding it to your Data Context. By the end of this guide you will have a Pandas Datasource which you can use in future workflows for creating Expectations and Validating data.
Steps
1. Import necessary modules and initialize your Data Context
from ruamel import yaml
import great_expectations as gx
data_context: gx.DataContext = gx.get_context()
The great_expectations
module will give you access to your Data Context, which is the entry point for working with a Great Expectations project.
The yaml
module from ruamel
will be used in validating your Datasource's configuration. Great Expectations will use a Python dictionary representation of your Datasource configuration when you add your Datasource to your Data Context. However, Great Expectations saves configurations as yaml
files, so when you validate your configuration you will need to convert it from a Python dictionary to a yaml
string, first.
Your Data Context that is initialized by get_data_context()
will be the Data Context defined in your current working directory. It will provide you with convenience methods that we will use to validate your Datasource configuration and add your Datasource to your Great Expectations project once you have configured it.
2. Create a new Datasource configuration.
A new Datasource can be configured in Python as a dictionary with a specific set of keys. We will build our Datasource configuration from scratch in this guide, although you can just as easily modify an existing one.
To start, create an empty dictionary. You will be populating it with keys as you go forward.
At this point, the configuration for your Datasource is merely:
datasource_config: dict = {}
However, from this humble beginning you will be able to build a full Datasource configuration.
The keys needed for your Datasource configuration
At the top level, your Datasource's configuration will need the following keys:
name
: The name of the Datasource, which will be used to reference the datasource in Batch Requests.class_name
: The name of the Python class instantiated by the Datasource. Typically, this will be theDatasource
class.module_name
: the name of the module that contains the Class definition indicated byclass_name
.execution_engine
: a dictionary containing theclass_name
andmodule_name
of the Execution Engine instantiated by the Datasource.data_connectors
: the configurations for any Data Connectors and their associated Data Assets that you want to have available when utilizing the Datasource.
In the following steps we will add those keys and their corresponding values to your currently empty Datasource configuration dictionary.
3. Name your Datasource
The first key that you will need to define for your new Datasource is its name
. You will use this to reference the Datasource in future workflows. It can be anything you want it to be, but ideally you will name it something relevant to the data that it interacts with.
For the purposes of this example, we will name this Datasource:
"name": "my_datasource_name", # Preferably name it something relevant
You should, however, name your Datsource something more relevant to your data.
At this point, your configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
}
4. Specify the Datasource class and module
The class_name
and module_name
for your Datasource will almost always indicate the Datasource
class found at great_expectations.datasource
. You may replace this with a specialized subclass, or a custom class, but for almost all regular purposes these two default values will suffice. For the purposes of this guide, add those two values to their corresponding keys.
"class_name": "Datasource",
"module_name": "great_expectations.datasource"
Your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
}
5. Add the Pandas Execution Engine to your Datasource configuration
Your Execution Engine is where you will specify that you want this Datasource to use Pandas in the backend. As with the Datasource top level configuration, you will need to provide the class_name
and module_name
that indicate the class definition and containing module for the Execution Engine that you will use.
For the purposes of this guide, these will consist of the PandasExecutionEngine
found at great_expectations.execution_engine
. The execution_engine
key and its corresponding value will therefore look like this:
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
After adding the above snippet to your Datasource configuration, your full configuration dictionary should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
}
6. Add a dictionary as the value of the data_connectors
key
The data_connectors
key should have a dictionary as its value. Each key/value pair in this dictionary will correspond to a Data Connector's name and configuration, respectively.
The keys in the data_connectors
dictionary will be the names of the Data Connectors, which you will use to indicate which Data Connector to use in future workflows. As with value of your Datasource's name
key, you can use any value you want for a Data Connector's name. Ideally, you will use something relevant to the data that each particular Data Connector will provide; the only significant difference is that for Data Connectors the name of the Data Connector is its key in the data_connectors
dictionary.
The values for each of your data_connectors
keys will be the Data Connector configurations that correspond to each Data Connector's name. You may define multiple Data Connectors in the data_connectors
dictionary by including multiple key/value pairs.
For now, start by adding an empty dictionary as the value of the data_connectors
key. We will begin populating it with Data Connector configurations in the next step.
Your current configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {},
}
7. Configure your individual Data Connectors
For each Data Connector configuration, you will need to specify which type of Data Connector you will be using. When using Pandas to work with data in a file system, the most likely ones will be the InferredAssetFilesystemDataConnector
, the ConfiguredAssetFilesystemDataConnector
, and the RuntimeDataConnector
.
If you are working with Pandas but not working with a file system, please see our cloud specific guides for more information.
If you are uncertain which Data Connector best suits your needs, please refer to our guide on how to choose which Data Connector to use.
Data Connector example configurations:
- InferredAssetFilesystemDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
The InferredDataConnector
is ideal for:
- quickly setting up a Datasource and getting access to data
- diving straight in to working with Great Expectations
- initial data discovery and introspection
However, the InferredDataConnector
allows less control over the definitions of your Data Assets than the ConfiguredAssetDataConnector
provides.
If you are at the point of building a repeatable workflow, we encourage using the ConfiguredAssetDataConnector
instead.
Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_inferred_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {"name_of_my_inferred_data_connector": {}},
}
When defining an InferredAssetFilesystemDataConnector
you will need to provide values for four keys in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "name_of_my_inferred_data_connector"
in the example above). These key/value pairs consist of:
class_name
: The name of the Class that will be instantiated for thisDataConnector
.base_directory
: The string representation of the directory that contains your filesystem data.default_regex
: A dictionary that describes how the data should be grouped into Batches.batch_spec_passthrough
: A dictionary of values that are passed to the Execution Engine's backend.
Additionally, you may optionally choose to define:
glob_directive
: A regular expression that can be used to access source data files contained in subfolders of yourbase_directory
. If this is not defined, the default value of*
will cause you Data Connector to only look at files in thebase_directory
itself.
For this example, you will be using the InferredAssetFilesystemDataConnector
as your class_name
. This is a subclass of the InferredAssetDataConnector
that is specialized to support filesystem Execution Engines, such as the SparkDFExecutionEngine
. This key/value entry will therefore look like:
"class_name": "InferredAssetFilesystemDataConnector",
Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name
along with a default value will be provided when this Data Connector is initialized.
However, if you want to use a custom Data Connector, you will need to explicitly add a module_name
key alongside the class_name
key.
The value for module_name
would then be set as the import path for the module containing your custom Data Connector, in the same fashion as you would provide class_name
and module_name
for a custom Datasource or Execution Engine.
For the base directory, you will want to put the relative path of your data from the folder that contains your Data Context. In this example we will use the same path that was used in the Getting Started Tutorial, Step 2: Connect to Data. Since we are manually entering this value rather than letting the CLI generate it, the key/value pair will look like:
"base_directory": "../data",
With these values added, along with blank dictionary for default_regex
(we will define it in the next step) and batch_spec_passthrough
, your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {},
"batch_spec_passthrough": {},
}
},
}
glob_directive
The glob_directive
parameter is provided to give the DataConnector
information about the directory structure to expect when identifying source data files to check against each Data Asset's default_regex
. If you do not specify a value for glob_directive
a default value of "*"
will be used. This will cause your Data Asset to check all files in the folder specified by base_directory
to determine which should be returned as Batches for the Data Asset, but will ignore any files in subdirectories.
Overriding the glob_directive
by providing your own value will allow your Data Connector to traverse subdirectories or otherwise alter which source data files are compared against your Data Connector's default_regex
.
For example, assume your source data is in files contained by subdirectories of your base_folder
, like so:
- 2019/yellow_taxidata_2019_01.csv
- 2020/yellow_taxidata_2020_01.csv
- 2021/yellow_taxidata_2021_01.csv
- 2022/yellow_taxidata_2022_01.csv
To include all of these files, you would need to tell the Data connector to look for files that are nested one level deeper than the base_directory
itself.
You would do this by setting the glob_directive
key in your Data Connector config to a value of "*/*"
. This value will cause the Data Connector to look for regex matches against the file names for all files found in any subfolder of your base_directory
. Such an entry would look like:
"glob_directive": "*.*"
The glob_directive
parameter works off of regex. You can also use it to limit the files that will be compared against the Data Connector's default_regex
for a match. For example, to only permit .csv
files to be checked for a match, you could specify the glob_directive
as "*.csv"
. To only check for matches against the .csv
files in subdirectories, you would use the value */*.csv
, and so forth.
In this guide's examples, all of our data is assumed to be in the base_directory
folder. Therefore, you will not need to add an entry for glob_directive
to your configuration. However, if you were to include the example glob_directive
from above, your full configuration would currently look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"glob_directive": "*.*",
"default_regex": {},
}
},
}
A ConfiguredAssetDataConnector
enables the most fine-tuning, allowing you to easily work with multiple Batches. It also requires an explicit listing of each Data Asset you connect to and how Batches or defined within that Data Asset, which makes it very clear what Data Assets are being provided when you reference it in Profilers, Batch Requests, or Checkpoints..
Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_configured_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {"name_of_my_configured_data_connector": {}},
}
When defining an ConfiguredAssetFilesystemDataConnector
you will need to provide values for four keys in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "name_of_my_configured_data_connector"
in the example above). These key/value pairs consist of:
class_name
: The name of the Class that will be instantiated for thisDataConnector
.base_directory
: The string representation of the directory that contains your filesystem data.default_regex
: A dictionary that describes how the data should be grouped into Batches.batch_spec_passthrough
: A dictionary of values that are passed to the Execution Engine's backend.
For this example, you will be using the ConfiguredAssetFilesystemDataConnector
as your class_name
. This is a subclass of the ConfiguredAssetDataConnector
that is specialized to support filesystem Execution Engines, such as the PandasExecutionEngine
. This key/value entry will therefore look like:
"class_name": "ConfiguredAssetFilesystemDataConnector",
Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name
along with a default value will be provided when this Data Connector is initialized.
However, if you want to use a custom Data Connector, you will need to explicitly add a module_name
key alongside the class_name
key.
The value for module_name
would then be set as the import path for the module containing your custom Data Connector, in the same fashion as you would provide class_name
and module_name
for a custom Datasource or Execution Engine.
For the base directory, you will want to put the relative path of your data from the folder that contains your Data Context. In this example we will use the same path that was used in the Getting Started Tutorial, Step 2: Connect to Data. Since we are manually entering this value rather than letting the CLI generate it, the key/value pair will look like:
"base_directory": "../data",
With these values added, along with blank dictionaries for assets
and batch_spec_passthrough
, your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {},
"batch_spec_passthrough": {},
}
},
}
A RuntimeDataConnector
is used to connect to an in-memory dataframe or path. The dataframe or path used for a RuntimeDataConnector
is therefore passed to the RuntimeDataConnector
as part of a Batch Request, rather than being a static part of the RuntimeDataConnector
's configuration.
A Runtime Data Connector will always only return one Batch of data: the current data that was passed in or specified as part of a Batch Request. This means that a RuntimeDataConnector
does not define Data Assets like an InferredDataConnector
or a ConfiguredDataConnector
would.
Instead, a Runtime Data Connector's configuration will provides a way for you to attach identifying values to a returned Batch of data so that the data as it was at the time it was returned can be referred to again in the future.
For more information on configuring a Batch Request for a Pandas Runtime Data Connector, please see our guide on how to create a Batch of data from an in-memory Spark or Pandas dataframe or path.
Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_runtime_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {"name_of_my_runtime_data_connector": {}},
}
When defining an RuntimeDataConnector
you will need to provide values for two keys in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "name_of_my_runtime_data_connector"
in the example above). These key/value pairs consist of:
class_name
: The name of the Class that will be instantiated for thisDataConnector
.batch_identifiers
: A list of strings that will be used as keys for identifying metadata that the user provides for the returned Batch.
For this example, you will be using the RuntimeDataConnector
as your class_name
. This key/value entry will therefore look like:
"class_name": "RuntimeDataConnector",
After including an empty list for your batch_identifiers
and an empty dictionary for batch_spec_passthrough
your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_spec_passthrough": {},
"batch_identifiers": [],
}
},
}
Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name
along with a default value will be provided when this Data Connector is initialized.
However, if you want to use a custom Data Connector, you will need to explicitly add a module_name
key alongside the class_name
key.
The value for module_name
would then be set as the import path for the module containing your custom Data Connector, in the same fashion as you would provide class_name
and module_name
for a custom Datasource or Execution Engine.
8. Configure the values for batch_spec_passthrough
The parameter batch_spec_passthrough
is used to access some native capabilities of your Execution Engine. If you do not specify it, your Execution Engine will attempt to determine the values based off of file extensions and defaults. If you do define it, it will contain two keys: reader_method
and reader_options
. These will correspond to a string and a dictionary, respectively.
"batch_spec_passthrough": {
"reader_method": "",
"reader_options": {},
Configuring your reader_method
:
The reader_method
is used to specify which of Spark's spark.read.*
methods will be used to read your data. For our example, we are using .csv
files as our source data, so we will specify the csv
method of spark.reader
as our reader_method
, like so:
"reader_method": "csv",
Configuring your reader_options
:
Start by adding a blank dictionary as the value of the reader_options
parameter. This dictionary will hold two key/value pairs: header
and inferSchema
.
"reader_options": {
"header": "",
"inferSchema": "",
},
The first key is header
, and the value should be either True
or False
. This will indicate to the Data Connector whether or not the first row of each source data file is a header row. For our example, we will set this to True
.
"header": True,
The second key to include is inferSchema
. Again, the value should be either True
or False
. This will indicate to the Data Connector whether or not the Execution Engine should attempt to infer the data type contained by each column in the source data files. Again, we will set this to True
for the purpose of this guide's example.
"inferSchema": True,
inferSchema
will read datetime columns in as text columns.
At this point, your batch_spec_passthrough
configuration should look like:
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
And your full configuration will look like:
- InferredAssetFilesystemDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
"batch_identifiers": [],
}
},
}
9. Configure your Data Connector's Data Assets
- InferredAssetFilesystemDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
In an Inferred Asset Data Connector for filesystem data, a regular expression is used to group the files into Batches for a Data Asset. This is done with the value we will define for the Data Connector's default_regex
key. The value for this key will consist of a dictionary that contains two values:
pattern
: This is the regex pattern that will define your Data Asset's potential Batch or Batches.group_names
: This is a list of names that correspond to the groups you defined inpattern
's regular expression.
The pattern
in default_regex
will be matched against the files in your base_directory
, and everything that matches against the first group in your regex will become a Batch in a Data Asset that possesses the name of the matching text. Any files that have a matching string for the first group will become Batches in the same Data Asset.
This means that when configuring your Data Connector's regular expression, you have the option to implement it so that the Data Connector is only capable of returning a single Batch per Data Asset, or so that it is capable of returning multiple Batches grouped into individual Data Assets. Each type of configuration is useful in certain cases, so we will provide examples of both.
If you are uncertain as to which type of configuration is best for your use case, please refer to our guide on how to choose between working with a single or multiple Batches of data.
- Single Batch Configuration
- Multi-Batch Configuration
Because of the simple regex matching that groups files into Batches for a given Data Asset, it is actually quite straight forward to create a Data Connector which has Data Assets that are only capable of providing a single Batch. All you need to do is define a regular expression that consists of a single group which corresponds to a unique portion of your data files' names that is unique for each file.
The simplest way to do this is to define a group that consists of the entire file name.
For this example, lets assume we have the following files in our data
directory:
yellow_tripdata_sample_2020-01.csv
yellow_tripdata_sample_2020-02.csv
yellow_tripdata_sample_2020-03.csv
In this case you could define the pattern
key as follows:
"pattern": "(.*)\\.csv",
This regex will match the full name of any file that has the .csv
extension, and will put everything prior to .csv
extension into a group.
Since each .csv
file will necessarily have a unique name preceeding its extension, the content that matches this pattern will be unique for each file. This will ensure that only one file is included as a Batch for each Data Asset.
To correspond to the single group that was defined in your regex, you will define a single entry in the list for the group_names
key. Since the first group in an Inferred Asset Data Connector is used to generate names for the inferred Data Assets, you should name that group as follows:
"group_names": ["data_asset_name"],
Looking back at our sample files, this regex will result in the InferredAssetFilesystemDataConnector
providing three Data Assets, which can be accessed by the portion of the file that matches the first group in our regex. In future workflows you will be able to refer to one of these Data Assets in a Batch Request py providing one of the following data_asset_name
s:
yellow_tripdata_sample_2020-01
yellow_tripdata_sample_2020-02
yellow_tripdata_sample_2020-03
Since we did not include .csv
in the first group of the regex we defined, the .csv
portion of the filename will be dropped from the value that is recognized as a valid data_asset_name
.
With all of these values put together into a single dictionary, your Data Connector configuration will look like this:
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {
"pattern": "(.*)\\.csv",
"group_names": ["data_asset_name"],
},
}
And the full configuration for your Datasource should look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {
"pattern": "(.*)\\.csv",
"group_names": ["data_asset_name"],
},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
Configuring an InferredAssetFilesystemDataConnector
so that its Data Assets are capable of returning more than one Batch is just a matter of defining an appropriate regular expression. For this kind of configuration, the regular expression you define should have two or more groups.
The first group will be treated as the Data Asset's name. It should be a portion of your file names that occurs in more than one file. The files that match this portion of the regular expression will be grouped together as a single Data Asset.
Any additional groups that you include in your regular expression will be used to identify specific Batches among those that are grouped together in each Data Asset.
For this example, lets assume you have the following files in our data
directory:
yellow_tripdata_sample_2020-01.csv
yellow_tripdata_sample_2020-02.csv
yellow_tripdata_sample_2020-03.csv
You can configure a Data Asset that groups these files together and differentiates each batch by month by defining a pattern
in the dictionary for the default_regex
key:
"pattern": "(yellow_tripdata_sample_2020)-(\\d.*)\\.csv",
This regex will group together all files that match the content of the first group as a single Data Asset. Since the first group does not include any special regex characters, this means that all of the .csv
files that start with "yellow_tripdata_sample_2020" will be combined into one Data Asset, and that all other files will be ignored.
The second defined group consists of the numeric characters after the last dash in a file name and prior to the .csv
extension. Specifying a value for that group in your future Batch Requests will allow you to request a specific Batch from the Data Asset.
Since you have defined two groups in your regex, you will need to provide two corresponding group names in your group_names
key. Since the first group in an Inferred Asset Data Connector is used to generate the names for the inferred Data Assets provided by the Data Connector and the second group you defined corresponds to the month of data that each file contains, you should name those groups as follows:
"group_names": ["data_asset_name", "month"],
Looking back at our sample files, this regex will result in the InferredAssetFilesystemDataConnector
providing a single Data Asset, which will contain three batches. In future workflows you will be able to refer to a specific Batch in this Data Asset in a Batch Request py providing the data_asset_name
of "yellow_tripdata_sample_2020"
and one of the following month
s:
01
02
03
Any characters that are not included in a group when you define your regex will still be checked for when determining if a file name "matches" the regular expression. However, those characters will not be included in any of the Batch Identifiers, which is why the -
and .csv
portions of the filenames are not found in either the data_asset_name
or month
values.
For more information on the special characters and mechanics of matching and grouping strings with regular expressions, please see the Python documentation on the re
module.
With all of these values put together into a single dictionary, your Data Connector configuration will look like this:
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {
"pattern": "(yellow_tripdata_sample_2020)-(\\d.*)\\.csv",
"group_names": ["data_asset_name", "month"],
},
}
And the full configuration for your Datasource should look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetFilesystemDataConnector",
"base_directory": "../data",
"default_regex": {
"pattern": "(yellow_tripdata_sample_2020)-(\\d.*)\\.csv",
"group_names": ["data_asset_name", "month"],
},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
In a Configured Asset Data Connector for filesystem data, each entry in the assets
dictionary will correspond to an explicitly defined Data Asset. The key provided will be used as the name of the Data Asset, while the value will be a dictionary that contains two additional keys:
pattern
: This is the regex pattern that will define your Data Asset's potential Batch or Batches.group_names
: This is a list of names that correspond to the groups you defined inpattern
's regular expression.
The pattern
in each assets
entry will be matched against the files in your base_directory
, and everything that matches against the pattern
's value will become a Batch in a Data Asset with a name matching the key for this entry in the assets
dictionary.
This means that when configuring your Data Connector's regular expression, you have the option to implement it so that the Data Connector is only capable of returning a single Batch per Data Asset, or so that it is capable of returning multiple Batches grouped into individual Data Assets. Each type of configuration is useful in certain cases, so we will provide examples of both.
If you are uncertain as to which type of configuration is best for your use case, please refer to our guide on how to choose between working with a single or multiple Batches of data.
- Single Batch Configuration
- Multi-Batch Configuration
Because you are explicitly defining each Data Asset in a ConfiguredAssetDataConnector
, it is very easy to define one that can will only have one Batch.
The simplest way to do this is to define a Data Asset with a pattern
value that does not contain any regex special characters which would match on more than one value.
For this example, lets assume we have the following files in our data
directory:
yellow_tripdata_sample_2020-01.csv
yellow_tripdata_sample_2020-02.csv
yellow_tripdata_sample_2020-03.csv
In this case, we want to define a single Data Asset for each month. To do so, we will need an entry in the assets
dictionary for each month, as well: one for each Data Asset we want to create.
Let's walk through the creation of the Data Asset for January's data.
First, you need to add an empty dictionary entry into the assets
dictionary. Since the key you associate with this entry will be treated as the Data Asset's name, go ahead and name it yellow_trip_data_jan
.
At this point, your entry in the `assets dictionary will look like:
"yellow_tripdata_jan": {}
Next, you will need to define the pattern
value and group_names
value for this Data Asset.
Since you want this Data Asset to only match the file yellow_tripdata_sample_2020-01.csv
value for the pattern
key should be one that does not contain any regex special characters that can match on more than one value. An example follows:
"pattern": "yellow_tripdata_sample_2020-(01)\\.csv",
The pattern we defined contains a regex group, even though we logically don't need a group to identify the desired Batch in a Data Asset that can only return one Batch. This is because Great Expectations currently does not permit pattern
to be defined without also having group_names
defined. Thus, in the example above you are creating a group that corresponds to 01
so that there is a valid group to associate a group_names
entry with.
Since none of the characters in this regex can possibly match more than one value, the only file that can possibly be matched is the one you want it to match: yellow_tripdata_sample_2020-01.csv
. This batch will also be associated with the Batch Identifier 01
, but you won't need to use that to specify the Batch in a Batch Request as it is the only Batch that this Data Asset is capable of returning.
To correspond to the single group that was defined in your regex, you will define a single entry in the list for the group_names
key. Since the assets
dictionary key is used for this Data Asset's name, you can give this group a name relevant to what it is matching on:
"group_names": ["month"],
Put entirely together, your assets
entry will look like:
"yellow_tripdata_jan": {
"pattern": "yellow_tripdata_sample_2020-(01)\\.csv",
"group_names": ["month"],
}
Looking back at our sample files, this entry will result in the ConfiguredAssetFilesystemDataConnector
providing one Data Asset, which can be accessed by the name yellow_tripdata_jan
. In future workflows you will be able to refer to this Data Asset and its single corresponding Batch by providing that name.
With all of these values put together into a single dictionary, your Data Connector configuration will look like this:
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {
"yellow_tripdata_jan": {
"pattern": "yellow_tripdata_sample_2020-(01)\\.csv",
"group_names": ["month"],
}
},
}
And the full configuration for your Datasource should look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {
"yellow_tripdata_jan": {
"pattern": "yellow_tripdata_sample_2020-(01)\\.csv",
"group_names": ["month"],
}
},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
Because Configured Data Assets require that you explicitly define each Data Asset they provide access to, you will have to add assets
entries for February and March if you also want to access yellow_tripdata_sample_2020-02.csv
and yellow_tripdata_sample_2020-03.csv
in the same way.
Configuring a ConfiguredAssetFilesystemDataConnector
so that its Data Assets are capable of returning more than one Batch is just a matter of defining an appropriate regular expression. For this kind of configuration, the regular expression you define should include at least one group that contains regular expression special characters capable of matching more than one value.
For this example, lets assume we have the following files in our data
directory:
yellow_tripdata_sample_2020-01.csv
yellow_tripdata_sample_2020-02.csv
yellow_tripdata_sample_2020-03.csv
In this case, we want to define a Data Asset that contains all of our data for the year 2020.
First, you need to add an empty dictionary entry into the assets
dictionary. Since the key you associate with this entry will be treated as the Data Asset's name, go ahead and name it yellow_trip_data_2020
.
At this point, your entry in the `assets dictionary will look like:
"yellow_tripdata_2020": {}
Next, you will need to define the pattern
value and group_names
value for this Data Asset.
Since you want this Data Asset to all of the 2020 files, the value for pattern
needs to be a regular expression that is capable of matching all of the files. To do this, we will need to use regular expression special characters that are capable of matching one more than one value.
Looking back at the files in our data
directory, you can see that each file differs from the others only in the digits indicating the month of the file. Therefore, the regular expression we create will separate those specific characters into a group, and will define the content of that group using special characters capable of matching on any values, like so:
"pattern": "yellow_tripdata_sample_2020-(.*)\\.csv",
To correspond to the single group that was defined in your pattern
, you will define a single entry in the list for the group_names
key. Since the assets
dictionary key is used for this Data Asset's name, you can give this group a name relevant to what it is matching on:
"group_names": ["month"],
Since the group in the above regular expression will match on any characters, this regex will successfully match on each of the file names in our data
directory, and will associate each file with the identifier month
that corresponds to the file's grouped characters:
yellow_tripdata_sample_2020_01.csv
will be Batch identified by amonth
value of01
yellow_tripdata_sample_2020_02.csv
will be Batch identified by amonth
value of02
yellow_tripdata_sample_2020_03.csv
will be Batch identified by amonth
value of03
Put entirely together, your assets
entry will look like:
"yellow_tripdata_2020": {
"pattern": "yellow_tripdata_sample_2020-(.*)\\.csv",
"group_names": ["month"],
}
Looking back at our sample files, this entry will result in the ConfiguredAssetFilesystemDataConnector
providing one Data Asset, which can be accessed by the name yellow_tripdata_2020
. In future workflows you will be able to refer to this Data Asset and by providing that name, and refer to a specific Batch in this Data Asset by providing your Batch Request with a batch_identifier
entry using the key month
and the value corresponding to the month portion of the filename of the file that corresponds to the Batch in question.
For more information on the special characters and mechanics of matching and grouping strings with regular expressions, please see the Python documentation on the re
module.
With all of these values put together into a single dictionary, your Data Connector configuration will look like this:
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {
"yellow_tripdata_2020": {
"pattern": "yellow_tripdata_sample_2020-(.*)\\.csv",
"group_names": ["month"],
}
},
}
And the full configuration for your Datasource should look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetFilesystemDataConnector",
"base_directory": "../data",
"assets": {
"yellow_tripdata_2020": {
"pattern": "yellow_tripdata_sample_2020-(.*)\\.csv",
"group_names": ["month"],
}
},
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
}
},
}
Remember that when you are working with a Configured Asset Data Connector you need to explicitly define each of your Data Assets. So, if you want to add additional Data Assets, go ahead and repeat the process of defining an entry in your configuration's assets
dictionary to do so.
Runtime Data Connectors put a wrapper around a single Batch of data, and therefore do not support Data Asset configurations that permit the return of more than one Batch of data. In fact, since you will use a Batch Request to pass in or specify the data that a Runtime Data Connector uses, there is no need to specify a Data Asset configuration at all.
Instead, you will provide a batch_identifiers
list which will be used to attach identifying information to a returned Batch so that you can reference the same data again in the future.
For this example, lets assume we have the folloÏwing files in our data
directory:
yellow_tripdata_sample_2020-01.csv
yellow_tripdata_sample_2020-02.csv
yellow_tripdata_sample_2020-03.csv
With a Runtime Data Connector you won't actually refer to them in your configuration! As mentioned above, you will provide the path or dataframe for one of those files to the Data Connector as part of a Batch Request.
Therefore, the file names are inconsequential to your Runtime Data Connector's configuration. In fact, the batch_identifiers
that you define in your Runtime Data Connector's configuration can be completely arbitrary. However, it is advised you name them after something meaningful regarding your data or the circumstances under which you will be accessing your data.
For instance, let's assume you are getting a daily update to your data, and so you are running daily validations. You could then choose to identify your Runtime Data Connector's Batches by the timestamp at which they are requested.
To do this, you would simply add a batch_timestamp
entry in your batch_identifiers
list. This would look like:
"batch_identifiers": ["batch_timestamp"]
Then, when you create your Batch Request you would populate the batch_timestamp
value in its batch_identifiers
dictionary with the value of the current date and time. This will attach the current date and time to the returned Batch, allowing you to reference the Batch again in the future even if the current data (the data that would be provided by the Runtime Data Connector if you requested a new Batch) had changed.
The full configuration for your Datasource should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "PandasExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_spec_passthrough": {
"reader_method": "csv",
"reader_options": {
"header": True,
"inferSchema": True,
},
},
"batch_identifiers": ["batch_timestamp"],
}
},
}
We stated above that the names that you use for your batch_identifiers
in a Runtime Data Connector's configuration can be completely arbitrary, and will be used as keys for the batch_identifiers
dictionary in future Batch Requests.
However, the same holds true for the values you pass in for each key in your Batch Request's batch_identifiers
!
Always make sure that your Batch Requests utilizing Runtime Data Connectors are providing meaningful identifying information, consistent with the keys that are derived from the batch_identifiers
you have defined in your Runtime Data Connector's configuration.
10. Test your configuration with .test_yaml_config(...)
Now that you have a full Datasource configuration, you can confirm that it is valid by testing it with the .test_yaml_config(...)
method. To do this, execute the Python code:
data_context.test_yaml_config(yaml.dump(datasource_config))
When executed, test_yaml_config
will instantiate the component described by the yaml configuration that is passed in and then run a self check procedure to verify that the component works as expected.
For a Datasource, this includes:
- confirming that the connection works
- gathering a list of available Data Assets
- verifying that at least one Batch can be fetched from the Datasource
For more information on the .test_yaml_config(...)
method, please see our guide on how to configure DataContext
components using test_yaml_config
.
11. (Optional) Add more Data Connectors to your configuration
The data_connectors
dictionary in your datasource_config
can contain multiple entries. If you want to add additional Data Connectors, just go through the process starting at step 7 again.
12. Add your new Datasource to your Data Context
Now that you have verified that you have a valid configuration you can add your new Datasource to your Data Context with the command:
data_context.add_datasource(**datasource_config)
If the value of datasource_config["name"]
corresponds to a Datasource that is already defined in your Data Context, then using the above command will overwrite the existing Datasource.
If you want to ensure that you only add a Datasource when it won't overwrite an existing one, you can use the following code instead:
# add_datasource only if it doesn't already exist in your Data Context
try:
data_context.get_datasource(datasource_config["name"])
except ValueError:
data_context.add_datasource(**datasource_config)
else:
print(
f"The datasource {datasource_config['name']} already exists in your Data Context!"
)
Next Steps
Congratulations! You have fully configured a Datasource and verified that it can be used in future workflows to provide a Batch or Batches of data.
For more information on using Batch Requests to retrieve data, please see our guide on how to get one or more Batches of data from a configured Datasource.
You can now move forward and create Expectations for your Datasource.