How to configure a SQL Datasource
This guide will walk you through the process of configuring a SQL Datasource from scratch, verifying that your configuration is valid, and adding it to your Data Context. By the end of this guide you will have a SQL Datasource which you can use in future workflows for creating Expectations and Validating data.
Steps
1. Import necessary modules and initialize your Data Context
from ruamel import yaml
import great_expectations as gx
data_context: gx.DataContext = gx.get_context()
The great_expectations
module will give you access to your Data Context, which is the entry point for working with a Great Expectations project.
The yaml
module from ruamel
will be used in validating your Datasource's configuration. Great Expectations will use a Python dictionary representation of your Datasource configuration when you add your Datasource to your Data Context. However, Great Expectations saves configurations as yaml
files, so when you validate your configuration you will need to convert it from a Python dictionary to a yaml
string, first.
Your Data Context that is initialized by get_data_context()
will be the Data Context defined in your current working directory. It will provide you with convenience methods that we will use to validate your Datasource configuration and add your Datasource to your Great Expectations project once you have configured it.
2. Create a new Datasource configuration.
A new Datasource can be configured in Python as a dictionary with a specific set of keys. We will build our Datasource configuration from scratch in this guide, although you can just as easily modify an existing one.
To start, create an empty dictionary. You will be populating it with keys as you go forward.
At this point, the configuration for your Datasource is merely:
datasource_config: dict = {}
However, from this humble beginning you will be able to build a full Datasource configuration.
The keys needed for your Datasource configuration
At the top level, your Datasource's configuration will need the following keys:
name
: The name of the Datasource, which will be used to reference the datasource in Batch Requests.class_name
: The name of the Python class instantiated by the Datasource. Typically, this will be theDatasource
class.module_name
: the name of the module that contains the Class definition indicated byclass_name
.execution_engine
: a dictionary containing theclass_name
andmodule_name
of the Execution Engine instantiated by the Datasource.data_connectors
: the configurations for any Data Connectors and their associated Data Assets that you want to have available when utilizing the Datasource.
In the following steps we will add those keys and their corresponding values to your currently empty Datasource configuration dictionary.
3. Name your Datasource
The first key that you will need to define for your new Datasource is its name
. You will use this to reference the Datasource in future workflows. It can be anything you want it to be, but ideally you will name it something relevant to the data that it interacts with.
For the purposes of this example, we will name this Datasource:
"name": "my_datasource_name", # Preferably name it something relevant
You should, however, name your Datsource something more relevant to your data.
At this point, your configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
}
4. Specify the Datasource class and module
The class_name
and module_name
for your Datasource will almost always indicate the Datasource
class found at great_expectations.datasource
. You may replace this with a specialized subclass, or a custom class, but for almost all regular purposes these two default values will suffice. For the purposes of this guide, add those two values to their corresponding keys.
"class_name": "Datasource",
"module_name": "great_expectations.datasource"
Your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name", # Preferably name it something relevant
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
}
5. Add the SqlAlchemy Execution Engine to your Datasource configuration
Your Execution Engine is where you will specify that you want this Datasource to use SQL in the backend. As with the Datasource top level configuration, you will need to provide the class_name
and module_name
that indicate the class definition and containing module for the Execution Engine that you will use.
For the purposes of this guide, these will consist of the SqlAlchemyExecutionEngine
found at great_expectations.execution_engine
. The execution_engine
key and its corresponding value will therefore look like this:
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
Additionally, your execution_engine
dictionary will require a values for either connection_string
or credentials
. You will only need to use one of these keys as they each serve the same purpose: to provide the parameters necessary for the SqlAlchemyExecutionEngine to connect to your desired database. For the purposes of this guide we will use the connection_string
key, the value of which will be the string representation of the information necessary to connect to your SQL database. At this point your configuration should look like:
"connection_string": CONNECTION_STRING,
Your connection string will vary depending on the type of SQL database you are connecting to. For more information on how to configure your connection string, please see the appropriate guide for connecting to a specific Database.
After adding the above snippets to your Datasource configuration, your full configuration dictionary should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
}
credentials
instead of connection_string
The credentials
key uses a dictionary to provide the elements of your connection string as separate, individual values. For information on how to populate the credentials
dictionary and how to configure your great_expectations.yml
project config file to populate credentials from either a YAML file or a secret manager, please see our guide on How to configure credentials.
In addition to substituting credentials
for connection_string
, there are a few other optional keys that you can add to your execution_engine
dictionary.
The first of these is the optional key create_temp_table
. Creating a temp table to improve query efficiency is the default behaviour of the SqlAlchemyExecutionEngine
. However, you can change this to False
if you don't have the permissions to create a temp table on the database you are working with.
The other valid keys are not actually defined by the Great Expectations code. Instead, any other key/value pairs that you add to the Execution Engine configuration will be passed directly to SqlAlchemy's create_engine(...)
method. This allows you to fully leverage all the capabilities supported by SqlAlchemy through that method.
For more information on what keys create_engine(...)
will accept as parameters and how you can utilize this feature, please refer to SqlAlchemy's documentation.
6. Add a dictionary as the value of the data_connectors
key
The data_connectors
key should have a dictionary as its value. Each key/value pair in this dictionary will correspond to a Data Connector's name and configuration, respectively.
The keys in the data_connectors
dictionary will be the names of the Data Connectors, which you will use to indicate which Data Connector to use in future workflows. As with value of your Datasource's name
key, you can use any value you want for a Data Connector's name. Ideally, you will use something relevant to the data that each particular Data Connector will provide; the only significant difference is that for Data Connectors the name of the Data Connector is its key in the data_connectors
dictionary.
The values for each of your data_connectors
keys will be the Data Connector configurations that correspond to each Data Connector's name. You may define multiple Data Connectors in the data_connectors
dictionary by including multiple key/value pairs.
For now, start by adding an empty dictionary as the value of the data_connectors
key. We will begin populating it with Data Connector configurations in the next step.
Your current configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {},
}
7. Configure your individual Data Connectors (Splitting, sampling, etc.)
For each Data Connector configuration, you will need to specify which type of Data Connector you will be using. For SQL, the most likely ones will be the InferredAssetSqlDataConnector
, the ConfiguredAssetSqlDataConnector
, and the RuntimeDataConnector
.
If you are uncertain which Data Connector best suits your needs, please refer to our guide on how to choose which Data Connector to use.
Data Connector example configurations:
- InferredAssetFilesystemDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
The InferredDataConnector
is ideal for:
- quickly setting up a Datasource and getting access to data
- diving straight in to working with Great Expectations
- initial data discovery and introspection
However, the InferredDataConnector
allows less control over the definitions of your Data Assets than the ConfiguredAssetDataConnector
provides.
If you are at the point of building a repeatable workflow, we encourage using the ConfiguredAssetDataConnector
instead.
Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_inferred_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_inferred_data_connector": {},
},
}
Required Data Connector configuration keys
When defining an InferredAssetSqlDataConnector
you will need to provide values for one required key in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "name_of_my_inferred_data_connector"
in the example above). This key/value pair consists of:
class_name
: The name of the Class that will be instantiated for thisDataConnector
.
For this example, you will be using the InferredAssetSqlDataConnector
as your class_name
. This is a subclass of the InferredAssetDataConnector
that is specialized to support SQL Execution Engines, such as the SqlAlchemyExecutionEngine
. This key/value entry will therefore look like:
"class_name": "InferredAssetSqlDataConnector",
Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name
along with a default value will be provided when this Data Connector is initialized.
However, if you want to use a custom Data Connector, you will need to explicitly add a module_name
key alongside the class_name
key.
The value for module_name
would then be set as the import path for the module containing your custom Data Connector, in the same fashion as you would provide class_name
and module_name
for a custom Datasource or Execution Engine.
With this value added your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_inferred_data_connector": {
"class_name": "InferredAssetSqlDataConnector",
},
},
}
Optional Data Connector configuration key for defining introspection behaviour
There is an optional key that can be defined for your Data Connector to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:
introspection_directives
: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.
You will find a list of the valid keys for the introspection_directives
dictionary and their corresponding values in the Introspection directives subsection of the Additional notes at the end of this guide.
There are also optional keys that you can define to alter how Data Assets are inferred, how they are split into Batches, and how data is sampled from Batches. This guide will go over those in the next section.
A ConfiguredAssetDataConnector
enables the most fine-tuning, allowing you to easily work with multiple Batches. It also requires an explicit listing of each Data Asset you connect to and how Batches or defined within that Data Asset, which makes it very clear what Data Assets are being provided when you reference it in Profilers, Batch Requests, or Checkpoints..
Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_configured_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_configured_data_connector": {},
},
}
Required Data Connector configuration keys
When defining a ConfiguredAssetSqlDataConnector
you will need to provide values for two required keys in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "name_of_my_configured_data_connector"
in the example above). These key/value pairs consist of:
class_name
: The name of the Class that will be instantiated for thisDataConnector
.assets
: A dictionary where the keys are the names of specific Data Assets and the values are the configurations for the corresponding Data Asset.
For this example, you will be using the ConfiguredAssetSqlDataConnector
as your class_name
. This is a subclass of the ConfiguredAssetDataConnector
that is specialized to support SQL Execution Engines, such as the SqlAlchemyExecutionEngine
. This key/value entry will therefore look like:
"class_name": "ConfiguredAssetSqlDataConnector",
Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name
along with a default value will be provided when this Data Connector is initialized.
However, if you want to use a custom Data Connector, you will need to explicitly add a module_name
key alongside the class_name
key.
The value for module_name
would then be set as the import path for the module containing your custom Data Connector, in the same fashion as you would provide class_name
and module_name
for a custom Datasource or Execution Engine.
With this value added, along with a blank dictionary for assets
, your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetSqlDataConnector",
"assets": {},
},
},
}
Optional Data Connector configuration keys for splitting Data Assets into Batches
In addition to the above key/value pairs, there are some optional key/value pairs that you can use to define how your Data Assets are split into Batches. If you define these key/value pairs in your Data Connector dictionary, they will be applied to all the Data Connector's Data Assets. However, if you choose not to define these key/value pairs, the Data Assets that are defined for this Data Connector will default to returning a single Batch consisting of a full table when requested from a Batch Request.
The key/value pairs that used for splitting a Data Asset into Batches are:
splitter_method
: A string that names the method that will be used to split the target table into multiple Batches.splitter_kwargs
: A dictionary containing keyword arguments to pass tosplitter_method
ifsplitter_method
has been defined.
For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.
Optional Data Connector configuration keys for sampling data from returned Batches
You may wish to only sample the data that is returned in your Data Asset's Batches. To do this, you will need to define the optional keys sampling_method
and sampling_kwargs
. As with splitter_method
and splitter_kwargs
, defining these key/value pairs in your Data Connector's dictionary will result in those values being applied to all Data Assets that are made available by the Data Connector.
The key/value pairs that are used for sampling data from a Data Asset are:
sampling_method
: A string that names the method that will be used to sample data from returned Batches.sampling_kwargs
: A dictionary containing keyword arguments to pass tosampling_method
ifsampling_method
has been defined.
Although this guide will not use sampling in its examples, there is a list of the available sampling methods in the Sampling methods subsection of the Additional notes section at the end of this guide.
Optional Data Connector configuration key for defining introspection behaviour
Finally, there is an optional key that can be defined to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:
introspection_directives
: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.
You will find a list of the valid keys for the introspection_directives
dictionary and their corresponding values in the Introspection directives subsection of the Additional notes at the end of this guide.
A note on optional Data Connector keys
These key/value pairs can also be defined in the configurations for individual Data Assets, which will be shown later in this guide. If these values are defined both in the Data Connector dictionary and in a Data Asset dictionary, the definition in the Data Asset will take precedence.
A RuntimeDataConnector
is used to connect to an in-memory dataframe or path. The dataframe or path used for a RuntimeDataConnector
is therefore passed to the RuntimeDataConnector
as part of a Batch Request, rather than being a static part of the RuntimeDataConnector
's configuration.
A Runtime Data Connector will always only return one Batch of data: the current data that was passed in or specified as part of a Batch Request. This means that a RuntimeDataConnector
does not define Data Assets like an InferredDataConnector
or a ConfiguredDataConnector
would.
Instead, a Runtime Data Connector's configuration will provides a way for you to attach identifying values to a returned Batch of data so that the data as it was at the time it was returned can be referred to again in the future.
For more information on configuring a Batch Request for a Pandas Runtime Data Connector, please see our guide on how to create a Batch of data from an in-memory Spark or Pandas dataframe or path.
Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_runtime_data_connector
but you may have it be anything you like.
At this point, your configuration should look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_runtime_data_connector": {},
},
}
Required Data Connector configuration keys
When defining a ConfiguredAssetSqlDataConnector
you will need to provide values for two required keys in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "name_of_my_configured_data_connector"
in the example above). These key/value pairs consist of:
class_name
: The name of the Class that will be instantiated for thisDataConnector
.batch_identifiers
: A list of strings that will be used as keys for identifying metadata that the user provides for the returned Batch.
For this example, you will be using the RuntimeDataConnector
as your class_name
. This key/value entry will therefore look like:
"class_name": "RuntimeDataConnector",
After including an empty list for your batch_identifiers
your full configuration should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": {},
},
},
}
Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name
along with a default value will be provided when this Data Connector is initialized.
However, if you want to use a custom Data Connector, you will need to explicitly add a module_name
key alongside the class_name
key.
The value for module_name
would then be set as the import path for the module containing your custom Data Connector, in the same fashion as you would provide class_name
and module_name
for a custom Datasource or Execution Engine.
8. Configure your Data Connector's Data Assets (Splitting, sampling, etc.)
- InferredAssetFilesystemDataConnector
- ConfiguredAssetDataConnector
- RuntimeDataConnector
In an Inferred Asset Data Connector for SQL data, all the behaviour for how Data Assets are inferred, how they are split into Batches, and how those Batches are sampled for data are configured in the Data Connector dictionary. To alter any of these behaviours from the default, you will simply set your desired behaviour by defining one of the following optional key/value pairs.
Optional Data Connector configuration keys for inferring Data Assets
A Data Asset in an Inferred Asset Data Connector for SQL data will consist of a single table in the database you have connected to. You can modify the way that the Data Connector infers which tables to utilize as Data Assets by defining the following key/value pairs in your Data Connector's dictionary in the Datasource configuration:
data_asset_name_prefix
: A string describing an optional prefix to prepend to the names of inferred Data Assets.data_asset_name_suffix
: A string describing an optional suffix to append to the names of inferred Data Assets.include_schema_name
: A boolean value which answers the question : "Should thedata_asset_name
include the schema as a prefix?"excluded_tables
: A list of tables to ignore when inferring Data Assets.included_tables
: A list that, if defined, will limit the inferred Data Assets to those tables that are included in the list.skip_inapplicable_tables
: A boolean value. IfTrue
, tables that can't be successfully queried using sampling and splitter methods are excluded from inferred Data Assets. IfFalse
, the class will throw an error during initialization if any such tables are encountered.
Optional Data Connector configuration keys for splitting Data Assets into Batches
Next is the matter of how (or even if) your Data Connector splits Data Assets into Batches. By default, each Data Asset will provide a single Batch consisting of the entire table that it corresponds to. You can change this behaviour by specifying the following key/value pairs:
splitter_method
: A string that names the method that will be used to split the target table into multiple Batches.splitter_kwargs
: A dictionary containing keyword arguments to pass tosplitter_method
ifsplitter_method
has been defined.
For example, imagine that you have one or more tables containing the NYC taxi data from the getting started tutorial in your database. You could instruct your Data Connector to infer Data Assets that return each table as a single Batch by simply not including a splitter_method
. Such a configuration would be identical to the data connector name_of_my_inferred_data_connector
that was defined in the example at the end of step 7, so let's rename that data_connector
entry inferred_data_connector_single_batch_asset
since that is more meaningful. Your configuration for a single Batch Data Asset would now look like:
"inferred_data_connector_single_batch_asset": {
"class_name": "InferredAssetSqlDataConnector",
},
Alternatively, you could define a Data Asset that is split into Batches based on the year and month by defining the splitter_method
to be split_on_year_and_month
and providing a Datetime column. (In the case of the NYC taxi data, this would be the pickup_datetime
column.) Creating a Data Asset like this would result in your configuration being:
"inferred_data_connector_multi_batch_asset_split_on_date_time": {
"class_name": "InferredAssetSqlDataConnector",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
If you included both of these Data Assets, your complete configuration would look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"inferred_data_connector_single_batch_asset": {
"class_name": "InferredAssetSqlDataConnector",
},
"inferred_data_connector_multi_batch_asset_split_on_date_time": {
"class_name": "InferredAssetSqlDataConnector",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
},
}
If you are uncertain whether you should be splitting your Data Assets into Batches, please refer to our guide on how to choose between working with a single or multiple Batches of data.
For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.
Optional Data Connector configuration keys for sampling data from returned Batches
Finally, you may wish to only sample a portion of the data that would be returned in your Data Asset's Batches. To do this, you will need to define the optional keys sampling_method
and sampling_kwargs
. As with splitter_method
and splitter_kwargs
, defining these key/value pairs in your Data Connector's dictionary will result in those values being applied to all Data Assets that are made available by the Data Connector.
The key/value pairs that are used for sampling data from a Data Asset are:
sampling_method
: A string that names the method that will be used to sample data from returned Batches.sampling_kwargs
: A dictionary containing keyword arguments to pass tosampling_method
ifsampling_method
has been defined.
Although this guide will not use sampling in its examples, there is a list of the available sampling methods in the Sampling methods subsection of the Additional notes section at the end of this guide.
In a Configured Asset Data Connector for SQL data, each entry in the assets
dictionary will correspond to an explicitly defined Data Asset. The key provided will be used as the name of the Data Asset, while the value will be a dictionary that specifies how that Data Asset is defined.
These key/value pairs are technically all optional. If none of them are defined, the Data Asset will correspond to the table that matches the key corresponding to the empty Data Asset dictionary.
Optional Data Asset configuration keys for Data Asset names and schemas:
You may provide the following key/value pairs in your Data Asset configuration to alter how your Data Asset behaves regarding its associated table:
table_name
: A string that, if defined, is used as the name for the Data Asset. If this is not defined, the default name will be that of the key corresponding to the Data Asset's dictionary in your configuration.schema_name
: An optional string that defines theschema
for the Data Asset.include_schema_name
: A boolean value that determines whether theschema_name
should be included as a prefix to the Data Asset's name.
For example, imagine that you have a copy of the NYC taxi data from the getting started tutorial in a table called yellow_tripdata_sample_2020
, along with a public schema. You could access this data by defining an entry in the assets
dictionary like:
"yellow_tripdata_sample_2020_full": {
"table_name": "yellow_tripdata_sample_2020",
"schema_name": "main",
},
Optional Data Asset configuration keys for splitting Data Assets into Batches
Next is the matter of how (or even if) your Data Connector splits Data Assets into Batches. By default, each Data Asset will provide a single Batch consisting of the entire table that it corresponds to. You can change this behaviour by specifying the following key/value pairs:
splitter_method
: A string that names the method that will be used to split the target table into multiple Batches.splitter_kwargs
: A dictionary containing keyword arguments to pass tosplitter_method
ifsplitter_method
has been defined.
The configuration we provided above does not include a splitter_method
, and therefore will return the entire table as a single Batch.
Alternatively, you could define a Data Asset that is split into Batches based on the year and month by defining the splitter_method
to be split_on_year_and_month
and providing a Datetime column. (In the case of the NYC taxi data, this would be the pickup_datetime
column.) Creating a Data Asset like this would result in the following key/value pair in your assets
dictionary:
"yellow_tripdata_sample_2020_by_year_and_month": {
"table_name": "yellow_tripdata_sample_2020",
"schema_name": "main",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
If you included both of these Data Assets, your complete configuration would look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_configured_data_connector": {
"class_name": "ConfiguredAssetSqlDataConnector",
"assets": {
"yellow_tripdata_sample_2020_full": {
"table_name": "yellow_tripdata_sample_2020",
"schema_name": "main",
},
"yellow_tripdata_sample_2020_by_year_and_month": {
"table_name": "yellow_tripdata_sample_2020",
"schema_name": "main",
"splitter_method": "split_on_year_and_month",
"splitter_kwargs": {
"column_name": "pickup_datetime",
},
},
},
},
},
}
If you are uncertain whether you should be splitting your Data Assets into Batches, please refer to our guide on how to choose between working with a single or multiple Batches of data.
The splitter_method
and splitter_kwargs
key/value pairs can be defined at either the Data Connector or the Data Asset level. If they are defined at the Data Connector level, they will apply to all Data Assets that do not have an alternative definition in their configuration. If they are defined at the Data Asset level, the Data Asset definition will take precedence over the Data Connector definition.
You can think of the Data Connector level definition of splitter_method
and splitter_kwargs
as a way to define default values of these keys for your Data Assets. The Data Asset level definitions, then, would be Data Asset specific definitions that overwrite those defaults.
For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.
Optional Data Asset configuration keys for sampling data from returned Batches
Finally, you may wish to only sample a portion of the data that would be returned in your Data Asset's Batches. To do this, you will need to define the optional keys sampling_method
and sampling_kwargs
. As with splitter_method
and splitter_kwargs
, defining these key/value pairs in your Data Connector's dictionary will result in those values being applied to all Data Assets that are made available by the Data Connector.
The key/value pairs that are used for sampling data from a Data Asset are:
sampling_method
: A string that names the method that will be used to sample data from returned Batches.sampling_kwargs
: A dictionary containing keyword arguments to pass tosampling_method
ifsampling_method
has been defined.
As with splitter_method
and splitter_kwargs
, sampling_method
and sampling_kwargs
can be defined at either the Data Connector or the Data Asset level. When defined at the Data Connector level, the definition acts as a default that applies to all Data Assets that do not have their own specific definition for the two keys. If defined at the Data Asset level, the values will take precedence over any that are defined in the Data Connector dictionary.
Although this guide will not use sampling in its examples, there is a list of the available sampling methods in the Sampling methods subsection of the Additional notes section at the end of this guide.
Optional Data Asset configuration key for defining introspection behaviour
Finally, there is an optional key that can be defined to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:
introspection_directives
: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.
Similar to the splitting and sampling key/value pairs, this key can be defined at either the Data Connector or the Data Asset level. A definition at the Data Connector level will be applied to all Data Assets, while a definition at the Data Asset level will take precidence over the Data Connector's values.
You will find a list of the valid keys for the introspection_directives
dictionary and their corresponding values in the Introspection directives subsection of the Additional notes at the end of this guide.
Runtime Data Connectors put a wrapper around a single Batch of data, and therefore do not support Data Asset configurations that permit the return of more than one Batch of data. In fact, since you will use a Batch Request to pass in or specify the data that a Runtime Data Connector uses, there is no need to specify a Data Asset configuration at all.
Instead, you will provide a batch_identifiers
list which will be used to attach identifying information to a returned Batch so that you can reference the same data again in the future.
For this example, lets assume we have the folloÏwing tables in our database:
yellow_tripdata_sample_2020-01
yellow_tripdata_sample_2020-02
yellow_tripdata_sample_2020-03
With a Runtime Data Connector you won't actually refer to them in your configuration! As mentioned above, you will provide the name of one of those tables to the Data Connector as part of a Batch Request.
Therefore, the table names are inconsequential to your Runtime Data Connector's configuration. In fact, the batch_identifiers
that you define in your Runtime Data Connector's configuration can be completely arbitrary. However, it is advised you name them after something meaningful regarding your data or the circumstances under which you will be accessing your data.
For instance, let's assume you are getting a daily update to your data, and so you are running daily validations. You could then choose to identify your Runtime Data Connector's Batches by the timestamp at which they are requested.
To do this, you would simply add a batch_timestamp
entry in your batch_identifiers
list. This would look like:
"batch_identifiers": ["batch_timestamp"],
Then, when you create your Batch Request you would populate the batch_timestamp
value in its batch_identifiers
dictionary with the value of the current date and time. This will attach the current date and time to the returned Batch, allowing you to reference the Batch again in the future even if the current data (the data that would be provided by the Runtime Data Connector if you requested a new Batch) had changed.
The full configuration for your Datasource should now look like:
datasource_config: dict = {
"name": "my_datasource_name",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"class_name": "SqlAlchemyExecutionEngine",
"module_name": "great_expectations.execution_engine",
"connection_string": CONNECTION_STRING,
},
"data_connectors": {
"name_of_my_runtime_data_connector": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["batch_timestamp"],
},
},
}
We stated above that the names that you use for your batch_identifiers
in a Runtime Data Connector's configuration can be completely arbitrary, and will be used as keys for the batch_identifiers
dictionary in future Batch Requests.
However, the same holds true for the values you pass in for each key in your Batch Request's batch_identifiers
!
Always make sure that your Batch Requests utilizing Runtime Data Connectors are providing meaningful identifying information, consistent with the keys that are derived from the batch_identifiers
you have defined in your Runtime Data Connector's configuration.
9. Test your configuration with .test_yaml_config(...)
Now that you have a full Datasource configuration, you can confirm that it is valid by testing it with the .test_yaml_config(...)
method. To do this, execute the Python code:
data_context.test_yaml_config(yaml.dump(datasource_config))
When executed, test_yaml_config
will instantiate the component described by the yaml configuration that is passed in and then run a self check procedure to verify that the component works as expected.
For a Datasource, this includes:
- confirming that the connection works
- gathering a list of available Data Assets
- verifying that at least one Batch can be fetched from the Datasource
For more information on the .test_yaml_config(...)
method, please see our guide on how to configure DataContext
components using test_yaml_config
.
10. (Optional) Add more Data Connectors to your configuration
The data_connectors
dictionary in your datasource_config
can contain multiple entries. If you want to add additional Data Connectors, just go through the process starting at step 7 again.
11. Add your new Datasource to your Data Context
Now that you have verified that you have a valid configuration you can add your new Datasource to your Data Context with the command:
data_context.add_datasource(**datasource_config)
If the value of datasource_config["name"]
corresponds to a Datasource that is already defined in your Data Context, then using the above command will overwrite the existing Datasource.
If you want to ensure that you only add a Datasource when it won't overwrite an existing one, you can use the following code instead:
# add_datasource only if it doesn't already exist in your Data Context
try:
data_context.get_datasource(datasource_config["name"])
except ValueError:
data_context.add_datasource(**datasource_config)
else:
print(
f"The datasource {datasource_config['name']} already exists in your Data Context!"
)
Next steps
Congratulations! You have fully configured a Datasource and verified that it can be used in future workflows to provide a Batch or Batches of data.
For more information on using Batch Requests to retrieve data, please see our guide on how to get one or more Batches of data from a configured Datasource.
You can now move forward and create Expectations for your Datasource.
Additional notes
Splitting methods
- The names of
splitter_method
values can be specified with or without a preceding underscore.
Available methods for splitter_method
values and their configuration parameters:
splitter_method | splitter_kwargs | Returned Batch Data |
---|---|---|
split_on_whole_table | N/A | Identical to original |
split_on_column_value | column_name='col' | Rows where value of column_name are same |
split_on_year | column_name='col' | Rows where the year of a datetime column are the same |
split_on_year_and_month | column_name='col' | Rows where the year and month of a datetime column are the same |
split_on_year_and_month_and_day | column_name='col' | Rows where the year, month and day of a datetime column are the same |
split_on_date_parts | column_name='col', date_parts='<list[DatePart]>' | Rows where the date parts of a datetime column are the same. Date parts can be specified as DatePart objects or as their string equivalent e.g. "year", "month", "week", "day", "hour", "minute", or "second" |
split_on_divided_integer | column_name='col', divisor=<int> | Rows where value of column_name divided (using integral division) by the given divisor are same |
split_on_mod_integer | column_name='col', mod=<int> | Rows where value of column_name divided (using modular division) by the given mod are same |
split_on_multi_column_values | column_names='<list[col]>' | Rows where values of column_names are same |
split_on_converted_datetime | column_name='col', date_format_string=<'%Y-%m-%d'> | Rows where value of column_name converted to datetime using the given date_format_string are same |
split_on_hashed_column | column_name='col', hash_digits=<int> | Rows where value of column_name hashed (using "md5" hash function) and retaining the stated number of hash_digits are same (experimental) |
Sampling methods
- The names of
sampling_method
values can be specified with or without a preceding underscore.
Available methods for sampling_method
values and their configuration parameters:
sampling_method | sampling_kwargs | Returned Batch Data |
---|---|---|
sample_using_limit | n=num_rows | First up to to n (specific limit parameter) rows of batch |
sample_using_random | p=fraction | Rows selected at random, whose number amounts to selected fraction of total number of rows in batch |
sample_using_mod | column_name='col', mod=<int> | Take the mod of named column, and only keep rows that match the given value |
sample_using_a_list | column_name='col', value_list=<list[val]> | Match the values in the named column against value_list, and only keep the matches |
sample_using_hash | column_name='col', hash_digits=<int>, hash_value=<str> | Hash the values in the named column (using "md5" hash function), and only keep rows that match the given hash_value |
Introspection directives
Valid keys for the introspection directives
dictionary include:
schema_name
: A string describing schema to introspect (default isNone
). If this is provided, the introspection will be limited to the specified schema.ignore_information_schemas_and_system_tables
: A boolean value (default=True
) which determines whether to ignore information schemas and system tables when introspecting the database.system_tables
: An optional list of strings that definesystem_tables
for your database.information_schemas
: An optional list of strings that defineinformation_schemas
for your database.include_views
: A boolean value (default=True
) which determines whether to includeviews
when introspecting the database.