How to configure a SQL Datasource

This guide will walk you through the process of configuring a SQL Datasource from scratch, verifying that your configuration is valid, and adding it to your Data Context. By the end of this guide you will have a SQL Datasource which you can use in future workflows for creating Expectations and Validating data.

Steps

1. Import necessary modules and initialize your Data Context

from ruamel import yaml

import great_expectations as gx

data_context: gx.DataContext = gx.get_context()

The great_expectations module will give you access to your Data Context, which is the entry point for working with a Great Expectations project.

The yaml module from ruamel will be used in validating your Datasource's configuration. Great Expectations will use a Python dictionary representation of your Datasource configuration when you add your Datasource to your Data Context. However, Great Expectations saves configurations as yaml files, so when you validate your configuration you will need to convert it from a Python dictionary to a yaml string, first.

Your Data Context that is initialized by get_data_context() will be the Data Context defined in your current working directory. It will provide you with convenience methods that we will use to validate your Datasource configuration and add your Datasource to your Great Expectations project once you have configured it.

2. Create a new Datasource configuration.

A new Datasource can be configured in Python as a dictionary with a specific set of keys. We will build our Datasource configuration from scratch in this guide, although you can just as easily modify an existing one.

To start, create an empty dictionary. You will be populating it with keys as you go forward.

At this point, the configuration for your Datasource is merely:

datasource_config: dict = {}

However, from this humble beginning you will be able to build a full Datasource configuration.

The keys needed for your Datasource configuration

At the top level, your Datasource's configuration will need the following keys:

name: The name of the Datasource, which will be used to reference the datasource in Batch Requests.
class_name: The name of the Python class instantiated by the Datasource. Typically, this will be the Datasource class.
module_name: the name of the module that contains the Class definition indicated by class_name.
execution_engine: a dictionary containing the class_name and module_name of the Execution Engine instantiated by the Datasource.
data_connectors: the configurations for any Data Connectors and their associated Data Assets that you want to have available when utilizing the Datasource.

In the following steps we will add those keys and their corresponding values to your currently empty Datasource configuration dictionary.

3. Name your Datasource

The first key that you will need to define for your new Datasource is its name. You will use this to reference the Datasource in future workflows. It can be anything you want it to be, but ideally you will name it something relevant to the data that it interacts with.

For the purposes of this example, we will name this Datasource:

"name": "my_datasource_name",  # Preferably name it something relevant

You should, however, name your Datsource something more relevant to your data.

At this point, your configuration should now look like:

datasource_config: dict = {
    "name": "my_datasource_name",  # Preferably name it something relevant
}

4. Specify the Datasource class and module

The class_name and module_name for your Datasource will almost always indicate the Datasource class found at great_expectations.datasource. You may replace this with a specialized subclass, or a custom class, but for almost all regular purposes these two default values will suffice. For the purposes of this guide, add those two values to their corresponding keys.

"class_name": "Datasource",
"module_name": "great_expectations.datasource"

Your full configuration should now look like:

datasource_config: dict = {
    "name": "my_datasource_name",  # Preferably name it something relevant
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
}

5. Add the SqlAlchemy Execution Engine to your Datasource configuration

Your Execution Engine is where you will specify that you want this Datasource to use SQL in the backend. As with the Datasource top level configuration, you will need to provide the class_name and module_name that indicate the class definition and containing module for the Execution Engine that you will use.

For the purposes of this guide, these will consist of the SqlAlchemyExecutionEngine found at great_expectations.execution_engine. The execution_engine key and its corresponding value will therefore look like this:

"execution_engine": {
    "class_name": "SqlAlchemyExecutionEngine",
    "module_name": "great_expectations.execution_engine",
},

Additionally, your execution_engine dictionary will require a values for either connection_string or credentials. You will only need to use one of these keys as they each serve the same purpose: to provide the parameters necessary for the SqlAlchemyExecutionEngine to connect to your desired database. For the purposes of this guide we will use the connection_string key, the value of which will be the string representation of the information necessary to connect to your SQL database. At this point your configuration should look like:

"connection_string": CONNECTION_STRING,

tip

Your connection string will vary depending on the type of SQL database you are connecting to. For more information on how to configure your connection string, please see the appropriate guide for connecting to a specific Database.

After adding the above snippets to your Datasource configuration, your full configuration dictionary should now look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
}

Tip: Using credentials instead of connection_string

The credentials key uses a dictionary to provide the elements of your connection string as separate, individual values. For information on how to populate the credentials dictionary and how to configure your great_expectations.yml project config file to populate credentials from either a YAML file or a secret manager, please see our guide on How to configure credentials.

Info: Additional (optional) Execution Engine keys

In addition to substituting credentials for connection_string, there are a few other optional keys that you can add to your execution_engine dictionary.

The first of these is the optional key create_temp_table. Creating a temp table to improve query efficiency is the default behaviour of the SqlAlchemyExecutionEngine. However, you can change this to False if you don't have the permissions to create a temp table on the database you are working with.

The other valid keys are not actually defined by the Great Expectations code. Instead, any other key/value pairs that you add to the Execution Engine configuration will be passed directly to SqlAlchemy's create_engine(...) method. This allows you to fully leverage all the capabilities supported by SqlAlchemy through that method.

For more information on what keys create_engine(...) will accept as parameters and how you can utilize this feature, please refer to SqlAlchemy's documentation.

6. Add a dictionary as the value of the `data_connectors` key

The data_connectors key should have a dictionary as its value. Each key/value pair in this dictionary will correspond to a Data Connector's name and configuration, respectively.

The keys in the data_connectors dictionary will be the names of the Data Connectors, which you will use to indicate which Data Connector to use in future workflows. As with value of your Datasource's name key, you can use any value you want for a Data Connector's name. Ideally, you will use something relevant to the data that each particular Data Connector will provide; the only significant difference is that for Data Connectors the name of the Data Connector is its key in the data_connectors dictionary.

The values for each of your data_connectors keys will be the Data Connector configurations that correspond to each Data Connector's name. You may define multiple Data Connectors in the data_connectors dictionary by including multiple key/value pairs.

For now, start by adding an empty dictionary as the value of the data_connectors key. We will begin populating it with Data Connector configurations in the next step.

Your current configuration should look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {},
}

7. Configure your individual Data Connectors (Splitting, sampling, etc.)

For each Data Connector configuration, you will need to specify which type of Data Connector you will be using. For SQL, the most likely ones will be the InferredAssetSqlDataConnector, the ConfiguredAssetSqlDataConnector, and the RuntimeDataConnector.

Reminder

If you are uncertain which Data Connector best suits your needs, please refer to our guide on how to choose which Data Connector to use.

Data Connector example configurations:

InferredAssetFilesystemDataConnector
ConfiguredAssetDataConnector
RuntimeDataConnector

tip

The InferredDataConnector is ideal for:

quickly setting up a Datasource and getting access to data
diving straight in to working with Great Expectations
initial data discovery and introspection

However, the InferredDataConnector allows less control over the definitions of your Data Assets than the ConfiguredAssetDataConnector provides.

If you are at the point of building a repeatable workflow, we encourage using the ConfiguredAssetDataConnector instead.

Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_inferred_data_connector but you may have it be anything you like.

At this point, your configuration should look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "name_of_my_inferred_data_connector": {},
    },
}

Required Data Connector configuration keys

When defining an InferredAssetSqlDataConnector you will need to provide values for one required key in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "name_of_my_inferred_data_connector" in the example above). This key/value pair consists of:

class_name: The name of the Class that will be instantiated for this DataConnector.

For this example, you will be using the InferredAssetSqlDataConnector as your class_name. This is a subclass of the InferredAssetDataConnector that is specialized to support SQL Execution Engines, such as the SqlAlchemyExecutionEngine. This key/value entry will therefore look like:

"class_name": "InferredAssetSqlDataConnector",

tip

Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name along with a default value will be provided when this Data Connector is initialized.

However, if you want to use a custom Data Connector, you will need to explicitly add a module_name key alongside the class_name key.

The value for module_name would then be set as the import path for the module containing your custom Data Connector, in the same fashion as you would provide class_name and module_name for a custom Datasource or Execution Engine.

With this value added your full configuration should now look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "name_of_my_inferred_data_connector": {
            "class_name": "InferredAssetSqlDataConnector",
        },
    },
}

Optional Data Connector configuration key for defining introspection behaviour

There is an optional key that can be defined for your Data Connector to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:

introspection_directives: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.

tip

You will find a list of the valid keys for the introspection_directives dictionary and their corresponding values in the Introspection directives subsection of the Additional notes at the end of this guide.

There are also optional keys that you can define to alter how Data Assets are inferred, how they are split into Batches, and how data is sampled from Batches. This guide will go over those in the next section.

tip

A ConfiguredAssetDataConnector enables the most fine-tuning, allowing you to easily work with multiple Batches. It also requires an explicit listing of each Data Asset you connect to and how Batches or defined within that Data Asset, which makes it very clear what Data Assets are being provided when you reference it in Profilers, Batch Requests, or Checkpoints..

Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_configured_data_connector but you may have it be anything you like.

At this point, your configuration should look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "name_of_my_configured_data_connector": {},
    },
}

Required Data Connector configuration keys

When defining a ConfiguredAssetSqlDataConnector you will need to provide values for two required keys in the Data Connector's configuration dictionary (the currently empty dictionary that corresponds to "name_of_my_configured_data_connector" in the example above). These key/value pairs consist of:

class_name: The name of the Class that will be instantiated for this DataConnector.

assets: A dictionary where the keys are the names of specific Data Assets and the values are the configurations for the corresponding Data Asset.

For this example, you will be using the ConfiguredAssetSqlDataConnector as your class_name. This is a subclass of the ConfiguredAssetDataConnector that is specialized to support SQL Execution Engines, such as the SqlAlchemyExecutionEngine. This key/value entry will therefore look like:

"class_name": "ConfiguredAssetSqlDataConnector",

tip

Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name along with a default value will be provided when this Data Connector is initialized.

However, if you want to use a custom Data Connector, you will need to explicitly add a module_name key alongside the class_name key.

With this value added, along with a blank dictionary for assets, your full configuration should now look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "name_of_my_configured_data_connector": {
            "class_name": "ConfiguredAssetSqlDataConnector",
            "assets": {},
        },
    },
}

Optional Data Connector configuration keys for splitting Data Assets into Batches

In addition to the above key/value pairs, there are some optional key/value pairs that you can use to define how your Data Assets are split into Batches. If you define these key/value pairs in your Data Connector dictionary, they will be applied to all the Data Connector's Data Assets. However, if you choose not to define these key/value pairs, the Data Assets that are defined for this Data Connector will default to returning a single Batch consisting of a full table when requested from a Batch Request.

The key/value pairs that used for splitting a Data Asset into Batches are:

splitter_method: A string that names the method that will be used to split the target table into multiple Batches.

splitter_kwargs: A dictionary containing keyword arguments to pass to splitter_method if splitter_method has been defined.

tip

For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.

Optional Data Connector configuration keys for sampling data from returned Batches

You may wish to only sample the data that is returned in your Data Asset's Batches. To do this, you will need to define the optional keys sampling_method and sampling_kwargs. As with splitter_method and splitter_kwargs, defining these key/value pairs in your Data Connector's dictionary will result in those values being applied to all Data Assets that are made available by the Data Connector.

The key/value pairs that are used for sampling data from a Data Asset are:

sampling_method: A string that names the method that will be used to sample data from returned Batches.

sampling_kwargs: A dictionary containing keyword arguments to pass to sampling_method if sampling_method has been defined.

tip

Although this guide will not use sampling in its examples, there is a list of the available sampling methods in the Sampling methods subsection of the Additional notes section at the end of this guide.

Optional Data Connector configuration key for defining introspection behaviour

Finally, there is an optional key that can be defined to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:

introspection_directives: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.

tip

A note on optional Data Connector keys

note

These key/value pairs can also be defined in the configurations for individual Data Assets, which will be shown later in this guide. If these values are defined both in the Data Connector dictionary and in a Data Asset dictionary, the definition in the Data Asset will take precedence.

tip

A RuntimeDataConnector is used to connect to an in-memory dataframe or path. The dataframe or path used for a RuntimeDataConnector is therefore passed to the RuntimeDataConnector as part of a Batch Request, rather than being a static part of the RuntimeDataConnector's configuration.

A Runtime Data Connector will always only return one Batch of data: the current data that was passed in or specified as part of a Batch Request. This means that a RuntimeDataConnector does not define Data Assets like an InferredDataConnector or a ConfiguredDataConnector would.

Instead, a Runtime Data Connector's configuration will provides a way for you to attach identifying values to a returned Batch of data so that the data as it was at the time it was returned can be referred to again in the future.

For more information on configuring a Batch Request for a Pandas Runtime Data Connector, please see our guide on how to create a Batch of data from an in-memory Spark or Pandas dataframe or path.

Remember, the key that you provide for each Data Connector configuration dictionary will be used as the name of the Data Connector. For this example, we will use the name name_of_my_runtime_data_connector but you may have it be anything you like.

At this point, your configuration should look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "name_of_my_runtime_data_connector": {},
    },
}

Required Data Connector configuration keys

class_name: The name of the Class that will be instantiated for this DataConnector.

batch_identifiers: A list of strings that will be used as keys for identifying metadata that the user provides for the returned Batch.

For this example, you will be using the RuntimeDataConnector as your class_name. This key/value entry will therefore look like:

"class_name": "RuntimeDataConnector",

After including an empty list for your batch_identifiers your full configuration should now look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "name_of_my_runtime_data_connector": {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": {},
        },
    },
}

tip

Because we are using one of Great Expectation's builtin Data Connectors, an entry for module_name along with a default value will be provided when this Data Connector is initialized.

However, if you want to use a custom Data Connector, you will need to explicitly add a module_name key alongside the class_name key.

8. Configure your Data Connector's Data Assets (Splitting, sampling, etc.)

InferredAssetFilesystemDataConnector
ConfiguredAssetDataConnector
RuntimeDataConnector

In an Inferred Asset Data Connector for SQL data, all the behaviour for how Data Assets are inferred, how they are split into Batches, and how those Batches are sampled for data are configured in the Data Connector dictionary. To alter any of these behaviours from the default, you will simply set your desired behaviour by defining one of the following optional key/value pairs.

Optional Data Connector configuration keys for inferring Data Assets

A Data Asset in an Inferred Asset Data Connector for SQL data will consist of a single table in the database you have connected to. You can modify the way that the Data Connector infers which tables to utilize as Data Assets by defining the following key/value pairs in your Data Connector's dictionary in the Datasource configuration:

data_asset_name_prefix: A string describing an optional prefix to prepend to the names of inferred Data Assets.

data_asset_name_suffix: A string describing an optional suffix to append to the names of inferred Data Assets.

include_schema_name: A boolean value which answers the question : "Should the data_asset_name include the schema as a prefix?"

excluded_tables: A list of tables to ignore when inferring Data Assets.

included_tables: A list that, if defined, will limit the inferred Data Assets to those tables that are included in the list.

skip_inapplicable_tables: A boolean value. If True, tables that can't be successfully queried using sampling and splitter methods are excluded from inferred Data Assets. If False, the class will throw an error during initialization if any such tables are encountered.

Optional Data Connector configuration keys for splitting Data Assets into Batches

Next is the matter of how (or even if) your Data Connector splits Data Assets into Batches. By default, each Data Asset will provide a single Batch consisting of the entire table that it corresponds to. You can change this behaviour by specifying the following key/value pairs:

splitter_method: A string that names the method that will be used to split the target table into multiple Batches.

splitter_kwargs: A dictionary containing keyword arguments to pass to splitter_method if splitter_method has been defined.

For example, imagine that you have one or more tables containing the NYC taxi data from the getting started tutorial in your database. You could instruct your Data Connector to infer Data Assets that return each table as a single Batch by simply not including a splitter_method. Such a configuration would be identical to the data connector name_of_my_inferred_data_connector that was defined in the example at the end of step 7, so let's rename that data_connector entry inferred_data_connector_single_batch_asset since that is more meaningful. Your configuration for a single Batch Data Asset would now look like:

"inferred_data_connector_single_batch_asset": {
    "class_name": "InferredAssetSqlDataConnector",
},

Alternatively, you could define a Data Asset that is split into Batches based on the year and month by defining the splitter_method to be split_on_year_and_month and providing a Datetime column. (In the case of the NYC taxi data, this would be the pickup_datetime column.) Creating a Data Asset like this would result in your configuration being:

"inferred_data_connector_multi_batch_asset_split_on_date_time": {
    "class_name": "InferredAssetSqlDataConnector",
    "splitter_method": "split_on_year_and_month",
    "splitter_kwargs": {
        "column_name": "pickup_datetime",
    },
},

If you included both of these Data Assets, your complete configuration would look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "inferred_data_connector_single_batch_asset": {
            "class_name": "InferredAssetSqlDataConnector",
        },
        "inferred_data_connector_multi_batch_asset_split_on_date_time": {
            "class_name": "InferredAssetSqlDataConnector",
            "splitter_method": "split_on_year_and_month",
            "splitter_kwargs": {
                "column_name": "pickup_datetime",
            },
        },
    },
}

Reminder

If you are uncertain whether you should be splitting your Data Assets into Batches, please refer to our guide on how to choose between working with a single or multiple Batches of data.

tip

For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.

Optional Data Connector configuration keys for sampling data from returned Batches

Finally, you may wish to only sample a portion of the data that would be returned in your Data Asset's Batches. To do this, you will need to define the optional keys sampling_method and sampling_kwargs. As with splitter_method and splitter_kwargs, defining these key/value pairs in your Data Connector's dictionary will result in those values being applied to all Data Assets that are made available by the Data Connector.

The key/value pairs that are used for sampling data from a Data Asset are:

sampling_method: A string that names the method that will be used to sample data from returned Batches.

sampling_kwargs: A dictionary containing keyword arguments to pass to sampling_method if sampling_method has been defined.

tip

In a Configured Asset Data Connector for SQL data, each entry in the assets dictionary will correspond to an explicitly defined Data Asset. The key provided will be used as the name of the Data Asset, while the value will be a dictionary that specifies how that Data Asset is defined.

These key/value pairs are technically all optional. If none of them are defined, the Data Asset will correspond to the table that matches the key corresponding to the empty Data Asset dictionary.

Optional Data Asset configuration keys for Data Asset names and schemas:

You may provide the following key/value pairs in your Data Asset configuration to alter how your Data Asset behaves regarding its associated table:

table_name: A string that, if defined, is used as the name for the Data Asset. If this is not defined, the default name will be that of the key corresponding to the Data Asset's dictionary in your configuration.
schema_name: An optional string that defines the schema for the Data Asset.
include_schema_name: A boolean value that determines whether the schema_name should be included as a prefix to the Data Asset's name.

For example, imagine that you have a copy of the NYC taxi data from the getting started tutorial in a table called yellow_tripdata_sample_2020, along with a public schema. You could access this data by defining an entry in the assets dictionary like:

"yellow_tripdata_sample_2020_full": {
    "table_name": "yellow_tripdata_sample_2020",
    "schema_name": "main",
},

Optional Data Asset configuration keys for splitting Data Assets into Batches

splitter_method: A string that names the method that will be used to split the target table into multiple Batches.

splitter_kwargs: A dictionary containing keyword arguments to pass to splitter_method if splitter_method has been defined.

The configuration we provided above does not include a splitter_method, and therefore will return the entire table as a single Batch.

"yellow_tripdata_sample_2020_by_year_and_month": {
    "table_name": "yellow_tripdata_sample_2020",
    "schema_name": "main",
    "splitter_method": "split_on_year_and_month",
    "splitter_kwargs": {
        "column_name": "pickup_datetime",
    },
},

If you included both of these Data Assets, your complete configuration would look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "name_of_my_configured_data_connector": {
            "class_name": "ConfiguredAssetSqlDataConnector",
            "assets": {
                "yellow_tripdata_sample_2020_full": {
                    "table_name": "yellow_tripdata_sample_2020",
                    "schema_name": "main",
                },
                "yellow_tripdata_sample_2020_by_year_and_month": {
                    "table_name": "yellow_tripdata_sample_2020",
                    "schema_name": "main",
                    "splitter_method": "split_on_year_and_month",
                    "splitter_kwargs": {
                        "column_name": "pickup_datetime",
                    },
                },
            },
        },
    },
}

Reminder

If you are uncertain whether you should be splitting your Data Assets into Batches, please refer to our guide on how to choose between working with a single or multiple Batches of data.

The splitter_method and splitter_kwargs key/value pairs can be defined at either the Data Connector or the Data Asset level. If they are defined at the Data Connector level, they will apply to all Data Assets that do not have an alternative definition in their configuration. If they are defined at the Data Asset level, the Data Asset definition will take precedence over the Data Connector definition.

You can think of the Data Connector level definition of splitter_method and splitter_kwargs as a way to define default values of these keys for your Data Assets. The Data Asset level definitions, then, would be Data Asset specific definitions that overwrite those defaults.

tip

For more information on the available splitting methods, please see the Splitting methods subsection under Additional notes at the end of this guide.

Optional Data Asset configuration keys for sampling data from returned Batches

The key/value pairs that are used for sampling data from a Data Asset are:

sampling_method: A string that names the method that will be used to sample data from returned Batches.

sampling_kwargs: A dictionary containing keyword arguments to pass to sampling_method if sampling_method has been defined.

As with splitter_method and splitter_kwargs, sampling_method and sampling_kwargs can be defined at either the Data Connector or the Data Asset level. When defined at the Data Connector level, the definition acts as a default that applies to all Data Assets that do not have their own specific definition for the two keys. If defined at the Data Asset level, the values will take precedence over any that are defined in the Data Connector dictionary.

tip

Optional Data Asset configuration key for defining introspection behaviour

Finally, there is an optional key that can be defined to alter the default behaviour of introspection methods such as those used by auto-initializing Expectations and Data Assistants. This key is:

introspection_directives: A dictionary of arguments passed to the introspection method of auto-initializing Expectations and Data Assistants to guide the introspection process.

Similar to the splitting and sampling key/value pairs, this key can be defined at either the Data Connector or the Data Asset level. A definition at the Data Connector level will be applied to all Data Assets, while a definition at the Data Asset level will take precidence over the Data Connector's values.

tip

Runtime Data Connectors put a wrapper around a single Batch of data, and therefore do not support Data Asset configurations that permit the return of more than one Batch of data. In fact, since you will use a Batch Request to pass in or specify the data that a Runtime Data Connector uses, there is no need to specify a Data Asset configuration at all.

Instead, you will provide a batch_identifiers list which will be used to attach identifying information to a returned Batch so that you can reference the same data again in the future.

For this example, lets assume we have the folloÏwing tables in our database:

yellow_tripdata_sample_2020-01
yellow_tripdata_sample_2020-02
yellow_tripdata_sample_2020-03

With a Runtime Data Connector you won't actually refer to them in your configuration! As mentioned above, you will provide the name of one of those tables to the Data Connector as part of a Batch Request.

Therefore, the table names are inconsequential to your Runtime Data Connector's configuration. In fact, the batch_identifiers that you define in your Runtime Data Connector's configuration can be completely arbitrary. However, it is advised you name them after something meaningful regarding your data or the circumstances under which you will be accessing your data.

For instance, let's assume you are getting a daily update to your data, and so you are running daily validations. You could then choose to identify your Runtime Data Connector's Batches by the timestamp at which they are requested.

To do this, you would simply add a batch_timestamp entry in your batch_identifiers list. This would look like:

"batch_identifiers": ["batch_timestamp"],

Then, when you create your Batch Request you would populate the batch_timestamp value in its batch_identifiers dictionary with the value of the current date and time. This will attach the current date and time to the returned Batch, allowing you to reference the Batch again in the future even if the current data (the data that would be provided by the Runtime Data Connector if you requested a new Batch) had changed.

The full configuration for your Datasource should now look like:

datasource_config: dict = {
    "name": "my_datasource_name",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "class_name": "SqlAlchemyExecutionEngine",
        "module_name": "great_expectations.execution_engine",
        "connection_string": CONNECTION_STRING,
    },
    "data_connectors": {
        "name_of_my_runtime_data_connector": {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": ["batch_timestamp"],
        },
    },
}

caution

We stated above that the names that you use for your batch_identifiers in a Runtime Data Connector's configuration can be completely arbitrary, and will be used as keys for the batch_identifiers dictionary in future Batch Requests.

However, the same holds true for the values you pass in for each key in your Batch Request's batch_identifiers!

Always make sure that your Batch Requests utilizing Runtime Data Connectors are providing meaningful identifying information, consistent with the keys that are derived from the batch_identifiers you have defined in your Runtime Data Connector's configuration.

9. Test your configuration with `.test_yaml_config(...)`

Now that you have a full Datasource configuration, you can confirm that it is valid by testing it with the .test_yaml_config(...) method. To do this, execute the Python code:

data_context.test_yaml_config(yaml.dump(datasource_config))

When executed, test_yaml_config will instantiate the component described by the yaml configuration that is passed in and then run a self check procedure to verify that the component works as expected.

For a Datasource, this includes:

confirming that the connection works
gathering a list of available Data Assets
verifying that at least one Batch can be fetched from the Datasource

For more information on the .test_yaml_config(...) method, please see our guide on how to configure DataContext components using test_yaml_config.

10. (Optional) Add more Data Connectors to your configuration

The data_connectors dictionary in your datasource_config can contain multiple entries. If you want to add additional Data Connectors, just go through the process starting at step 7 again.

11. Add your new Datasource to your Data Context

Now that you have verified that you have a valid configuration you can add your new Datasource to your Data Context with the command:

data_context.add_datasource(**datasource_config)

caution

If the value of datasource_config["name"] corresponds to a Datasource that is already defined in your Data Context, then using the above command will overwrite the existing Datasource.

tip

If you want to ensure that you only add a Datasource when it won't overwrite an existing one, you can use the following code instead:

# add_datasource only if it doesn't already exist in your Data Context
try:
    data_context.get_datasource(datasource_config["name"])
except ValueError:
    data_context.add_datasource(**datasource_config)
else:
    print(
        f"The datasource {datasource_config['name']} already exists in your Data Context!"
    )

Next steps

Congratulations! You have fully configured a Datasource and verified that it can be used in future workflows to provide a Batch or Batches of data.

tip

For more information on using Batch Requests to retrieve data, please see our guide on how to get one or more Batches of data from a configured Datasource.

You can now move forward and create Expectations for your Datasource.

Additional notes

Splitting methods

NOTES ON SPLITTER METHODS

The names of splitter_method values can be specified with or without a preceding underscore.

Available methods for `splitter_method` values and their configuration parameters:

`splitter_method`	`splitter_kwargs`	Returned Batch Data
`split_on_whole_table`	N/A	Identical to original
`split_on_column_value`	`column_name='col'`	Rows where value of column_name are same
`split_on_year`	`column_name='col'`	Rows where the year of a datetime column are the same
`split_on_year_and_month`	`column_name='col'`	Rows where the year and month of a datetime column are the same
`split_on_year_and_month_and_day`	`column_name='col'`	Rows where the year, month and day of a datetime column are the same
`split_on_date_parts`	`column_name='col', date_parts='<list[DatePart]>'`	Rows where the date parts of a datetime column are the same. Date parts can be specified as DatePart objects or as their string equivalent e.g. "year", "month", "week", "day", "hour", "minute", or "second"
`split_on_divided_integer`	`column_name='col', divisor=<int>`	Rows where value of column_name divided (using integral division) by the given divisor are same
`split_on_mod_integer`	`column_name='col', mod=<int>`	Rows where value of column_name divided (using modular division) by the given mod are same
`split_on_multi_column_values`	`column_names='<list[col]>'`	Rows where values of column_names are same
`split_on_converted_datetime`	`column_name='col', date_format_string=<'%Y-%m-%d'>`	Rows where value of column_name converted to datetime using the given date_format_string are same
`split_on_hashed_column`	`column_name='col', hash_digits=<int>`	Rows where value of column_name hashed (using "md5" hash function) and retaining the stated number of hash_digits are same (experimental)

Sampling methods

NOTES ON SAMPLING METHODS

The names of sampling_method values can be specified with or without a preceding underscore.

Available methods for `sampling_method` values and their configuration parameters:

`sampling_method`	`sampling_kwargs`	Returned Batch Data
`sample_using_limit`	`n=num_rows`	First up to to n (specific limit parameter) rows of batch
`sample_using_random`	`p=fraction`	Rows selected at random, whose number amounts to selected fraction of total number of rows in batch
`sample_using_mod`	`column_name='col', mod=<int>`	Take the mod of named column, and only keep rows that match the given value
`sample_using_a_list`	`column_name='col', value_list=<list[val]>`	Match the values in the named column against value_list, and only keep the matches
`sample_using_hash`	`column_name='col', hash_digits=<int>, hash_value=<str>`	Hash the values in the named column (using "md5" hash function), and only keep rows that match the given hash_value

Introspection directives

Valid keys for the introspection directives dictionary include:

schema_name: A string describing schema to introspect (default is None). If this is provided, the introspection will be limited to the specified schema.
ignore_information_schemas_and_system_tables: A boolean value (default=True) which determines whether to ignore information schemas and system tables when introspecting the database.
system_tables: An optional list of strings that define system_tables for your database.
information_schemas: An optional list of strings that define information_schemas for your database.
include_views: A boolean value (default=True) which determines whether to include views when introspecting the database.

How to configure a SQL Datasource

Steps​

1. Import necessary modules and initialize your Data Context​

2. Create a new Datasource configuration.​

The keys needed for your Datasource configuration​

3. Name your Datasource​

4. Specify the Datasource class and module​

5. Add the SqlAlchemy Execution Engine to your Datasource configuration​

6. Add a dictionary as the value of the data_connectors key​

7. Configure your individual Data Connectors (Splitting, sampling, etc.)​

Data Connector example configurations:​

Required Data Connector configuration keys​

Optional Data Connector configuration key for defining introspection behaviour​

Required Data Connector configuration keys​

Optional Data Connector configuration keys for splitting Data Assets into Batches​

Optional Data Connector configuration keys for sampling data from returned Batches​

Optional Data Connector configuration key for defining introspection behaviour​

A note on optional Data Connector keys​

Required Data Connector configuration keys​

8. Configure your Data Connector's Data Assets (Splitting, sampling, etc.)​

Optional Data Connector configuration keys for inferring Data Assets​

Optional Data Connector configuration keys for splitting Data Assets into Batches​

Optional Data Connector configuration keys for sampling data from returned Batches​

Optional Data Asset configuration keys for Data Asset names and schemas:​

Optional Data Asset configuration keys for splitting Data Assets into Batches​

Optional Data Asset configuration keys for sampling data from returned Batches​

Optional Data Asset configuration key for defining introspection behaviour​

9. Test your configuration with .test_yaml_config(...)​

10. (Optional) Add more Data Connectors to your configuration​

11. Add your new Datasource to your Data Context​

Next steps​

Additional notes​

Splitting methods​

Available methods for splitter_method values and their configuration parameters:​

Sampling methods​

Available methods for sampling_method values and their configuration parameters:​

Introspection directives​

Steps

1. Import necessary modules and initialize your Data Context

2. Create a new Datasource configuration.

The keys needed for your Datasource configuration

3. Name your Datasource

4. Specify the Datasource class and module

5. Add the SqlAlchemy Execution Engine to your Datasource configuration

6. Add a dictionary as the value of the `data_connectors` key

7. Configure your individual Data Connectors (Splitting, sampling, etc.)

Data Connector example configurations:

Required Data Connector configuration keys

Optional Data Connector configuration key for defining introspection behaviour

Required Data Connector configuration keys

Optional Data Connector configuration keys for splitting Data Assets into Batches

Optional Data Connector configuration keys for sampling data from returned Batches

Optional Data Connector configuration key for defining introspection behaviour

A note on optional Data Connector keys

Required Data Connector configuration keys

8. Configure your Data Connector's Data Assets (Splitting, sampling, etc.)

Optional Data Connector configuration keys for inferring Data Assets

Optional Data Connector configuration keys for splitting Data Assets into Batches

Optional Data Connector configuration keys for sampling data from returned Batches

Optional Data Asset configuration keys for Data Asset names and schemas:

Optional Data Asset configuration keys for splitting Data Assets into Batches

Optional Data Asset configuration keys for sampling data from returned Batches

Optional Data Asset configuration key for defining introspection behaviour

9. Test your configuration with `.test_yaml_config(...)`

10. (Optional) Add more Data Connectors to your configuration

11. Add your new Datasource to your Data Context

Next steps

Additional notes

Splitting methods

Available methods for `splitter_method` values and their configuration parameters:

Sampling methods

Available methods for `sampling_method` values and their configuration parameters:

Introspection directives