Profiler
Overview
Definition
A Profiler generates MetricsA computed attribute of data such as the mean of a column. and candidate ExpectationsA verifiable assertion about data. from data.
Features and promises
A Profiler creates a starting point for quickly generating Expectations. For example, during the Getting Started Tutorial, Great Expectations uses the UserConfigurableProfiler
to demonstrate important features of Expectations by creating and validating an Expectation SuiteA collection of verifiable assertions about data. that has several kinds of Expectations built from a small sample of data.
There are several Profilers included with Great Expectations; conceptually, each Profiler is a checklist of questions which will generate an Expectation Suite when asked of a Batch of data.
Relationship to other objects
A Profiler builds an Expectation Suite from one or more Data Assets. Many Profiler workflows will also include a step that ValidatesThe act of applying an Expectation Suite to a Batch. the data against the newly-generated Expectation Suite to return a Validation ResultGenerated when data is Validated against an Expectation or Expectation Suite..
Use cases
Create Expectations |
Profilers come into use when it is time to configure Expectations for your project. At this point in your workflow you can configure a new Profiler, or use an existing one to generate Expectations from a BatchA selection of records from a Data Asset. of data.
For details on how to configure a customized Rule-Based Profiler, see our guide on how to create a new expectation suite using Rule-Based Profilers.
For instructions on how to use a UserConfigurableProfiler
to generate Expectations from data, see our guide on how to create and edit Expectations with a Profiler.
Features
Multiple types of Profilers available
There are multiple types of Profilers built in to Great Expectations. Below is a list with overviews of each one. For more information, you can view their docstrings and source code in the great_expectations\profile
folder on our GitHub.
UserConfigurableProfiler
The UserConfigurableProfiler
is used to build an Expectation Suite from a dataset. The Expectations built are strict - they can be used to determine whether two tables are the same. When these Profilers are instantiated they can be configured by providing one or more input configuration parameters, allowing you to rapidly create a Profiler without needing to edit configuration files. However, if you need to change these parameters you will also need to instantiate a new UserConfigurableProfiler
using the updated parameters.
For instructions on how to use a UserConfigurableProfiler
to generate Expectations from data, see our guide on how to create and edit Expectations with a Profiler.
JsonSchemaProfiler
The JsonSchemaProfiler
creates Expectation Suites from JSONSchema artifacts. Basic suites can be created from these specifications.
- There is not yet a notion of nested data types in Great Expectations so suites generated by a
JsonSchemaProfiler
use column map expectations. - A
JsonSchemaProfiler
does not traverse nested schemas and requires a top level object of typeobject
.
For an example of how to use the JsonSchemaProfiler
, see our guide on how to create a new Expectation Suite by profiling from a JsonSchema file.
Rule-Based Profiler
Rule-Based Profilers are a newer implementation of Profiler that allows you to directly configure the Profiler through a YAML configuration. Rule-Based Profilers allow you to integrate organizational knowledge about your data into the profiling process. For example, a team might have a convention that all columns named "id" are primary keys, whereas all columns ending with the suffix "_id" are foreign keys. In that case, when the team using Great Expectations first encounters a new dataset that followed the convention, a Profiler could use that knowledge to add an expect_column_values_to_be_unique
Expectation to the "id" column (but not, for example an "address_id" column).
For details on how to configure a customized Rule-Based Profiler, see our guide on how to create a new expectation suite using Rule-Based Profilers.
API basics
How to access
The recommended workflow for Profilers is to use the UserConfigurableProfiler
. Doing so can be as simple as importing it and instantiating a copy by passing a ValidatorUsed to run an Expectation Suite against data. to the class, like so:
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler
profiler = UserConfigurableProfiler(profile_dataset=validator)
There are additional parameters that can be passed to a UserConfigurableProfiler
, all of which are either optional or have a default value. These consist of:
- excluded_expectations: A list of Expectations to not include in the suite.
- ignored_columns: A list of columns for which you would like to NOT create Expectations.
- not_null_only: Boolean, default False. By default, each column is evaluated for nullity. If the column values contain fewer than 50% null values, then the Profiler will add
expect_column_values_to_not_be_null
; if greater than 50% it will addexpect_column_values_to_be_null
. Ifnot_null_only
is set toTrue
, the Profiler will add a not_null Expectation irrespective of the percent nullity (and therefore will not add anexpect_column_values_to_be_null
). - primary_or_compound_key: A list containing one or more columns which are a dataset's primary or compound key. This will create an
expect_column_values_to_be_unique
orexpect_compound_columns_to_be_unique
expectation. This will occur even if one or more of theprimary_or_compound_key
columns are specified inignored_columns
. - semantic_types_dict: A dictionary where the keys are available
semantic_types
(see profiler.base.ProfilerSemanticTypes) and the values are lists of columns for which you would like to createsemantic_type
specific Expectations e.g.:"semantic_types": { "value_set": ["state","country"], "numeric":["age", "amount_due"]}
. - table_expectations_only: Boolean, default False. If True, this will only create the two table level Expectations available to this Profiler (
expect_table_columns_to_match_ordered_list
andexpect_table_row_count_to_be_between
). If aprimary_or_compound_key
is specified, it will create a uniqueness Expectation for that column as well. - value_set_threshold: Takes a string from the following ordered list - "none", "one", "two", "very_few", "few", "many", "very_many", "unique". When the Profiler runs without a semantic_types dict, each column is profiled for cardinality. This threshold determines the greatest cardinality for which to add
expect_column_values_to_be_in_set
. For example, ifvalue_set_threshold
is set to "unique", it will add a value_set Expectation for every included column. If set to "few", it will add a value_set Expectation for columns whose cardinality is one of "one", "two", "very_few" or "few". The default value is "many". For the purposes of comparing whether two tables are identical, it might make the most sense to set this to "unique".
How to create
It is unlikely that you will need to create a custom Profiler by extending an existing Profiler with a subclass. Instead, you should work with a Rule-Based Profiler which can be fully configured in a YAML configuration file.
Configuring a custom Rule-Based Profiler is covered in more detail in the Configuration section below. You can also read our guide on how to create a new expectation suite using Rule-Based Profilers to be walked through the process, or view the full source code for that guide on our GitHub as an example.
Configuration
Rule-Based Profilers
Rule-Based Profilers allow users to provide a highly configurable specification which is composed of Rules to use in order to build an Expectation Suite by profiling existing data.
Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data.
A Rule in a Rule-Based Profiler could say something like "Look at every column in my Sales table, and if that column is numeric, add an expect_column_values_to_be_between
Expectation to my Expectation Suite, where the min_value
for the Expectation is the minimum value for the column, and the max_value
for the Expectation is the maximum value for the column."
Each rule in a Rule-Based Profiler has three types of components:
- DomainBuilders: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations
- ParameterBuilders: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations
- ExpectationConfigurationBuilders: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder
In the above example, imagine your table of Sales has twenty columns, of which five are numeric:
- Your DomainBuilder would inspect all twenty columns, and then yield a list of the five numeric columns
- You would specify two ParameterBuilders: one which gets the min of a column, and one which gets a max. Your Profiler would loop over the Domain (or column) list built by the DomainBuilder and use the two
ParameterBuilders
to get the min and max for each column. - Then the Profiler loops over Domains built by the
DomainBuilder
and uses the ExpectationConfigurationBuilders to add aexpect_column_values_to_between
column for each of these Domains, where themin_value
andmax_value
are the values that we got in theParameterBuilders
.
In addition to Rules, a Rule-Based Profiler enables you to specify Variables, which are global and can be used in any of the Rules. For instance, you may want to reference the same BatchRequest
or the same tolerance in multiple Rules, and declaring these as Variables will enable you to do so.
Below is an example configuration based on this discussion:
variables:
my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile
datasource_name: my_sales_datasource
data_connector_name: monthly_sales
data_asset_name: sales_data
data_connector_query:
index: -1
mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations below
rules:
my_rule_for_numeric_columns: # This is the name of our Rule
domain_builder:
batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax
class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below
semantic_types:
- numeric
parameter_builders:
- parameter_name: my_column_min
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.min # This is the metric we want to get with this ParameterBuilder
metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead.
- parameter_name: my_column_max
class_name: MetricParameterBuilder
batch_request: $variables.my_last_month_sales_batch_request
metric_name: column.max
metric_domain_kwargs: $domain.domain_kwargs
expectation_configuration_builders:
- expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite
class_name: DefaultExpectationConfigurationBuilder
column: $domain.domain_kwargs.column
min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables
max_value: $parameter.my_column_max.value
mostly: $variables.mostly_default