Add freshness documentation (#9449)
@ -13,9 +13,15 @@ After the metadata ingestion has been done correctly, we can configure and deplo
|
||||
This Pipeline will be in charge of feeding the Profiler tab of the Table Entity, as well as running any tests configured in the Entity.
|
||||
|
||||
<Image
|
||||
src={"/images/openmetadata/ingestion/workflows/profiler/profiler-summary.png"}
|
||||
alt="Profiler summary page"
|
||||
caption="Profiler summary page"
|
||||
src={"/images/openmetadata/ingestion/workflows/profiler/profiler-summary-table.png"}
|
||||
alt="Table profile summary page"
|
||||
caption="Table profile summary page"
|
||||
/>
|
||||
|
||||
<Image
|
||||
src={"/images/openmetadata/ingestion/workflows/profiler/profiler-summary-colomn.png"}
|
||||
alt="Column profile summary page"
|
||||
caption="Column profile summary page"
|
||||
/>
|
||||
|
||||
|
||||
@ -43,21 +49,26 @@ Define the name of the Profiler Workflow. While we only support a single workflo
|
||||
|
||||
As profiling is a costly task, this enables a fine-grained approach to profiling and running tests by specifying different filters for each pipeline.
|
||||
|
||||
**Database filter pattern**
|
||||
**Database filter pattern (Optional)**
|
||||
regex expression to filter databases.
|
||||
|
||||
**Schema filter pattern**
|
||||
**Schema filter pattern (Optional)**
|
||||
regex expression to filter schemas.
|
||||
|
||||
**Table filter pattern**
|
||||
**Table filter pattern (Optional)**
|
||||
regex expression to filter tables.
|
||||
|
||||
**Profile Sample**
|
||||
Sampling percentage to apply for profiling tables.
|
||||
**Profile Sample (Optional)**
|
||||
Set the sample to be use by the profiler for the specific table.
|
||||
- `Percentage`: Value must be between 0 and 100 exclusive (0 < percentage < 100). This will sample the table based on a percentage
|
||||
- `Row Count`: The table will be sampled based on a number of rows (i.e. `1,000`, `2,000`), etc.
|
||||
|
||||
**Thread Count**
|
||||
**Thread Count (Optional)**
|
||||
Number of thread to use when computing metrics for the profiler. For Snowflake users we recommend setting it to 1. There is a known issue with one of the dependency (`snowflake-connector-python`) affecting projects with certain environments.
|
||||
|
||||
**Timeout in Seconds (Optional)**
|
||||
This will set the duration a profiling job against a table should wait before interrupting its execution and moving on to profiling the next table. It is important to note that the profiler will wait for the hanging query to terminiate before killing the execution. If there is a risk for your profiling job to hang, it is important to also set a query/connection timeout on your database engine. The default value for the profiler timeout is 12-hours.
|
||||
|
||||
**Ingest Sample Data**
|
||||
Whether the profiler should ingest sample data
|
||||
|
||||
@ -80,14 +91,29 @@ Once you have created your profiler you can adjust some behavior at the table le
|
||||
/>
|
||||
|
||||
#### Profiler Options
|
||||
**Profile Sample %**
|
||||
Set the sample percentage to be use by the profiler for the specific table. This will overwrite the workflow sample percentage.
|
||||
**Profile Sample**
|
||||
Set the sample to be use by the profiler for the specific table.
|
||||
- `Percentage`: Value must be between 0 and 100 exclusive (0 < percentage < 100). This will sample the table based on a percentage
|
||||
- `Row Count`: The table will be sampled based on a number of rows (i.e. `1,000`, `2,000`), etc.
|
||||
|
||||
**Profile Sample Query**
|
||||
Use a query to sample data for the profiler. This will overwrite any profle sample set.
|
||||
|
||||
**Enable Column Profile**
|
||||
This setting allows you exclude or include specific column from the profiler. It also allows you to exclude the computation of specific metrics.
|
||||
This setting allows user to exclude or include specific columns and metrics from the profiler.
|
||||
|
||||
**Enable Partition**
|
||||
If your table includes a timestamp, date or datetime column type you can enable partitionning. If enabled, the profiler will fetch the last `<interval>` `<interval unit>` of data to profile the table. Note that if "profile sample" is set, this configuration will be used against the partitioned data and not the whole table.
|
||||
- `Column Name`: this is the name of the column that will be used as the partition field
|
||||
- `Interval Type`:
|
||||
- `TIME-UNIT`: a business logical timestamp/date/datetime (e.g. order date, sign up datetime, etc.)
|
||||
- `INGESTION-TIME`: a process logical timestamp/date/datetime (i.e. when was my data ingested in my table)
|
||||
- `Interval`: the interval value (e.g. `1`, `2`, etc.)
|
||||
- `Interval Unit`:
|
||||
- `HOUR`
|
||||
- `DAY`
|
||||
- `MONTH`
|
||||
- `YEAR`
|
||||
|
||||
|
||||
## YAML Configuration
|
||||
|
||||
@ -11,7 +11,6 @@ A Metric is a computation that we can run on top of a Table or Column to receive
|
||||
|
||||
* **Metrics** define the queries and computations generically. They do not aim at specific columns or database dialects. Instead, they are expressions built with SQLAlchemy that should run everywhere.
|
||||
* A **Profiler** is the binding between a set of metrics and the external world. The Profiler contains the Table and Session information and is in charge of executing the metrics.
|
||||
* A **Test Case** adds logic to the Metrics results. A Metric is neither good nor wrong, so we need the Test definitions to map results into Success or Failures.
|
||||
|
||||
On this page, you will learn all the metrics that we currently support and their meaning. We will base all the namings on the definitions on the JSON Schemas.
|
||||
|
||||
@ -35,6 +34,15 @@ It computes the number of rows in the Table.
|
||||
|
||||
Returns the number of columns in the Table.
|
||||
|
||||
## System Metrics
|
||||
System metrics are metrics related to DML operations performed on the table. These metrics are available for BigQuery, Redshift and Snowflake only. Other database engines are currently not supported so the computation of the system metrics will be skipped.
|
||||
|
||||
### DML Operations
|
||||
This metrics shows all the DML operations performed (`INSERT`, `UPDATE`, `DELETE`) against the table in a timeseries fashion.
|
||||
|
||||
### Rows Affected by the DML Operation
|
||||
This metrics shows the number of rows that were affected by a DML operation (`INSERT`, `UPDATE`, `DELETE`) over time.
|
||||
|
||||
## Column Metrics
|
||||
|
||||
List of Metrics that we run for all the columns.
|
||||
@ -114,6 +122,28 @@ Only for numerical values. Returns the standard deviation.
|
||||
|
||||
The histogram returns a dictionary of the different bins and the number of values found for that bin.
|
||||
|
||||
## Grant Access to User for System Metrics
|
||||
OpenMetadata uses system tables to compute system metrics. You can find the required access as well as more details for your database engine below.
|
||||
### Snowflake
|
||||
OpenMetadata uses the `QUERY_HISTORY_BY_WAREHOUSE` view of the `INFORMATION_SCHEMA` to collect metrics about DML operations. To collect information about the `RESULT_SCAN` command alongside the QUERY ID will be passed to the `RESULT_SCAN` function to get the number of rows affected by the operation. You need to make sure the user running the profiler workflow has access to this view and this function.
|
||||
|
||||
OpenMetadata will look at the past 24-hours to fetch the operations that were performed against a table.
|
||||
|
||||
### Redshift
|
||||
OpenMetadata uses `stl_insert`, `stl_delete`, `svv_table_info`, and `stl_querytext` to fecth DNL operations as well as the number of rows affected by these operations. You need to make sure the user running the profiler workflow has access to these views and tables.
|
||||
|
||||
OpenMetadata will look at the previous day to fetch the operations that were performed against a table.
|
||||
|
||||
### Redshift
|
||||
OpenMetadata uses `stl_insert`, `stl_delete`, `svv_table_info`, and `stl_querytext` to fecth DNL operations as well as the number of rows affected by these operations. You need to make sure the user running the profiler workflow has access to these views and tables.
|
||||
|
||||
OpenMetadata will look at the previous day to fetch the operations that were performed against a table.
|
||||
|
||||
### BigQuery
|
||||
Bigquery uses the `JOBS` table of the `INFORMATION_SCHEMA` to fecth DNL operations as well as the number of rows affected by these operations. You will need to make sure your data location is properly set when creating your BigQuery service connection in OpenMetadata.
|
||||
|
||||
OpenMetadata will look at the previous day to fetch the operations that were performed against a table filter on the `creation_time` partition field to limit the size of data scanned.
|
||||
|
||||
## Reach out!
|
||||
|
||||
Is there any metric you'd like to see? Open an [issue](https://github.com/open-metadata/OpenMetadata/issues/new/choose) or reach out on [Slack](https://slack.open-metadata.org).
|
||||
|
||||
|
Before Width: | Height: | Size: 114 KiB After Width: | Height: | Size: 98 KiB |
|
Before Width: | Height: | Size: 38 KiB After Width: | Height: | Size: 40 KiB |
|
Before Width: | Height: | Size: 43 KiB After Width: | Height: | Size: 44 KiB |
|
Before Width: | Height: | Size: 62 KiB After Width: | Height: | Size: 72 KiB |
|
Before Width: | Height: | Size: 44 KiB After Width: | Height: | Size: 62 KiB |
|
Before Width: | Height: | Size: 55 KiB After Width: | Height: | Size: 67 KiB |
|
Before Width: | Height: | Size: 51 KiB After Width: | Height: | Size: 52 KiB |
|
Before Width: | Height: | Size: 51 KiB After Width: | Height: | Size: 65 KiB |
|
Before Width: | Height: | Size: 77 KiB After Width: | Height: | Size: 91 KiB |
|
Before Width: | Height: | Size: 116 KiB After Width: | Height: | Size: 130 KiB |
|
After Width: | Height: | Size: 147 KiB |
|
After Width: | Height: | Size: 140 KiB |
|
Before Width: | Height: | Size: 188 KiB |
|
Before Width: | Height: | Size: 79 KiB After Width: | Height: | Size: 25 KiB |