doc: added best practice section to profiler (#10712)

* doc: added best practice section to profiler

* Update openmetadata-docs/content/connectors/ingestion/workflows/profiler/index.md

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>

* Update openmetadata-docs/content/connectors/ingestion/workflows/profiler/index.md

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>

* Update openmetadata-docs/content/connectors/ingestion/workflows/profiler/index.md

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>

* Update openmetadata-docs/content/connectors/ingestion/workflows/profiler/index.md

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>

* Update openmetadata-docs/content/connectors/ingestion/workflows/profiler/index.md

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>

* Update openmetadata-docs/content/connectors/ingestion/workflows/profiler/index.md

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>

---------

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
This commit is contained in:
Teddy 2023-03-22 19:36:42 +01:00 committed by GitHub
parent 51f4e0b170
commit b2e1eed842
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -150,3 +150,24 @@ This is a good option if you which to execute your workflow via the Airflow SDK
- customers
[...]
```
## Profiler Best Practices
When setting a profiler workflow it is important to keep in mind that queries will be running against your database. Depending on your database engine, you may incur costs (e.g., Google BigQuery, Snowflake). Execution time will also vary depending on your database engine computing power, the size of the table, and the number of columns. Given these elements, there are a few best practices we recommend you follow.
### 1. Profile what you Need
Profiling all the tables in your data platform might not be the most optimized approach. Profiled tables give an indication of the structure of the table, which is most useful for tables where this information is valuable (e.g., tables used by analysts or data scientists, etc.).
When setting up a profiler workflow, you have the possibility to filter out/in certain databases, schemas, or tables. Using this feature will greatly help you narrow down which table you want to profile.
### 2. Sampling and Partitionning your Tables
On a table asset, you have the possibility to add a sample percentage/rows and a partitioning logic. Doing so will significantly reduce the amount of data scanned and the computing power required to perform the different operations.
For sampling, you can set a sampling percentage at the workflow level.
### 3. Excluding/Including Specific Columns/Metrics
By default, the profiler will compute all the metrics against all the columns. This behavior can be fine-tuned only to include or exclude specific columns and specific metrics.
For example, excluding `id` columns will reduce the number of columns against which the metrics are computed.
### 4. Set Up Multiple Workflow
If you have a large number of tables you would like to profile, setting up multiple workflows will help distribute the load. It is important though to monitor your instance CPU, and memory as having a large amount of workflow running simultaneously will require an adapted amount of resources.