Doc: Sample data Removal (#21709)

Co-authored-by: Rounak Dhillon <rounakdhillon@Rounaks-MacBook-Air.local>
This commit is contained in:
Rounak Dhillon 2025-06-11 17:56:49 +05:30 committed by GitHub
parent a618a273f2
commit 2045a4be2f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
12 changed files with 18 additions and 132 deletions

View File

@ -860,8 +860,6 @@ site_menu:
url: /how-to-guides/data-quality-observability/profiler/metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/external-workflow
- category: How-to Guides / Data Quality and Observability / Data Observability
@ -957,6 +955,8 @@ site_menu:
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification

View File

@ -1,6 +1,6 @@
---
title: External Storage for Sample Data
slug: /how-to-guides/data-quality-observability/profiler/external-sample-data
slug: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
---
# External Storage for Sample Data

View File

@ -83,10 +83,7 @@ This Flag is useful in scenarios when you have different schemas with same name
**Compute Metrics**
Set the Compute Metrics toggle off to not perform any metric computation during the profiler ingestion workflow. Used in combination with Ingest Sample Data toggle on allows you to only ingest sample data.
**Advanced Configuration**
**PII Inference Confidence LevelConfidence (Optional)**
If `Auto PII Tagging` is enable, this confidence level will determine the threshold to use for OpenMetadata's NLP model to consider a column as containing PII data.
**Advanced Configuration**
**Sample Data Rows Count**
Set the number of rows to ingest when Ingest Sample Data toggle is on. Defaults to 50.
@ -124,9 +121,6 @@ Set the sample to be use by the profiler for the specific table.
⚠️ This option is currently not support for Druid. Sampling leverage `RANDOM` functions in most database (some have specific sampling functions) and Druid provides neither of these option. We recommend using the partitioning or sample query option if you need to limit the amount of data scanned.
**Profile Sample Query**
Use a query to sample data for the profiler. This will overwrite any profle sample set.
**Enable Column Profile**
This setting allows user to exclude or include specific columns and metrics from the profiler.
@ -198,15 +192,6 @@ This is a sample config for the profiler:
{% codeInfoContainer %}
{% codeInfo srNumber=10 %}
#### Source Configuration - Source Config
You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).
**generateSampleData**: Option to turn on/off generating sample data.
{% /codeInfo %}
{% codeInfo srNumber=22 %}
**computeMetrics**: Option to turn on/off computing profiler metrics. This flag is useful when you want to only ingest the sample data with the profiler workflow and not any other information.
@ -226,19 +211,6 @@ You can find all the definitions and types for the `sourceConfig` [here](https:
{% /codeInfo %}
{% codeInfo srNumber=13 %}
**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.
{% /codeInfo %}
{% codeInfo srNumber=14 %}
**confidence**: Set the Confidence value for which you want the column to be marked
{% /codeInfo %}
{% codeInfo srNumber=15 %}
**timeoutSeconds**: Profiler Timeout in Seconds
@ -305,9 +277,6 @@ source:
type: Profiler
```
```yaml {% srNumber=10 %}
generateSampleData: true
```
```yaml {% srNumber=22 %}
computeMetrics: true
```
@ -317,12 +286,6 @@ source:
```yaml {% srNumber=12 %}
# threadCount: 5
```
```yaml {% srNumber=13 %}
processPiiSensitive: false
```
```yaml {% srNumber=14 %}
# confidence: 80
```
```yaml {% srNumber=15 %}
# timeoutSeconds: 43200
```
@ -363,7 +326,6 @@ processor:
# profileSample: <number between 0 and 99> # default
# profileSample: <number between 0 and 99> # default will be 100 if omitted
# profileQuery: <query to use for sampling data for the profiler>
# columnConfig:
# excludeColumns:
# - <column name>

View File

@ -993,8 +993,6 @@ site_menu:
url: /how-to-guides/data-quality-observability/profiler/metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/external-workflow
- category: How-to Guides / Data Quality and Observability / Data Observability
@ -1076,6 +1074,8 @@ site_menu:
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification

View File

@ -902,8 +902,6 @@ site_menu:
url: /how-to-guides/data-quality-observability/profiler/metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
- category: How-to Guides / Data Quality and Observability / Data Observability
@ -1001,6 +999,8 @@ site_menu:
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification

View File

@ -1,6 +1,6 @@
---
title: External Storage for Sample Data
slug: /how-to-guides/data-quality-observability/profiler/external-sample-data
slug: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
---
# External Storage for Sample Data

View File

@ -83,10 +83,7 @@ This Flag is useful in scenarios when you have different schemas with same name
**Compute Metrics**
Set the Compute Metrics toggle off to not perform any metric computation during the profiler ingestion workflow. Used in combination with Ingest Sample Data toggle on allows you to only ingest sample data.
**Advanced Configuration**
**PII Inference Confidence LevelConfidence (Optional)**
If `Auto PII Tagging` is enable, this confidence level will determine the threshold to use for OpenMetadata's NLP model to consider a column as containing PII data.
**Advanced Configuration**
**Sample Data Rows Count**
Set the number of rows to ingest when Ingest Sample Data toggle is on. Defaults to 50.
@ -124,9 +121,6 @@ Set the sample to be use by the profiler for the specific table.
⚠️ This option is currently not support for Druid. Sampling leverage `RANDOM` functions in most database (some have specific sampling functions) and Druid provides neither of these option. We recommend using the partitioning or sample query option if you need to limit the amount of data scanned.
**Profile Sample Query**
Use a query to sample data for the profiler. This will overwrite any profle sample set.
**Enable Column Profile**
This setting allows user to exclude or include specific columns and metrics from the profiler.
@ -198,15 +192,6 @@ This is a sample config for the profiler:
{% codeInfoContainer %}
{% codeInfo srNumber=10 %}
#### Source Configuration - Source Config
You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).
**generateSampleData**: Option to turn on/off generating sample data.
{% /codeInfo %}
{% codeInfo srNumber=22 %}
**computeMetrics**: Option to turn on/off computing profiler metrics. This flag is useful when you want to only ingest the sample data with the profiler workflow and not any other information.
@ -226,19 +211,6 @@ You can find all the definitions and types for the `sourceConfig` [here](https:
{% /codeInfo %}
{% codeInfo srNumber=13 %}
**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.
{% /codeInfo %}
{% codeInfo srNumber=14 %}
**confidence**: Set the Confidence value for which you want the column to be marked
{% /codeInfo %}
{% codeInfo srNumber=15 %}
**timeoutSeconds**: Profiler Timeout in Seconds
@ -305,9 +277,6 @@ source:
type: Profiler
```
```yaml {% srNumber=10 %}
generateSampleData: true
```
```yaml {% srNumber=22 %}
computeMetrics: true
```
@ -317,12 +286,6 @@ source:
```yaml {% srNumber=12 %}
# threadCount: 5
```
```yaml {% srNumber=13 %}
processPiiSensitive: false
```
```yaml {% srNumber=14 %}
# confidence: 80
```
```yaml {% srNumber=15 %}
# timeoutSeconds: 43200
```
@ -362,8 +325,7 @@ processor:
# - fullyQualifiedName: <table fqn>
# profileSample: <number between 0 and 99> # default
# profileSample: <number between 0 and 99> # default will be 100 if omitted
# profileQuery: <query to use for sampling data for the profiler>
# profileSample: <number between 0 and 99> # default will be 100 if omitted
# columnConfig:
# excludeColumns:
# - <column name>

View File

@ -1019,8 +1019,6 @@ site_menu:
url: /how-to-guides/data-quality-observability/profiler/metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/external-workflow
- category: How-to Guides / Data Quality and Observability / Data Observability
@ -1104,6 +1102,8 @@ site_menu:
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification

View File

@ -902,8 +902,6 @@ site_menu:
url: /how-to-guides/data-quality-observability/profiler/metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
- category: How-to Guides / Data Quality and Observability / Data Observability
@ -1001,6 +999,8 @@ site_menu:
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification

View File

@ -1,6 +1,6 @@
---
title: External Storage for Sample Data
slug: /how-to-guides/data-quality-observability/profiler/external-sample-data
slug: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
---
# External Storage for Sample Data

View File

@ -85,9 +85,6 @@ Set the Compute Metrics toggle off to not perform any metric computation during
**Advanced Configuration**
**PII Inference Confidence LevelConfidence (Optional)**
If `Auto PII Tagging` is enable, this confidence level will determine the threshold to use for OpenMetadata's NLP model to consider a column as containing PII data.
**Sample Data Rows Count**
Set the number of rows to ingest when Ingest Sample Data toggle is on. Defaults to 50.
@ -124,9 +121,6 @@ Set the sample to be use by the profiler for the specific table.
⚠️ This option is currently not support for Druid. Sampling leverage `RANDOM` functions in most database (some have specific sampling functions) and Druid provides neither of these option. We recommend using the partitioning or sample query option if you need to limit the amount of data scanned.
**Profile Sample Query**
Use a query to sample data for the profiler. This will overwrite any profle sample set.
**Enable Column Profile**
This setting allows user to exclude or include specific columns and metrics from the profiler.
@ -198,15 +192,6 @@ This is a sample config for the profiler:
{% codeInfoContainer %}
{% codeInfo srNumber=10 %}
#### Source Configuration - Source Config
You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).
**generateSampleData**: Option to turn on/off generating sample data.
{% /codeInfo %}
{% codeInfo srNumber=22 %}
**computeMetrics**: Option to turn on/off computing profiler metrics. This flag is useful when you want to only ingest the sample data with the profiler workflow and not any other information.
@ -226,19 +211,6 @@ You can find all the definitions and types for the `sourceConfig` [here](https:
{% /codeInfo %}
{% codeInfo srNumber=13 %}
**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.
{% /codeInfo %}
{% codeInfo srNumber=14 %}
**confidence**: Set the Confidence value for which you want the column to be marked
{% /codeInfo %}
{% codeInfo srNumber=15 %}
**timeoutSeconds**: Profiler Timeout in Seconds
@ -305,9 +277,6 @@ source:
type: Profiler
```
```yaml {% srNumber=10 %}
generateSampleData: true
```
```yaml {% srNumber=22 %}
computeMetrics: true
```
@ -317,12 +286,6 @@ source:
```yaml {% srNumber=12 %}
# threadCount: 5
```
```yaml {% srNumber=13 %}
processPiiSensitive: false
```
```yaml {% srNumber=14 %}
# confidence: 80
```
```yaml {% srNumber=15 %}
# timeoutSeconds: 43200
```
@ -363,7 +326,6 @@ processor:
# profileSample: <number between 0 and 99> # default
# profileSample: <number between 0 and 99> # default will be 100 if omitted
# profileQuery: <query to use for sampling data for the profiler>
# columnConfig:
# excludeColumns:
# - <column name>

View File

@ -1025,8 +1025,6 @@ site_menu:
url: /how-to-guides/data-quality-observability/profiler/metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/external-workflow
- category: How-to Guides / Data Quality and Observability / Data Observability
@ -1110,6 +1108,8 @@ site_menu:
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification