mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-11-02 19:48:17 +00:00
Doc: Sample data Removal (#21709)
Co-authored-by: Rounak Dhillon <rounakdhillon@Rounaks-MacBook-Air.local>
This commit is contained in:
parent
a618a273f2
commit
2045a4be2f
@ -860,8 +860,6 @@ site_menu:
|
||||
url: /how-to-guides/data-quality-observability/profiler/metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
|
||||
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-workflow
|
||||
- category: How-to Guides / Data Quality and Observability / Data Observability
|
||||
@ -957,6 +955,8 @@ site_menu:
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
- category: How-to Guides / Data Governance / Classification / What are Tiers
|
||||
url: /how-to-guides/data-governance/classification/tiers
|
||||
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
---
|
||||
title: External Storage for Sample Data
|
||||
slug: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
slug: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
---
|
||||
|
||||
# External Storage for Sample Data
|
||||
@ -83,10 +83,7 @@ This Flag is useful in scenarios when you have different schemas with same name
|
||||
**Compute Metrics**
|
||||
Set the Compute Metrics toggle off to not perform any metric computation during the profiler ingestion workflow. Used in combination with Ingest Sample Data toggle on allows you to only ingest sample data.
|
||||
|
||||
**Advanced Configuration**
|
||||
|
||||
**PII Inference Confidence LevelConfidence (Optional)**
|
||||
If `Auto PII Tagging` is enable, this confidence level will determine the threshold to use for OpenMetadata's NLP model to consider a column as containing PII data.
|
||||
**Advanced Configuration**
|
||||
|
||||
**Sample Data Rows Count**
|
||||
Set the number of rows to ingest when Ingest Sample Data toggle is on. Defaults to 50.
|
||||
@ -124,9 +121,6 @@ Set the sample to be use by the profiler for the specific table.
|
||||
|
||||
⚠️ This option is currently not support for Druid. Sampling leverage `RANDOM` functions in most database (some have specific sampling functions) and Druid provides neither of these option. We recommend using the partitioning or sample query option if you need to limit the amount of data scanned.
|
||||
|
||||
**Profile Sample Query**
|
||||
Use a query to sample data for the profiler. This will overwrite any profle sample set.
|
||||
|
||||
**Enable Column Profile**
|
||||
This setting allows user to exclude or include specific columns and metrics from the profiler.
|
||||
|
||||
@ -198,15 +192,6 @@ This is a sample config for the profiler:
|
||||
|
||||
{% codeInfoContainer %}
|
||||
|
||||
{% codeInfo srNumber=10 %}
|
||||
#### Source Configuration - Source Config
|
||||
|
||||
You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).
|
||||
|
||||
**generateSampleData**: Option to turn on/off generating sample data.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=22 %}
|
||||
|
||||
**computeMetrics**: Option to turn on/off computing profiler metrics. This flag is useful when you want to only ingest the sample data with the profiler workflow and not any other information.
|
||||
@ -226,19 +211,6 @@ You can find all the definitions and types for the `sourceConfig` [here](https:
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=13 %}
|
||||
|
||||
**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=14 %}
|
||||
|
||||
**confidence**: Set the Confidence value for which you want the column to be marked
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
|
||||
{% codeInfo srNumber=15 %}
|
||||
|
||||
**timeoutSeconds**: Profiler Timeout in Seconds
|
||||
@ -305,9 +277,6 @@ source:
|
||||
type: Profiler
|
||||
```
|
||||
|
||||
```yaml {% srNumber=10 %}
|
||||
generateSampleData: true
|
||||
```
|
||||
```yaml {% srNumber=22 %}
|
||||
computeMetrics: true
|
||||
```
|
||||
@ -317,12 +286,6 @@ source:
|
||||
```yaml {% srNumber=12 %}
|
||||
# threadCount: 5
|
||||
```
|
||||
```yaml {% srNumber=13 %}
|
||||
processPiiSensitive: false
|
||||
```
|
||||
```yaml {% srNumber=14 %}
|
||||
# confidence: 80
|
||||
```
|
||||
```yaml {% srNumber=15 %}
|
||||
# timeoutSeconds: 43200
|
||||
```
|
||||
@ -363,7 +326,6 @@ processor:
|
||||
# profileSample: <number between 0 and 99> # default
|
||||
|
||||
# profileSample: <number between 0 and 99> # default will be 100 if omitted
|
||||
# profileQuery: <query to use for sampling data for the profiler>
|
||||
# columnConfig:
|
||||
# excludeColumns:
|
||||
# - <column name>
|
||||
|
||||
@ -993,8 +993,6 @@ site_menu:
|
||||
url: /how-to-guides/data-quality-observability/profiler/metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
|
||||
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-workflow
|
||||
- category: How-to Guides / Data Quality and Observability / Data Observability
|
||||
@ -1076,6 +1074,8 @@ site_menu:
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
- category: How-to Guides / Data Governance / Classification / What are Tiers
|
||||
url: /how-to-guides/data-governance/classification/tiers
|
||||
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
|
||||
|
||||
@ -902,8 +902,6 @@ site_menu:
|
||||
url: /how-to-guides/data-quality-observability/profiler/metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
|
||||
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
|
||||
url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
|
||||
- category: How-to Guides / Data Quality and Observability / Data Observability
|
||||
@ -1001,6 +999,8 @@ site_menu:
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
- category: How-to Guides / Data Governance / Classification / What are Tiers
|
||||
url: /how-to-guides/data-governance/classification/tiers
|
||||
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
---
|
||||
title: External Storage for Sample Data
|
||||
slug: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
slug: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
---
|
||||
|
||||
# External Storage for Sample Data
|
||||
@ -83,10 +83,7 @@ This Flag is useful in scenarios when you have different schemas with same name
|
||||
**Compute Metrics**
|
||||
Set the Compute Metrics toggle off to not perform any metric computation during the profiler ingestion workflow. Used in combination with Ingest Sample Data toggle on allows you to only ingest sample data.
|
||||
|
||||
**Advanced Configuration**
|
||||
|
||||
**PII Inference Confidence LevelConfidence (Optional)**
|
||||
If `Auto PII Tagging` is enable, this confidence level will determine the threshold to use for OpenMetadata's NLP model to consider a column as containing PII data.
|
||||
**Advanced Configuration**
|
||||
|
||||
**Sample Data Rows Count**
|
||||
Set the number of rows to ingest when Ingest Sample Data toggle is on. Defaults to 50.
|
||||
@ -124,9 +121,6 @@ Set the sample to be use by the profiler for the specific table.
|
||||
|
||||
⚠️ This option is currently not support for Druid. Sampling leverage `RANDOM` functions in most database (some have specific sampling functions) and Druid provides neither of these option. We recommend using the partitioning or sample query option if you need to limit the amount of data scanned.
|
||||
|
||||
**Profile Sample Query**
|
||||
Use a query to sample data for the profiler. This will overwrite any profle sample set.
|
||||
|
||||
**Enable Column Profile**
|
||||
This setting allows user to exclude or include specific columns and metrics from the profiler.
|
||||
|
||||
@ -198,15 +192,6 @@ This is a sample config for the profiler:
|
||||
|
||||
{% codeInfoContainer %}
|
||||
|
||||
{% codeInfo srNumber=10 %}
|
||||
#### Source Configuration - Source Config
|
||||
|
||||
You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).
|
||||
|
||||
**generateSampleData**: Option to turn on/off generating sample data.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=22 %}
|
||||
|
||||
**computeMetrics**: Option to turn on/off computing profiler metrics. This flag is useful when you want to only ingest the sample data with the profiler workflow and not any other information.
|
||||
@ -226,19 +211,6 @@ You can find all the definitions and types for the `sourceConfig` [here](https:
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=13 %}
|
||||
|
||||
**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=14 %}
|
||||
|
||||
**confidence**: Set the Confidence value for which you want the column to be marked
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
|
||||
{% codeInfo srNumber=15 %}
|
||||
|
||||
**timeoutSeconds**: Profiler Timeout in Seconds
|
||||
@ -305,9 +277,6 @@ source:
|
||||
type: Profiler
|
||||
```
|
||||
|
||||
```yaml {% srNumber=10 %}
|
||||
generateSampleData: true
|
||||
```
|
||||
```yaml {% srNumber=22 %}
|
||||
computeMetrics: true
|
||||
```
|
||||
@ -317,12 +286,6 @@ source:
|
||||
```yaml {% srNumber=12 %}
|
||||
# threadCount: 5
|
||||
```
|
||||
```yaml {% srNumber=13 %}
|
||||
processPiiSensitive: false
|
||||
```
|
||||
```yaml {% srNumber=14 %}
|
||||
# confidence: 80
|
||||
```
|
||||
```yaml {% srNumber=15 %}
|
||||
# timeoutSeconds: 43200
|
||||
```
|
||||
@ -362,8 +325,7 @@ processor:
|
||||
# - fullyQualifiedName: <table fqn>
|
||||
# profileSample: <number between 0 and 99> # default
|
||||
|
||||
# profileSample: <number between 0 and 99> # default will be 100 if omitted
|
||||
# profileQuery: <query to use for sampling data for the profiler>
|
||||
# profileSample: <number between 0 and 99> # default will be 100 if omitted
|
||||
# columnConfig:
|
||||
# excludeColumns:
|
||||
# - <column name>
|
||||
|
||||
@ -1019,8 +1019,6 @@ site_menu:
|
||||
url: /how-to-guides/data-quality-observability/profiler/metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
|
||||
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-workflow
|
||||
- category: How-to Guides / Data Quality and Observability / Data Observability
|
||||
@ -1104,6 +1102,8 @@ site_menu:
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
- category: How-to Guides / Data Governance / Classification / What are Tiers
|
||||
url: /how-to-guides/data-governance/classification/tiers
|
||||
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
|
||||
|
||||
@ -902,8 +902,6 @@ site_menu:
|
||||
url: /how-to-guides/data-quality-observability/profiler/metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
|
||||
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
|
||||
url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
|
||||
- category: How-to Guides / Data Quality and Observability / Data Observability
|
||||
@ -1001,6 +999,8 @@ site_menu:
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
- category: How-to Guides / Data Governance / Classification / What are Tiers
|
||||
url: /how-to-guides/data-governance/classification/tiers
|
||||
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
---
|
||||
title: External Storage for Sample Data
|
||||
slug: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
slug: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
---
|
||||
|
||||
# External Storage for Sample Data
|
||||
@ -85,9 +85,6 @@ Set the Compute Metrics toggle off to not perform any metric computation during
|
||||
|
||||
**Advanced Configuration**
|
||||
|
||||
**PII Inference Confidence LevelConfidence (Optional)**
|
||||
If `Auto PII Tagging` is enable, this confidence level will determine the threshold to use for OpenMetadata's NLP model to consider a column as containing PII data.
|
||||
|
||||
**Sample Data Rows Count**
|
||||
Set the number of rows to ingest when Ingest Sample Data toggle is on. Defaults to 50.
|
||||
|
||||
@ -124,9 +121,6 @@ Set the sample to be use by the profiler for the specific table.
|
||||
|
||||
⚠️ This option is currently not support for Druid. Sampling leverage `RANDOM` functions in most database (some have specific sampling functions) and Druid provides neither of these option. We recommend using the partitioning or sample query option if you need to limit the amount of data scanned.
|
||||
|
||||
**Profile Sample Query**
|
||||
Use a query to sample data for the profiler. This will overwrite any profle sample set.
|
||||
|
||||
**Enable Column Profile**
|
||||
This setting allows user to exclude or include specific columns and metrics from the profiler.
|
||||
|
||||
@ -198,15 +192,6 @@ This is a sample config for the profiler:
|
||||
|
||||
{% codeInfoContainer %}
|
||||
|
||||
{% codeInfo srNumber=10 %}
|
||||
#### Source Configuration - Source Config
|
||||
|
||||
You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).
|
||||
|
||||
**generateSampleData**: Option to turn on/off generating sample data.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=22 %}
|
||||
|
||||
**computeMetrics**: Option to turn on/off computing profiler metrics. This flag is useful when you want to only ingest the sample data with the profiler workflow and not any other information.
|
||||
@ -226,19 +211,6 @@ You can find all the definitions and types for the `sourceConfig` [here](https:
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=13 %}
|
||||
|
||||
**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=14 %}
|
||||
|
||||
**confidence**: Set the Confidence value for which you want the column to be marked
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
|
||||
{% codeInfo srNumber=15 %}
|
||||
|
||||
**timeoutSeconds**: Profiler Timeout in Seconds
|
||||
@ -305,9 +277,6 @@ source:
|
||||
type: Profiler
|
||||
```
|
||||
|
||||
```yaml {% srNumber=10 %}
|
||||
generateSampleData: true
|
||||
```
|
||||
```yaml {% srNumber=22 %}
|
||||
computeMetrics: true
|
||||
```
|
||||
@ -317,12 +286,6 @@ source:
|
||||
```yaml {% srNumber=12 %}
|
||||
# threadCount: 5
|
||||
```
|
||||
```yaml {% srNumber=13 %}
|
||||
processPiiSensitive: false
|
||||
```
|
||||
```yaml {% srNumber=14 %}
|
||||
# confidence: 80
|
||||
```
|
||||
```yaml {% srNumber=15 %}
|
||||
# timeoutSeconds: 43200
|
||||
```
|
||||
@ -363,7 +326,6 @@ processor:
|
||||
# profileSample: <number between 0 and 99> # default
|
||||
|
||||
# profileSample: <number between 0 and 99> # default will be 100 if omitted
|
||||
# profileQuery: <query to use for sampling data for the profiler>
|
||||
# columnConfig:
|
||||
# excludeColumns:
|
||||
# - <column name>
|
||||
|
||||
@ -1025,8 +1025,6 @@ site_menu:
|
||||
url: /how-to-guides/data-quality-observability/profiler/metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Custom Metrics
|
||||
url: /how-to-guides/data-quality-observability/profiler/custom-metrics
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
|
||||
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
|
||||
url: /how-to-guides/data-quality-observability/profiler/external-workflow
|
||||
- category: How-to Guides / Data Quality and Observability / Data Observability
|
||||
@ -1110,6 +1108,8 @@ site_menu:
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-workflow
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Auto PII Tagging
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/auto-pii-tagging
|
||||
- category: How-to Guides / Data Governance / Classification / Auto-Classification Workflow / Sample Data
|
||||
url: /how-to-guides/data-governance/classification/auto-classification/external-sample-data
|
||||
- category: How-to Guides / Data Governance / Classification / What are Tiers
|
||||
url: /how-to-guides/data-governance/classification/tiers
|
||||
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user