mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-10-13 17:58:36 +00:00
DOCS - Storage Connector example (#16910)
This commit is contained in:
parent
02e06c7795
commit
e3bbb18a03
@ -121,23 +121,51 @@ Again, this information will be added on top of the inferred schema from the dat
|
||||
|
||||
### Global Manifest
|
||||
|
||||
You can also manage a **single** manifest file to centralize the ingestion process for any container. In that case,
|
||||
You can also manage a **single** manifest file to centralize the ingestion process for any container, named `openmetadata_storage_manifest.json`. For example:
|
||||
|
||||
In that case,
|
||||
you will need to add a `containerName` entry to the structure above. For example:
|
||||
|
||||
```yaml
|
||||
{% codePreview %}
|
||||
|
||||
{% codeInfoContainer %}
|
||||
|
||||
{% codeInfo srNumber=1 %}
|
||||
|
||||
The fields shown above (`dataPath`, `structureFormat`, `isPartitioned`, etc.) are still valid.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=2 %}
|
||||
|
||||
**Container Name**: Since we are using a single manifest for all your containers, the field `containerName` will
|
||||
help us identify which container (or Bucket in S3, etc.), contains the presented information.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% /codeInfoContainer %}
|
||||
|
||||
{% codeBlock fileName="openmetadata-global.json" %}
|
||||
|
||||
```json {% srNumber=1 %}
|
||||
{
|
||||
"entries": [
|
||||
{
|
||||
"dataPath": "transactions",
|
||||
"structureFormat": "csv",
|
||||
"isPartitioned": false,
|
||||
```
|
||||
|
||||
```json {% srNumber=2 %}
|
||||
"containerName": "collate-demo-storage"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You can also keep local manifests in each container, but if possible, we will always try to pick up the global manifest
|
||||
during the ingestion.
|
||||
{% /codeBlock %}
|
||||
|
||||
We will look for a file named `openmetadata_storage_manifest.json`.
|
||||
{% /codePreview %}
|
||||
|
||||
You can also keep local manifests `openmetadata.json` in each container, but if possible, we will always try to pick up the global manifest
|
||||
during the ingestion.
|
||||
|
@ -121,23 +121,51 @@ Again, this information will be added on top of the inferred schema from the dat
|
||||
|
||||
### Global Manifest
|
||||
|
||||
You can also manage a **single** manifest file to centralize the ingestion process for any container. In that case,
|
||||
You can also manage a **single** manifest file to centralize the ingestion process for any container, named `openmetadata_storage_manifest.json`. For example:
|
||||
|
||||
In that case,
|
||||
you will need to add a `containerName` entry to the structure above. For example:
|
||||
|
||||
```yaml
|
||||
{% codePreview %}
|
||||
|
||||
{% codeInfoContainer %}
|
||||
|
||||
{% codeInfo srNumber=1 %}
|
||||
|
||||
The fields shown above (`dataPath`, `structureFormat`, `isPartitioned`, etc.) are still valid.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% codeInfo srNumber=2 %}
|
||||
|
||||
**Container Name**: Since we are using a single manifest for all your containers, the field `containerName` will
|
||||
help us identify which container (or Bucket in S3, etc.), contains the presented information.
|
||||
|
||||
{% /codeInfo %}
|
||||
|
||||
{% /codeInfoContainer %}
|
||||
|
||||
{% codeBlock fileName="openmetadata-global.json" %}
|
||||
|
||||
```json {% srNumber=1 %}
|
||||
{
|
||||
"entries": [
|
||||
{
|
||||
"dataPath": "transactions",
|
||||
"structureFormat": "csv",
|
||||
"isPartitioned": false,
|
||||
```
|
||||
|
||||
```json {% srNumber=2 %}
|
||||
"containerName": "collate-demo-storage"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You can also keep local manifests in each container, but if possible, we will always try to pick up the global manifest
|
||||
during the ingestion.
|
||||
{% /codeBlock %}
|
||||
|
||||
We will look for a file named `openmetadata_storage_manifest.json`.
|
||||
{% /codePreview %}
|
||||
|
||||
You can also keep local manifests `openmetadata.json` in each container, but if possible, we will always try to pick up the global manifest
|
||||
during the ingestion.
|
||||
|
@ -22,6 +22,12 @@ In these systems we can have different types of information:
|
||||
- Structured data in single and independent files (which can also be ingested with the [Data Lake connector](/connectors/database/datalake))
|
||||
- Structured data in partitioned files, e.g., `my_table/year=2022/...parquet`, `my_table/year=2023/...parquet`, etc.
|
||||
|
||||
{% note %}
|
||||
|
||||
The Storage Connector will help you bring in **Structured data in partitioned files**.
|
||||
|
||||
{% /note %}
|
||||
|
||||
Then the question is, how do we know which data in each Container is relevant and which structure does it follow? In order to
|
||||
optimize ingestion costs and make sure we are only bringing in useful metadata, the Storage Services ingestion process
|
||||
follow this approach:
|
||||
@ -41,3 +47,168 @@ We are flattening this structure to simplify the navigation.
|
||||
{% /note %}
|
||||
|
||||
{% partial file="/v1.4/connectors/storage/manifest.md" /%}
|
||||
|
||||
## Example
|
||||
|
||||
Let's show an example on how the data process and metadata look like. We will work with S3, using a global manifest,
|
||||
and two buckets.
|
||||
|
||||
### S3 Data
|
||||
|
||||
In S3 we have:
|
||||
|
||||
```
|
||||
S3
|
||||
|__ om-glue-test # bucket
|
||||
| |__ openmetadata_storage_manifest.json # Global Manifest
|
||||
|__ collate-demo-storage # bucket
|
||||
|__ cities_multiple_simple/
|
||||
| |__ 20230412/
|
||||
| |__ State=AL/ # Directory with parquet files
|
||||
| |__ State=AZ/ # Directory with parquet files
|
||||
|__ cities_multiple/
|
||||
| |__ Year=2023/
|
||||
| |__ State=AL/ # Directory with parquet files
|
||||
| |__ State=AZ/ # Directory with parquet files
|
||||
|__ cities/
|
||||
| |__ State=AL/ # Directory with parquet files
|
||||
| |__ State=AZ/ # Directory with parquet files
|
||||
|__ transactions_separator/ # Directory with CSV files using ;
|
||||
|__ transactions/ # Directory with CSV files using ,
|
||||
```
|
||||
|
||||
1. We have a bucket `om-glue-test` where our `openmetadata_storage_manifest.json` global manifest lives.
|
||||
2. We have another bucket `collate-demo-storage` where we want to ingest the metadata of 5 partitioned containers with different formats
|
||||
1. The `cities_multiple_simple` container has a time partition (formatting just a date) and a `State` partition.
|
||||
2. The `cities_multiple` container has a `Year` and a `State` partition.
|
||||
3. The `cities` container is only partitioned by `State`.
|
||||
4. The `transactions_separator` container contains multiple CSV files separated by `;`.
|
||||
5. The `transactions` container contains multiple CSV files separated by `,`.
|
||||
|
||||
The ingestion process will pick up a random sample of files from the directories (or subdirectories).
|
||||
|
||||
### Global Manifest
|
||||
|
||||
Our global manifest looks like follows:
|
||||
|
||||
```json
|
||||
{
|
||||
"entries":[
|
||||
{
|
||||
"dataPath": "transactions",
|
||||
"structureFormat": "csv",
|
||||
"isPartitioned": false,
|
||||
"containerName": "collate-demo-storage"
|
||||
},
|
||||
{
|
||||
"dataPath": "transactions_separator",
|
||||
"structureFormat": "csv",
|
||||
"isPartitioned": false,
|
||||
"separator": ";",
|
||||
"containerName": "collate-demo-storage"
|
||||
},
|
||||
{
|
||||
"dataPath": "cities",
|
||||
"structureFormat": "parquet",
|
||||
"isPartitioned": true,
|
||||
"containerName": "collate-demo-storage"
|
||||
},
|
||||
{
|
||||
"dataPath": "cities_multiple",
|
||||
"structureFormat": "parquet",
|
||||
"isPartitioned": true,
|
||||
"containerName": "collate-demo-storage",
|
||||
"partitionColumns": [
|
||||
{
|
||||
"name": "Year",
|
||||
"dataType": "DATE",
|
||||
"dataTypeDisplay": "date (year)"
|
||||
},
|
||||
{
|
||||
"name": "State",
|
||||
"dataType": "STRING"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"dataPath": "cities_multiple_simple",
|
||||
"structureFormat": "parquet",
|
||||
"isPartitioned": true,
|
||||
"containerName": "collate-demo-storage",
|
||||
"partitionColumns": [
|
||||
{
|
||||
"name": "State",
|
||||
"dataType": "STRING"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
We are specifying:
|
||||
1. Where to find the data for each container we want to ingest via the `dataPath`,
|
||||
2. The `format`,
|
||||
3. Indication if the data has sub partitions or not (e.g., `State` or `Year`),
|
||||
4. The `containerName`, so that the process knows in which S3 bucket to look for this data.
|
||||
|
||||
### Source Config
|
||||
|
||||
In order to prepare the ingestion, we will:
|
||||
1. Set the `sourceConfig` to include only the containers we are interested in.
|
||||
2. Set the `storageMetadataConfigSource` pointing to the global manifest stored in S3, specifying the container name as `om-glue-test`.
|
||||
|
||||
```yaml
|
||||
source:
|
||||
type: s3
|
||||
serviceName: s3-demo
|
||||
serviceConnection:
|
||||
config:
|
||||
type: S3
|
||||
awsConfig:
|
||||
awsAccessKeyId: ...
|
||||
awsSecretAccessKey: ...
|
||||
awsRegion: ...
|
||||
sourceConfig:
|
||||
config:
|
||||
type: StorageMetadata
|
||||
containerFilterPattern:
|
||||
includes:
|
||||
- collate-demo-storage
|
||||
- om-glue-test
|
||||
storageMetadataConfigSource:
|
||||
securityConfig:
|
||||
awsAccessKeyId: ...
|
||||
awsSecretAccessKey: ...
|
||||
awsRegion: ...
|
||||
prefixConfig:
|
||||
containerName: om-glue-test
|
||||
sink:
|
||||
type: metadata-rest
|
||||
config: {}
|
||||
workflowConfig:
|
||||
openMetadataServerConfig:
|
||||
hostPort: http://localhost:8585/api
|
||||
authProvider: openmetadata
|
||||
securityConfig:
|
||||
jwtToken: "..."
|
||||
```
|
||||
|
||||
You can run this same process from the UI, or directly with the `metadata` CLI via `metadata ingest -c <path to yaml>`.
|
||||
|
||||
### Checking the results
|
||||
|
||||
Once the ingestion process runs, we'll see the following metadata:
|
||||
|
||||
First, the service we called `s3-demo`, which has the two buckets we included in the filter.
|
||||
|
||||
{% image src="/images/v1.4/connectors/storage/s3-demo.png" alt="s3-demo" /%}
|
||||
|
||||
Then, if we click on the `collate-demo-storage` container, we'll see all the children defined in the manifest.
|
||||
|
||||
{% image src="/images/v1.4/connectors/storage/collate-demo-storage.png" alt="s3-demo" /%}
|
||||
|
||||
- **cities**: Will show the columns extracted from the sampled parquet files, since there is no partition columns specified.
|
||||
- **cities_multiple**: Will have the parquet columns and the `Year` and `State` columns indicated in the partitions.
|
||||
- **cities_multiple_simple**: Will have the parquet columns and the `State` column indicated in the partition.
|
||||
- **transactions** and **transactions_separator**: Will have the CSV columns.
|
||||
|
@ -22,6 +22,12 @@ In these systems we can have different types of information:
|
||||
- Structured data in single and independent files (which can also be ingested with the [Data Lake connector](/connectors/database/datalake))
|
||||
- Structured data in partitioned files, e.g., `my_table/year=2022/...parquet`, `my_table/year=2023/...parquet`, etc.
|
||||
|
||||
{% note %}
|
||||
|
||||
The Storage Connector will help you bring in **Structured data in partitioned files**.
|
||||
|
||||
{% /note %}
|
||||
|
||||
Then the question is, how do we know which data in each Container is relevant and which structure does it follow? In order to
|
||||
optimize ingestion costs and make sure we are only bringing in useful metadata, the Storage Services ingestion process
|
||||
follow this approach:
|
||||
@ -41,3 +47,168 @@ We are flattening this structure to simplify the navigation.
|
||||
{% /note %}
|
||||
|
||||
{% partial file="/v1.5/connectors/storage/manifest.md" /%}
|
||||
|
||||
## Example
|
||||
|
||||
Let's show an example on how the data process and metadata look like. We will work with S3, using a global manifest,
|
||||
and two buckets.
|
||||
|
||||
### S3 Data
|
||||
|
||||
In S3 we have:
|
||||
|
||||
```
|
||||
S3
|
||||
|__ om-glue-test # bucket
|
||||
| |__ openmetadata_storage_manifest.json # Global Manifest
|
||||
|__ collate-demo-storage # bucket
|
||||
|__ cities_multiple_simple/
|
||||
| |__ 20230412/
|
||||
| |__ State=AL/ # Directory with parquet files
|
||||
| |__ State=AZ/ # Directory with parquet files
|
||||
|__ cities_multiple/
|
||||
| |__ Year=2023/
|
||||
| |__ State=AL/ # Directory with parquet files
|
||||
| |__ State=AZ/ # Directory with parquet files
|
||||
|__ cities/
|
||||
| |__ State=AL/ # Directory with parquet files
|
||||
| |__ State=AZ/ # Directory with parquet files
|
||||
|__ transactions_separator/ # Directory with CSV files using ;
|
||||
|__ transactions/ # Directory with CSV files using ,
|
||||
```
|
||||
|
||||
1. We have a bucket `om-glue-test` where our `openmetadata_storage_manifest.json` global manifest lives.
|
||||
2. We have another bucket `collate-demo-storage` where we want to ingest the metadata of 5 partitioned containers with different formats
|
||||
1. The `cities_multiple_simple` container has a time partition (formatting just a date) and a `State` partition.
|
||||
2. The `cities_multiple` container has a `Year` and a `State` partition.
|
||||
3. The `cities` container is only partitioned by `State`.
|
||||
4. The `transactions_separator` container contains multiple CSV files separated by `;`.
|
||||
5. The `transactions` container contains multiple CSV files separated by `,`.
|
||||
|
||||
The ingestion process will pick up a random sample of files from the directories (or subdirectories).
|
||||
|
||||
### Global Manifest
|
||||
|
||||
Our global manifest looks like follows:
|
||||
|
||||
```json
|
||||
{
|
||||
"entries":[
|
||||
{
|
||||
"dataPath": "transactions",
|
||||
"structureFormat": "csv",
|
||||
"isPartitioned": false,
|
||||
"containerName": "collate-demo-storage"
|
||||
},
|
||||
{
|
||||
"dataPath": "transactions_separator",
|
||||
"structureFormat": "csv",
|
||||
"isPartitioned": false,
|
||||
"separator": ";",
|
||||
"containerName": "collate-demo-storage"
|
||||
},
|
||||
{
|
||||
"dataPath": "cities",
|
||||
"structureFormat": "parquet",
|
||||
"isPartitioned": true,
|
||||
"containerName": "collate-demo-storage"
|
||||
},
|
||||
{
|
||||
"dataPath": "cities_multiple",
|
||||
"structureFormat": "parquet",
|
||||
"isPartitioned": true,
|
||||
"containerName": "collate-demo-storage",
|
||||
"partitionColumns": [
|
||||
{
|
||||
"name": "Year",
|
||||
"dataType": "DATE",
|
||||
"dataTypeDisplay": "date (year)"
|
||||
},
|
||||
{
|
||||
"name": "State",
|
||||
"dataType": "STRING"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"dataPath": "cities_multiple_simple",
|
||||
"structureFormat": "parquet",
|
||||
"isPartitioned": true,
|
||||
"containerName": "collate-demo-storage",
|
||||
"partitionColumns": [
|
||||
{
|
||||
"name": "State",
|
||||
"dataType": "STRING"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
We are specifying:
|
||||
1. Where to find the data for each container we want to ingest via the `dataPath`,
|
||||
2. The `format`,
|
||||
3. Indication if the data has sub partitions or not (e.g., `State` or `Year`),
|
||||
4. The `containerName`, so that the process knows in which S3 bucket to look for this data.
|
||||
|
||||
### Source Config
|
||||
|
||||
In order to prepare the ingestion, we will:
|
||||
1. Set the `sourceConfig` to include only the containers we are interested in.
|
||||
2. Set the `storageMetadataConfigSource` pointing to the global manifest stored in S3, specifying the container name as `om-glue-test`.
|
||||
|
||||
```yaml
|
||||
source:
|
||||
type: s3
|
||||
serviceName: s3-demo
|
||||
serviceConnection:
|
||||
config:
|
||||
type: S3
|
||||
awsConfig:
|
||||
awsAccessKeyId: ...
|
||||
awsSecretAccessKey: ...
|
||||
awsRegion: ...
|
||||
sourceConfig:
|
||||
config:
|
||||
type: StorageMetadata
|
||||
containerFilterPattern:
|
||||
includes:
|
||||
- collate-demo-storage
|
||||
- om-glue-test
|
||||
storageMetadataConfigSource:
|
||||
securityConfig:
|
||||
awsAccessKeyId: ...
|
||||
awsSecretAccessKey: ...
|
||||
awsRegion: ...
|
||||
prefixConfig:
|
||||
containerName: om-glue-test
|
||||
sink:
|
||||
type: metadata-rest
|
||||
config: {}
|
||||
workflowConfig:
|
||||
openMetadataServerConfig:
|
||||
hostPort: http://localhost:8585/api
|
||||
authProvider: openmetadata
|
||||
securityConfig:
|
||||
jwtToken: "..."
|
||||
```
|
||||
|
||||
You can run this same process from the UI, or directly with the `metadata` CLI via `metadata ingest -c <path to yaml>`.
|
||||
|
||||
### Checking the results
|
||||
|
||||
Once the ingestion process runs, we'll see the following metadata:
|
||||
|
||||
First, the service we called `s3-demo`, which has the two buckets we included in the filter.
|
||||
|
||||
{% image src="/images/v1.5/connectors/storage/s3-demo.png" alt="s3-demo" /%}
|
||||
|
||||
Then, if we click on the `collate-demo-storage` container, we'll see all the children defined in the manifest.
|
||||
|
||||
{% image src="/images/v1.5/connectors/storage/collate-demo-storage.png" alt="s3-demo" /%}
|
||||
|
||||
- **cities**: Will show the columns extracted from the sampled parquet files, since there is no partition columns specified.
|
||||
- **cities_multiple**: Will have the parquet columns and the `Year` and `State` columns indicated in the partitions.
|
||||
- **cities_multiple_simple**: Will have the parquet columns and the `State` column indicated in the partition.
|
||||
- **transactions** and **transactions_separator**: Will have the CSV columns.
|
||||
|
Binary file not shown.
After Width: | Height: | Size: 37 KiB |
BIN
openmetadata-docs/images/v1.4/connectors/storage/s3-demo.png
Normal file
BIN
openmetadata-docs/images/v1.4/connectors/storage/s3-demo.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 24 KiB |
Binary file not shown.
After Width: | Height: | Size: 37 KiB |
BIN
openmetadata-docs/images/v1.5/connectors/storage/s3-demo.png
Normal file
BIN
openmetadata-docs/images/v1.5/connectors/storage/s3-demo.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 24 KiB |
Loading…
x
Reference in New Issue
Block a user