diff --git a/openmetadata-docs/content/partials/v1.4/connectors/storage/manifest.md b/openmetadata-docs/content/partials/v1.4/connectors/storage/manifest.md index 289d48a1aec..be18c834c6d 100644 --- a/openmetadata-docs/content/partials/v1.4/connectors/storage/manifest.md +++ b/openmetadata-docs/content/partials/v1.4/connectors/storage/manifest.md @@ -121,23 +121,51 @@ Again, this information will be added on top of the inferred schema from the dat ### Global Manifest -You can also manage a **single** manifest file to centralize the ingestion process for any container. In that case, +You can also manage a **single** manifest file to centralize the ingestion process for any container, named `openmetadata_storage_manifest.json`. For example: + +In that case, you will need to add a `containerName` entry to the structure above. For example: -```yaml +{% codePreview %} + +{% codeInfoContainer %} + +{% codeInfo srNumber=1 %} + +The fields shown above (`dataPath`, `structureFormat`, `isPartitioned`, etc.) are still valid. + +{% /codeInfo %} + +{% codeInfo srNumber=2 %} + +**Container Name**: Since we are using a single manifest for all your containers, the field `containerName` will +help us identify which container (or Bucket in S3, etc.), contains the presented information. + +{% /codeInfo %} + +{% /codeInfoContainer %} + +{% codeBlock fileName="openmetadata-global.json" %} + +```json {% srNumber=1 %} { "entries": [ { "dataPath": "transactions", "structureFormat": "csv", "isPartitioned": false, +``` + +```json {% srNumber=2 %} "containerName": "collate-demo-storage" } ] } ``` -You can also keep local manifests in each container, but if possible, we will always try to pick up the global manifest -during the ingestion. +{% /codeBlock %} -We will look for a file named `openmetadata_storage_manifest.json`. +{% /codePreview %} + +You can also keep local manifests `openmetadata.json` in each container, but if possible, we will always try to pick up the global manifest +during the ingestion. diff --git a/openmetadata-docs/content/partials/v1.5/connectors/storage/manifest.md b/openmetadata-docs/content/partials/v1.5/connectors/storage/manifest.md index 289d48a1aec..d37692bab60 100644 --- a/openmetadata-docs/content/partials/v1.5/connectors/storage/manifest.md +++ b/openmetadata-docs/content/partials/v1.5/connectors/storage/manifest.md @@ -121,23 +121,51 @@ Again, this information will be added on top of the inferred schema from the dat ### Global Manifest -You can also manage a **single** manifest file to centralize the ingestion process for any container. In that case, +You can also manage a **single** manifest file to centralize the ingestion process for any container, named `openmetadata_storage_manifest.json`. For example: + +In that case, you will need to add a `containerName` entry to the structure above. For example: -```yaml +{% codePreview %} + +{% codeInfoContainer %} + +{% codeInfo srNumber=1 %} + +The fields shown above (`dataPath`, `structureFormat`, `isPartitioned`, etc.) are still valid. + +{% /codeInfo %} + +{% codeInfo srNumber=2 %} + +**Container Name**: Since we are using a single manifest for all your containers, the field `containerName` will +help us identify which container (or Bucket in S3, etc.), contains the presented information. + +{% /codeInfo %} + +{% /codeInfoContainer %} + +{% codeBlock fileName="openmetadata-global.json" %} + +```json {% srNumber=1 %} { "entries": [ { "dataPath": "transactions", "structureFormat": "csv", "isPartitioned": false, +``` + +```json {% srNumber=2 %} "containerName": "collate-demo-storage" } ] } ``` -You can also keep local manifests in each container, but if possible, we will always try to pick up the global manifest -during the ingestion. +{% /codeBlock %} -We will look for a file named `openmetadata_storage_manifest.json`. +{% /codePreview %} + +You can also keep local manifests `openmetadata.json` in each container, but if possible, we will always try to pick up the global manifest +during the ingestion. diff --git a/openmetadata-docs/content/v1.4.x/connectors/storage/index.md b/openmetadata-docs/content/v1.4.x/connectors/storage/index.md index b2f37907587..8786a1421ed 100644 --- a/openmetadata-docs/content/v1.4.x/connectors/storage/index.md +++ b/openmetadata-docs/content/v1.4.x/connectors/storage/index.md @@ -22,6 +22,12 @@ In these systems we can have different types of information: - Structured data in single and independent files (which can also be ingested with the [Data Lake connector](/connectors/database/datalake)) - Structured data in partitioned files, e.g., `my_table/year=2022/...parquet`, `my_table/year=2023/...parquet`, etc. +{% note %} + +The Storage Connector will help you bring in **Structured data in partitioned files**. + +{% /note %} + Then the question is, how do we know which data in each Container is relevant and which structure does it follow? In order to optimize ingestion costs and make sure we are only bringing in useful metadata, the Storage Services ingestion process follow this approach: @@ -41,3 +47,168 @@ We are flattening this structure to simplify the navigation. {% /note %} {% partial file="/v1.4/connectors/storage/manifest.md" /%} + +## Example + +Let's show an example on how the data process and metadata look like. We will work with S3, using a global manifest, +and two buckets. + +### S3 Data + +In S3 we have: + +``` +S3 +|__ om-glue-test # bucket +| |__ openmetadata_storage_manifest.json # Global Manifest +|__ collate-demo-storage # bucket + |__ cities_multiple_simple/ + | |__ 20230412/ + | |__ State=AL/ # Directory with parquet files + | |__ State=AZ/ # Directory with parquet files + |__ cities_multiple/ + | |__ Year=2023/ + | |__ State=AL/ # Directory with parquet files + | |__ State=AZ/ # Directory with parquet files + |__ cities/ + | |__ State=AL/ # Directory with parquet files + | |__ State=AZ/ # Directory with parquet files + |__ transactions_separator/ # Directory with CSV files using ; + |__ transactions/ # Directory with CSV files using , +``` + +1. We have a bucket `om-glue-test` where our `openmetadata_storage_manifest.json` global manifest lives. +2. We have another bucket `collate-demo-storage` where we want to ingest the metadata of 5 partitioned containers with different formats + 1. The `cities_multiple_simple` container has a time partition (formatting just a date) and a `State` partition. + 2. The `cities_multiple` container has a `Year` and a `State` partition. + 3. The `cities` container is only partitioned by `State`. + 4. The `transactions_separator` container contains multiple CSV files separated by `;`. + 5. The `transactions` container contains multiple CSV files separated by `,`. + +The ingestion process will pick up a random sample of files from the directories (or subdirectories). + +### Global Manifest + +Our global manifest looks like follows: + +```json +{ + "entries":[ + { + "dataPath": "transactions", + "structureFormat": "csv", + "isPartitioned": false, + "containerName": "collate-demo-storage" + }, + { + "dataPath": "transactions_separator", + "structureFormat": "csv", + "isPartitioned": false, + "separator": ";", + "containerName": "collate-demo-storage" + }, + { + "dataPath": "cities", + "structureFormat": "parquet", + "isPartitioned": true, + "containerName": "collate-demo-storage" + }, + { + "dataPath": "cities_multiple", + "structureFormat": "parquet", + "isPartitioned": true, + "containerName": "collate-demo-storage", + "partitionColumns": [ + { + "name": "Year", + "dataType": "DATE", + "dataTypeDisplay": "date (year)" + }, + { + "name": "State", + "dataType": "STRING" + } + ] + }, + { + "dataPath": "cities_multiple_simple", + "structureFormat": "parquet", + "isPartitioned": true, + "containerName": "collate-demo-storage", + "partitionColumns": [ + { + "name": "State", + "dataType": "STRING" + } + ] + } + ] +} +``` + +We are specifying: +1. Where to find the data for each container we want to ingest via the `dataPath`, +2. The `format`, +3. Indication if the data has sub partitions or not (e.g., `State` or `Year`), +4. The `containerName`, so that the process knows in which S3 bucket to look for this data. + +### Source Config + +In order to prepare the ingestion, we will: +1. Set the `sourceConfig` to include only the containers we are interested in. +2. Set the `storageMetadataConfigSource` pointing to the global manifest stored in S3, specifying the container name as `om-glue-test`. + +```yaml +source: + type: s3 + serviceName: s3-demo + serviceConnection: + config: + type: S3 + awsConfig: + awsAccessKeyId: ... + awsSecretAccessKey: ... + awsRegion: ... + sourceConfig: + config: + type: StorageMetadata + containerFilterPattern: + includes: + - collate-demo-storage + - om-glue-test + storageMetadataConfigSource: + securityConfig: + awsAccessKeyId: ... + awsSecretAccessKey: ... + awsRegion: ... + prefixConfig: + containerName: om-glue-test +sink: + type: metadata-rest + config: {} +workflowConfig: + openMetadataServerConfig: + hostPort: http://localhost:8585/api + authProvider: openmetadata + securityConfig: + jwtToken: "..." +``` + +You can run this same process from the UI, or directly with the `metadata` CLI via `metadata ingest -c `. + +### Checking the results + +Once the ingestion process runs, we'll see the following metadata: + +First, the service we called `s3-demo`, which has the two buckets we included in the filter. + +{% image src="/images/v1.4/connectors/storage/s3-demo.png" alt="s3-demo" /%} + +Then, if we click on the `collate-demo-storage` container, we'll see all the children defined in the manifest. + +{% image src="/images/v1.4/connectors/storage/collate-demo-storage.png" alt="s3-demo" /%} + +- **cities**: Will show the columns extracted from the sampled parquet files, since there is no partition columns specified. +- **cities_multiple**: Will have the parquet columns and the `Year` and `State` columns indicated in the partitions. +- **cities_multiple_simple**: Will have the parquet columns and the `State` column indicated in the partition. +- **transactions** and **transactions_separator**: Will have the CSV columns. diff --git a/openmetadata-docs/content/v1.5.x-SNAPSHOT/connectors/storage/index.md b/openmetadata-docs/content/v1.5.x-SNAPSHOT/connectors/storage/index.md index e983fb9cc52..7bc596f57a2 100644 --- a/openmetadata-docs/content/v1.5.x-SNAPSHOT/connectors/storage/index.md +++ b/openmetadata-docs/content/v1.5.x-SNAPSHOT/connectors/storage/index.md @@ -22,6 +22,12 @@ In these systems we can have different types of information: - Structured data in single and independent files (which can also be ingested with the [Data Lake connector](/connectors/database/datalake)) - Structured data in partitioned files, e.g., `my_table/year=2022/...parquet`, `my_table/year=2023/...parquet`, etc. +{% note %} + +The Storage Connector will help you bring in **Structured data in partitioned files**. + +{% /note %} + Then the question is, how do we know which data in each Container is relevant and which structure does it follow? In order to optimize ingestion costs and make sure we are only bringing in useful metadata, the Storage Services ingestion process follow this approach: @@ -41,3 +47,168 @@ We are flattening this structure to simplify the navigation. {% /note %} {% partial file="/v1.5/connectors/storage/manifest.md" /%} + +## Example + +Let's show an example on how the data process and metadata look like. We will work with S3, using a global manifest, +and two buckets. + +### S3 Data + +In S3 we have: + +``` +S3 +|__ om-glue-test # bucket +| |__ openmetadata_storage_manifest.json # Global Manifest +|__ collate-demo-storage # bucket + |__ cities_multiple_simple/ + | |__ 20230412/ + | |__ State=AL/ # Directory with parquet files + | |__ State=AZ/ # Directory with parquet files + |__ cities_multiple/ + | |__ Year=2023/ + | |__ State=AL/ # Directory with parquet files + | |__ State=AZ/ # Directory with parquet files + |__ cities/ + | |__ State=AL/ # Directory with parquet files + | |__ State=AZ/ # Directory with parquet files + |__ transactions_separator/ # Directory with CSV files using ; + |__ transactions/ # Directory with CSV files using , +``` + +1. We have a bucket `om-glue-test` where our `openmetadata_storage_manifest.json` global manifest lives. +2. We have another bucket `collate-demo-storage` where we want to ingest the metadata of 5 partitioned containers with different formats + 1. The `cities_multiple_simple` container has a time partition (formatting just a date) and a `State` partition. + 2. The `cities_multiple` container has a `Year` and a `State` partition. + 3. The `cities` container is only partitioned by `State`. + 4. The `transactions_separator` container contains multiple CSV files separated by `;`. + 5. The `transactions` container contains multiple CSV files separated by `,`. + +The ingestion process will pick up a random sample of files from the directories (or subdirectories). + +### Global Manifest + +Our global manifest looks like follows: + +```json +{ + "entries":[ + { + "dataPath": "transactions", + "structureFormat": "csv", + "isPartitioned": false, + "containerName": "collate-demo-storage" + }, + { + "dataPath": "transactions_separator", + "structureFormat": "csv", + "isPartitioned": false, + "separator": ";", + "containerName": "collate-demo-storage" + }, + { + "dataPath": "cities", + "structureFormat": "parquet", + "isPartitioned": true, + "containerName": "collate-demo-storage" + }, + { + "dataPath": "cities_multiple", + "structureFormat": "parquet", + "isPartitioned": true, + "containerName": "collate-demo-storage", + "partitionColumns": [ + { + "name": "Year", + "dataType": "DATE", + "dataTypeDisplay": "date (year)" + }, + { + "name": "State", + "dataType": "STRING" + } + ] + }, + { + "dataPath": "cities_multiple_simple", + "structureFormat": "parquet", + "isPartitioned": true, + "containerName": "collate-demo-storage", + "partitionColumns": [ + { + "name": "State", + "dataType": "STRING" + } + ] + } + ] +} +``` + +We are specifying: +1. Where to find the data for each container we want to ingest via the `dataPath`, +2. The `format`, +3. Indication if the data has sub partitions or not (e.g., `State` or `Year`), +4. The `containerName`, so that the process knows in which S3 bucket to look for this data. + +### Source Config + +In order to prepare the ingestion, we will: +1. Set the `sourceConfig` to include only the containers we are interested in. +2. Set the `storageMetadataConfigSource` pointing to the global manifest stored in S3, specifying the container name as `om-glue-test`. + +```yaml +source: + type: s3 + serviceName: s3-demo + serviceConnection: + config: + type: S3 + awsConfig: + awsAccessKeyId: ... + awsSecretAccessKey: ... + awsRegion: ... + sourceConfig: + config: + type: StorageMetadata + containerFilterPattern: + includes: + - collate-demo-storage + - om-glue-test + storageMetadataConfigSource: + securityConfig: + awsAccessKeyId: ... + awsSecretAccessKey: ... + awsRegion: ... + prefixConfig: + containerName: om-glue-test +sink: + type: metadata-rest + config: {} +workflowConfig: + openMetadataServerConfig: + hostPort: http://localhost:8585/api + authProvider: openmetadata + securityConfig: + jwtToken: "..." +``` + +You can run this same process from the UI, or directly with the `metadata` CLI via `metadata ingest -c `. + +### Checking the results + +Once the ingestion process runs, we'll see the following metadata: + +First, the service we called `s3-demo`, which has the two buckets we included in the filter. + +{% image src="/images/v1.5/connectors/storage/s3-demo.png" alt="s3-demo" /%} + +Then, if we click on the `collate-demo-storage` container, we'll see all the children defined in the manifest. + +{% image src="/images/v1.5/connectors/storage/collate-demo-storage.png" alt="s3-demo" /%} + +- **cities**: Will show the columns extracted from the sampled parquet files, since there is no partition columns specified. +- **cities_multiple**: Will have the parquet columns and the `Year` and `State` columns indicated in the partitions. +- **cities_multiple_simple**: Will have the parquet columns and the `State` column indicated in the partition. +- **transactions** and **transactions_separator**: Will have the CSV columns. diff --git a/openmetadata-docs/images/v1.4/connectors/storage/collate-demo-storage.png b/openmetadata-docs/images/v1.4/connectors/storage/collate-demo-storage.png new file mode 100644 index 00000000000..293a48ff95e Binary files /dev/null and b/openmetadata-docs/images/v1.4/connectors/storage/collate-demo-storage.png differ diff --git a/openmetadata-docs/images/v1.4/connectors/storage/s3-demo.png b/openmetadata-docs/images/v1.4/connectors/storage/s3-demo.png new file mode 100644 index 00000000000..a520d5f5f6c Binary files /dev/null and b/openmetadata-docs/images/v1.4/connectors/storage/s3-demo.png differ diff --git a/openmetadata-docs/images/v1.5/connectors/storage/collate-demo-storage.png b/openmetadata-docs/images/v1.5/connectors/storage/collate-demo-storage.png new file mode 100644 index 00000000000..293a48ff95e Binary files /dev/null and b/openmetadata-docs/images/v1.5/connectors/storage/collate-demo-storage.png differ diff --git a/openmetadata-docs/images/v1.5/connectors/storage/s3-demo.png b/openmetadata-docs/images/v1.5/connectors/storage/s3-demo.png new file mode 100644 index 00000000000..a520d5f5f6c Binary files /dev/null and b/openmetadata-docs/images/v1.5/connectors/storage/s3-demo.png differ