mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-08-23 08:28:10 +00:00
[Docs] - Docs storage for manifest & Domo (#13290)
* Domo docs * Storage Manifest
This commit is contained in:
parent
f05e874c7a
commit
c53fe684fd
@ -0,0 +1,112 @@
|
|||||||
|
## OpenMetadata Manifest
|
||||||
|
|
||||||
|
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
|
||||||
|
and can look like this:
|
||||||
|
|
||||||
|
{% codePreview %}
|
||||||
|
|
||||||
|
{% codeInfoContainer %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=1 %}
|
||||||
|
|
||||||
|
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
|
||||||
|
one. In this case, we will be ingesting 4 children.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=2 %}
|
||||||
|
|
||||||
|
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
|
||||||
|
need to bring information about:
|
||||||
|
|
||||||
|
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
|
||||||
|
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
|
||||||
|
|
||||||
|
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=3 %}
|
||||||
|
|
||||||
|
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
|
||||||
|
|
||||||
|
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
|
||||||
|
source files schemas', but won't add any other information.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=4 %}
|
||||||
|
|
||||||
|
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
|
||||||
|
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
|
||||||
|
definition for table columns. The minimum required information is the `name` and `dataType`.
|
||||||
|
|
||||||
|
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=5 %}
|
||||||
|
|
||||||
|
**Multiple-Partition Container**: We can add multiple columns as partitions.
|
||||||
|
|
||||||
|
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
|
||||||
|
|
||||||
|
Again, this information will be added on top of the inferred schema from the data files.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% /codeInfoContainer %}
|
||||||
|
|
||||||
|
{% codeBlock fileName="openmetadata.json" %}
|
||||||
|
|
||||||
|
```json {% srNumber=1 %}
|
||||||
|
{
|
||||||
|
"entries": [
|
||||||
|
```
|
||||||
|
```json {% srNumber=2 %}
|
||||||
|
{
|
||||||
|
"dataPath": "transactions",
|
||||||
|
"structureFormat": "csv"
|
||||||
|
},
|
||||||
|
```
|
||||||
|
```json {% srNumber=3 %}
|
||||||
|
{
|
||||||
|
"dataPath": "cities",
|
||||||
|
"structureFormat": "parquet",
|
||||||
|
"isPartitioned": true
|
||||||
|
},
|
||||||
|
```
|
||||||
|
```json {% srNumber=4 %}
|
||||||
|
{
|
||||||
|
"dataPath": "cities_multiple_simple",
|
||||||
|
"structureFormat": "parquet",
|
||||||
|
"isPartitioned": true,
|
||||||
|
"partitionColumns": [
|
||||||
|
{
|
||||||
|
"name": "State",
|
||||||
|
"dataType": "STRING"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
```
|
||||||
|
```json {% srNumber=5 %}
|
||||||
|
{
|
||||||
|
"dataPath": "cities_multiple",
|
||||||
|
"structureFormat": "parquet",
|
||||||
|
"isPartitioned": true,
|
||||||
|
"partitionColumns": [
|
||||||
|
{
|
||||||
|
"name": "Year",
|
||||||
|
"displayName": "Year (Partition)",
|
||||||
|
"dataType": "DATE",
|
||||||
|
"dataTypeDisplay": "date (year)"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "State",
|
||||||
|
"dataType": "STRING"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
@ -0,0 +1,112 @@
|
|||||||
|
## OpenMetadata Manifest
|
||||||
|
|
||||||
|
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
|
||||||
|
and can look like this:
|
||||||
|
|
||||||
|
{% codePreview %}
|
||||||
|
|
||||||
|
{% codeInfoContainer %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=1 %}
|
||||||
|
|
||||||
|
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
|
||||||
|
one. In this case, we will be ingesting 4 children.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=2 %}
|
||||||
|
|
||||||
|
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
|
||||||
|
need to bring information about:
|
||||||
|
|
||||||
|
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
|
||||||
|
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
|
||||||
|
|
||||||
|
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=3 %}
|
||||||
|
|
||||||
|
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
|
||||||
|
|
||||||
|
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
|
||||||
|
source files schemas', but won't add any other information.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=4 %}
|
||||||
|
|
||||||
|
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
|
||||||
|
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
|
||||||
|
definition for table columns. The minimum required information is the `name` and `dataType`.
|
||||||
|
|
||||||
|
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% codeInfo srNumber=5 %}
|
||||||
|
|
||||||
|
**Multiple-Partition Container**: We can add multiple columns as partitions.
|
||||||
|
|
||||||
|
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
|
||||||
|
|
||||||
|
Again, this information will be added on top of the inferred schema from the data files.
|
||||||
|
|
||||||
|
{% /codeInfo %}
|
||||||
|
|
||||||
|
{% /codeInfoContainer %}
|
||||||
|
|
||||||
|
{% codeBlock fileName="openmetadata.json" %}
|
||||||
|
|
||||||
|
```json {% srNumber=1 %}
|
||||||
|
{
|
||||||
|
"entries": [
|
||||||
|
```
|
||||||
|
```json {% srNumber=2 %}
|
||||||
|
{
|
||||||
|
"dataPath": "transactions",
|
||||||
|
"structureFormat": "csv"
|
||||||
|
},
|
||||||
|
```
|
||||||
|
```json {% srNumber=3 %}
|
||||||
|
{
|
||||||
|
"dataPath": "cities",
|
||||||
|
"structureFormat": "parquet",
|
||||||
|
"isPartitioned": true
|
||||||
|
},
|
||||||
|
```
|
||||||
|
```json {% srNumber=4 %}
|
||||||
|
{
|
||||||
|
"dataPath": "cities_multiple_simple",
|
||||||
|
"structureFormat": "parquet",
|
||||||
|
"isPartitioned": true,
|
||||||
|
"partitionColumns": [
|
||||||
|
{
|
||||||
|
"name": "State",
|
||||||
|
"dataType": "STRING"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
```
|
||||||
|
```json {% srNumber=5 %}
|
||||||
|
{
|
||||||
|
"dataPath": "cities_multiple",
|
||||||
|
"structureFormat": "parquet",
|
||||||
|
"isPartitioned": true,
|
||||||
|
"partitionColumns": [
|
||||||
|
{
|
||||||
|
"name": "Year",
|
||||||
|
"displayName": "Year (Partition)",
|
||||||
|
"dataType": "DATE",
|
||||||
|
"dataTypeDisplay": "date (year)"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "State",
|
||||||
|
"dataType": "STRING"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
@ -40,115 +40,5 @@ We are flattening this structure to simplify the navigation.
|
|||||||
|
|
||||||
{% /note %}
|
{% /note %}
|
||||||
|
|
||||||
## OpenMetadata Manifest
|
|
||||||
|
|
||||||
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
|
{% partial file="/v1.1/connectors/storage/manifest.md" /%}
|
||||||
and can look like this:
|
|
||||||
|
|
||||||
{% codePreview %}
|
|
||||||
|
|
||||||
{% codeInfoContainer %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=1 %}
|
|
||||||
|
|
||||||
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
|
|
||||||
one. In this case, we will be ingesting 4 children.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=2 %}
|
|
||||||
|
|
||||||
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
|
|
||||||
need to bring information about:
|
|
||||||
|
|
||||||
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
|
|
||||||
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
|
|
||||||
|
|
||||||
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=3 %}
|
|
||||||
|
|
||||||
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
|
|
||||||
|
|
||||||
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
|
|
||||||
source files schemas', but won't add any other information.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=4 %}
|
|
||||||
|
|
||||||
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
|
|
||||||
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
|
|
||||||
definition for table columns. The minimum required information is the `name` and `dataType`.
|
|
||||||
|
|
||||||
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=5 %}
|
|
||||||
|
|
||||||
**Multiple-Partition Container**: We can add multiple columns as partitions.
|
|
||||||
|
|
||||||
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
|
|
||||||
|
|
||||||
Again, this information will be added on top of the inferred schema from the data files.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% /codeInfoContainer %}
|
|
||||||
|
|
||||||
{% codeBlock fileName="openmetadata.json" %}
|
|
||||||
|
|
||||||
```json {% srNumber=1 %}
|
|
||||||
{
|
|
||||||
"entries": [
|
|
||||||
```
|
|
||||||
```json {% srNumber=2 %}
|
|
||||||
{
|
|
||||||
"dataPath": "transactions",
|
|
||||||
"structureFormat": "csv"
|
|
||||||
},
|
|
||||||
```
|
|
||||||
```json {% srNumber=3 %}
|
|
||||||
{
|
|
||||||
"dataPath": "cities",
|
|
||||||
"structureFormat": "parquet",
|
|
||||||
"isPartitioned": true
|
|
||||||
},
|
|
||||||
```
|
|
||||||
```json {% srNumber=4 %}
|
|
||||||
{
|
|
||||||
"dataPath": "cities_multiple_simple",
|
|
||||||
"structureFormat": "parquet",
|
|
||||||
"isPartitioned": true,
|
|
||||||
"partitionColumns": [
|
|
||||||
{
|
|
||||||
"name": "State",
|
|
||||||
"dataType": "STRING"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
```
|
|
||||||
```json {% srNumber=5 %}
|
|
||||||
{
|
|
||||||
"dataPath": "cities_multiple",
|
|
||||||
"structureFormat": "parquet",
|
|
||||||
"isPartitioned": true,
|
|
||||||
"partitionColumns": [
|
|
||||||
{
|
|
||||||
"name": "Year",
|
|
||||||
"displayName": "Year (Partition)",
|
|
||||||
"dataType": "DATE",
|
|
||||||
"dataTypeDisplay": "date (year)"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "State",
|
|
||||||
"dataType": "STRING"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
@ -82,6 +82,16 @@ The policy would look like:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### OpenMetadata Manifest
|
||||||
|
|
||||||
|
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level
|
||||||
|
metadata from buckets, but in order to understand their internal structure we need users to provide an `openmetadata.json`
|
||||||
|
file at the bucket root.
|
||||||
|
|
||||||
|
You can learn more about this [here](/connectors/storage). Keep reading for an example on the shape of the manifest file.
|
||||||
|
|
||||||
|
{% partial file="/v1.1/connectors/storage/manifest.md" /%}
|
||||||
|
|
||||||
## Metadata Ingestion
|
## Metadata Ingestion
|
||||||
|
|
||||||
{% stepsContainer %}
|
{% stepsContainer %}
|
||||||
|
@ -92,6 +92,16 @@ To run the Athena ingestion, you will need to install:
|
|||||||
pip3 install "openmetadata-ingestion[athena]"
|
pip3 install "openmetadata-ingestion[athena]"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### OpenMetadata Manifest
|
||||||
|
|
||||||
|
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level
|
||||||
|
metadata from buckets, but in order to understand their internal structure we need users to provide an `openmetadata.json`
|
||||||
|
file at the bucket root.
|
||||||
|
|
||||||
|
You can learn more about this [here](/connectors/storage). Keep reading for an example on the shape of the manifest file.
|
||||||
|
|
||||||
|
{% partial file="/v1.1/connectors/storage/manifest.md" /%}
|
||||||
|
|
||||||
## Metadata Ingestion
|
## Metadata Ingestion
|
||||||
|
|
||||||
All connectors are defined as JSON Schemas.
|
All connectors are defined as JSON Schemas.
|
||||||
|
@ -48,7 +48,7 @@ For questions related to scopes, click [here](https://developer.domo.com/portal/
|
|||||||
- **Secret Token**: Secret Token to Connect DOMO Dashboard.
|
- **Secret Token**: Secret Token to Connect DOMO Dashboard.
|
||||||
- **Access Token**: Access to Connect to DOMO Dashboard.
|
- **Access Token**: Access to Connect to DOMO Dashboard.
|
||||||
- **API Host**: API Host to Connect to DOMO Dashboard instance.
|
- **API Host**: API Host to Connect to DOMO Dashboard instance.
|
||||||
- **SandBox Domain**: Connect to SandBox Domain.
|
- **Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
|
||||||
|
|
||||||
{% /extraContent %}
|
{% /extraContent %}
|
||||||
|
|
||||||
|
@ -90,7 +90,7 @@ This is a sample config for Domo-Dashboard:
|
|||||||
|
|
||||||
{% codeInfo srNumber=5 %}
|
{% codeInfo srNumber=5 %}
|
||||||
|
|
||||||
**SandBox Domain**: Connect to SandBox Domain.
|
**Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
|
||||||
|
|
||||||
{% /codeInfo %}
|
{% /codeInfo %}
|
||||||
|
|
||||||
@ -145,7 +145,7 @@ source:
|
|||||||
apiHost: api.domo.com
|
apiHost: api.domo.com
|
||||||
```
|
```
|
||||||
```yaml {% srNumber=5 %}
|
```yaml {% srNumber=5 %}
|
||||||
sandboxDomain: https://<api_domo>.domo.com
|
instanceDomain: https://<your>.domo.com
|
||||||
```
|
```
|
||||||
```yaml {% srNumber=6 %}
|
```yaml {% srNumber=6 %}
|
||||||
sourceConfig:
|
sourceConfig:
|
||||||
|
@ -63,7 +63,7 @@ For questions related to scopes, click [here](https://developer.domo.com/portal/
|
|||||||
- **Secret Token**: Secret Token to Connect DOMO Database.
|
- **Secret Token**: Secret Token to Connect DOMO Database.
|
||||||
- **Access Token**: Access to Connect to DOMO Database.
|
- **Access Token**: Access to Connect to DOMO Database.
|
||||||
- **Api Host**: API Host to Connect to DOMO Database instance.
|
- **Api Host**: API Host to Connect to DOMO Database instance.
|
||||||
- **SandBox Domain**: Connect to SandBox Domain.
|
- **Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
|
||||||
|
|
||||||
{% /extraContent %}
|
{% /extraContent %}
|
||||||
|
|
||||||
|
@ -108,7 +108,7 @@ This is a sample config for DomoDatabase:
|
|||||||
|
|
||||||
{% codeInfo srNumber=5 %}
|
{% codeInfo srNumber=5 %}
|
||||||
|
|
||||||
**SandBox Domain**: Connect to SandBox Domain.
|
**Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
|
||||||
|
|
||||||
{% /codeInfo %}
|
{% /codeInfo %}
|
||||||
|
|
||||||
@ -186,7 +186,7 @@ source:
|
|||||||
apiHost: api.domo.com
|
apiHost: api.domo.com
|
||||||
```
|
```
|
||||||
```yaml {% srNumber=5 %}
|
```yaml {% srNumber=5 %}
|
||||||
sandboxDomain: https://<api_domo>.domo.com
|
instancexDomain: https://<your>.domo.com
|
||||||
```
|
```
|
||||||
```yaml {% srNumber=6 %}
|
```yaml {% srNumber=6 %}
|
||||||
# database: database
|
# database: database
|
||||||
|
@ -40,7 +40,7 @@ For questions related to scopes, click [here](https://developer.domo.com/portal/
|
|||||||
- **Secret Token**: Secret Token to Connect to DOMO Pipeline.
|
- **Secret Token**: Secret Token to Connect to DOMO Pipeline.
|
||||||
- **Access Token**: Access to Connect to DOMO Pipeline.
|
- **Access Token**: Access to Connect to DOMO Pipeline.
|
||||||
- **API Host**: API Host to Connect to DOMO Pipeline.
|
- **API Host**: API Host to Connect to DOMO Pipeline.
|
||||||
- **SandBox Domain**: Connect to SandBox Domain.
|
- **Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
|
||||||
|
|
||||||
{% /extraContent %}
|
{% /extraContent %}
|
||||||
|
|
||||||
|
@ -85,7 +85,7 @@ This is a sample config for Domo-Pipeline:
|
|||||||
|
|
||||||
{% codeInfo srNumber=5 %}
|
{% codeInfo srNumber=5 %}
|
||||||
|
|
||||||
**SandBox Domain**: Connect to SandBox Domain.
|
**Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
|
||||||
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
{% /codeInfo %}
|
||||||
@ -143,7 +143,7 @@ source:
|
|||||||
apiHost: api.domo.com
|
apiHost: api.domo.com
|
||||||
```
|
```
|
||||||
```yaml {% srNumber=5 %}
|
```yaml {% srNumber=5 %}
|
||||||
sandboxDomain: https://<api_domo>.domo.com
|
instanceDomain: https://<your>.domo.com
|
||||||
```
|
```
|
||||||
```yaml {% srNumber=6 %}
|
```yaml {% srNumber=6 %}
|
||||||
sourceConfig:
|
sourceConfig:
|
||||||
|
@ -40,115 +40,4 @@ We are flattening this structure to simplify the navigation.
|
|||||||
|
|
||||||
{% /note %}
|
{% /note %}
|
||||||
|
|
||||||
## OpenMetadata Manifest
|
{% partial file="/v1.2/connectors/storage/manifest.md" /%}
|
||||||
|
|
||||||
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
|
|
||||||
and can look like this:
|
|
||||||
|
|
||||||
{% codePreview %}
|
|
||||||
|
|
||||||
{% codeInfoContainer %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=1 %}
|
|
||||||
|
|
||||||
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
|
|
||||||
one. In this case, we will be ingesting 4 children.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=2 %}
|
|
||||||
|
|
||||||
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
|
|
||||||
need to bring information about:
|
|
||||||
|
|
||||||
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
|
|
||||||
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
|
|
||||||
|
|
||||||
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=3 %}
|
|
||||||
|
|
||||||
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
|
|
||||||
|
|
||||||
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
|
|
||||||
source files schemas', but won't add any other information.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=4 %}
|
|
||||||
|
|
||||||
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
|
|
||||||
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
|
|
||||||
definition for table columns. The minimum required information is the `name` and `dataType`.
|
|
||||||
|
|
||||||
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% codeInfo srNumber=5 %}
|
|
||||||
|
|
||||||
**Multiple-Partition Container**: We can add multiple columns as partitions.
|
|
||||||
|
|
||||||
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
|
|
||||||
|
|
||||||
Again, this information will be added on top of the inferred schema from the data files.
|
|
||||||
|
|
||||||
{% /codeInfo %}
|
|
||||||
|
|
||||||
{% /codeInfoContainer %}
|
|
||||||
|
|
||||||
{% codeBlock fileName="openmetadata.json" %}
|
|
||||||
|
|
||||||
```json {% srNumber=1 %}
|
|
||||||
{
|
|
||||||
"entries": [
|
|
||||||
```
|
|
||||||
```json {% srNumber=2 %}
|
|
||||||
{
|
|
||||||
"dataPath": "transactions",
|
|
||||||
"structureFormat": "csv"
|
|
||||||
},
|
|
||||||
```
|
|
||||||
```json {% srNumber=3 %}
|
|
||||||
{
|
|
||||||
"dataPath": "cities",
|
|
||||||
"structureFormat": "parquet",
|
|
||||||
"isPartitioned": true
|
|
||||||
},
|
|
||||||
```
|
|
||||||
```json {% srNumber=4 %}
|
|
||||||
{
|
|
||||||
"dataPath": "cities_multiple_simple",
|
|
||||||
"structureFormat": "parquet",
|
|
||||||
"isPartitioned": true,
|
|
||||||
"partitionColumns": [
|
|
||||||
{
|
|
||||||
"name": "State",
|
|
||||||
"dataType": "STRING"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
```
|
|
||||||
```json {% srNumber=5 %}
|
|
||||||
{
|
|
||||||
"dataPath": "cities_multiple",
|
|
||||||
"structureFormat": "parquet",
|
|
||||||
"isPartitioned": true,
|
|
||||||
"partitionColumns": [
|
|
||||||
{
|
|
||||||
"name": "Year",
|
|
||||||
"displayName": "Year (Partition)",
|
|
||||||
"dataType": "DATE",
|
|
||||||
"dataTypeDisplay": "date (year)"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "State",
|
|
||||||
"dataType": "STRING"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
@ -82,6 +82,16 @@ The policy would look like:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### OpenMetadata Manifest
|
||||||
|
|
||||||
|
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level
|
||||||
|
metadata from buckets, but in order to understand their internal structure we need users to provide an `openmetadata.json`
|
||||||
|
file at the bucket root.
|
||||||
|
|
||||||
|
You can learn more about this [here](/connectors/storage). Keep reading for an example on the shape of the manifest file.
|
||||||
|
|
||||||
|
{% partial file="/v1.2/connectors/storage/manifest.md" /%}
|
||||||
|
|
||||||
## Metadata Ingestion
|
## Metadata Ingestion
|
||||||
|
|
||||||
{% stepsContainer %}
|
{% stepsContainer %}
|
||||||
|
@ -92,6 +92,16 @@ To run the Athena ingestion, you will need to install:
|
|||||||
pip3 install "openmetadata-ingestion[athena]"
|
pip3 install "openmetadata-ingestion[athena]"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### OpenMetadata Manifest
|
||||||
|
|
||||||
|
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level
|
||||||
|
metadata from buckets, but in order to understand their internal structure we need users to provide an `openmetadata.json`
|
||||||
|
file at the bucket root.
|
||||||
|
|
||||||
|
You can learn more about this [here](/connectors/storage). Keep reading for an example on the shape of the manifest file.
|
||||||
|
|
||||||
|
{% partial file="/v1.2/connectors/storage/manifest.md" /%}
|
||||||
|
|
||||||
## Metadata Ingestion
|
## Metadata Ingestion
|
||||||
|
|
||||||
All connectors are defined as JSON Schemas.
|
All connectors are defined as JSON Schemas.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user