[Docs] - Docs storage for manifest & Domo (#13290)

* Domo docs

* Storage Manifest
This commit is contained in:
Pere Miquel Brull 2023-09-21 13:06:56 +02:00 committed by GitHub
parent f05e874c7a
commit c53fe684fd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
14 changed files with 275 additions and 232 deletions

View File

@ -0,0 +1,112 @@
## OpenMetadata Manifest
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
and can look like this:
{% codePreview %}
{% codeInfoContainer %}
{% codeInfo srNumber=1 %}
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
one. In this case, we will be ingesting 4 children.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
need to bring information about:
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
source files schemas', but won't add any other information.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
definition for table columns. The minimum required information is the `name` and `dataType`.
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
**Multiple-Partition Container**: We can add multiple columns as partitions.
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
Again, this information will be added on top of the inferred schema from the data files.
{% /codeInfo %}
{% /codeInfoContainer %}
{% codeBlock fileName="openmetadata.json" %}
```json {% srNumber=1 %}
{
"entries": [
```
```json {% srNumber=2 %}
{
"dataPath": "transactions",
"structureFormat": "csv"
},
```
```json {% srNumber=3 %}
{
"dataPath": "cities",
"structureFormat": "parquet",
"isPartitioned": true
},
```
```json {% srNumber=4 %}
{
"dataPath": "cities_multiple_simple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "State",
"dataType": "STRING"
}
]
},
```
```json {% srNumber=5 %}
{
"dataPath": "cities_multiple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "Year",
"displayName": "Year (Partition)",
"dataType": "DATE",
"dataTypeDisplay": "date (year)"
},
{
"name": "State",
"dataType": "STRING"
}
]
}
]
}
```

View File

@ -0,0 +1,112 @@
## OpenMetadata Manifest
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
and can look like this:
{% codePreview %}
{% codeInfoContainer %}
{% codeInfo srNumber=1 %}
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
one. In this case, we will be ingesting 4 children.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
need to bring information about:
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
source files schemas', but won't add any other information.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
definition for table columns. The minimum required information is the `name` and `dataType`.
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
**Multiple-Partition Container**: We can add multiple columns as partitions.
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
Again, this information will be added on top of the inferred schema from the data files.
{% /codeInfo %}
{% /codeInfoContainer %}
{% codeBlock fileName="openmetadata.json" %}
```json {% srNumber=1 %}
{
"entries": [
```
```json {% srNumber=2 %}
{
"dataPath": "transactions",
"structureFormat": "csv"
},
```
```json {% srNumber=3 %}
{
"dataPath": "cities",
"structureFormat": "parquet",
"isPartitioned": true
},
```
```json {% srNumber=4 %}
{
"dataPath": "cities_multiple_simple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "State",
"dataType": "STRING"
}
]
},
```
```json {% srNumber=5 %}
{
"dataPath": "cities_multiple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "Year",
"displayName": "Year (Partition)",
"dataType": "DATE",
"dataTypeDisplay": "date (year)"
},
{
"name": "State",
"dataType": "STRING"
}
]
}
]
}
```

View File

@ -40,115 +40,5 @@ We are flattening this structure to simplify the navigation.
{% /note %}
## OpenMetadata Manifest
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
and can look like this:
{% codePreview %}
{% codeInfoContainer %}
{% codeInfo srNumber=1 %}
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
one. In this case, we will be ingesting 4 children.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
need to bring information about:
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
source files schemas', but won't add any other information.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
definition for table columns. The minimum required information is the `name` and `dataType`.
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
**Multiple-Partition Container**: We can add multiple columns as partitions.
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
Again, this information will be added on top of the inferred schema from the data files.
{% /codeInfo %}
{% /codeInfoContainer %}
{% codeBlock fileName="openmetadata.json" %}
```json {% srNumber=1 %}
{
"entries": [
```
```json {% srNumber=2 %}
{
"dataPath": "transactions",
"structureFormat": "csv"
},
```
```json {% srNumber=3 %}
{
"dataPath": "cities",
"structureFormat": "parquet",
"isPartitioned": true
},
```
```json {% srNumber=4 %}
{
"dataPath": "cities_multiple_simple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "State",
"dataType": "STRING"
}
]
},
```
```json {% srNumber=5 %}
{
"dataPath": "cities_multiple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "Year",
"displayName": "Year (Partition)",
"dataType": "DATE",
"dataTypeDisplay": "date (year)"
},
{
"name": "State",
"dataType": "STRING"
}
]
}
]
}
```
{% partial file="/v1.1/connectors/storage/manifest.md" /%}

View File

@ -82,6 +82,16 @@ The policy would look like:
}
```
### OpenMetadata Manifest
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level
metadata from buckets, but in order to understand their internal structure we need users to provide an `openmetadata.json`
file at the bucket root.
You can learn more about this [here](/connectors/storage). Keep reading for an example on the shape of the manifest file.
{% partial file="/v1.1/connectors/storage/manifest.md" /%}
## Metadata Ingestion
{% stepsContainer %}

View File

@ -92,6 +92,16 @@ To run the Athena ingestion, you will need to install:
pip3 install "openmetadata-ingestion[athena]"
```
### OpenMetadata Manifest
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level
metadata from buckets, but in order to understand their internal structure we need users to provide an `openmetadata.json`
file at the bucket root.
You can learn more about this [here](/connectors/storage). Keep reading for an example on the shape of the manifest file.
{% partial file="/v1.1/connectors/storage/manifest.md" /%}
## Metadata Ingestion
All connectors are defined as JSON Schemas.

View File

@ -48,7 +48,7 @@ For questions related to scopes, click [here](https://developer.domo.com/portal/
- **Secret Token**: Secret Token to Connect DOMO Dashboard.
- **Access Token**: Access to Connect to DOMO Dashboard.
- **API Host**: API Host to Connect to DOMO Dashboard instance.
- **SandBox Domain**: Connect to SandBox Domain.
- **Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
{% /extraContent %}

View File

@ -90,7 +90,7 @@ This is a sample config for Domo-Dashboard:
{% codeInfo srNumber=5 %}
**SandBox Domain**: Connect to SandBox Domain.
**Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
{% /codeInfo %}
@ -145,7 +145,7 @@ source:
apiHost: api.domo.com
```
```yaml {% srNumber=5 %}
sandboxDomain: https://<api_domo>.domo.com
instanceDomain: https://<your>.domo.com
```
```yaml {% srNumber=6 %}
sourceConfig:

View File

@ -63,7 +63,7 @@ For questions related to scopes, click [here](https://developer.domo.com/portal/
- **Secret Token**: Secret Token to Connect DOMO Database.
- **Access Token**: Access to Connect to DOMO Database.
- **Api Host**: API Host to Connect to DOMO Database instance.
- **SandBox Domain**: Connect to SandBox Domain.
- **Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
{% /extraContent %}

View File

@ -108,7 +108,7 @@ This is a sample config for DomoDatabase:
{% codeInfo srNumber=5 %}
**SandBox Domain**: Connect to SandBox Domain.
**Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
{% /codeInfo %}
@ -186,7 +186,7 @@ source:
apiHost: api.domo.com
```
```yaml {% srNumber=5 %}
sandboxDomain: https://<api_domo>.domo.com
instancexDomain: https://<your>.domo.com
```
```yaml {% srNumber=6 %}
# database: database

View File

@ -40,7 +40,7 @@ For questions related to scopes, click [here](https://developer.domo.com/portal/
- **Secret Token**: Secret Token to Connect to DOMO Pipeline.
- **Access Token**: Access to Connect to DOMO Pipeline.
- **API Host**: API Host to Connect to DOMO Pipeline.
- **SandBox Domain**: Connect to SandBox Domain.
- **Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
{% /extraContent %}

View File

@ -85,7 +85,7 @@ This is a sample config for Domo-Pipeline:
{% codeInfo srNumber=5 %}
**SandBox Domain**: Connect to SandBox Domain.
**Instance Domain**: URL to connect to your Domo instance UI. For example `https://<your>.domo.com`.
{% /codeInfo %}
@ -143,7 +143,7 @@ source:
apiHost: api.domo.com
```
```yaml {% srNumber=5 %}
sandboxDomain: https://<api_domo>.domo.com
instanceDomain: https://<your>.domo.com
```
```yaml {% srNumber=6 %}
sourceConfig:

View File

@ -40,115 +40,4 @@ We are flattening this structure to simplify the navigation.
{% /note %}
## OpenMetadata Manifest
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
and can look like this:
{% codePreview %}
{% codeInfoContainer %}
{% codeInfo srNumber=1 %}
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
one. In this case, we will be ingesting 4 children.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
need to bring information about:
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
source files schemas', but won't add any other information.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
definition for table columns. The minimum required information is the `name` and `dataType`.
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
**Multiple-Partition Container**: We can add multiple columns as partitions.
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
Again, this information will be added on top of the inferred schema from the data files.
{% /codeInfo %}
{% /codeInfoContainer %}
{% codeBlock fileName="openmetadata.json" %}
```json {% srNumber=1 %}
{
"entries": [
```
```json {% srNumber=2 %}
{
"dataPath": "transactions",
"structureFormat": "csv"
},
```
```json {% srNumber=3 %}
{
"dataPath": "cities",
"structureFormat": "parquet",
"isPartitioned": true
},
```
```json {% srNumber=4 %}
{
"dataPath": "cities_multiple_simple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "State",
"dataType": "STRING"
}
]
},
```
```json {% srNumber=5 %}
{
"dataPath": "cities_multiple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "Year",
"displayName": "Year (Partition)",
"dataType": "DATE",
"dataTypeDisplay": "date (year)"
},
{
"name": "State",
"dataType": "STRING"
}
]
}
]
}
```
{% partial file="/v1.2/connectors/storage/manifest.md" /%}

View File

@ -82,6 +82,16 @@ The policy would look like:
}
```
### OpenMetadata Manifest
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level
metadata from buckets, but in order to understand their internal structure we need users to provide an `openmetadata.json`
file at the bucket root.
You can learn more about this [here](/connectors/storage). Keep reading for an example on the shape of the manifest file.
{% partial file="/v1.2/connectors/storage/manifest.md" /%}
## Metadata Ingestion
{% stepsContainer %}

View File

@ -92,6 +92,16 @@ To run the Athena ingestion, you will need to install:
pip3 install "openmetadata-ingestion[athena]"
```
### OpenMetadata Manifest
In any other connector, extracting metadata happens automatically. In this case, we will be able to extract high-level
metadata from buckets, but in order to understand their internal structure we need users to provide an `openmetadata.json`
file at the bucket root.
You can learn more about this [here](/connectors/storage). Keep reading for an example on the shape of the manifest file.
{% partial file="/v1.2/connectors/storage/manifest.md" /%}
## Metadata Ingestion
All connectors are defined as JSON Schemas.