Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

198 lines
6.1 KiB
Markdown
Raw Normal View History

## OpenMetadata Manifest
Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
and can look like this:
{% codePreview %}
{% codeInfoContainer %}
{% codeInfo srNumber=1 %}
**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
one. In this case, we will be ingesting 4 children.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
need to bring information about:
- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
- **separator**: Optionally, for delimiter-separated formats such as CSV, you can specify the separator to use when reading the file.
If you don't, we will use `,` for CSV and `/t` for TSV files.
After ingesting this container, we will bring in the schema of the data in the `dataPath`.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
**Partitioned Container**: We can ingest partitioned data without bringing in any further details.
By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
source files schemas', but won't add any other information.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
definition for table columns. The minimum required information is the `name` and `dataType`.
When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
**Multiple-Partition Container**: We can add multiple columns as partitions.
Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.
Again, this information will be added on top of the inferred schema from the data files.
{% /codeInfo %}
{% codeInfo srNumber=6 %}
**Unstructured Container**: OpenMetadata supports ingesting unstructured files like images, pdf's etc. We support fetching the file names, size and tags associates to such files.
In case you want to ingest a single unstructured file, then just specifying the full path of the unstructured file in `datapath` would be enough for ingestion.
In case you want to ingest all unstructured files with a specific extension for example `pdf` & `png` then you can provide the folder name containing such files in `dataPath` and list of extensions in the `unstructuredFormats` field.
In case you want to ingest all unstructured files with irrespective of their file type or extension then you can provide the folder name containing such files in `dataPath` and `["*"]` in the `unstructuredFormats` field.
{% /codeInfo %}
{% /codeInfoContainer %}
{% codeBlock fileName="openmetadata.json" %}
```json {% srNumber=1 %}
{
"entries": [
```
```json {% srNumber=2 %}
{
"dataPath": "transactions",
"structureFormat": "csv",
"separator": ","
},
```
```json {% srNumber=3 %}
{
"dataPath": "cities",
"structureFormat": "parquet",
"isPartitioned": true
},
```
```json {% srNumber=4 %}
{
"dataPath": "cities_multiple_simple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "State",
"dataType": "STRING"
}
]
},
```
```json {% srNumber=5 %}
{
"dataPath": "cities_multiple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "Year",
"displayName": "Year (Partition)",
"dataType": "DATE",
"dataTypeDisplay": "date (year)"
},
{
"name": "State",
"dataType": "STRING"
}
]
}
```
```json {% srNumber=6 %}
{
"dataPath": "path/to/solution.pdf",
},
{
"dataPath": "path/to/unstructured_folder_png_pdf",
"unstructuredFormats": ["png","pdf"]
},
{
"dataPath": "path/to/unstructured_folder_all",
"unstructuredFormats": ["*"]
}
]
}
```
{% /codeBlock %}
{% /codePreview %}
### Global Manifest
You can also manage a **single** manifest file to centralize the ingestion process for any container, named `openmetadata_storage_manifest.json`. For example:
In that case,
you will need to add a `containerName` entry to the structure above. For example:
{% codePreview %}
{% codeInfoContainer %}
{% codeInfo srNumber=1 %}
The fields shown above (`dataPath`, `structureFormat`, `isPartitioned`, etc.) are still valid.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
**Container Name**: Since we are using a single manifest for all your containers, the field `containerName` will
help us identify which container (or Bucket in S3, etc.), contains the presented information.
{% /codeInfo %}
{% /codeInfoContainer %}
{% codeBlock fileName="openmetadata-global.json" %}
```json {% srNumber=1 %}
{
"entries": [
{
"dataPath": "transactions",
"structureFormat": "csv",
"isPartitioned": false,
```
```json {% srNumber=2 %}
"containerName": "collate-demo-storage"
}
]
}
```
{% /codeBlock %}
{% /codePreview %}
You can also keep local manifests `openmetadata.json` in each container, but if possible, we will always try to pick up the global manifest
during the ingestion.