3.5 KiB
OpenMetadata Manifest
Our manifest file is defined as a JSON Schema, and can look like this:
{% codePreview %}
{% codeInfoContainer %}
{% codeInfo srNumber=1 %}
Entries: We need to add a list of entries. Each inner JSON structure will be ingested as a child container of the top-level
one. In this case, we will be ingesting 4 children.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
Simple Container: The simplest container we can have would be structured, but without partitions. Note that we still need to bring information about:
- dataPath: Where we can find the data. This should be a path relative to the top-level container.
- structureFormat: What is the format of the data we are going to find. This information will be used to read the data.
After ingesting this container, we will bring in the schema of the data in the dataPath.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
Partitioned Container: We can ingest partitioned data without bringing in any further details.
By informing the isPartitioned field as true, we'll flag the container as Partitioned. We will be reading the
source files schemas', but won't add any other information.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
Single-Partition Container: We can bring partition information by specifying the partitionColumns. Their definition
is based on the JSON Schema
definition for table columns. The minimum required information is the name and dataType.
When passing partitionColumns, these values will be added to the schema, on top of the inferred information from the files.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
Multiple-Partition Container: We can add multiple columns as partitions.
Note how in the example we even bring our custom displayName for the column dataTypeDisplay for its type.
Again, this information will be added on top of the inferred schema from the data files.
{% /codeInfo %}
{% /codeInfoContainer %}
{% codeBlock fileName="openmetadata.json" %}
{
"entries": [
{
"dataPath": "transactions",
"structureFormat": "csv"
},
{
"dataPath": "cities",
"structureFormat": "parquet",
"isPartitioned": true
},
{
"dataPath": "cities_multiple_simple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "State",
"dataType": "STRING"
}
]
},
{
"dataPath": "cities_multiple",
"structureFormat": "parquet",
"isPartitioned": true,
"partitionColumns": [
{
"name": "Year",
"displayName": "Year (Partition)",
"dataType": "DATE",
"dataTypeDisplay": "date (year)"
},
{
"name": "State",
"dataType": "STRING"
}
]
}
]
}