OpenMetadata/openmetadata-docs/content/partials/v1.6/connectors/storage/manifest.md

## OpenMetadata Manifest

Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),
and can look like this:

{% codePreview %}

{% codeInfoContainer %}

{% codeInfo srNumber=1 %}

**Entries**: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
one. In this case, we will be ingesting 4 children.

{% /codeInfo %}

{% codeInfo srNumber=2 %}

**Simple Container**: The simplest container we can have would be structured, but without partitions. Note that we still
need to bring information about:

- **dataPath**: Where we can find the data. This should be a path relative to the top-level container.
- **structureFormat**: What is the format of the data we are going to find. This information will be used to read the data.
- **separator**: Optionally, for delimiter-separated formats such as CSV, you can specify the separator to use when reading the file.
  If you don't, we will use `,` for CSV and `/t` for TSV files.

After ingesting this container, we will bring in the schema of the data in the `dataPath`.

{% /codeInfo %}

{% codeInfo srNumber=3 %}

**Partitioned Container**: We can ingest partitioned data without bringing in any further details.

By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
source files schemas', but won't add any other information.

{% /codeInfo %}

{% codeInfo srNumber=4 %}

**Single-Partition Container**: We can bring partition information by specifying the `partitionColumns`. Their definition
is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)
definition for table columns. The minimum required information is the `name` and `dataType`.

When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.

{% /codeInfo %}

{% codeInfo srNumber=5 %}

**Multiple-Partition Container**: We can add multiple columns as partitions.

Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.

Again, this information will be added on top of the inferred schema from the data files.

{% /codeInfo %}

{% codeInfo srNumber=6 %}

**Unstructured Container**: OpenMetadata supports ingesting unstructured files like images, pdf's etc. We support fetching the file names, size and tags associates to such files.

In case you want to ingest a single unstructured file, then just specifying the full path of the unstructured file in `datapath` would be enough for ingestion.

In case you want to ingest all unstructured files with a specific extension for example `pdf` & `png` then you can provide the folder name containing such files in `dataPath` and list of extensions in the `unstructuredFormats` field.

In case you want to ingest all unstructured files with irrespective of their file type or extension then you can provide the folder name containing such files in `dataPath` and `["*"]` in the `unstructuredFormats` field.

{% /codeInfo %}


{% /codeInfoContainer %}

{% codeBlock fileName="openmetadata.json" %}

```json {% srNumber=1 %}
{
    "entries": [
```
```json {% srNumber=2 %}
        {
            "dataPath": "transactions",
            "structureFormat": "csv",
            "separator": ","
        },
```
```json {% srNumber=3 %}
        {
            "dataPath": "cities",
            "structureFormat": "parquet",
            "isPartitioned": true
        },
```
```json {% srNumber=4 %}
        {
            "dataPath": "cities_multiple_simple",
            "structureFormat": "parquet",
            "isPartitioned": true,
            "partitionColumns": [
                {
                    "name": "State",
                    "dataType": "STRING"
                }
            ]
        },
```
```json {% srNumber=5 %}
        {
            "dataPath": "cities_multiple",
            "structureFormat": "parquet",
            "isPartitioned": true,
            "partitionColumns": [
                {
                    "name": "Year",
                    "displayName": "Year (Partition)",
                    "dataType": "DATE",
                    "dataTypeDisplay": "date (year)"
                },
                {
                    "name": "State",
                    "dataType": "STRING"
                }
            ]
        }
```
```json {% srNumber=6 %}
        {
            "dataPath": "path/to/solution.pdf",
        },
        {
            "dataPath": "path/to/unstructured_folder_png_pdf",
            "unstructuredFormats": ["png","pdf"]
        },
        {
            "dataPath": "path/to/unstructured_folder_all",
            "unstructuredFormats": ["*"]
        }
    ]
}
```

{% /codeBlock %}

{% /codePreview %}


### Global Manifest

You can also manage a **single** manifest file to centralize the ingestion process for any container, named `openmetadata_storage_manifest.json`. For example:

In that case,
you will need to add a `containerName` entry to the structure above. For example:

{% codePreview %}

{% codeInfoContainer %}

{% codeInfo srNumber=1 %}

The fields shown above (`dataPath`, `structureFormat`, `isPartitioned`, etc.) are still valid.

{% /codeInfo %}

{% codeInfo srNumber=2 %}

**Container Name**: Since we are using a single manifest for all your containers, the field `containerName` will
help us identify which container (or Bucket in S3, etc.), contains the presented information.

{% /codeInfo %}

{% /codeInfoContainer %}

{% codeBlock fileName="openmetadata-global.json" %}

```json {% srNumber=1 %}
{
  "entries": [
    {
      "dataPath": "transactions",
      "structureFormat": "csv",
      "isPartitioned": false,
```

```json {% srNumber=2 %}
      "containerName": "collate-demo-storage"
    }
  ]
}
```

{% /codeBlock %}

{% /codePreview %}

You can also keep local manifests `openmetadata.json` in each container, but if possible, we will always try to pick up the global manifest
during the ingestion.
MINOR - Prepare 1.3 docs directories (#14357) 2023-12-13 14:03:08 +01:00			`## OpenMetadata Manifest`

			`Our manifest file is defined as a [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/storage/containerMetadataConfig.json),`
			`and can look like this:`

			`{% codePreview %}`

			`{% codeInfoContainer %}`

			`{% codeInfo srNumber=1 %}`

			Entries: We need to add a list of `entries`. Each inner JSON structure will be ingested as a child container of the top-level
			`one. In this case, we will be ingesting 4 children.`

			`{% /codeInfo %}`

			`{% codeInfo srNumber=2 %}`

			`Simple Container: The simplest container we can have would be structured, but without partitions. Note that we still`
			`need to bring information about:`

			`- dataPath: Where we can find the data. This should be a path relative to the top-level container.`
			`- structureFormat: What is the format of the data we are going to find. This information will be used to read the data.`
			`- separator: Optionally, for delimiter-separated formats such as CSV, you can specify the separator to use when reading the file.`
			If you don't, we will use `,` for CSV and `/t` for TSV files.

			After ingesting this container, we will bring in the schema of the data in the `dataPath`.

			`{% /codeInfo %}`

			`{% codeInfo srNumber=3 %}`

			`Partitioned Container: We can ingest partitioned data without bringing in any further details.`

			By informing the `isPartitioned` field as `true`, we'll flag the container as `Partitioned`. We will be reading the
			`source files schemas', but won't add any other information.`

			`{% /codeInfo %}`

			`{% codeInfo srNumber=4 %}`

			Single-Partition Container: We can bring partition information by specifying the `partitionColumns`. Their definition
			`is based on the [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/data/table.json#L232)`
			definition for table columns. The minimum required information is the `name` and `dataType`.

			When passing `partitionColumns`, these values will be added to the schema, on top of the inferred information from the files.

			`{% /codeInfo %}`

			`{% codeInfo srNumber=5 %}`

			`Multiple-Partition Container: We can add multiple columns as partitions.`

			Note how in the example we even bring our custom `displayName` for the column `dataTypeDisplay` for its type.

			`Again, this information will be added on top of the inferred schema from the data files.`

			`{% /codeInfo %}`

DOCS - Update Version Snapshots (#17589) 2024-08-27 11:11:12 +02:00			`{% codeInfo srNumber=6 %}`

			`Unstructured Container: OpenMetadata supports ingesting unstructured files like images, pdf's etc. We support fetching the file names, size and tags associates to such files.`

			In case you want to ingest a single unstructured file, then just specifying the full path of the unstructured file in `datapath` would be enough for ingestion.

			In case you want to ingest all unstructured files with a specific extension for example `pdf` & `png` then you can provide the folder name containing such files in `dataPath` and list of extensions in the `unstructuredFormats` field.

			In case you want to ingest all unstructured files with irrespective of their file type or extension then you can provide the folder name containing such files in `dataPath` and `["*"]` in the `unstructuredFormats` field.

			`{% /codeInfo %}`


MINOR - Prepare 1.3 docs directories (#14357) 2023-12-13 14:03:08 +01:00			`{% /codeInfoContainer %}`

			`{% codeBlock fileName="openmetadata.json" %}`

			```json {% srNumber=1 %}
			`{`
			`"entries": [`
			```
			```json {% srNumber=2 %}
			`{`
			`"dataPath": "transactions",`
			`"structureFormat": "csv",`
			`"separator": ","`
			`},`
			```
			```json {% srNumber=3 %}
			`{`
			`"dataPath": "cities",`
			`"structureFormat": "parquet",`
			`"isPartitioned": true`
			`},`
			```
			```json {% srNumber=4 %}
			`{`
			`"dataPath": "cities_multiple_simple",`
			`"structureFormat": "parquet",`
			`"isPartitioned": true,`
			`"partitionColumns": [`
			`{`
			`"name": "State",`
			`"dataType": "STRING"`
			`}`
			`]`
			`},`
			```
			```json {% srNumber=5 %}
			`{`
			`"dataPath": "cities_multiple",`
			`"structureFormat": "parquet",`
			`"isPartitioned": true,`
			`"partitionColumns": [`
			`{`
			`"name": "Year",`
			`"displayName": "Year (Partition)",`
			`"dataType": "DATE",`
			`"dataTypeDisplay": "date (year)"`
			`},`
			`{`
			`"name": "State",`
			`"dataType": "STRING"`
			`}`
			`]`
			`}`
DOCS - Update Version Snapshots (#17589) 2024-08-27 11:11:12 +02:00			```
			```json {% srNumber=6 %}
			`{`
			`"dataPath": "path/to/solution.pdf",`
			`},`
			`{`
			`"dataPath": "path/to/unstructured_folder_png_pdf",`
			`"unstructuredFormats": ["png","pdf"]`
			`},`
			`{`
			`"dataPath": "path/to/unstructured_folder_all",`
			`"unstructuredFormats": ["*"]`
			`}`
MINOR - Prepare 1.3 docs directories (#14357) 2023-12-13 14:03:08 +01:00			`]`
			`}`
			```

			`{% /codeBlock %}`

			`{% /codePreview %}`


			`### Global Manifest`

DOCS - Update Version Snapshots (#17589) 2024-08-27 11:11:12 +02:00			You can also manage a single manifest file to centralize the ingestion process for any container, named `openmetadata_storage_manifest.json`. For example:

			`In that case,`
MINOR - Prepare 1.3 docs directories (#14357) 2023-12-13 14:03:08 +01:00			you will need to add a `containerName` entry to the structure above. For example:

DOCS - Update Version Snapshots (#17589) 2024-08-27 11:11:12 +02:00			`{% codePreview %}`

			`{% codeInfoContainer %}`

			`{% codeInfo srNumber=1 %}`

			The fields shown above (`dataPath`, `structureFormat`, `isPartitioned`, etc.) are still valid.

			`{% /codeInfo %}`

			`{% codeInfo srNumber=2 %}`

			Container Name: Since we are using a single manifest for all your containers, the field `containerName` will
			`help us identify which container (or Bucket in S3, etc.), contains the presented information.`

			`{% /codeInfo %}`

			`{% /codeInfoContainer %}`

			`{% codeBlock fileName="openmetadata-global.json" %}`

			```json {% srNumber=1 %}
MINOR - Prepare 1.3 docs directories (#14357) 2023-12-13 14:03:08 +01:00			`{`
			`"entries": [`
			`{`
			`"dataPath": "transactions",`
			`"structureFormat": "csv",`
			`"isPartitioned": false,`
DOCS - Update Version Snapshots (#17589) 2024-08-27 11:11:12 +02:00			```

			```json {% srNumber=2 %}
MINOR - Prepare 1.3 docs directories (#14357) 2023-12-13 14:03:08 +01:00			`"containerName": "collate-demo-storage"`
			`}`
			`]`
			`}`
			```

DOCS - Update Version Snapshots (#17589) 2024-08-27 11:11:12 +02:00			`{% /codeBlock %}`
MINOR - Prepare 1.3 docs directories (#14357) 2023-12-13 14:03:08 +01:00
DOCS - Update Version Snapshots (#17589) 2024-08-27 11:11:12 +02:00			`{% /codePreview %}`

			You can also keep local manifests `openmetadata.json` in each container, but if possible, we will always try to pick up the global manifest
			`during the ingestion.`