datahub/metadata-ingestion/docs/sources/s3/README.md

This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub.
Refer to the section [Path Specs](https://docs.datahub.com/docs/generated/ingestion/sources/s3/#path-specs) for more details.

:::tip
This connector can also be used to ingest local files.
Just replace `s3://` in your path_specs with an absolute path to files on the machine running ingestion.
:::

### Supported file types

Supported file types are as follows:

- CSV (\*.csv)
- TSV (\*.tsv)
- JSONL (\*.jsonl)
- JSON (\*.json)
- Parquet (\*.parquet)
- Apache Avro (\*.avro)

Schemas for Parquet and Avro files are extracted as provided.

Schemas for schemaless formats (CSV, TSV, JSONL, JSON) are inferred. For CSV, TSV and JSONL files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.

### Concept Mapping

This ingestion source maps the following Source System Concepts to DataHub Concepts:

| Source Concept                           | DataHub Concept                                                                           | Notes               |
| ---------------------------------------- | ----------------------------------------------------------------------------------------- | ------------------- |
| `"s3"`                                   | [Data Platform](https://docs.datahub.com/docs/generated/metamodel/entities/dataplatform/) |                     |
| s3 object / Folder containing s3 objects | [Dataset](https://docs.datahub.com/docs/generated/metamodel/entities/dataset/)            |                     |
| s3 bucket                                | [Container](https://docs.datahub.com/docs/generated/metamodel/entities/container/)        | Subtype `S3 bucket` |
| s3 folder                                | [Container](https://docs.datahub.com/docs/generated/metamodel/entities/container/)        | Subtype `Folder`    |

### Profiling

This plugin extracts:

- Row and column counts for each dataset
- For each column, if profiling is enabled:
  - null counts and proportions
  - distinct counts and proportions
  - minimum, maximum, mean, median, standard deviation, some quantile values
  - histograms or frequencies of unique values

Note that because the profiling is run with PySpark, we require Spark 3.0.3 with Hadoop 3.2 to be installed (see [compatibility](#compatibility) for more details). If profiling, make sure that permissions for **s3a://** access are set because Spark and Hadoop use the s3a:// protocol to interface with AWS (schema inference outside of profiling requires s3:// access).
Enabling profiling will slow down ingestion runs.
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub.`
doc: Acryl to DataHub, datahubproject.io to datahub.com (#13252) Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> 2025-04-28 23:34:33 +09:00			`Refer to the section [Path Specs](https://docs.datahub.com/docs/generated/ingestion/sources/s3/#path-specs) for more details.`
docs(ingest): update s3 and gcs doc with concept mapping (#8575) 2023-08-11 23:31:15 +05:30
docs(ingest): Rename csv / s3 / file source and sink (#10675) 2024-06-11 11:44:13 -07:00			`:::tip`
			`This connector can also be used to ingest local files.`
			Just replace `s3://` in your path_specs with an absolute path to files on the machine running ingestion.
			`:::`
docs(ingest): update s3 and gcs doc with concept mapping (#8575) 2023-08-11 23:31:15 +05:30
			`### Supported file types`
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00
docs(ingest): update s3 and gcs doc with concept mapping (#8575) 2023-08-11 23:31:15 +05:30			`Supported file types are as follows:`

ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`- CSV (\*.csv)`
			`- TSV (\*.tsv)`
			`- JSONL (\*.jsonl)`
			`- JSON (\*.json)`
			`- Parquet (\*.parquet)`
			`- Apache Avro (\*.avro)`
docs(ingest): update s3 and gcs doc with concept mapping (#8575) 2023-08-11 23:31:15 +05:30
			`Schemas for Parquet and Avro files are extracted as provided.`

feat(ingest): Support for JSONL in s3 source with max_rows support (#9921) Co-authored-by: Aditya <aditya.malik@quillbot.com> Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2024-02-28 19:35:30 +05:30			Schemas for schemaless formats (CSV, TSV, JSONL, JSON) are inferred. For CSV, TSV and JSONL files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
docs(ingest): update s3 and gcs doc with concept mapping (#8575) 2023-08-11 23:31:15 +05:30			`JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.`
			`We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.`

docs(ingest): Rename csv / s3 / file source and sink (#10675) 2024-06-11 11:44:13 -07:00			`### Concept Mapping`

			`This ingestion source maps the following Source System Concepts to DataHub Concepts:`

doc: Acryl to DataHub, datahubproject.io to datahub.com (#13252) Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> 2025-04-28 23:34:33 +09:00			`\| Source Concept \| DataHub Concept \| Notes \|`
			`\| ---------------------------------------- \| ----------------------------------------------------------------------------------------- \| ------------------- \|`
			\| `"s3"` \| [Data Platform](https://docs.datahub.com/docs/generated/metamodel/entities/dataplatform/) \| \|
			`\| s3 object / Folder containing s3 objects \| [Dataset](https://docs.datahub.com/docs/generated/metamodel/entities/dataset/) \| \|`
			\| s3 bucket \| [Container](https://docs.datahub.com/docs/generated/metamodel/entities/container/) \| Subtype `S3 bucket` \|
			\| s3 folder \| [Container](https://docs.datahub.com/docs/generated/metamodel/entities/container/) \| Subtype `Folder` \|
docs(ingest): update s3 and gcs doc with concept mapping (#8575) 2023-08-11 23:31:15 +05:30
			`### Profiling`

			`This plugin extracts:`
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00
docs(ingest): update s3 and gcs doc with concept mapping (#8575) 2023-08-11 23:31:15 +05:30			`- Row and column counts for each dataset`
			`- For each column, if profiling is enabled:`
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`- null counts and proportions`
			`- distinct counts and proportions`
			`- minimum, maximum, mean, median, standard deviation, some quantile values`
			`- histograms or frequencies of unique values`
docs(ingest): update s3 and gcs doc with concept mapping (#8575) 2023-08-11 23:31:15 +05:30
			`Note that because the profiling is run with PySpark, we require Spark 3.0.3 with Hadoop 3.2 to be installed (see [compatibility](#compatibility) for more details). If profiling, make sure that permissions for s3a:// access are set because Spark and Hadoop use the s3a:// protocol to interface with AWS (schema inference outside of profiling requires s3:// access).`
docs(ingest): Rename csv / s3 / file source and sink (#10675) 2024-06-11 11:44:13 -07:00			`Enabling profiling will slow down ingestion runs.`