mirror of
https://github.com/datahub-project/datahub.git
synced 2025-09-07 08:13:06 +00:00
112 lines
12 KiB
Markdown
112 lines
12 KiB
Markdown
![]() |
# Data lake files
|
||
|
|
||
|
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||
|
|
||
|
## Setup
|
||
|
|
||
|
To install this plugin, run `pip install 'acryl-datahub[data-lake]'`. Because the files are read using PySpark, we require Spark 3.0.3 with Hadoop 3.2 to be installed.
|
||
|
|
||
|
The data lake connector extracts schemas and profiles from a variety of file formats (see below for an exhaustive list).
|
||
|
Individual files are ingested as tables, and profiles are computed similar to the [SQL profiler](./sql_profiles.md).
|
||
|
|
||
|
Enabling profiling will slow down ingestion runs.
|
||
|
|
||
|
:::caution
|
||
|
|
||
|
Running profiling against many tables or over many rows can run up significant costs.
|
||
|
While we've done our best to limit the expensiveness of the queries the profiler runs, you
|
||
|
should be prudent about the set of tables profiling is enabled on or the frequency
|
||
|
of the profiling runs.
|
||
|
|
||
|
:::
|
||
|
|
||
|
## Capabilities
|
||
|
|
||
|
Extracts:
|
||
|
|
||
|
- Row and column counts for each table
|
||
|
- For each column, if applicable:
|
||
|
- null counts and proportions
|
||
|
- distinct counts and proportions
|
||
|
- minimum, maximum, mean, median, standard deviation, some quantile values
|
||
|
- histograms or frequencies of unique values
|
||
|
|
||
|
This connector supports both local files as well as those stored on AWS S3 (which must be identified using the prefix `s3://`). Supported file types are as follows:
|
||
|
|
||
|
- CSV
|
||
|
- TSV
|
||
|
- Parquet
|
||
|
- JSON
|
||
|
|
||
|
:::caution
|
||
|
|
||
|
If you are ingesting datasets from AWS S3, we recommend running the ingestion on a server in the same region to avoid high egress costs.
|
||
|
|
||
|
:::
|
||
|
|
||
|
## Quickstart recipe
|
||
|
|
||
|
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||
|
|
||
|
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||
|
|
||
|
```yml
|
||
|
source:
|
||
|
type: data-lake
|
||
|
config:
|
||
|
env: "prod"
|
||
|
platform: "local-data-lake"
|
||
|
base_path: "/path/to/data/folder"
|
||
|
profiling:
|
||
|
enabled: true
|
||
|
|
||
|
sink:
|
||
|
# sink configs
|
||
|
```
|
||
|
|
||
|
## Config details
|
||
|
|
||
|
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||
|
|
||
|
| Field | Required | Default | Description |
|
||
|
| ---------------------------------------------------- | ------------------------ | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
|
| `env` | | `PROD` | Environment to use in namespace when constructing URNs. |
|
||
|
| `platform` | ✅ | | Platform to use in namespace when constructing URNs. |
|
||
|
| `base_path` | ✅ | | Path of the base folder to crawl. Unless `schema_patterns` and `profile_patterns` are set, the connector will ingest all files in this folder. |
|
||
|
| `spark_driver_memory` | | `4g` | Max amount of memory to grant Spark. |
|
||
|
| `aws_config.aws_region` | If ingesting from AWS S3 | | AWS region code. |
|
||
|
| `aws_config.aws_access_key_id` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||
|
| `aws_config.aws_secret_access_key` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||
|
| `aws_config.aws_session_token` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||
|
| `schema_patterns.allow` | | `*` | List of regex patterns for tables to ingest. Defaults to all. |
|
||
|
| `schema_patterns.deny` | | | List of regex patterns for tables to not ingest. Defaults to none. |
|
||
|
| `schema_patterns.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching of tables to ingest. |
|
||
|
| `profile_patterns.allow` | | `*` | List of regex patterns for tables to profile (a must also be ingested for profiling). Defaults to all. |
|
||
|
| `profile_patterns.deny` | | | List of regex patterns for tables to not profile (a must also be ingested for profiling). Defaults to none. |
|
||
|
| `profile_patterns.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching of tables to profile. |
|
||
|
| `profiling.enabled` | | `False` | Whether profiling should be done. |
|
||
|
| `profiling.profile_table_level_only` | | `False` | Whether to perform profiling at table-level only or include column-level profiling as well. |
|
||
|
| `profiling.max_number_of_fields_to_profile` | | `None` | A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. |
|
||
|
| `profiling.include_field_null_count` | | `True` | Whether to profile for the number of nulls for each column. |
|
||
|
| `profiling.include_field_min_value` | | `True` | Whether to profile for the min value of numeric columns. |
|
||
|
| `profiling.include_field_max_value` | | `True` | Whether to profile for the max value of numeric columns. |
|
||
|
| `profiling.include_field_mean_value` | | `True` | Whether to profile for the mean value of numeric columns. |
|
||
|
| `profiling.include_field_median_value` | | `True` | Whether to profile for the median value of numeric columns. |
|
||
|
| `profiling.include_field_stddev_value` | | `True` | Whether to profile for the standard deviation of numeric columns. |
|
||
|
| `profiling.include_field_quantiles` | | `True` | Whether to profile for the quantiles of numeric columns. |
|
||
|
| `profiling.include_field_distinct_value_frequencies` | | `False` | Whether to profile for distinct value frequencies. |
|
||
|
| `profiling.include_field_histogram` | | `False` | Whether to profile for the histogram for numeric fields. |
|
||
|
| `profiling.include_field_sample_values` | | `True` | Whether to profile for the sample values for all columns. |
|
||
|
|
||
|
## Compatibility
|
||
|
|
||
|
Files are read using PySpark and profiles are computed with PyDeequ.
|
||
|
We currently require Spark 3.0.3 with Hadoop 3.2 to be installed and the `SPARK_HOME` environment variable to be set for PySpark.
|
||
|
The Spark+Hadoop binary can be downloaded [here](https://www.apache.org/dyn/closer.lua/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz).
|
||
|
|
||
|
For an example guide on setting up PyDeequ on AWS, see [this guide](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/).
|
||
|
|
||
|
## Questions
|
||
|
|
||
|
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|