mirror of
https://github.com/datahub-project/datahub.git
synced 2025-07-23 01:22:00 +00:00
167 lines
5.2 KiB
Markdown
167 lines
5.2 KiB
Markdown
|
||
### Path Specs
|
||
|
||
**Example - Dataset per file**
|
||
|
||
Bucket structure:
|
||
|
||
```
|
||
test-s3-bucket
|
||
├── employees.csv
|
||
└── food_items.csv
|
||
```
|
||
|
||
Path specs config
|
||
```
|
||
path_specs:
|
||
- include: s3://test-s3-bucket/*.csv
|
||
|
||
```
|
||
|
||
**Example - Datasets with partitions**
|
||
|
||
Bucket structure:
|
||
```
|
||
test-s3-bucket
|
||
├── orders
|
||
│ └── year=2022
|
||
│ └── month=2
|
||
│ ├── 1.parquet
|
||
│ └── 2.parquet
|
||
└── returns
|
||
└── year=2021
|
||
└── month=2
|
||
└── 1.parquet
|
||
|
||
```
|
||
|
||
Path specs config:
|
||
```
|
||
path_specs:
|
||
- include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||
```
|
||
|
||
**Example - Datasets with partition and exclude**
|
||
|
||
Bucket structure:
|
||
```
|
||
test-s3-bucket
|
||
├── orders
|
||
│ └── year=2022
|
||
│ └── month=2
|
||
│ ├── 1.parquet
|
||
│ └── 2.parquet
|
||
└── tmp_orders
|
||
└── year=2021
|
||
└── month=2
|
||
└── 1.parquet
|
||
|
||
|
||
```
|
||
|
||
Path specs config:
|
||
```
|
||
path_specs:
|
||
- include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||
exclude:
|
||
- **/tmp_orders/**
|
||
```
|
||
**Example - Datasets of mixed nature**
|
||
|
||
Bucket structure:
|
||
```
|
||
test-s3-bucket
|
||
├── customers
|
||
│ ├── part1.json
|
||
│ ├── part2.json
|
||
│ ├── part3.json
|
||
│ └── part4.json
|
||
├── employees.csv
|
||
├── food_items.csv
|
||
├── tmp_10101000.csv
|
||
└── orders
|
||
└── year=2022
|
||
└── month=2
|
||
├── 1.parquet
|
||
├── 2.parquet
|
||
└── 3.parquet
|
||
|
||
```
|
||
|
||
Path specs config:
|
||
```
|
||
path_specs:
|
||
- include: s3://test-s3-bucket/*.csv
|
||
exclude:
|
||
- **/tmp_10101000.csv
|
||
- include: s3://test-s3-bucket/{table}/*.json
|
||
- include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||
```
|
||
|
||
**Valid path_specs.include**
|
||
|
||
```python
|
||
s3://my-bucket/foo/tests/bar.avro # single file table
|
||
s3://my-bucket/foo/tests/*.* # mulitple file level tables
|
||
s3://my-bucket/foo/tests/{table}/*.avro #table without partition
|
||
s3://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified
|
||
s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
|
||
s3://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name
|
||
s3://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format
|
||
s3://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format
|
||
s3://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions
|
||
s3://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket
|
||
s3://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket
|
||
```
|
||
|
||
**Valid path_specs.exclude**
|
||
- \**/tests/**
|
||
- s3://my-bucket/hr/**
|
||
- **/tests/*.csv
|
||
- s3://my-bucket/foo/*/my_table/**
|
||
|
||
**Notes**
|
||
|
||
- {table} represents folder for which dataset will be created.
|
||
- include path must end with (*.* or *.[ext]) to represent leaf level.
|
||
- if *.[ext] is provided then only files with specified type will be scanned.
|
||
- /*/ represents single folder.
|
||
- {partition[i]} represents value of partition.
|
||
- {partition_key[i]} represents name of the partition.
|
||
- While extracting, “i” will be used to match partition_key to partition.
|
||
- all folder levels need to be specified in include. Only exclude path can have ** like matching.
|
||
- exclude path cannot have named variables ( {} ).
|
||
- Folder names should not contain {, }, *, / in their names.
|
||
- {folder} is reserved for internal working. please do not use in named variables.
|
||
|
||
|
||
|
||
If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit.
|
||
|
||
:::caution
|
||
|
||
Specify as long fixed prefix ( with out /*/ ) as possible in `path_specs.include`. This will reduce the scanning time and cost, specifically on AWS S3
|
||
|
||
:::
|
||
|
||
:::caution
|
||
|
||
Running profiling against many tables or over many rows can run up significant costs.
|
||
While we've done our best to limit the expensiveness of the queries the profiler runs, you
|
||
should be prudent about the set of tables profiling is enabled on or the frequency
|
||
of the profiling runs.
|
||
|
||
:::
|
||
|
||
:::caution
|
||
|
||
If you are ingesting datasets from AWS S3, we recommend running the ingestion on a server in the same region to avoid high egress costs.
|
||
|
||
:::
|
||
|
||
### Compatibility
|
||
|
||
Profiles are computed with PyDeequ, which relies on PySpark. Therefore, for computing profiles, we currently require Spark 3.0.3 with Hadoop 3.2 to be installed and the `SPARK_HOME` and `SPARK_VERSION` environment variables to be set. The Spark+Hadoop binary can be downloaded [here](https://www.apache.org/dyn/closer.lua/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz).
|
||
|
||
For an example guide on setting up PyDeequ on AWS, see [this guide](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/).
|