mirror of
https://github.com/datahub-project/datahub.git
synced 2025-08-20 07:07:59 +00:00

* feat(mssql): add multi database ingest support * Delete older golden file. * Update s3.md * fix test setup
67 lines
3.2 KiB
Markdown
67 lines
3.2 KiB
Markdown
## Valid path_specs.include
|
|
|
|
```python
|
|
s3://my-bucket/foo/tests/bar.avro # single file table
|
|
s3://my-bucket/foo/tests/*.* # mulitple file level tables
|
|
s3://my-bucket/foo/tests/{table}/*.avro #table without partition
|
|
s3://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified
|
|
s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
|
|
s3://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name
|
|
s3://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format
|
|
s3://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format
|
|
s3://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions
|
|
s3://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket
|
|
s3://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket
|
|
```
|
|
|
|
## Valid path_specs.exclude
|
|
- **/tests/**
|
|
- s3://my-bucket/hr/**
|
|
- **/tests/*.csv
|
|
- s3://my-bucket/foo/*/my_table/**
|
|
|
|
### Notes
|
|
|
|
- {table} represents folder for which dataset will be created.
|
|
- include path must end with (*.* or *.[ext]) to represent leaf level.
|
|
- if *.[ext] is provided then only files with specified type will be scanned.
|
|
- /*/ represents single folder.
|
|
- {partition[i]} represents value of partition.
|
|
- {partition_key[i]} represents name of the partition.
|
|
- While extracting, “i” will be used to match partition_key to partition.
|
|
- all folder levels need to be specified in include. Only exclude path can have ** like matching.
|
|
- exclude path cannot have named variables ( {} ).
|
|
- Folder names should not contain {, }, *, / in their names.
|
|
- {folder} is reserved for internal working. please do not use in named variables.
|
|
|
|
|
|
|
|
If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit.
|
|
|
|
:::caution
|
|
|
|
Specify as long fixed prefix ( with out /*/ ) as possible in `path_specs.include`. This will reduce the scanning time and cost, specifically on AWS S3
|
|
|
|
:::
|
|
|
|
:::caution
|
|
|
|
Running profiling against many tables or over many rows can run up significant costs.
|
|
While we've done our best to limit the expensiveness of the queries the profiler runs, you
|
|
should be prudent about the set of tables profiling is enabled on or the frequency
|
|
of the profiling runs.
|
|
|
|
:::
|
|
|
|
:::caution
|
|
|
|
If you are ingesting datasets from AWS S3, we recommend running the ingestion on a server in the same region to avoid high egress costs.
|
|
|
|
:::
|
|
|
|
## Compatibility
|
|
|
|
Profiles are computed with PyDeequ, which relies on PySpark. Therefore, for computing profiles, we currently require Spark 3.0.3 with Hadoop 3.2 to be installed and the `SPARK_HOME` and `SPARK_VERSION` environment variables to be set. The Spark+Hadoop binary can be downloaded [here](https://www.apache.org/dyn/closer.lua/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz).
|
|
|
|
For an example guide on setting up PyDeequ on AWS, see [this guide](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/).
|