datahub/gcs.md at unity_sql_proxy

yujunjun/datahub

Fork 0

mirror of https://github.com/datahub-project/datahub.git synced 2025-07-04 07:34:44 +00:00

Andrew Sikowitz d138a64a6a

ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220 )

2025-04-16 16:55:51 -07:00

4.4 KiB

Raw Permalink Blame History

Path Specs

Example - Dataset per file

Bucket structure:

test-gs-bucket
├── employees.csv
└── food_items.csv

Path specs config

path_specs:
    - include: gs://test-gs-bucket/*.csv

Example - Datasets with partitions

Bucket structure:

test-gs-bucket
├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── returns
    └── year=2021
        └── month=2
            └── 1.parquet

Path specs config:

path_specs:
    - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet

Example - Datasets with partition and exclude

Bucket structure:

test-gs-bucket
├── orders
│   └── year=2022
│       └── month=2
│           ├── 1.parquet
│           └── 2.parquet
└── tmp_orders
    └── year=2021
        └── month=2
            └── 1.parquet

Path specs config:

path_specs:
    - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
      exclude:
        - **/tmp_orders/**

Example - Datasets of mixed nature

Bucket structure:

test-gs-bucket
├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└──  orders
    └── year=2022
        └── month=2
            ├── 1.parquet
            ├── 2.parquet
            └── 3.parquet

Path specs config:

path_specs:
    - include: gs://test-gs-bucket/*.csv
      exclude:
        - **/tmp_10101000.csv
    - include: gs://test-gs-bucket/{table}/*.json
    - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet

Valid path_specs.include

gs://my-bucket/foo/tests/bar.avro # single file table
gs://my-bucket/foo/tests/*.* # mulitple file level tables
gs://my-bucket/foo/tests/{table}/*.avro #table without partition
gs://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified
gs://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
gs://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name
gs://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format
gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format
gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions
gs://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket
gs://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket

Valid path_specs.exclude

**/tests/**
gs://my-bucket/hr/**
*/tests/.csv
gs://my-bucket/foo/*/my_table/**

Notes

{table} represents folder for which dataset will be created.
include path must end with (. or *.[ext]) to represent leaf level.
if *.[ext] is provided then only files with specified type will be scanned.
/*/ represents single folder.
{partition[i]} represents value of partition.
{partition_key[i]} represents name of the partition.
While extracting, “i” will be used to match partition_key to partition.
all folder levels need to be specified in include. Only exclude path can have ** like matching.
exclude path cannot have named variables ( {} ).
Folder names should not contain {, }, *, / in their names.
{folder} is reserved for internal working. please do not use in named variables.

If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit.

:::caution

Specify as long fixed prefix ( with out /*/ ) as possible in path_specs.include. This will reduce the scanning time and cost, specifically on Google Cloud Storage.

:::

:::caution

If you are ingesting datasets from Google Cloud Storage, we recommend running the ingestion on a server in the same region to avoid high egress costs.

:::

4.4 KiB Raw Permalink Blame History

Path Specs

4.4 KiB

Raw Permalink Blame History