docs(csv-enricher): add example csv file & recipe (#8141)

This commit is contained in:
Gabe Lyons 2023-05-29 06:32:26 -07:00 committed by GitHub
parent 6adb496581
commit ada6ea5a45
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 22 additions and 12 deletions

View File

@ -0,0 +1,8 @@
source:
type: csv-enricher
config:
# relative path to your csv file to ingest
filename: ./path/to/your/file.csv
# Default sink is datahub-rest and doesn't need to be configured
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options

View File

@ -98,22 +98,24 @@ class CSVEnricherReport(SourceReport):
@support_status(SupportStatus.INCUBATING)
class CSVEnricherSource(Source):
"""
This plugin is used to apply glossary terms, tags, owners and domain at the entity level. It can also be used to apply tags
and glossary terms at the column level. These values are read from a CSV file and can be used to either overwrite
or append the above aspects to entities.
This plugin is used to bulk upload metadata to Datahub.
It will apply glossary terms, tags, decription, owners and domain at the entity level. It can also be used to apply tags,
glossary terms, and documentation at the column level. These values are read from a CSV file. You have the option to either overwrite
or append existing values.
The format of the CSV must be like so, with a few example rows.
The format of the CSV is demonstrated below. The header is required and URNs should be surrounded by quotes when they contains commas (most URNs contains commas).
|resource |subresource|glossary_terms |tags |owners |ownership_type |description |domain |
|----------------------------------------------------------------|-----------|------------------------------------|-------------------|---------------------------------------------------|---------------|---------------|---------------------------|
|urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)| |[urn:li:glossaryTerm:AccountBalance]|[urn:li:tag:Legacy]|[urn:li:corpuser:datahub|urn:li:corpuser:jdoe]|TECHNICAL_OWNER|new description|urn:li:domain:Engineering |
|urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)|field_foo |[urn:li:glossaryTerm:AccountBalance]| | | |field_foo! | |
|urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)|field_bar | |[urn:li:tag:Legacy]| | |field_bar? | |
```
resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain
"urn:li:dataset:(urn:li:dataPlatform:snowflake,datahub.growth.users,PROD",,[urn:li:glossaryTerm:Users],[urn:li:tag:HighQuality],[urn:li:corpuser:lfoe;urn:li:corpuser:jdoe],TECHNICAL_OWNER,"description for users table",urn:li:domain:Engineering
"urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD",first_name,[urn:li:glossaryTerm:FirstName],,,,"first_name description"
"urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD",last_name,[urn:li:glossaryTerm:LastName],,,,"last_name description"
```
Note that the first row does not have a subresource populated. That means any glossary terms, tags, and owners will
be applied at the entity field. If a subresource IS populated (as it is for the second and third rows), glossary
terms and tags will be applied on the subresource. Every row MUST have a resource. Also note that owners can only
be applied at the resource level and will be ignored if populated for a row with a subresource.
be applied at the entity field. If a subresource is populated (as it is for the second and third rows), glossary
terms and tags will be applied on the column. Every row MUST have a resource. Also note that owners can only
be applied at the resource level.
:::note
This source will not work on very large csv files that do not fit in memory.