mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-25 17:08:29 +00:00
feat(docs): refactor source and sink ingestion docs (#3031)
This commit is contained in:
parent
a7ea888612
commit
32b8fc6108
@ -159,6 +159,13 @@ function markdown_guess_title(
|
||||
} else {
|
||||
// Find first h1 header and use it as the title.
|
||||
const headers = contents.content.match(/^# (.+)$/gm);
|
||||
|
||||
if (!headers) {
|
||||
throw new Error(
|
||||
`${filepath} must have at least one h1 header for setting the title`
|
||||
);
|
||||
}
|
||||
|
||||
if (headers.length > 1 && contents.content.indexOf("```") < 0) {
|
||||
throw new Error(`too many h1 headers in ${filepath}`);
|
||||
}
|
||||
|
||||
@ -55,6 +55,14 @@ module.exports = {
|
||||
"docs/architecture/metadata-serving",
|
||||
//"docs/what/gms",
|
||||
],
|
||||
"Metadata Ingestion": [
|
||||
{
|
||||
Sources: list_ids_in_directory("metadata-ingestion/source_docs"),
|
||||
},
|
||||
{
|
||||
Sinks: list_ids_in_directory("metadata-ingestion/sink_docs"),
|
||||
},
|
||||
],
|
||||
"Metadata Modeling": [
|
||||
"docs/modeling/metadata-model",
|
||||
"docs/modeling/extending-the-metadata-model",
|
||||
|
||||
@ -40,7 +40,7 @@ Our open sourcing [blog post](https://engineering.linkedin.com/blog/2020/open-so
|
||||
- **Schema history**: view and diff historic versions of schemas
|
||||
- **GraphQL**: visualization of GraphQL schemas
|
||||
|
||||
### Jos/flows [*coming soon*]
|
||||
### Jobs/flows [*coming soon*]
|
||||
- **Search**: full-text & advanced search, search ranking
|
||||
- **Browse**: browsing through a configurable hierarchy
|
||||
- **Basic information**:
|
||||
|
||||
@ -28,38 +28,47 @@ If you run into an error, try checking the [_common setup issues_](./developing.
|
||||
|
||||
#### Installing Plugins
|
||||
|
||||
We use a plugin architecture so that you can install only the dependencies you actually need.
|
||||
We use a plugin architecture so that you can install only the dependencies you actually need. Click the plugin name to learn more about the specific source recipe and any FAQs!
|
||||
|
||||
| Plugin Name | Install Command | Provides |
|
||||
| --------------- | ---------------------------------------------------------- | ----------------------------------- |
|
||||
| file | _included by default_ | File source and sink |
|
||||
| console | _included by default_ | Console sink |
|
||||
| athena | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
|
||||
| bigquery | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
|
||||
| bigquery-usage | `pip install 'acryl-datahub[bigquery-usage]'` | BigQuery usage statistics source |
|
||||
| feast | `pip install 'acryl-datahub[feast]'` | Feast source |
|
||||
| glue | `pip install 'acryl-datahub[glue]'` | AWS Glue source |
|
||||
| hive | `pip install 'acryl-datahub[hive]'` | Hive source |
|
||||
| mssql | `pip install 'acryl-datahub[mssql]'` | SQL Server source |
|
||||
| mysql | `pip install 'acryl-datahub[mysql]'` | MySQL source |
|
||||
| oracle | `pip install 'acryl-datahub[oracle]'` | Oracle source |
|
||||
| postgres | `pip install 'acryl-datahub[postgres]'` | Postgres source |
|
||||
| redshift | `pip install 'acryl-datahub[redshift]'` | Redshift source |
|
||||
| sagemaker | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source |
|
||||
| sqlalchemy | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source |
|
||||
| snowflake | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
|
||||
| snowflake-usage | `pip install 'acryl-datahub[snowflake-usage]'` | Snowflake usage statistics source |
|
||||
| sql-profiles | `pip install 'acryl-datahub[sql-profiles]'` | Data profiles for SQL-based systems |
|
||||
| superset | `pip install 'acryl-datahub[superset]'` | Superset source |
|
||||
| mongodb | `pip install 'acryl-datahub[mongodb]'` | MongoDB source |
|
||||
| ldap | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source |
|
||||
| looker | `pip install 'acryl-datahub[looker]'` | Looker source |
|
||||
| lookml | `pip install 'acryl-datahub[lookml]'` | LookML source, requires Python 3.7+ |
|
||||
| kafka | `pip install 'acryl-datahub[kafka]'` | Kafka source |
|
||||
| druid | `pip install 'acryl-datahub[druid]'` | Druid Source |
|
||||
| dbt | `pip install 'acryl-datahub[dbt]'` | dbt source |
|
||||
| datahub-rest | `pip install 'acryl-datahub[datahub-rest]'` | DataHub sink over REST API |
|
||||
| datahub-kafka | `pip install 'acryl-datahub[datahub-kafka]'` | DataHub sink over Kafka |
|
||||
Sources:
|
||||
|
||||
| Plugin Name | Install Command | Provides |
|
||||
| ----------------------------------------------- | ---------------------------------------------------------- | ----------------------------------- |
|
||||
| [file](./source_docs/file.md) | _included by default_ | File source and sink |
|
||||
| [athena](./source_docs/athena.md) | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
|
||||
| [bigquery](./source_docs/bigquery.md) | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
|
||||
| [bigquery-usage](./source_docs/bigquery.md) | `pip install 'acryl-datahub[bigquery-usage]'` | BigQuery usage statistics source |
|
||||
| [dbt](./source_docs/dbt.md) | _no additional dependencies_ | dbt source |
|
||||
| [druid](./source_docs/druid.md) | `pip install 'acryl-datahub[druid]'` | Druid Source |
|
||||
| [feast](./source_docs/feast.md) | `pip install 'acryl-datahub[feast]'` | Feast source |
|
||||
| [glue](./source_docs/glue.md) | `pip install 'acryl-datahub[glue]'` | AWS Glue source |
|
||||
| [hive](./source_docs/hive.md) | `pip install 'acryl-datahub[hive]'` | Hive source |
|
||||
| [kafka](./source_docs/kafka.md) | `pip install 'acryl-datahub[kafka]'` | Kafka source |
|
||||
| [kafka-connect](./source_docs/kafka-connect.md) | `pip install 'acryl-datahub[kafka-connect]'` | Kafka connect source |
|
||||
| [ldap](./source_docs/ldap.md) | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source |
|
||||
| [looker](./source_docs/looker.md) | `pip install 'acryl-datahub[looker]'` | Looker source |
|
||||
| [lookml](./source_docs/lookml.md) | `pip install 'acryl-datahub[lookml]'` | LookML source, requires Python 3.7+ |
|
||||
| [mongodb](./source_docs/mongodb.md) | `pip install 'acryl-datahub[mongodb]'` | MongoDB source |
|
||||
| [mssql](./source_docs/mssql.md) | `pip install 'acryl-datahub[mssql]'` | SQL Server source |
|
||||
| [mysql](./source_docs/mysql.md) | `pip install 'acryl-datahub[mysql]'` | MySQL source |
|
||||
| [oracle](./source_docs/oracle.md) | `pip install 'acryl-datahub[oracle]'` | Oracle source |
|
||||
| [postgres](./source_docs/postgres.md) | `pip install 'acryl-datahub[postgres]'` | Postgres source |
|
||||
| [redshift](./source_docs/redshift.md) | `pip install 'acryl-datahub[redshift]'` | Redshift source |
|
||||
| [sagemaker](./source_docs/sagemaker.md) | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source |
|
||||
| [snowflake](./source_docs/snowflake.md) | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
|
||||
| [snowflake-usage](./source_docs/snowflake.md) | `pip install 'acryl-datahub[snowflake-usage]'` | Snowflake usage statistics source |
|
||||
| sql-profiles | `pip install 'acryl-datahub[sql-profiles]'` | Data profiles for SQL-based systems |
|
||||
| [sqlalchemy](./source_docs/sqlalchemy.md) | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source |
|
||||
| [superset](./source_docs/superset.md) | `pip install 'acryl-datahub[superset]'` | Superset source |
|
||||
|
||||
Sinks
|
||||
|
||||
| Plugin Name | Install Command | Provides |
|
||||
| --------------------------------------- | -------------------------------------------- | -------------------------- |
|
||||
| [file](./sink_docs/file.md) | _included by default_ | File source and sink |
|
||||
| [console](./sink_docs/console.md) | _included by default_ | Console sink |
|
||||
| [datahub-rest](./sink_docs/datahub.md) | `pip install 'acryl-datahub[datahub-rest]'` | DataHub sink over REST API |
|
||||
| [datahub-kafka](./sink_docs/datahub.md) | `pip install 'acryl-datahub[datahub-kafka]'` | DataHub sink over Kafka |
|
||||
|
||||
These plugins can be mixed and matched as desired. For example:
|
||||
|
||||
@ -137,875 +146,7 @@ Running a recipe is quite easy.
|
||||
datahub ingest -c ./examples/recipes/mssql_to_datahub.yml
|
||||
```
|
||||
|
||||
A number of recipes are included in the examples/recipes directory.
|
||||
|
||||
## Sources
|
||||
|
||||
### Kafka Metadata `kafka`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of topics - from the Kafka broker
|
||||
- Schemas associated with each topic - from the schema registry
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "kafka"
|
||||
config:
|
||||
connection:
|
||||
bootstrap: "broker:9092"
|
||||
consumer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.DeserializingConsumer
|
||||
schema_registry_url: http://localhost:8081
|
||||
schema_registry_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient
|
||||
```
|
||||
|
||||
The options in the consumer config and schema registry config are passed to the Kafka DeserializingConsumer and SchemaRegistryClient respectively.
|
||||
|
||||
For a full example with a number of security options, see this [example recipe](./examples/recipes/secured_kafka.yml).
|
||||
|
||||
### MySQL Metadata `mysql`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases and tables
|
||||
- Column types and schema associated with each table
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: mysql
|
||||
config:
|
||||
username: root
|
||||
password: example
|
||||
database: dbname
|
||||
host_port: localhost:3306
|
||||
table_pattern:
|
||||
deny:
|
||||
# Note that the deny patterns take precedence over the allow patterns.
|
||||
- "performance_schema"
|
||||
allow:
|
||||
- "schema1.table2"
|
||||
# Although the 'table_pattern' enables you to skip everything from certain schemas,
|
||||
# having another option to allow/deny on schema level is an optimization for the case when there is a large number
|
||||
# of schemas that one wants to skip and you want to avoid the time to needlessly fetch those tables only to filter
|
||||
# them out afterwards via the table_pattern.
|
||||
schema_pattern:
|
||||
deny:
|
||||
- "garbage_schema"
|
||||
allow:
|
||||
- "schema1"
|
||||
```
|
||||
|
||||
### Microsoft SQL Server Metadata `mssql`
|
||||
|
||||
We have two options for the underlying library used to connect to SQL Server: (1) [python-tds](https://github.com/denisenkom/pytds) and (2) [pyodbc](https://github.com/mkleehammer/pyodbc). The TDS library is pure Python and hence easier to install, but only PyODBC supports encrypted connections.
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases, schema, tables and views
|
||||
- Column types associated with each table/view
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: mssql
|
||||
config:
|
||||
username: user
|
||||
password: pass
|
||||
host_port: localhost:1433
|
||||
database: DemoDatabase
|
||||
include_views: True # whether to include views, defaults to True
|
||||
table_pattern:
|
||||
deny:
|
||||
- "^.*\\.sys_.*" # deny all tables that start with sys_
|
||||
allow:
|
||||
- "schema1.table1"
|
||||
- "schema1.table2"
|
||||
options:
|
||||
# Any options specified here will be passed to SQLAlchemy's create_engine as kwargs.
|
||||
# See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details.
|
||||
# Many of these options are specific to the underlying database driver, so that library's
|
||||
# documentation will be a good reference for what is supported. To find which dialect is likely
|
||||
# in use, consult this table: https://docs.sqlalchemy.org/en/14/dialects/index.html.
|
||||
charset: "utf8"
|
||||
# If set to true, we'll use the pyodbc library. This requires you to have
|
||||
# already installed the Microsoft ODBC Driver for SQL Server.
|
||||
# See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
|
||||
use_odbc: False
|
||||
uri_args: {}
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>Example: using ingestion with ODBC and encryption</summary>
|
||||
|
||||
This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
|
||||
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: mssql
|
||||
config:
|
||||
# See https://docs.sqlalchemy.org/en/14/dialects/mssql.html#module-sqlalchemy.dialects.mssql.pyodbc
|
||||
use_odbc: True
|
||||
username: user
|
||||
password: pass
|
||||
host_port: localhost:1433
|
||||
database: DemoDatabase
|
||||
include_views: True # whether to include views, defaults to True
|
||||
uri_args:
|
||||
# See https://docs.microsoft.com/en-us/sql/connect/odbc/dsn-connection-string-attribute?view=sql-server-ver15
|
||||
driver: "ODBC Driver 17 for SQL Server"
|
||||
Encrypt: "yes"
|
||||
TrustServerCertificate: "Yes"
|
||||
ssl: "True"
|
||||
# Trusted_Connection: "yes"
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Hive `hive`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases, schema, and tables
|
||||
- Column types associated with each table
|
||||
- Detailed table and storage information
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: hive
|
||||
config:
|
||||
# For more details on authentication, see the PyHive docs:
|
||||
# https://github.com/dropbox/PyHive#passing-session-configuration.
|
||||
# LDAP, Kerberos, etc. are supported using connect_args, which can be
|
||||
# added under the `options` config parameter.
|
||||
#scheme: 'hive+http' # set this if Thrift should use the HTTP transport
|
||||
#scheme: 'hive+https' # set this if Thrift should use the HTTP with SSL transport
|
||||
username: user # optional
|
||||
password: pass # optional
|
||||
host_port: localhost:10000
|
||||
database: DemoDatabase # optional, if not specified, ingests from all databases
|
||||
# table_pattern/schema_pattern is same as above
|
||||
# options is same as above
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>Example: using ingestion with Azure HDInsight</summary>
|
||||
|
||||
```yml
|
||||
# Connecting to Microsoft Azure HDInsight using TLS.
|
||||
source:
|
||||
type: hive
|
||||
config:
|
||||
scheme: "hive+https"
|
||||
host_port: <cluster_name>.azurehdinsight.net:443
|
||||
username: admin
|
||||
password: "<password>"
|
||||
options:
|
||||
connect_args:
|
||||
http_path: "/hive2"
|
||||
auth: BASIC
|
||||
# table_pattern/schema_pattern is same as above
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### PostgreSQL `postgres`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases, schema, and tables
|
||||
- Column types associated with each table
|
||||
- Also supports PostGIS extensions
|
||||
- database_alias (optional) can be used to change the name of database to be ingested
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: postgres
|
||||
config:
|
||||
username: user
|
||||
password: pass
|
||||
host_port: localhost:5432
|
||||
database: DemoDatabase
|
||||
database_alias: DatabaseNameToBeIngested
|
||||
include_views: True # whether to include views, defaults to True
|
||||
# table_pattern/schema_pattern is same as above
|
||||
# options is same as above
|
||||
```
|
||||
|
||||
### Redshift `redshift`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases, schema, and tables
|
||||
- Column types associated with each table
|
||||
- Also supports PostGIS extensions
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: redshift
|
||||
config:
|
||||
username: user
|
||||
password: pass
|
||||
host_port: example.something.us-west-2.redshift.amazonaws.com:5439
|
||||
database: DemoDatabase
|
||||
include_views: True # whether to include views, defaults to True
|
||||
# table_pattern/schema_pattern is same as above
|
||||
# options is same as above
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>Extra options when running Redshift behind a proxy</summary>
|
||||
|
||||
This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
|
||||
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: redshift
|
||||
config:
|
||||
# username, password, database, etc are all the same as above
|
||||
host_port: my-proxy-hostname:5439
|
||||
options:
|
||||
connect_args:
|
||||
sslmode: "prefer" # or "require" or "verify-ca"
|
||||
sslrootcert: ~ # needed to unpin the AWS Redshift certificate
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### AWS SageMaker `sagemaker`
|
||||
|
||||
Extracts:
|
||||
|
||||
- Feature groups
|
||||
- Models, jobs, and lineage between the two (e.g. when jobs output a model or a model is used by a job)
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: sagemaker
|
||||
config:
|
||||
aws_region: # aws_region_name, i.e. "eu-west-1"
|
||||
env: # environment for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". Defaults to "PROD".
|
||||
|
||||
# Credentials. If not specified here, these are picked up according to boto3 rules.
|
||||
# (see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)
|
||||
aws_access_key_id: # Optional.
|
||||
aws_secret_access_key: # Optional.
|
||||
aws_session_token: # Optional.
|
||||
aws_role: # Optional (Role chaining supported by using a sorted list).
|
||||
|
||||
extract_feature_groups: True # if feature groups should be ingested, default True
|
||||
extract_models: True # if models should be ingested, default True
|
||||
extract_jobs: # if jobs should be ingested, default True for all
|
||||
auto_ml: True
|
||||
compilation: True
|
||||
edge_packaging: True
|
||||
hyper_parameter_tuning: True
|
||||
labeling: True
|
||||
processing: True
|
||||
training: True
|
||||
transform: True
|
||||
```
|
||||
|
||||
### Snowflake `snowflake`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases, schema, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: snowflake
|
||||
config:
|
||||
username: user
|
||||
password: pass
|
||||
host_port: account_name
|
||||
database_pattern:
|
||||
# The escaping of the \$ symbol helps us skip the environment variable substitution.
|
||||
allow:
|
||||
- ^MY_DEMO_DATA.*
|
||||
- ^ANOTHER_DB_REGEX
|
||||
deny:
|
||||
- ^SNOWFLAKE\$
|
||||
- ^SNOWFLAKE_SAMPLE_DATA\$
|
||||
warehouse: "COMPUTE_WH" # optional
|
||||
role: "sysadmin" # optional
|
||||
include_views: True # whether to include views, defaults to True
|
||||
# table_pattern/schema_pattern is same as above
|
||||
# options is same as above
|
||||
```
|
||||
|
||||
:::tip
|
||||
|
||||
You can also get fine-grained usage statistics for Snowflake using the `snowflake-usage` source.
|
||||
|
||||
:::
|
||||
|
||||
### SQL Profiles `sql-profiles`
|
||||
|
||||
The SQL-based profiler does not run alone, but rather can be enabled for other SQL-based sources.
|
||||
Enabling profiling will slow down ingestion runs.
|
||||
|
||||
Extracts:
|
||||
|
||||
- row and column counts for each table
|
||||
- for each column, if applicable:
|
||||
- null counts and proportions
|
||||
- distinct counts and proportions
|
||||
- minimum, maximum, mean, median, standard deviation, some quantile values
|
||||
- histograms or frequencies of unique values
|
||||
|
||||
Supported SQL sources:
|
||||
|
||||
- AWS Athena
|
||||
- BigQuery
|
||||
- Druid
|
||||
- Hive
|
||||
- Microsoft SQL Server
|
||||
- MySQL
|
||||
- Oracle
|
||||
- Postgres
|
||||
- Redshift
|
||||
- Snowflake
|
||||
- Generic SQLAlchemy source
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: <sql-source> # can be bigquery, snowflake, etc - see above for the list
|
||||
config:
|
||||
# username, password, etc - varies by source type
|
||||
profiling:
|
||||
enabled: true
|
||||
limit: 1000 # optional - max rows to profile
|
||||
offset: 0 # optional - offset of first row to profile
|
||||
profile_pattern:
|
||||
deny:
|
||||
# Skip all tables ending with "_staging"
|
||||
- _staging\$
|
||||
allow:
|
||||
# Profile all tables in that start with "gold_" in "myschema"
|
||||
- myschema\.gold_.*
|
||||
|
||||
# If you only want profiles (but no catalog information), set these to false
|
||||
include_tables: true
|
||||
include_views: true
|
||||
```
|
||||
|
||||
:::caution
|
||||
|
||||
Running profiling against many tables or over many rows can run up significant costs.
|
||||
While we've done our best to limit the expensiveness of the queries the profiler runs, you
|
||||
should be prudent about the set of tables profiling is enabled on or the frequency
|
||||
of the profiling runs.
|
||||
|
||||
:::
|
||||
|
||||
### Superset `superset`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of charts and dashboards
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: superset
|
||||
config:
|
||||
username: user
|
||||
password: pass
|
||||
provider: db | ldap
|
||||
connect_uri: http://localhost:8088
|
||||
env: "PROD" # Optional, default is "PROD"
|
||||
```
|
||||
|
||||
See documentation for superset's `/security/login` at https://superset.apache.org/docs/rest-api for more details on superset's login api.
|
||||
|
||||
### Oracle `oracle`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases, schema, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
Using the Oracle source requires that you've also installed the correct drivers; see the [cx_Oracle docs](https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html). The easiest one is the [Oracle Instant Client](https://www.oracle.com/database/technologies/instant-client.html).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: oracle
|
||||
config:
|
||||
# For more details on authentication, see the documentation:
|
||||
# https://docs.sqlalchemy.org/en/14/dialects/oracle.html#dialect-oracle-cx_oracle-connect and
|
||||
# https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html#connection-strings.
|
||||
username: user
|
||||
password: pass
|
||||
host_port: localhost:5432
|
||||
database: dbname
|
||||
service_name: svc # omit database if using this option
|
||||
include_views: True # whether to include views, defaults to True
|
||||
# table_pattern/schema_pattern is same as above
|
||||
# options is same as above
|
||||
```
|
||||
|
||||
### Feast `feast`
|
||||
|
||||
**Note: Feast ingestion requires Docker to be installed.**
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of feature tables (modeled as [`MLFeatureTable`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLFeatureTableProperties.pdl)s),
|
||||
features ([`MLFeature`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLFeatureProperties.pdl)s),
|
||||
and entities ([`MLPrimaryKey`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLPrimaryKeyProperties.pdl)s)
|
||||
- Column types associated with each feature and entity
|
||||
|
||||
Note: this uses a separate Docker container to extract Feast's metadata into a JSON file, which is then
|
||||
parsed to DataHub's native objects. This was done because of a dependency conflict in the `feast` module.
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: feast
|
||||
config:
|
||||
core_url: localhost:6565 # default
|
||||
env: "PROD" # Optional, default is "PROD"
|
||||
use_local_build: False # Whether to build Feast ingestion image locally, default is False
|
||||
```
|
||||
|
||||
### Google BigQuery `bigquery`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases, schema, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: bigquery
|
||||
config:
|
||||
project_id: project # optional - can autodetect from environment
|
||||
options: # options is same as above
|
||||
# See https://github.com/mxmzdlv/pybigquery#authentication for details.
|
||||
credentials_path: "/path/to/keyfile.json" # optional
|
||||
include_views: True # whether to include views, defaults to True
|
||||
# table_pattern/schema_pattern is same as above
|
||||
```
|
||||
|
||||
:::tip
|
||||
|
||||
You can also get fine-grained usage statistics for BigQuery using the `bigquery-usage` source.
|
||||
|
||||
:::
|
||||
|
||||
### AWS Athena `athena`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases and tables
|
||||
- Column types associated with each table
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: athena
|
||||
config:
|
||||
username: aws_access_key_id # Optional. If not specified, credentials are picked up according to boto3 rules.
|
||||
# See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
|
||||
password: aws_secret_access_key # Optional.
|
||||
database: database # Optional, defaults to "default"
|
||||
aws_region: aws_region_name # i.e. "eu-west-1"
|
||||
s3_staging_dir: s3_location # "s3://<bucket-name>/prefix/"
|
||||
# The s3_staging_dir parameter is needed because Athena always writes query results to S3.
|
||||
# See https://docs.aws.amazon.com/athena/latest/ug/querying.html
|
||||
# However, the athena driver will transparently fetch these results as you would expect from any other sql client.
|
||||
work_group: athena_workgroup # "primary"
|
||||
# table_pattern/schema_pattern is same as above
|
||||
```
|
||||
|
||||
### AWS Glue `glue`
|
||||
|
||||
Note: if you also have files in S3 that you'd like to ingest, we recommend you use Glue's built-in data catalog. See [here](./s3-ingestion.md) for a quick guide on how to set up a crawler on Glue and ingest the outputs with DataHub.
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of tables
|
||||
- Column types associated with each table
|
||||
- Table metadata, such as owner, description and parameters
|
||||
- Jobs and their component transformations, data sources, and data sinks
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: glue
|
||||
config:
|
||||
aws_region: # aws_region_name, i.e. "eu-west-1"
|
||||
extract_transforms: True # whether to ingest Glue jobs, defaults to True
|
||||
env: # environment for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". Defaults to "PROD".
|
||||
|
||||
# Filtering patterns for databases and tables to scan
|
||||
database_pattern: # Optional, to filter databases scanned, same as schema_pattern above.
|
||||
table_pattern: # Optional, to filter tables scanned, same as table_pattern above.
|
||||
|
||||
# Credentials. If not specified here, these are picked up according to boto3 rules.
|
||||
# (see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)
|
||||
aws_access_key_id: # Optional.
|
||||
aws_secret_access_key: # Optional.
|
||||
aws_session_token: # Optional.
|
||||
aws_role: # Optional (Role chaining supported by using a sorted list).
|
||||
underlying_platform: #Optional (Can change platform name to be athena)
|
||||
```
|
||||
|
||||
### Druid `druid`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases, schema, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
**Note** It is important to define a explicitly define deny schema pattern for internal druid databases (lookup & sys)
|
||||
if adding a schema pattern otherwise the crawler may crash before processing relevant databases.
|
||||
This deny pattern is defined by default but is overriden by user-submitted configurations
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: druid
|
||||
config:
|
||||
# Point to broker address
|
||||
host_port: localhost:8082
|
||||
schema_pattern:
|
||||
deny:
|
||||
- "^(lookup|sys).*"
|
||||
# options is same as above
|
||||
```
|
||||
|
||||
### Other databases using SQLAlchemy `sqlalchemy`
|
||||
|
||||
The `sqlalchemy` source is useful if we don't have a pre-built source for your chosen
|
||||
database system, but there is an [SQLAlchemy dialect](https://docs.sqlalchemy.org/en/14/dialects/)
|
||||
defined elsewhere. In order to use this, you must `pip install` the required dialect packages yourself.
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of schemas and tables
|
||||
- Column types associated with each table
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: sqlalchemy
|
||||
config:
|
||||
# See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls
|
||||
connect_uri: "dialect+driver://username:password@host:port/database"
|
||||
options: {} # same as above
|
||||
schema_pattern: {} # same as above
|
||||
table_pattern: {} # same as above
|
||||
include_views: True # whether to include views, defaults to True
|
||||
```
|
||||
|
||||
### MongoDB `mongodb`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of databases
|
||||
- List of collections in each database and infers schemas for each collection
|
||||
|
||||
By default, schema inference samples 1,000 documents from each collection. Setting `schemaSamplingSize: null` will scan the entire collection.
|
||||
Moreover, setting `useRandomSampling: False` will sample the first documents found without random selection, which may be faster for large collections.
|
||||
|
||||
Note that `schemaSamplingSize` has no effect if `enableSchemaInference: False` is set.
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "mongodb"
|
||||
config:
|
||||
# For advanced configurations, see the MongoDB docs.
|
||||
# https://pymongo.readthedocs.io/en/stable/examples/authentication.html
|
||||
connect_uri: "mongodb://localhost"
|
||||
username: admin
|
||||
password: password
|
||||
env: "PROD" # Optional, default is "PROD"
|
||||
authMechanism: "DEFAULT"
|
||||
options: {}
|
||||
database_pattern: {}
|
||||
collection_pattern: {}
|
||||
enableSchemaInference: True
|
||||
schemaSamplingSize: 1000
|
||||
useRandomSampling: True # whether to randomly sample docs for schema or just use the first ones, True by default
|
||||
# database_pattern/collection_pattern are similar to schema_pattern/table_pattern from above
|
||||
```
|
||||
|
||||
### LDAP `ldap`
|
||||
|
||||
Extracts:
|
||||
|
||||
- List of people
|
||||
- Names, emails, titles, and manager information for each person
|
||||
- List of groups
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "ldap"
|
||||
config:
|
||||
ldap_server: ldap://localhost
|
||||
ldap_user: "cn=admin,dc=example,dc=org"
|
||||
ldap_password: "admin"
|
||||
base_dn: "dc=example,dc=org"
|
||||
filter: "(objectClass=*)" # optional field
|
||||
drop_missing_first_last_name: False # optional
|
||||
```
|
||||
|
||||
The `drop_missing_first_last_name` should be set to true if you've got many "headless" user LDAP accounts
|
||||
for devices or services should be excluded when they do not contain a first and last name. This will only
|
||||
impact the ingestion of LDAP users, while LDAP groups will be unaffected by this config option.
|
||||
|
||||
### LookML `lookml`
|
||||
|
||||
Note! This plugin uses a package that requires Python 3.7+!
|
||||
|
||||
Extracts:
|
||||
|
||||
- LookML views from model files
|
||||
- Name, upstream table names, dimensions, measures, and dimension groups
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "lookml"
|
||||
config:
|
||||
base_folder: /path/to/model/files # where the *.model.lkml and *.view.lkml files are stored
|
||||
connection_to_platform_map: # mappings between connection names in the model files to platform names
|
||||
connection_name: platform_name (or platform_name.database_name) # for ex. my_snowflake_conn: snowflake.my_database
|
||||
model_pattern: {}
|
||||
view_pattern: {}
|
||||
env: "PROD" # optional, default is "PROD"
|
||||
parse_table_names_from_sql: False # see note below
|
||||
platform_name: "looker" # optional, default is "looker"
|
||||
```
|
||||
|
||||
Note! The integration can use [`sql-metadata`](https://pypi.org/project/sql-metadata/) to try to parse the tables the
|
||||
views depends on. As these SQL's can be complicated, and the package doesn't official support all the SQL dialects that
|
||||
Looker supports, the result might not be correct. This parsing is disabled by default, but can be enabled by setting
|
||||
`parse_table_names_from_sql: True`.
|
||||
|
||||
### Looker dashboards `looker`
|
||||
|
||||
Extracts:
|
||||
|
||||
- Looker dashboards and dashboard elements (charts)
|
||||
- Names, descriptions, URLs, chart types, input view for the charts
|
||||
|
||||
See the [Looker authentication docs](https://docs.looker.com/reference/api-and-integration/api-auth#authentication_with_an_sdk) for the steps to create a client ID and secret.
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "looker"
|
||||
config:
|
||||
client_id: # Your Looker API3 client ID
|
||||
client_secret: # Your Looker API3 client secret
|
||||
base_url: # The url to your Looker instance: https://company.looker.com:19999 or https://looker.company.com, or similar.
|
||||
dashboard_pattern: # supports allow/deny regexes
|
||||
chart_pattern: # supports allow/deny regexes
|
||||
actor: urn:li:corpuser:etl # Optional, defaults to urn:li:corpuser:etl
|
||||
env: "PROD" # Optional, default is "PROD"
|
||||
platform_name: "looker" # Optional, default is "looker"
|
||||
```
|
||||
|
||||
### File `file`
|
||||
|
||||
Pulls metadata from a previously generated file. Note that the file sink
|
||||
can produce such files, and a number of samples are included in the
|
||||
[examples/mce_files](examples/mce_files) directory.
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: file
|
||||
config:
|
||||
filename: ./path/to/mce/file.json
|
||||
```
|
||||
|
||||
### dbt `dbt`
|
||||
|
||||
Pull metadata from dbt artifacts files:
|
||||
|
||||
- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json)
|
||||
- This file contains model, source and lineage data.
|
||||
- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json)
|
||||
- This file contains schema data.
|
||||
- dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models
|
||||
- [dbt sources file](https://docs.getdbt.com/reference/artifacts/sources-json)
|
||||
- This file contains metadata for sources with freshness checks.
|
||||
- We transfer dbt's freshness checks to DataHub's last-modified fields.
|
||||
- Note that this file is optional – if not specified, we'll use time of ingestion instead as a proxy for time last-modified.
|
||||
- target_platform:
|
||||
- The data platform you are enriching with dbt metadata.
|
||||
- [data platforms](https://github.com/linkedin/datahub/blob/master/gms/impl/src/main/resources/DataPlatformInfo.json)
|
||||
- load_schemas:
|
||||
- Load schemas from dbt catalog file, not necessary when the underlying data platform already has this data.
|
||||
- node_type_pattern:
|
||||
- Use this filter to exclude and include node types using allow or deny method
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "dbt"
|
||||
config:
|
||||
manifest_path: "./path/dbt/manifest_file.json"
|
||||
catalog_path: "./path/dbt/catalog_file.json"
|
||||
sources_path: "./path/dbt/sources_file.json" # (optional, used for freshness checks)
|
||||
target_platform: "postgres" # optional, eg "postgres", "snowflake", etc.
|
||||
load_schemas: True or False
|
||||
node_type_pattern: # optional
|
||||
deny:
|
||||
- ^test.*
|
||||
allow:
|
||||
- ^.*
|
||||
```
|
||||
|
||||
Note: when `load_schemas` is False, models that use [identifiers](https://docs.getdbt.com/reference/resource-properties/identifier) to reference their source tables are ingested using the model identifier as the model name to preserve the lineage.
|
||||
|
||||
### Google BigQuery Usage Stats `bigquery-usage`
|
||||
|
||||
- Fetch a list of queries issued
|
||||
- Fetch a list of tables and columns accessed
|
||||
- Aggregate these statistics into buckets, by day or hour granularity
|
||||
|
||||
Note: the client must have one of the following OAuth scopes, and should be authorized on all projects you'd like to ingest usage stats from.
|
||||
|
||||
- https://www.googleapis.com/auth/logging.read
|
||||
- https://www.googleapis.com/auth/logging.admin
|
||||
- https://www.googleapis.com/auth/cloud-platform.read-only
|
||||
- https://www.googleapis.com/auth/cloud-platform
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: bigquery-usage
|
||||
config:
|
||||
projects: # optional - can autodetect a single project from the environment
|
||||
- project_id_1
|
||||
- project_id_2
|
||||
options:
|
||||
# See https://googleapis.dev/python/logging/latest/client.html for details.
|
||||
credentials: ~ # optional - see docs
|
||||
env: PROD
|
||||
|
||||
bucket_duration: "DAY"
|
||||
start_time: ~ # defaults to the last full day in UTC (or hour)
|
||||
end_time: ~ # defaults to the last full day in UTC (or hour)
|
||||
|
||||
top_n_queries: 10 # number of queries to save for each table
|
||||
```
|
||||
|
||||
:::note
|
||||
|
||||
This source only does usage statistics. To get the tables, views, and schemas in your BigQuery project, use the `bigquery` source.
|
||||
|
||||
:::
|
||||
|
||||
### Snowflake Usage Stats `snowflake-usage`
|
||||
|
||||
- Fetch a list of queries issued
|
||||
- Fetch a list of tables and columns accessed (excludes views)
|
||||
- Aggregate these statistics into buckets, by day or hour granularity
|
||||
|
||||
Note: the user/role must have access to the account usage table. The "accountadmin" role has this by default, and other roles can be [granted this permission](https://docs.snowflake.com/en/sql-reference/account-usage.html#enabling-account-usage-for-other-roles).
|
||||
|
||||
Note: the underlying access history views that we use are only available in Snowflake's enterprise edition or higher.
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: snowflake-usage
|
||||
config:
|
||||
username: user
|
||||
password: pass
|
||||
host_port: account_name
|
||||
role: ACCOUNTADMIN
|
||||
env: PROD
|
||||
|
||||
bucket_duration: "DAY"
|
||||
start_time: ~ # defaults to the last full day in UTC (or hour)
|
||||
end_time: ~ # defaults to the last full day in UTC (or hour)
|
||||
|
||||
top_n_queries: 10 # number of queries to save for each table
|
||||
```
|
||||
|
||||
:::note
|
||||
|
||||
This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake` source.
|
||||
|
||||
:::
|
||||
|
||||
### Kafka Connect `kafka-connect`
|
||||
|
||||
Extracts:
|
||||
|
||||
- Kafka Connect connector as individual `DataFlowSnapshotClass` entity
|
||||
- Creating individual `DataJobSnapshotClass` entity using `{connector_name}:{source_dataset}` naming
|
||||
- Lineage information between source database to Kafka topic
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "kafka-connect"
|
||||
config:
|
||||
connect_uri: "http://localhost:8083"
|
||||
cluster_name: "connect-cluster"
|
||||
connector_patterns:
|
||||
deny:
|
||||
- ^denied-connector.*
|
||||
allow:
|
||||
- ^allowed-connector.*
|
||||
```
|
||||
|
||||
Current limitations:
|
||||
|
||||
- Currently works only for Debezium source connectors.
|
||||
|
||||
## Sinks
|
||||
|
||||
### DataHub Rest `datahub-rest`
|
||||
|
||||
Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface
|
||||
is that any errors can immediately be reported.
|
||||
|
||||
```yml
|
||||
sink:
|
||||
type: "datahub-rest"
|
||||
config:
|
||||
server: "http://localhost:8080"
|
||||
```
|
||||
|
||||
### DataHub Kafka `datahub-kafka`
|
||||
|
||||
Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based
|
||||
interface is that it's asynchronous and can handle higher throughput. This requires the
|
||||
Datahub mce-consumer container to be running.
|
||||
|
||||
```yml
|
||||
sink:
|
||||
type: "datahub-kafka"
|
||||
config:
|
||||
connection:
|
||||
bootstrap: "localhost:9092"
|
||||
producer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.SerializingProducer
|
||||
schema_registry_url: "http://localhost:8081"
|
||||
schema_registry_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient
|
||||
```
|
||||
|
||||
The options in the producer config and schema registry config are passed to the Kafka SerializingProducer and SchemaRegistryClient respectively.
|
||||
|
||||
For a full example with a number of security options, see this [example recipe](./examples/recipes/secured_kafka.yml).
|
||||
|
||||
### Console `console`
|
||||
|
||||
Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.
|
||||
|
||||
```yml
|
||||
sink:
|
||||
type: "console"
|
||||
```
|
||||
|
||||
### File `file`
|
||||
|
||||
Outputs metadata to a file. This can be used to decouple metadata sourcing from the
|
||||
process of pushing it into DataHub, and is particularly useful for debugging purposes.
|
||||
Note that the file source can read files generated by this sink.
|
||||
|
||||
```yml
|
||||
sink:
|
||||
type: file
|
||||
config:
|
||||
filename: ./path/to/mce/file.json
|
||||
```
|
||||
A number of recipes are included in the [examples/recipes](./examples/recipes) directory. For full info and context on each source and sink, see the pages described in the [table of plugins](#installing-plugins).
|
||||
|
||||
## Transformations
|
||||
|
||||
@ -1040,10 +181,13 @@ If you're simply looking to run ingestion on a schedule, take a look at these sa
|
||||
The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.
|
||||
|
||||
:::
|
||||
|
||||
1. You need to install the required dependency in your airflow. See https://registry.astronomer.io/providers/datahub/modules/datahublineagebackend
|
||||
```shell
|
||||
pip install acryl-datahub[airflow]
|
||||
```
|
||||
|
||||
```shell
|
||||
pip install acryl-datahub[airflow]
|
||||
```
|
||||
|
||||
2. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
|
||||
|
||||
```shell
|
||||
|
||||
@ -13,7 +13,6 @@ source:
|
||||
collection_pattern: {}
|
||||
enableSchemaInference: True
|
||||
schemaSamplingSize: 1000
|
||||
# database_pattern/collection_pattern are similar to schema_pattern/table_pattern from above
|
||||
sink:
|
||||
type: "datahub-rest"
|
||||
config:
|
||||
|
||||
33
metadata-ingestion/sink_docs/console.md
Normal file
33
metadata-ingestion/sink_docs/console.md
Normal file
@ -0,0 +1,33 @@
|
||||
# Console
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
Works with `acryl-datahub` out of the box.
|
||||
|
||||
## Capabilities
|
||||
|
||||
Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
# source configs
|
||||
|
||||
sink:
|
||||
type: "console"
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
None!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
87
metadata-ingestion/sink_docs/datahub.md
Normal file
87
metadata-ingestion/sink_docs/datahub.md
Normal file
@ -0,0 +1,87 @@
|
||||
# DataHub
|
||||
|
||||
## DataHub Rest
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
### Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[datahub-rest]'`.
|
||||
|
||||
### Capabilities
|
||||
|
||||
Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface
|
||||
is that any errors can immediately be reported.
|
||||
|
||||
### Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
# source configs
|
||||
sink:
|
||||
type: "datahub-rest"
|
||||
config:
|
||||
server: "http://localhost:8080"
|
||||
```
|
||||
|
||||
### Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| -------- | -------- | ------- | ---------------------------- |
|
||||
| `server` | ✅ | | URL of DataHub GMS endpoint. |
|
||||
|
||||
## DataHub Kafka
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
### Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[datahub-kafka]'`.
|
||||
|
||||
### Capabilities
|
||||
|
||||
Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based
|
||||
interface is that it's asynchronous and can handle higher throughput.
|
||||
|
||||
### Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
# source configs
|
||||
|
||||
sink:
|
||||
type: "datahub-kafka"
|
||||
config:
|
||||
connection:
|
||||
bootstrap: "localhost:9092"
|
||||
schema_registry_url: "http://localhost:8081"
|
||||
```
|
||||
|
||||
### Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| -------------------------------------------- | -------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `connection.bootstrap` | ✅ | | Kafka bootstrap URL. |
|
||||
| `connection.producer_config.<option>` | | | Passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.SerializingProducer |
|
||||
| `connection.schema_registry_url` | ✅ | | URL of schema registry being used. |
|
||||
| `connection.schema_registry_config.<option>` | | | Passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient |
|
||||
|
||||
The options in the producer config and schema registry config are passed to the Kafka SerializingProducer and SchemaRegistryClient respectively.
|
||||
|
||||
For a full example with a number of security options, see this [example recipe](../examples/recipes/secured_kafka.yml).
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
41
metadata-ingestion/sink_docs/file.md
Normal file
41
metadata-ingestion/sink_docs/file.md
Normal file
@ -0,0 +1,41 @@
|
||||
# File
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
Works with `acryl-datahub` out of the box.
|
||||
|
||||
## Capabilities
|
||||
|
||||
Outputs metadata to a file. This can be used to decouple metadata sourcing from the
|
||||
process of pushing it into DataHub, and is particularly useful for debugging purposes.
|
||||
Note that the [file source](../source_docs/file.md) can read files generated by this sink.
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
# source configs
|
||||
|
||||
sink:
|
||||
type: file
|
||||
config:
|
||||
filename: ./path/to/mce/file.json
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| -------- | -------- | ------- | ------------------------- |
|
||||
| filename | ✅ | | Path to file to write to. |
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
70
metadata-ingestion/source_docs/athena.md
Normal file
70
metadata-ingestion/source_docs/athena.md
Normal file
@ -0,0 +1,70 @@
|
||||
# Athena
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[athena]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: athena
|
||||
config:
|
||||
# Coordinates
|
||||
aws_region: my_aws_region_name
|
||||
work_group: my_work_group
|
||||
|
||||
# Credentials
|
||||
username: my_aws_access_key_id
|
||||
password: my_aws_secret_access_key
|
||||
database: my_database
|
||||
|
||||
# Options
|
||||
s3_staging_dir: "s3://<bucket-name>/<folder>/"
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | Autodetected | Username credential. If not specified, detected with boto3 rules. See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `password` | | Autodetected | Same detection scheme as `username` |
|
||||
| `database` | | Autodetected | |
|
||||
| `aws_region` | ✅ | | AWS region code. |
|
||||
| `s3_staging_dir` | ✅ | | Of format `"s3://<bucket-name>/prefix/"`. The `s3_staging_dir` parameter is needed because Athena always writes query results to S3. <br />See https://docs.aws.amazon.com/athena/latest/ug/querying.html. |
|
||||
| `work_group` | ✅ | | Name of Athena workgroup. <br />See https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
135
metadata-ingestion/source_docs/bigquery.md
Normal file
135
metadata-ingestion/source_docs/bigquery.md
Normal file
@ -0,0 +1,135 @@
|
||||
# BigQuery
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[bigquery]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
:::tip
|
||||
|
||||
You can also get fine-grained usage statistics for BigQuery using the `bigquery-usage` source described below.
|
||||
|
||||
:::
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: bigquery
|
||||
config:
|
||||
# Coordinates
|
||||
project_id: my_project_id
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `project_id` | | Autodetected | Project ID to ingest from. If not specified, will infer from environment. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## BigQuery Usage Stats
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
### Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[bigquery-usage]'`.
|
||||
|
||||
### Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Statistics on queries issued and tables and columns accessed (excludes views)
|
||||
- Aggregation of these statistics into buckets, by day or hour granularity
|
||||
|
||||
Note: the client must have one of the following OAuth scopes, and should be authorized on all projects you'd like to ingest usage stats from.
|
||||
|
||||
- https://www.googleapis.com/auth/logging.read
|
||||
- https://www.googleapis.com/auth/logging.admin
|
||||
- https://www.googleapis.com/auth/cloud-platform.read-only
|
||||
- https://www.googleapis.com/auth/cloud-platform
|
||||
|
||||
:::note
|
||||
|
||||
This source only does usage statistics. To get the tables, views, and schemas in your BigQuery project, use the `bigquery` source described above.
|
||||
|
||||
:::
|
||||
|
||||
### Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: bigquery-usage
|
||||
config:
|
||||
# Coordinates
|
||||
projects:
|
||||
- project_id_1
|
||||
- project_id_2
|
||||
|
||||
# Options
|
||||
top_n_queries: 10
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
### Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
By default, we extract usage stats for the last day, with the recommendation that this source is executed every day.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | -------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `projects` | | | |
|
||||
| `extra_client_options` | | | |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `start_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Earliest date of usage logs to consider. |
|
||||
| `end_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Latest date of usage logs to consider. |
|
||||
| `top_n_queries` | | `10` | Number of top queries to save to each table. |
|
||||
| `extra_client_options` | | | Additional options to pass to `google.cloud.logging_v2.client.Client`. |
|
||||
| `query_log_deplay` | | | To account for the possibility that the query event arrives after the read event in the audit logs, we wait for at least `query_log_delay` additional events to be processed before attempting to resolve BigQuery job information from the logs. If `query_log_delay` is `None`, it gets treated as an unlimited delay, which prioritizes correctness at the expense of memory usage. |
|
||||
| `max_query_duration` | | `15` | Correction to pad `start_time` and `end_time` with. For handling the case where the read happens within our time range but the query completion event is delayed and happens after the configured end time. |
|
||||
|
||||
### Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
76
metadata-ingestion/source_docs/dbt.md
Normal file
76
metadata-ingestion/source_docs/dbt.md
Normal file
@ -0,0 +1,76 @@
|
||||
# dbt
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
Works with `acryl-datahub` out of the box.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin pulls metadata from dbt's artifact files:
|
||||
|
||||
- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json)
|
||||
- This file contains model, source and lineage data.
|
||||
- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json)
|
||||
- This file contains schema data.
|
||||
- dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models
|
||||
- [dbt sources file](https://docs.getdbt.com/reference/artifacts/sources-json)
|
||||
- This file contains metadata for sources with freshness checks.
|
||||
- We transfer dbt's freshness checks to DataHub's last-modified fields.
|
||||
- Note that this file is optional – if not specified, we'll use time of ingestion instead as a proxy for time last-modified.
|
||||
- target_platform:
|
||||
- The data platform you are enriching with dbt metadata.
|
||||
- [data platforms](https://github.com/linkedin/datahub/blob/master/gms/impl/src/main/resources/DataPlatformInfo.json)
|
||||
- load_schemas:
|
||||
- Load schemas from dbt catalog file, not necessary when the underlying data platform already has this data.
|
||||
- node_type_pattern:
|
||||
- Use this filter to exclude and include node types using allow or deny method
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "dbt"
|
||||
config:
|
||||
# Coordinates
|
||||
manifest_path: "./path/dbt/manifest_file.json"
|
||||
catalog_path: "./path/dbt/catalog_file.json"
|
||||
sources_path: "./path/dbt/sources_file.json"
|
||||
|
||||
# Options
|
||||
target_platform: "my_target_platform_id"
|
||||
load_schemas: True # note: if this is disabled
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ------------------------- | -------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `manifest_path` | ✅ | | Path to dbt manifest JSON. See https://docs.getdbt.com/reference/artifacts/manifest-json |
|
||||
| `catalog_path` | ✅ | | Path to dbt catalog JSON. See https://docs.getdbt.com/reference/artifacts/catalog-json |
|
||||
| `sources_path` | | | Path to dbt sources JSON. See https://docs.getdbt.com/reference/artifacts/sources-json. If not specified, last-modified fields will not be populated. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `target_platform` | ✅ | | The platform that dbt is loading onto. |
|
||||
| `load_schemas` | ✅ | | Whether to load database schemas. If set to `False`, table schema details (e.g. columns) will not be ingested. |
|
||||
| `node_type_pattern.allow` | | | Regex pattern for dbt nodes to include in ingestion. |
|
||||
| `node_type_pattern.deny` | | | Regex pattern for dbt nodes to exclude from ingestion. |
|
||||
|
||||
Note: when `load_schemas` is False, models that use [identifiers](https://docs.getdbt.com/reference/resource-properties/identifier) to reference their source tables are ingested using the model identifier as the model name to preserve the lineage.
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
67
metadata-ingestion/source_docs/druid.md
Normal file
67
metadata-ingestion/source_docs/druid.md
Normal file
@ -0,0 +1,67 @@
|
||||
# Druid
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[druid]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
**Note**: It is important to explicitly define the deny schema pattern for internal Druid databases (lookup & sys) if adding a schema pattern. Otherwise, the crawler may crash before processing relevant databases. This deny pattern is defined by default but is overriden by user-submitted configurations.
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: druid
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: "localhost:8082"
|
||||
|
||||
# Credentials
|
||||
username: admin
|
||||
password: password
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | | Database username. |
|
||||
| `password` | | | Database password. |
|
||||
| `host_port` | ✅ | | Host URL and port to connect to. |
|
||||
| `database` | | | Database to ingest. |
|
||||
| `database_alias` | | | Alias to apply to database when ingesting. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | `"^(lookup \| sys).\*"` | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
56
metadata-ingestion/source_docs/feast.md
Normal file
56
metadata-ingestion/source_docs/feast.md
Normal file
@ -0,0 +1,56 @@
|
||||
# Feast
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
**Note: Feast ingestion requires Docker to be installed.**
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[feast]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- List of feature tables (modeled as [`MLFeatureTable`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLFeatureTableProperties.pdl)s),
|
||||
features ([`MLFeature`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLFeatureProperties.pdl)s),
|
||||
and entities ([`MLPrimaryKey`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLPrimaryKeyProperties.pdl)s)
|
||||
- Column types associated with each feature and entity
|
||||
|
||||
Note: this uses a separate Docker container to extract Feast's metadata into a JSON file, which is then
|
||||
parsed to DataHub's native objects. This separation was performed because of a dependency conflict in the `feast` module.
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: feast
|
||||
config:
|
||||
# Coordinates
|
||||
core_url: "localhost:6565"
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ----------------- | -------- | ------------------ | ------------------------------------------------------- |
|
||||
| `core_url` | | `"localhost:6565"` | URL of Feast Core instance. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `use_local_build` | | `False` | Whether to build Feast ingestion Docker image locally. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
46
metadata-ingestion/source_docs/file.md
Normal file
46
metadata-ingestion/source_docs/file.md
Normal file
@ -0,0 +1,46 @@
|
||||
# File
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
Works with `acryl-datahub` out of the box.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin pulls metadata from a previously generated file. The [file sink](../sink_docs/file.md)
|
||||
can produce such files, and a number of samples are included in the
|
||||
[examples/mce_files](../examples/mce_files) directory.
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: file
|
||||
config:
|
||||
# Coordinates
|
||||
filename: ./path/to/mce/file.json
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------- | -------- | ------- | ----------------------- |
|
||||
| `filename` | ✅ | | Path to file to ingest. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
62
metadata-ingestion/source_docs/glue.md
Normal file
62
metadata-ingestion/source_docs/glue.md
Normal file
@ -0,0 +1,62 @@
|
||||
# Glue
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[glue]'`.
|
||||
|
||||
Note: if you also have files in S3 that you'd like to ingest, we recommend you use Glue's built-in data catalog. See [here](../s3-ingestion.md) for a quick guide on how to set up a crawler on Glue and ingest the outputs with DataHub.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Tables in the Glue catalog
|
||||
- Column types associated with each table
|
||||
- Table metadata, such as owner, description and parameters
|
||||
- Jobs and their component transformations, data sources, and data sinks
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: glue
|
||||
config:
|
||||
# Coordinates
|
||||
aws_region: "my-aws-region"
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ------------------------ | -------- | --------------------------- | ---------------------------------------------------------------------------------- |
|
||||
| `aws_region` | ✅ | | AWS region code. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `aws_access_key_id` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `aws_secret_access_key` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `aws_session_token` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `aws_role` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `extract_transforms` | | `True` | Whether to extract Glue transform jobs. |
|
||||
| `database_pattern.allow` | | | Regex pattern for databases to include in ingestion. |
|
||||
| `database_pattern.deny` | | | Regex pattern for databases to exclude from ingestion. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `underlying_platform` | | Override for platform name. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
100
metadata-ingestion/source_docs/hive.md
Normal file
100
metadata-ingestion/source_docs/hive.md
Normal file
@ -0,0 +1,100 @@
|
||||
# Hive
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[hive]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, and tables
|
||||
- Column types associated with each table
|
||||
- Detailed table and storage information
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: hive
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: localhost:10000
|
||||
database: DemoDatabase # optional, if not specified, ingests from all databases
|
||||
|
||||
# Credentials
|
||||
username: user # optional
|
||||
password: pass # optional
|
||||
|
||||
# For more details on authentication, see the PyHive docs:
|
||||
# https://github.com/dropbox/PyHive#passing-session-configuration.
|
||||
# LDAP, Kerberos, etc. are supported using connect_args, which can be
|
||||
# added under the `options` config parameter.
|
||||
#scheme: 'hive+http' # set this if Thrift should use the HTTP transport
|
||||
#scheme: 'hive+https' # set this if Thrift should use the HTTP with SSL transport
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>Example: using ingestion with Azure HDInsight</summary>
|
||||
|
||||
```yml
|
||||
# Connecting to Microsoft Azure HDInsight using TLS.
|
||||
source:
|
||||
type: hive
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: <cluster_name>.azurehdinsight.net:443
|
||||
|
||||
# Credentials
|
||||
username: admin
|
||||
password: password
|
||||
|
||||
# Options
|
||||
options:
|
||||
connect_args:
|
||||
http_path: "/hive2"
|
||||
auth: BASIC
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | | Database username. |
|
||||
| `password` | | | Database password. |
|
||||
| `host_port` | ✅ | | Host URL and port to connect to. |
|
||||
| `database` | | | Database to ingest. |
|
||||
| `database_alias` | | | Alias to apply to database when ingesting. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
63
metadata-ingestion/source_docs/kafka-connect.md
Normal file
63
metadata-ingestion/source_docs/kafka-connect.md
Normal file
@ -0,0 +1,63 @@
|
||||
# Kafka Connect
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[kafka-connect]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Kafka Connect connector as individual `DataFlowSnapshotClass` entity
|
||||
- Creating individual `DataJobSnapshotClass` entity using `{connector_name}:{source_dataset}` naming
|
||||
- Lineage information between source database to Kafka topic
|
||||
|
||||
Current limitations:
|
||||
|
||||
- Currently works only for Debezium source connectors.
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "kafka-connect"
|
||||
config:
|
||||
# Coordinates
|
||||
connect_uri: "http://localhost:8083"
|
||||
cluster_name: "connect-cluster"
|
||||
|
||||
# Credentials
|
||||
username: admin
|
||||
password: password
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| -------------------------- | -------- | -------------------------- | ------------------------------------------------------- |
|
||||
| `connect_uri` | | `"http://localhost:8083/"` | URI to connect to. |
|
||||
| `username` | | | Kafka Connect username. |
|
||||
| `password` | | | Kafka Connect password. |
|
||||
| `cluster_name` | | `"connect-cluster"` | Cluster to ingest from. |
|
||||
| `connector_patterns.deny` | | | Regex pattern for connectors to include in ingestion. |
|
||||
| `connector_patterns.allow` | | | Regex pattern for connectors to exclude from ingestion. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
60
metadata-ingestion/source_docs/kafka.md
Normal file
60
metadata-ingestion/source_docs/kafka.md
Normal file
@ -0,0 +1,60 @@
|
||||
# Kafka Metadata
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[kafka]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Topics from the Kafka broker
|
||||
- Schemas associated with each topic from the schema registry
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "kafka"
|
||||
config:
|
||||
# Coordinates
|
||||
connection:
|
||||
bootstrap: "broker:9092"
|
||||
|
||||
schema_registry_url: http://localhost:8081
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| -------------------------------------------- | -------- | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `conection.bootstrap` | | `"localhost:9092"` | Bootstrap servers. |
|
||||
| `connection.schema_registry_url` | | `http://localhost:8081"` | Schema registry location. |
|
||||
| `connection.schema_registry_config.<option>` | | | Extra schema registry config. These options will be passed into Kafka's SchemaRegistryClient. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html?#schemaregistryclient. |
|
||||
| `connection.consumer_config.<option>` | | | Extra consumer config. These options will be passed into Kafka's DeserializingConsumer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#deserializingconsumer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md. |
|
||||
| `connection.producer_config.<option>` | | | Extra producer config. These options will be passed into Kafka's SerializingProducer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#serializingproducer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md. |
|
||||
| `topic_patterns.allow` | | | Regex pattern for topics to include in ingestion. |
|
||||
| `topic_patterns.deny` | | | Regex pattern for topics to exclude from ingestion. |
|
||||
|
||||
The options in the consumer config and schema registry config are passed to the Kafka DeserializingConsumer and SchemaRegistryClient respectively.
|
||||
|
||||
For a full example with a number of security options, see this [example recipe](../examples/recipes/secured_kafka.yml).
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
65
metadata-ingestion/source_docs/ldap.md
Normal file
65
metadata-ingestion/source_docs/ldap.md
Normal file
@ -0,0 +1,65 @@
|
||||
# LDAP
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[ldap]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- People
|
||||
- Names, emails, titles, and manager information for each person
|
||||
- List of groups
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "ldap"
|
||||
config:
|
||||
# Coordinates
|
||||
ldap_server: ldap://localhost
|
||||
|
||||
# Credentials
|
||||
ldap_user: "cn=admin,dc=example,dc=org"
|
||||
ldap_password: "admin"
|
||||
|
||||
# Options
|
||||
base_dn: "dc=example,dc=org"
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ------------------------------ | -------- | ------------------- | ----------------------------------------------------------------------- |
|
||||
| `ldap_server` | ✅ | | LDAP server URL. |
|
||||
| `ldap_user` | ✅ | | LDAP user. |
|
||||
| `ldap_password` | ✅ | | LDAP password. |
|
||||
| `base_dn` | ✅ | | LDAP DN. |
|
||||
| `filter` | | `"(objectClass=*)"` | LDAP extractor filter. |
|
||||
| `drop_missing_first_last_name` | | `True` | If set to true, any users without first and last names will be dropped. |
|
||||
| `page_size` | | `20` | Size of each page to fetch when extracting metadata. |
|
||||
|
||||
The `drop_missing_first_last_name` should be set to true if you've got many "headless" user LDAP accounts
|
||||
for devices or services should be excluded when they do not contain a first and last name. This will only
|
||||
impact the ingestion of LDAP users, while LDAP groups will be unaffected by this config option.
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
62
metadata-ingestion/source_docs/looker.md
Normal file
62
metadata-ingestion/source_docs/looker.md
Normal file
@ -0,0 +1,62 @@
|
||||
# Looker dashboards
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[looker]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Looker dashboards and dashboard elements (charts)
|
||||
- Names, descriptions, URLs, chart types, input view for the charts
|
||||
|
||||
See the [Looker authentication docs](https://docs.looker.com/reference/api-and-integration/api-auth#authentication_with_an_sdk) for the steps to create a client ID and secret.
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "looker"
|
||||
config:
|
||||
# Coordinates
|
||||
base_url: https://company.looker.com:19999
|
||||
|
||||
# Credentials
|
||||
client_id: admin
|
||||
client_secret: password
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ------------------------- | -------- | ----------------------- | ------------------------------------------------------------------------------------------------------------ |
|
||||
| `client_id` | ✅ | | Looker API3 client ID. |
|
||||
| `client_secret` | ✅ | | Looker API3 client secret. |
|
||||
| `base_url` | ✅ | | Url to your Looker instance: `https://company.looker.com:19999` or `https://looker.company.com`, or similar. |
|
||||
| `platform_name` | | `"looker"` | Platform to use in namespace when constructing URNs. |
|
||||
| `actor` | | `"urn:li:corpuser:etl"` | Actor to use in ownership properties of ingested metadata. |
|
||||
| `dashboard_pattern.allow` | | | Regex pattern for dashboards to include in ingestion. |
|
||||
| `dashboard_pattern.deny` | | | Regex pattern for dashboards to exclude from ingestion. |
|
||||
| `chart_pattern.allow` | | | Regex pattern for charts to include in ingestion. |
|
||||
| `chart_pattern.deny` | | | Regex pattern for charts to exclude from ingestion. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
66
metadata-ingestion/source_docs/lookml.md
Normal file
66
metadata-ingestion/source_docs/lookml.md
Normal file
@ -0,0 +1,66 @@
|
||||
# LookML
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[lookml]'`.
|
||||
|
||||
Note! This plugin uses a package that requires Python 3.7+!
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- LookML views from model files
|
||||
- Name, upstream table names, dimensions, measures, and dimension groups
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "lookml"
|
||||
config:
|
||||
# Coordinates
|
||||
base_folder: /path/to/model/files
|
||||
|
||||
# Options
|
||||
connection_to_platform_map:
|
||||
connection_name: platform_name (or platform_name.database_name) # for ex. my_snowflake_conn: snowflake.my_database
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------------------------------- | -------- | ---------- | ----------------------------------------------------------------------- |
|
||||
| `base_folder` | ✅ | | Where the `*.model.lkml` and `*.view.lkml` files are stored. |
|
||||
| `connection_to_platform_map.<connection_name>` | ✅ | | Mappings between connection names in the model files to platform names. |
|
||||
| `platform_name` | | `"looker"` | Platform to use in namespace when constructing URNs. |
|
||||
| `model_pattern.allow` | | | Regex pattern for models to include in ingestion. |
|
||||
| `model_pattern.deny` | | | Regex pattern for models to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `parse_table_names_from_sql` | | `False` | See note below. |
|
||||
|
||||
Note! The integration can use [`sql-metadata`](https://pypi.org/project/sql-metadata/) to try to parse the tables the
|
||||
views depends on. As these SQL's can be complicated, and the package doesn't official support all the SQL dialects that
|
||||
Looker supports, the result might not be correct. This parsing is disabled by default, but can be enabled by setting
|
||||
`parse_table_names_from_sql: True`.
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
73
metadata-ingestion/source_docs/mongodb.md
Normal file
73
metadata-ingestion/source_docs/mongodb.md
Normal file
@ -0,0 +1,73 @@
|
||||
# MongoDB
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[mongodb]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Databases and associated metadata
|
||||
- Collections in each database and schemas for each collection (via schema inference)
|
||||
|
||||
By default, schema inference samples 1,000 documents from each collection. Setting `schemaSamplingSize: null` will scan the entire collection.
|
||||
Moreover, setting `useRandomSampling: False` will sample the first documents found without random selection, which may be faster for large collections.
|
||||
|
||||
Note that `schemaSamplingSize` has no effect if `enableSchemaInference: False` is set.
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: "mongodb"
|
||||
config:
|
||||
# Coordinates
|
||||
connect_uri: "mongodb://localhost"
|
||||
|
||||
# Credentials
|
||||
username: admin
|
||||
password: password
|
||||
authMechanism: "DEFAULT"
|
||||
|
||||
# Options
|
||||
enableSchemaInference: True
|
||||
useRandomSampling: True
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| -------------------------- | -------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `connect_uri` | | `"mongodb://localhost"` | MongoDB connection URI. |
|
||||
| `username` | | | MongoDB username. |
|
||||
| `password` | | | MongoDB password. |
|
||||
| `authMechanism` | | | MongoDB authentication mechanism. See https://pymongo.readthedocs.io/en/stable/examples/authentication.html for details. |
|
||||
| `options` | | | Additional options to pass to `pymongo.MongoClient()`. |
|
||||
| `enableSchemaInference` | | `True` | Whether to infer schemas. |
|
||||
| `schemaSamplingSize` | | `1000` | Number of documents to use when inferring schema size. If set to `0`, all documents will be scanned. |
|
||||
| `useRandomSampling` | | `True` | If documents for schema inference should be randomly selected. If `False`, documents will be selected from start. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `database_pattern.allow` | | | Regex pattern for databases to include in ingestion. |
|
||||
| `database_pattern.deny` | | | Regex pattern for databases to exclude from ingestion. |
|
||||
| `collection_pattern.allow` | | | Regex pattern for collections to include in ingestion. |
|
||||
| `collection_pattern.deny` | | | Regex pattern for collections to exclude from ingestion. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
101
metadata-ingestion/source_docs/mssql.md
Normal file
101
metadata-ingestion/source_docs/mssql.md
Normal file
@ -0,0 +1,101 @@
|
||||
# Microsoft SQL Server
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[mssql]'`.
|
||||
|
||||
We have two options for the underlying library used to connect to SQL Server: (1) [python-tds](https://github.com/denisenkom/pytds) and (2) [pyodbc](https://github.com/mkleehammer/pyodbc). The TDS library is pure Python and hence easier to install, but only PyODBC supports encrypted connections.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, views and tables
|
||||
- Column types associated with each table/view
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: mssql
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: localhost:1433
|
||||
database: DemoDatabase
|
||||
|
||||
# Credentials
|
||||
username: user
|
||||
password: pass
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>Example: using ingestion with ODBC and encryption</summary>
|
||||
|
||||
This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
|
||||
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: mssql
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: localhost:1433
|
||||
database: DemoDatabase
|
||||
|
||||
# Credentials
|
||||
username: admin
|
||||
password: password
|
||||
|
||||
# Options
|
||||
uri_args:
|
||||
driver: "ODBC Driver 17 for SQL Server"
|
||||
Encrypt: "yes"
|
||||
TrustServerCertificate: "Yes"
|
||||
ssl: "True"
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | | MSSQL username. |
|
||||
| `password` | | | MSSQL password. |
|
||||
| `host_port` | | `"localhost:1433"` | MSSQL host URL. |
|
||||
| `database` | | | MSSQL database. |
|
||||
| `database_alias` | | | Alias to apply to database when ingesting. |
|
||||
| `use_odbc` | | `False` | See https://docs.sqlalchemy.org/en/14/dialects/mssql.html#module-sqlalchemy.dialects.mssql.pyodbc. |
|
||||
| `uri_args.<uri_arg>` | | | Arguments to URL-encode when connecting. See https://docs.microsoft.com/en-us/sql/connect/odbc/dsn-connection-string-attribute?view=sql-server-ver15. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
66
metadata-ingestion/source_docs/mysql.md
Normal file
66
metadata-ingestion/source_docs/mysql.md
Normal file
@ -0,0 +1,66 @@
|
||||
# MySQL
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[mysql]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, and tables
|
||||
- Column types and schema associated with each table
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: mysql
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: localhost:3306
|
||||
database: dbname
|
||||
|
||||
# Credentials
|
||||
username: root
|
||||
password: example
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | | MySQL username. |
|
||||
| `password` | | | MySQL password. |
|
||||
| `host_port` | | `"localhost:3306"` | MySQL host URL. |
|
||||
| `database` | | | MySQL database. |
|
||||
| `database_alias` | | | Alias to apply to database when ingesting. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
74
metadata-ingestion/source_docs/oracle.md
Normal file
74
metadata-ingestion/source_docs/oracle.md
Normal file
@ -0,0 +1,74 @@
|
||||
# Oracle
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[oracle]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
Using the Oracle source requires that you've also installed the correct drivers; see the [cx_Oracle docs](https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html). The easiest one is the [Oracle Instant Client](https://www.oracle.com/database/technologies/instant-client.html).
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: oracle
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: localhost:5432
|
||||
database: dbname
|
||||
|
||||
# Credentials
|
||||
username: user
|
||||
password: pass
|
||||
|
||||
# Options
|
||||
service_name: svc # omit database if using this option
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
Exactly one of `database` or `service_name` is required.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | ------------------------------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | | Oracle username. For more details on authentication, see the documentation: https://docs.sqlalchemy.org/en/14/dialects/oracle.html#dialect-oracle-cx_oracle-connect <br /> and https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html#connection-strings. |
|
||||
| `password` | | | Oracle password. |
|
||||
| `host_port` | | | Oracle host URL. |
|
||||
| `database` | If `service_name` is not set | | If using, omit `service_name`. |
|
||||
| `service_name` | If `database_alias` is not set | | Oracle service name. If using, omit `database`. |
|
||||
| `database_alias` | | | Alias to apply to database when ingesting. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
71
metadata-ingestion/source_docs/postgres.md
Normal file
71
metadata-ingestion/source_docs/postgres.md
Normal file
@ -0,0 +1,71 @@
|
||||
# PostgreSQL
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[postgres]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, views, and tables
|
||||
- Column types associated with each table
|
||||
- Also supports PostGIS extensions
|
||||
- database_alias (optional) can be used to change the name of database to be ingested
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: postgres
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: localhost:5432
|
||||
database: DemoDatabase
|
||||
|
||||
# Credentials
|
||||
username: user
|
||||
password: pass
|
||||
|
||||
# Options
|
||||
database_alias: DatabaseNameToBeIngested
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | | PostgreSQL username. |
|
||||
| `password` | | | PostgreSQL password. |
|
||||
| `host_port` | ✅ | | PostgreSQL host URL. |
|
||||
| `database` | | | PostgreSQL database. |
|
||||
| `database_alias` | | | Alias to apply to database when ingesting. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
97
metadata-ingestion/source_docs/redshift.md
Normal file
97
metadata-ingestion/source_docs/redshift.md
Normal file
@ -0,0 +1,97 @@
|
||||
# Redshift
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[redshift]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, views and tables
|
||||
- Column types associated with each table
|
||||
- Also supports PostGIS extensions
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: redshift
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: example.something.us-west-2.redshift.amazonaws.com:5439
|
||||
database: DemoDatabase
|
||||
|
||||
# Credentials
|
||||
username: user
|
||||
password: pass
|
||||
|
||||
# Options
|
||||
options:
|
||||
# driver_option: some-option
|
||||
|
||||
include_views: True # whether to include views, defaults to True
|
||||
include_tables: True # whether to include views, defaults to True
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>Extra options when running Redshift behind a proxy</summary>
|
||||
|
||||
This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
|
||||
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: redshift
|
||||
config:
|
||||
host_port: my-proxy-hostname:5439
|
||||
|
||||
options:
|
||||
connect_args:
|
||||
sslmode: "prefer" # or "require" or "verify-ca"
|
||||
sslrootcert: ~ # needed to unpin the AWS Redshift certificate
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | | Redshift username. |
|
||||
| `password` | | | Redshift password. |
|
||||
| `host_port` | ✅ | | Redshift host URL. |
|
||||
| `database` | | | Redshift database. |
|
||||
| `database_alias` | | | Alias to apply to database when ingesting. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
62
metadata-ingestion/source_docs/sagemaker.md
Normal file
62
metadata-ingestion/source_docs/sagemaker.md
Normal file
@ -0,0 +1,62 @@
|
||||
# SageMaker
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[sagemaker]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Feature groups
|
||||
- Models, jobs, and lineage between the two (e.g. when jobs output a model or a model is used by a job)
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: sagemaker
|
||||
config:
|
||||
# Coordinates
|
||||
aws_region: "my-aws-region"
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ------------------------------------- | -------- | ------------ | ---------------------------------------------------------------------------------- |
|
||||
| `aws_region` | ✅ | | AWS region code. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `aws_access_key_id` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `aws_secret_access_key` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `aws_session_token` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `aws_role` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
|
||||
| `extract_feature_groups` | | `True` | Whether to extract feather groups. |
|
||||
| `extract_models` | | `True` | Whether to extract models. |
|
||||
| `extract_jobs.auto_ml` | | `True` | Whether to extract AutoML jobs. |
|
||||
| `extract_jobs.compilation` | | `True` | Whether to extract compilation jobs. |
|
||||
| `extract_jobs.edge_packaging` | | `True` | Whether to extract edge packaging jobs. |
|
||||
| `extract_jobs.hyper_parameter_tuning` | | `True` | Whether to extract hyperparameter tuning jobs. |
|
||||
| `extract_jobs.labeling` | | `True` | Whether to extract labeling jobs. |
|
||||
| `extract_jobs.processing` | | `True` | Whether to extract processing jobs. |
|
||||
| `extract_jobs.training` | | `True` | Whether to extract training jobs. |
|
||||
| `extract_jobs.transform` | | `True` | Whether to extract transform jobs. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
147
metadata-ingestion/source_docs/snowflake.md
Normal file
147
metadata-ingestion/source_docs/snowflake.md
Normal file
@ -0,0 +1,147 @@
|
||||
# Snowflake
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[snowflake]'`.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, views and tables
|
||||
- Column types associated with each table
|
||||
|
||||
:::tip
|
||||
|
||||
You can also get fine-grained usage statistics for Snowflake using the `snowflake-usage` source described below.
|
||||
|
||||
:::
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: snowflake
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: account_name
|
||||
warehouse: "COMPUTE_WH"
|
||||
|
||||
# Credentials
|
||||
username: user
|
||||
password: pass
|
||||
role: "sysadmin"
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ------------------------ | -------- | -------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `username` | | | Snowflake username. |
|
||||
| `password` | | | Snowflake password. |
|
||||
| `host_port` | ✅ | | Snowflake host URL. |
|
||||
| `warehouse` | | | Snowflake warehouse. |
|
||||
| `role` | | | Snowflake role. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `database_pattern.allow` | | | Regex pattern for databases to include in ingestion. |
|
||||
| `database_pattern.deny` | | `"^UTIL_DB$" `<br />`"^SNOWFLAKE$"`<br />`"^SNOWFLAKE_SAMPLE_DATA$"` | Regex pattern for databases to exclude from ingestion. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Snowflake Usage Stats
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
### Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[snowflake-usage]'`.
|
||||
|
||||
### Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Statistics on queries issued and tables and columns accessed (excludes views)
|
||||
- Aggregation of these statistics into buckets, by day or hour granularity
|
||||
|
||||
Note: the user/role must have access to the account usage table. The "accountadmin" role has this by default, and other roles can be [granted this permission](https://docs.snowflake.com/en/sql-reference/account-usage.html#enabling-account-usage-for-other-roles).
|
||||
|
||||
Note: the underlying access history views that we use are only available in Snowflake's enterprise edition or higher.
|
||||
|
||||
:::note
|
||||
|
||||
This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake` source described above.
|
||||
|
||||
:::
|
||||
|
||||
### Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: snowflake-usage
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: account_name
|
||||
warehouse: "COMPUTE_WH"
|
||||
|
||||
# Credentials
|
||||
username: user
|
||||
password: pass
|
||||
role: "sysadmin"
|
||||
|
||||
# Options
|
||||
top_n_queries: 10
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
### Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ----------------- | -------- | -------------------------------------------------------------- | --------------------------------------------------------------- |
|
||||
| `username` | | | Snowflake username. |
|
||||
| `password` | | | Snowflake password. |
|
||||
| `host_port` | ✅ | | Snowflake host URL. |
|
||||
| `warehouse` | | | Snowflake warehouse. |
|
||||
| `role` | | | Snowflake role. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `bucket_duration` | | `"DAY"` | Duration to bucket usage events by. Can be `"DAY"` or `"HOUR"`. |
|
||||
| `start_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Earliest date of usage logs to consider. |
|
||||
| `end_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Latest date of usage logs to consider. |
|
||||
| `top_n_queries` | | `10` | Number of top queries to save to each table. |
|
||||
|
||||
### Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
84
metadata-ingestion/source_docs/sql_profiles.md
Normal file
84
metadata-ingestion/source_docs/sql_profiles.md
Normal file
@ -0,0 +1,84 @@
|
||||
# SQL Profiles
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[sql-profiles]'`.
|
||||
|
||||
The SQL-based profiler does not run alone, but rather can be enabled for other SQL-based sources.
|
||||
Enabling profiling will slow down ingestion runs.
|
||||
|
||||
:::caution
|
||||
|
||||
Running profiling against many tables or over many rows can run up significant costs.
|
||||
While we've done our best to limit the expensiveness of the queries the profiler runs, you
|
||||
should be prudent about the set of tables profiling is enabled on or the frequency
|
||||
of the profiling runs.
|
||||
|
||||
:::
|
||||
|
||||
## Capabilities
|
||||
|
||||
Extracts:
|
||||
|
||||
- Row and column counts for each table
|
||||
- For each column, if applicable:
|
||||
- null counts and proportions
|
||||
- distinct counts and proportions
|
||||
- minimum, maximum, mean, median, standard deviation, some quantile values
|
||||
- histograms or frequencies of unique values
|
||||
|
||||
Supported SQL sources:
|
||||
|
||||
- AWS Athena
|
||||
- BigQuery
|
||||
- Druid
|
||||
- Hive
|
||||
- Microsoft SQL Server
|
||||
- MySQL
|
||||
- Oracle
|
||||
- Postgres
|
||||
- Redshift
|
||||
- Snowflake
|
||||
- Generic SQLAlchemy source
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: <sql-source> # can be bigquery, snowflake, etc - see above for the list
|
||||
config:
|
||||
# ... any other source-specific options ...
|
||||
|
||||
# Options
|
||||
profiling:
|
||||
enabled: true
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ----------------------- | -------- | ------- | ----------------------------------------------------------------------- |
|
||||
| `profiling.enabled` | | `False` | Whether profiling should be done. |
|
||||
| `profiling.limit` | | | Max number of documents to profile. By default, profiles all documents. |
|
||||
| `profiling.offset` | | | Offset in documents to profile. By default, uses no offset. |
|
||||
| `profile_pattern.allow` | | | Regex pattern for tables to profile. |
|
||||
| `profile_pattern.deny` | | | Regex pattern for tables to not profile. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
62
metadata-ingestion/source_docs/sqlalchemy.md
Normal file
62
metadata-ingestion/source_docs/sqlalchemy.md
Normal file
@ -0,0 +1,62 @@
|
||||
# Other SQLAlchemy databases
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[sqlalchemy]'`.
|
||||
|
||||
The `sqlalchemy` source is useful if we don't have a pre-built source for your chosen
|
||||
database system, but there is an [SQLAlchemy dialect](https://docs.sqlalchemy.org/en/14/dialects/)
|
||||
defined elsewhere. In order to use this, you must `pip install` the required dialect packages yourself.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Metadata for databases, schemas, views, and tables
|
||||
- Column types associated with each table
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: sqlalchemy
|
||||
config:
|
||||
# Coordinates
|
||||
connect_uri: "dialect+driver://username:password@host:port/database"
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ---------------------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `platform` | ✅ | | Name of platform being ingested, used in constructing URNs. |
|
||||
| `connect_uri` | ✅ | | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
|
||||
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
|
||||
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
|
||||
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
|
||||
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
|
||||
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
|
||||
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
|
||||
| `include_tables` | | `True` | Whether tables should be ingested. |
|
||||
| `include_views` | | `True` | Whether views should be ingested. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
57
metadata-ingestion/source_docs/superset.md
Normal file
57
metadata-ingestion/source_docs/superset.md
Normal file
@ -0,0 +1,57 @@
|
||||
# Superset
|
||||
|
||||
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||||
|
||||
## Setup
|
||||
|
||||
To install this plugin, run `pip install 'acryl-datahub[superset]'`.
|
||||
|
||||
See documentation for superset's `/security/login` at https://superset.apache.org/docs/rest-api for more details on superset's login api.
|
||||
|
||||
## Capabilities
|
||||
|
||||
This plugin extracts the following:
|
||||
|
||||
- Charts, dashboards, and associated metadata
|
||||
|
||||
## Quickstart recipe
|
||||
|
||||
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
|
||||
|
||||
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||||
|
||||
```yml
|
||||
source:
|
||||
type: superset
|
||||
config:
|
||||
# Coordinates
|
||||
connect_uri: http://localhost:8088
|
||||
|
||||
# Credentials
|
||||
username: user
|
||||
password: pass
|
||||
provider: ldap
|
||||
|
||||
sink:
|
||||
# sink configs
|
||||
```
|
||||
|
||||
## Config details
|
||||
|
||||
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
|
||||
| Field | Required | Default | Description |
|
||||
| ------------- | -------- | ------------------ | ------------------------------------------------------- |
|
||||
| `connect_uri` | | `"localhost:8088"` | Superset host URL. |
|
||||
| `username` | | | Superset username. |
|
||||
| `password` | | | Superset password. |
|
||||
| `provider` | | `"db"` | Superset provider. |
|
||||
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
Coming soon!
|
||||
|
||||
## Questions
|
||||
|
||||
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
|
||||
Loading…
x
Reference in New Issue
Block a user