feat(docs): refactor source and sink ingestion docs (#3031)

This commit is contained in:
Kevin Hu 2021-08-08 16:40:51 -04:00 committed by GitHub
parent a7ea888612
commit 32b8fc6108
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
32 changed files with 2046 additions and 905 deletions

View File

@ -159,6 +159,13 @@ function markdown_guess_title(
} else {
// Find first h1 header and use it as the title.
const headers = contents.content.match(/^# (.+)$/gm);
if (!headers) {
throw new Error(
`${filepath} must have at least one h1 header for setting the title`
);
}
if (headers.length > 1 && contents.content.indexOf("```") < 0) {
throw new Error(`too many h1 headers in ${filepath}`);
}

View File

@ -55,6 +55,14 @@ module.exports = {
"docs/architecture/metadata-serving",
//"docs/what/gms",
],
"Metadata Ingestion": [
{
Sources: list_ids_in_directory("metadata-ingestion/source_docs"),
},
{
Sinks: list_ids_in_directory("metadata-ingestion/sink_docs"),
},
],
"Metadata Modeling": [
"docs/modeling/metadata-model",
"docs/modeling/extending-the-metadata-model",

View File

@ -40,7 +40,7 @@ Our open sourcing [blog post](https://engineering.linkedin.com/blog/2020/open-so
- **Schema history**: view and diff historic versions of schemas
- **GraphQL**: visualization of GraphQL schemas
### Jos/flows [*coming soon*]
### Jobs/flows [*coming soon*]
- **Search**: full-text & advanced search, search ranking
- **Browse**: browsing through a configurable hierarchy
- **Basic information**:

View File

@ -28,38 +28,47 @@ If you run into an error, try checking the [_common setup issues_](./developing.
#### Installing Plugins
We use a plugin architecture so that you can install only the dependencies you actually need.
We use a plugin architecture so that you can install only the dependencies you actually need. Click the plugin name to learn more about the specific source recipe and any FAQs!
| Plugin Name | Install Command | Provides |
| --------------- | ---------------------------------------------------------- | ----------------------------------- |
| file | _included by default_ | File source and sink |
| console | _included by default_ | Console sink |
| athena | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
| bigquery | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
| bigquery-usage | `pip install 'acryl-datahub[bigquery-usage]'` | BigQuery usage statistics source |
| feast | `pip install 'acryl-datahub[feast]'` | Feast source |
| glue | `pip install 'acryl-datahub[glue]'` | AWS Glue source |
| hive | `pip install 'acryl-datahub[hive]'` | Hive source |
| mssql | `pip install 'acryl-datahub[mssql]'` | SQL Server source |
| mysql | `pip install 'acryl-datahub[mysql]'` | MySQL source |
| oracle | `pip install 'acryl-datahub[oracle]'` | Oracle source |
| postgres | `pip install 'acryl-datahub[postgres]'` | Postgres source |
| redshift | `pip install 'acryl-datahub[redshift]'` | Redshift source |
| sagemaker | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source |
| sqlalchemy | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source |
| snowflake | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
| snowflake-usage | `pip install 'acryl-datahub[snowflake-usage]'` | Snowflake usage statistics source |
| sql-profiles | `pip install 'acryl-datahub[sql-profiles]'` | Data profiles for SQL-based systems |
| superset | `pip install 'acryl-datahub[superset]'` | Superset source |
| mongodb | `pip install 'acryl-datahub[mongodb]'` | MongoDB source |
| ldap | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source |
| looker | `pip install 'acryl-datahub[looker]'` | Looker source |
| lookml | `pip install 'acryl-datahub[lookml]'` | LookML source, requires Python 3.7+ |
| kafka | `pip install 'acryl-datahub[kafka]'` | Kafka source |
| druid | `pip install 'acryl-datahub[druid]'` | Druid Source |
| dbt | `pip install 'acryl-datahub[dbt]'` | dbt source |
| datahub-rest | `pip install 'acryl-datahub[datahub-rest]'` | DataHub sink over REST API |
| datahub-kafka | `pip install 'acryl-datahub[datahub-kafka]'` | DataHub sink over Kafka |
Sources:
| Plugin Name | Install Command | Provides |
| ----------------------------------------------- | ---------------------------------------------------------- | ----------------------------------- |
| [file](./source_docs/file.md) | _included by default_ | File source and sink |
| [athena](./source_docs/athena.md) | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
| [bigquery](./source_docs/bigquery.md) | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
| [bigquery-usage](./source_docs/bigquery.md) | `pip install 'acryl-datahub[bigquery-usage]'` | BigQuery usage statistics source |
| [dbt](./source_docs/dbt.md) | _no additional dependencies_ | dbt source |
| [druid](./source_docs/druid.md) | `pip install 'acryl-datahub[druid]'` | Druid Source |
| [feast](./source_docs/feast.md) | `pip install 'acryl-datahub[feast]'` | Feast source |
| [glue](./source_docs/glue.md) | `pip install 'acryl-datahub[glue]'` | AWS Glue source |
| [hive](./source_docs/hive.md) | `pip install 'acryl-datahub[hive]'` | Hive source |
| [kafka](./source_docs/kafka.md) | `pip install 'acryl-datahub[kafka]'` | Kafka source |
| [kafka-connect](./source_docs/kafka-connect.md) | `pip install 'acryl-datahub[kafka-connect]'` | Kafka connect source |
| [ldap](./source_docs/ldap.md) | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source |
| [looker](./source_docs/looker.md) | `pip install 'acryl-datahub[looker]'` | Looker source |
| [lookml](./source_docs/lookml.md) | `pip install 'acryl-datahub[lookml]'` | LookML source, requires Python 3.7+ |
| [mongodb](./source_docs/mongodb.md) | `pip install 'acryl-datahub[mongodb]'` | MongoDB source |
| [mssql](./source_docs/mssql.md) | `pip install 'acryl-datahub[mssql]'` | SQL Server source |
| [mysql](./source_docs/mysql.md) | `pip install 'acryl-datahub[mysql]'` | MySQL source |
| [oracle](./source_docs/oracle.md) | `pip install 'acryl-datahub[oracle]'` | Oracle source |
| [postgres](./source_docs/postgres.md) | `pip install 'acryl-datahub[postgres]'` | Postgres source |
| [redshift](./source_docs/redshift.md) | `pip install 'acryl-datahub[redshift]'` | Redshift source |
| [sagemaker](./source_docs/sagemaker.md) | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source |
| [snowflake](./source_docs/snowflake.md) | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
| [snowflake-usage](./source_docs/snowflake.md) | `pip install 'acryl-datahub[snowflake-usage]'` | Snowflake usage statistics source |
| sql-profiles | `pip install 'acryl-datahub[sql-profiles]'` | Data profiles for SQL-based systems |
| [sqlalchemy](./source_docs/sqlalchemy.md) | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source |
| [superset](./source_docs/superset.md) | `pip install 'acryl-datahub[superset]'` | Superset source |
Sinks
| Plugin Name | Install Command | Provides |
| --------------------------------------- | -------------------------------------------- | -------------------------- |
| [file](./sink_docs/file.md) | _included by default_ | File source and sink |
| [console](./sink_docs/console.md) | _included by default_ | Console sink |
| [datahub-rest](./sink_docs/datahub.md) | `pip install 'acryl-datahub[datahub-rest]'` | DataHub sink over REST API |
| [datahub-kafka](./sink_docs/datahub.md) | `pip install 'acryl-datahub[datahub-kafka]'` | DataHub sink over Kafka |
These plugins can be mixed and matched as desired. For example:
@ -137,875 +146,7 @@ Running a recipe is quite easy.
datahub ingest -c ./examples/recipes/mssql_to_datahub.yml
```
A number of recipes are included in the examples/recipes directory.
## Sources
### Kafka Metadata `kafka`
Extracts:
- List of topics - from the Kafka broker
- Schemas associated with each topic - from the schema registry
```yml
source:
type: "kafka"
config:
connection:
bootstrap: "broker:9092"
consumer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.DeserializingConsumer
schema_registry_url: http://localhost:8081
schema_registry_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient
```
The options in the consumer config and schema registry config are passed to the Kafka DeserializingConsumer and SchemaRegistryClient respectively.
For a full example with a number of security options, see this [example recipe](./examples/recipes/secured_kafka.yml).
### MySQL Metadata `mysql`
Extracts:
- List of databases and tables
- Column types and schema associated with each table
```yml
source:
type: mysql
config:
username: root
password: example
database: dbname
host_port: localhost:3306
table_pattern:
deny:
# Note that the deny patterns take precedence over the allow patterns.
- "performance_schema"
allow:
- "schema1.table2"
# Although the 'table_pattern' enables you to skip everything from certain schemas,
# having another option to allow/deny on schema level is an optimization for the case when there is a large number
# of schemas that one wants to skip and you want to avoid the time to needlessly fetch those tables only to filter
# them out afterwards via the table_pattern.
schema_pattern:
deny:
- "garbage_schema"
allow:
- "schema1"
```
### Microsoft SQL Server Metadata `mssql`
We have two options for the underlying library used to connect to SQL Server: (1) [python-tds](https://github.com/denisenkom/pytds) and (2) [pyodbc](https://github.com/mkleehammer/pyodbc). The TDS library is pure Python and hence easier to install, but only PyODBC supports encrypted connections.
Extracts:
- List of databases, schema, tables and views
- Column types associated with each table/view
```yml
source:
type: mssql
config:
username: user
password: pass
host_port: localhost:1433
database: DemoDatabase
include_views: True # whether to include views, defaults to True
table_pattern:
deny:
- "^.*\\.sys_.*" # deny all tables that start with sys_
allow:
- "schema1.table1"
- "schema1.table2"
options:
# Any options specified here will be passed to SQLAlchemy's create_engine as kwargs.
# See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details.
# Many of these options are specific to the underlying database driver, so that library's
# documentation will be a good reference for what is supported. To find which dialect is likely
# in use, consult this table: https://docs.sqlalchemy.org/en/14/dialects/index.html.
charset: "utf8"
# If set to true, we'll use the pyodbc library. This requires you to have
# already installed the Microsoft ODBC Driver for SQL Server.
# See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
use_odbc: False
uri_args: {}
```
<details>
<summary>Example: using ingestion with ODBC and encryption</summary>
This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
```yml
source:
type: mssql
config:
# See https://docs.sqlalchemy.org/en/14/dialects/mssql.html#module-sqlalchemy.dialects.mssql.pyodbc
use_odbc: True
username: user
password: pass
host_port: localhost:1433
database: DemoDatabase
include_views: True # whether to include views, defaults to True
uri_args:
# See https://docs.microsoft.com/en-us/sql/connect/odbc/dsn-connection-string-attribute?view=sql-server-ver15
driver: "ODBC Driver 17 for SQL Server"
Encrypt: "yes"
TrustServerCertificate: "Yes"
ssl: "True"
# Trusted_Connection: "yes"
```
</details>
### Hive `hive`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
- Detailed table and storage information
```yml
source:
type: hive
config:
# For more details on authentication, see the PyHive docs:
# https://github.com/dropbox/PyHive#passing-session-configuration.
# LDAP, Kerberos, etc. are supported using connect_args, which can be
# added under the `options` config parameter.
#scheme: 'hive+http' # set this if Thrift should use the HTTP transport
#scheme: 'hive+https' # set this if Thrift should use the HTTP with SSL transport
username: user # optional
password: pass # optional
host_port: localhost:10000
database: DemoDatabase # optional, if not specified, ingests from all databases
# table_pattern/schema_pattern is same as above
# options is same as above
```
<details>
<summary>Example: using ingestion with Azure HDInsight</summary>
```yml
# Connecting to Microsoft Azure HDInsight using TLS.
source:
type: hive
config:
scheme: "hive+https"
host_port: <cluster_name>.azurehdinsight.net:443
username: admin
password: "<password>"
options:
connect_args:
http_path: "/hive2"
auth: BASIC
# table_pattern/schema_pattern is same as above
```
</details>
### PostgreSQL `postgres`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
- Also supports PostGIS extensions
- database_alias (optional) can be used to change the name of database to be ingested
```yml
source:
type: postgres
config:
username: user
password: pass
host_port: localhost:5432
database: DemoDatabase
database_alias: DatabaseNameToBeIngested
include_views: True # whether to include views, defaults to True
# table_pattern/schema_pattern is same as above
# options is same as above
```
### Redshift `redshift`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
- Also supports PostGIS extensions
```yml
source:
type: redshift
config:
username: user
password: pass
host_port: example.something.us-west-2.redshift.amazonaws.com:5439
database: DemoDatabase
include_views: True # whether to include views, defaults to True
# table_pattern/schema_pattern is same as above
# options is same as above
```
<details>
<summary>Extra options when running Redshift behind a proxy</summary>
This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
```yml
source:
type: redshift
config:
# username, password, database, etc are all the same as above
host_port: my-proxy-hostname:5439
options:
connect_args:
sslmode: "prefer" # or "require" or "verify-ca"
sslrootcert: ~ # needed to unpin the AWS Redshift certificate
```
</details>
### AWS SageMaker `sagemaker`
Extracts:
- Feature groups
- Models, jobs, and lineage between the two (e.g. when jobs output a model or a model is used by a job)
```yml
source:
type: sagemaker
config:
aws_region: # aws_region_name, i.e. "eu-west-1"
env: # environment for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". Defaults to "PROD".
# Credentials. If not specified here, these are picked up according to boto3 rules.
# (see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)
aws_access_key_id: # Optional.
aws_secret_access_key: # Optional.
aws_session_token: # Optional.
aws_role: # Optional (Role chaining supported by using a sorted list).
extract_feature_groups: True # if feature groups should be ingested, default True
extract_models: True # if models should be ingested, default True
extract_jobs: # if jobs should be ingested, default True for all
auto_ml: True
compilation: True
edge_packaging: True
hyper_parameter_tuning: True
labeling: True
processing: True
training: True
transform: True
```
### Snowflake `snowflake`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
```yml
source:
type: snowflake
config:
username: user
password: pass
host_port: account_name
database_pattern:
# The escaping of the \$ symbol helps us skip the environment variable substitution.
allow:
- ^MY_DEMO_DATA.*
- ^ANOTHER_DB_REGEX
deny:
- ^SNOWFLAKE\$
- ^SNOWFLAKE_SAMPLE_DATA\$
warehouse: "COMPUTE_WH" # optional
role: "sysadmin" # optional
include_views: True # whether to include views, defaults to True
# table_pattern/schema_pattern is same as above
# options is same as above
```
:::tip
You can also get fine-grained usage statistics for Snowflake using the `snowflake-usage` source.
:::
### SQL Profiles `sql-profiles`
The SQL-based profiler does not run alone, but rather can be enabled for other SQL-based sources.
Enabling profiling will slow down ingestion runs.
Extracts:
- row and column counts for each table
- for each column, if applicable:
- null counts and proportions
- distinct counts and proportions
- minimum, maximum, mean, median, standard deviation, some quantile values
- histograms or frequencies of unique values
Supported SQL sources:
- AWS Athena
- BigQuery
- Druid
- Hive
- Microsoft SQL Server
- MySQL
- Oracle
- Postgres
- Redshift
- Snowflake
- Generic SQLAlchemy source
```yml
source:
type: <sql-source> # can be bigquery, snowflake, etc - see above for the list
config:
# username, password, etc - varies by source type
profiling:
enabled: true
limit: 1000 # optional - max rows to profile
offset: 0 # optional - offset of first row to profile
profile_pattern:
deny:
# Skip all tables ending with "_staging"
- _staging\$
allow:
# Profile all tables in that start with "gold_" in "myschema"
- myschema\.gold_.*
# If you only want profiles (but no catalog information), set these to false
include_tables: true
include_views: true
```
:::caution
Running profiling against many tables or over many rows can run up significant costs.
While we've done our best to limit the expensiveness of the queries the profiler runs, you
should be prudent about the set of tables profiling is enabled on or the frequency
of the profiling runs.
:::
### Superset `superset`
Extracts:
- List of charts and dashboards
```yml
source:
type: superset
config:
username: user
password: pass
provider: db | ldap
connect_uri: http://localhost:8088
env: "PROD" # Optional, default is "PROD"
```
See documentation for superset's `/security/login` at https://superset.apache.org/docs/rest-api for more details on superset's login api.
### Oracle `oracle`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
Using the Oracle source requires that you've also installed the correct drivers; see the [cx_Oracle docs](https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html). The easiest one is the [Oracle Instant Client](https://www.oracle.com/database/technologies/instant-client.html).
```yml
source:
type: oracle
config:
# For more details on authentication, see the documentation:
# https://docs.sqlalchemy.org/en/14/dialects/oracle.html#dialect-oracle-cx_oracle-connect and
# https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html#connection-strings.
username: user
password: pass
host_port: localhost:5432
database: dbname
service_name: svc # omit database if using this option
include_views: True # whether to include views, defaults to True
# table_pattern/schema_pattern is same as above
# options is same as above
```
### Feast `feast`
**Note: Feast ingestion requires Docker to be installed.**
Extracts:
- List of feature tables (modeled as [`MLFeatureTable`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLFeatureTableProperties.pdl)s),
features ([`MLFeature`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLFeatureProperties.pdl)s),
and entities ([`MLPrimaryKey`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLPrimaryKeyProperties.pdl)s)
- Column types associated with each feature and entity
Note: this uses a separate Docker container to extract Feast's metadata into a JSON file, which is then
parsed to DataHub's native objects. This was done because of a dependency conflict in the `feast` module.
```yml
source:
type: feast
config:
core_url: localhost:6565 # default
env: "PROD" # Optional, default is "PROD"
use_local_build: False # Whether to build Feast ingestion image locally, default is False
```
### Google BigQuery `bigquery`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
```yml
source:
type: bigquery
config:
project_id: project # optional - can autodetect from environment
options: # options is same as above
# See https://github.com/mxmzdlv/pybigquery#authentication for details.
credentials_path: "/path/to/keyfile.json" # optional
include_views: True # whether to include views, defaults to True
# table_pattern/schema_pattern is same as above
```
:::tip
You can also get fine-grained usage statistics for BigQuery using the `bigquery-usage` source.
:::
### AWS Athena `athena`
Extracts:
- List of databases and tables
- Column types associated with each table
```yml
source:
type: athena
config:
username: aws_access_key_id # Optional. If not specified, credentials are picked up according to boto3 rules.
# See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
password: aws_secret_access_key # Optional.
database: database # Optional, defaults to "default"
aws_region: aws_region_name # i.e. "eu-west-1"
s3_staging_dir: s3_location # "s3://<bucket-name>/prefix/"
# The s3_staging_dir parameter is needed because Athena always writes query results to S3.
# See https://docs.aws.amazon.com/athena/latest/ug/querying.html
# However, the athena driver will transparently fetch these results as you would expect from any other sql client.
work_group: athena_workgroup # "primary"
# table_pattern/schema_pattern is same as above
```
### AWS Glue `glue`
Note: if you also have files in S3 that you'd like to ingest, we recommend you use Glue's built-in data catalog. See [here](./s3-ingestion.md) for a quick guide on how to set up a crawler on Glue and ingest the outputs with DataHub.
Extracts:
- List of tables
- Column types associated with each table
- Table metadata, such as owner, description and parameters
- Jobs and their component transformations, data sources, and data sinks
```yml
source:
type: glue
config:
aws_region: # aws_region_name, i.e. "eu-west-1"
extract_transforms: True # whether to ingest Glue jobs, defaults to True
env: # environment for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". Defaults to "PROD".
# Filtering patterns for databases and tables to scan
database_pattern: # Optional, to filter databases scanned, same as schema_pattern above.
table_pattern: # Optional, to filter tables scanned, same as table_pattern above.
# Credentials. If not specified here, these are picked up according to boto3 rules.
# (see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)
aws_access_key_id: # Optional.
aws_secret_access_key: # Optional.
aws_session_token: # Optional.
aws_role: # Optional (Role chaining supported by using a sorted list).
underlying_platform: #Optional (Can change platform name to be athena)
```
### Druid `druid`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
**Note** It is important to define a explicitly define deny schema pattern for internal druid databases (lookup & sys)
if adding a schema pattern otherwise the crawler may crash before processing relevant databases.
This deny pattern is defined by default but is overriden by user-submitted configurations
```yml
source:
type: druid
config:
# Point to broker address
host_port: localhost:8082
schema_pattern:
deny:
- "^(lookup|sys).*"
# options is same as above
```
### Other databases using SQLAlchemy `sqlalchemy`
The `sqlalchemy` source is useful if we don't have a pre-built source for your chosen
database system, but there is an [SQLAlchemy dialect](https://docs.sqlalchemy.org/en/14/dialects/)
defined elsewhere. In order to use this, you must `pip install` the required dialect packages yourself.
Extracts:
- List of schemas and tables
- Column types associated with each table
```yml
source:
type: sqlalchemy
config:
# See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls
connect_uri: "dialect+driver://username:password@host:port/database"
options: {} # same as above
schema_pattern: {} # same as above
table_pattern: {} # same as above
include_views: True # whether to include views, defaults to True
```
### MongoDB `mongodb`
Extracts:
- List of databases
- List of collections in each database and infers schemas for each collection
By default, schema inference samples 1,000 documents from each collection. Setting `schemaSamplingSize: null` will scan the entire collection.
Moreover, setting `useRandomSampling: False` will sample the first documents found without random selection, which may be faster for large collections.
Note that `schemaSamplingSize` has no effect if `enableSchemaInference: False` is set.
```yml
source:
type: "mongodb"
config:
# For advanced configurations, see the MongoDB docs.
# https://pymongo.readthedocs.io/en/stable/examples/authentication.html
connect_uri: "mongodb://localhost"
username: admin
password: password
env: "PROD" # Optional, default is "PROD"
authMechanism: "DEFAULT"
options: {}
database_pattern: {}
collection_pattern: {}
enableSchemaInference: True
schemaSamplingSize: 1000
useRandomSampling: True # whether to randomly sample docs for schema or just use the first ones, True by default
# database_pattern/collection_pattern are similar to schema_pattern/table_pattern from above
```
### LDAP `ldap`
Extracts:
- List of people
- Names, emails, titles, and manager information for each person
- List of groups
```yml
source:
type: "ldap"
config:
ldap_server: ldap://localhost
ldap_user: "cn=admin,dc=example,dc=org"
ldap_password: "admin"
base_dn: "dc=example,dc=org"
filter: "(objectClass=*)" # optional field
drop_missing_first_last_name: False # optional
```
The `drop_missing_first_last_name` should be set to true if you've got many "headless" user LDAP accounts
for devices or services should be excluded when they do not contain a first and last name. This will only
impact the ingestion of LDAP users, while LDAP groups will be unaffected by this config option.
### LookML `lookml`
Note! This plugin uses a package that requires Python 3.7+!
Extracts:
- LookML views from model files
- Name, upstream table names, dimensions, measures, and dimension groups
```yml
source:
type: "lookml"
config:
base_folder: /path/to/model/files # where the *.model.lkml and *.view.lkml files are stored
connection_to_platform_map: # mappings between connection names in the model files to platform names
connection_name: platform_name (or platform_name.database_name) # for ex. my_snowflake_conn: snowflake.my_database
model_pattern: {}
view_pattern: {}
env: "PROD" # optional, default is "PROD"
parse_table_names_from_sql: False # see note below
platform_name: "looker" # optional, default is "looker"
```
Note! The integration can use [`sql-metadata`](https://pypi.org/project/sql-metadata/) to try to parse the tables the
views depends on. As these SQL's can be complicated, and the package doesn't official support all the SQL dialects that
Looker supports, the result might not be correct. This parsing is disabled by default, but can be enabled by setting
`parse_table_names_from_sql: True`.
### Looker dashboards `looker`
Extracts:
- Looker dashboards and dashboard elements (charts)
- Names, descriptions, URLs, chart types, input view for the charts
See the [Looker authentication docs](https://docs.looker.com/reference/api-and-integration/api-auth#authentication_with_an_sdk) for the steps to create a client ID and secret.
```yml
source:
type: "looker"
config:
client_id: # Your Looker API3 client ID
client_secret: # Your Looker API3 client secret
base_url: # The url to your Looker instance: https://company.looker.com:19999 or https://looker.company.com, or similar.
dashboard_pattern: # supports allow/deny regexes
chart_pattern: # supports allow/deny regexes
actor: urn:li:corpuser:etl # Optional, defaults to urn:li:corpuser:etl
env: "PROD" # Optional, default is "PROD"
platform_name: "looker" # Optional, default is "looker"
```
### File `file`
Pulls metadata from a previously generated file. Note that the file sink
can produce such files, and a number of samples are included in the
[examples/mce_files](examples/mce_files) directory.
```yml
source:
type: file
config:
filename: ./path/to/mce/file.json
```
### dbt `dbt`
Pull metadata from dbt artifacts files:
- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json)
- This file contains model, source and lineage data.
- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json)
- This file contains schema data.
- dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models
- [dbt sources file](https://docs.getdbt.com/reference/artifacts/sources-json)
- This file contains metadata for sources with freshness checks.
- We transfer dbt's freshness checks to DataHub's last-modified fields.
- Note that this file is optional if not specified, we'll use time of ingestion instead as a proxy for time last-modified.
- target_platform:
- The data platform you are enriching with dbt metadata.
- [data platforms](https://github.com/linkedin/datahub/blob/master/gms/impl/src/main/resources/DataPlatformInfo.json)
- load_schemas:
- Load schemas from dbt catalog file, not necessary when the underlying data platform already has this data.
- node_type_pattern:
- Use this filter to exclude and include node types using allow or deny method
```yml
source:
type: "dbt"
config:
manifest_path: "./path/dbt/manifest_file.json"
catalog_path: "./path/dbt/catalog_file.json"
sources_path: "./path/dbt/sources_file.json" # (optional, used for freshness checks)
target_platform: "postgres" # optional, eg "postgres", "snowflake", etc.
load_schemas: True or False
node_type_pattern: # optional
deny:
- ^test.*
allow:
- ^.*
```
Note: when `load_schemas` is False, models that use [identifiers](https://docs.getdbt.com/reference/resource-properties/identifier) to reference their source tables are ingested using the model identifier as the model name to preserve the lineage.
### Google BigQuery Usage Stats `bigquery-usage`
- Fetch a list of queries issued
- Fetch a list of tables and columns accessed
- Aggregate these statistics into buckets, by day or hour granularity
Note: the client must have one of the following OAuth scopes, and should be authorized on all projects you'd like to ingest usage stats from.
- https://www.googleapis.com/auth/logging.read
- https://www.googleapis.com/auth/logging.admin
- https://www.googleapis.com/auth/cloud-platform.read-only
- https://www.googleapis.com/auth/cloud-platform
```yml
source:
type: bigquery-usage
config:
projects: # optional - can autodetect a single project from the environment
- project_id_1
- project_id_2
options:
# See https://googleapis.dev/python/logging/latest/client.html for details.
credentials: ~ # optional - see docs
env: PROD
bucket_duration: "DAY"
start_time: ~ # defaults to the last full day in UTC (or hour)
end_time: ~ # defaults to the last full day in UTC (or hour)
top_n_queries: 10 # number of queries to save for each table
```
:::note
This source only does usage statistics. To get the tables, views, and schemas in your BigQuery project, use the `bigquery` source.
:::
### Snowflake Usage Stats `snowflake-usage`
- Fetch a list of queries issued
- Fetch a list of tables and columns accessed (excludes views)
- Aggregate these statistics into buckets, by day or hour granularity
Note: the user/role must have access to the account usage table. The "accountadmin" role has this by default, and other roles can be [granted this permission](https://docs.snowflake.com/en/sql-reference/account-usage.html#enabling-account-usage-for-other-roles).
Note: the underlying access history views that we use are only available in Snowflake's enterprise edition or higher.
```yml
source:
type: snowflake-usage
config:
username: user
password: pass
host_port: account_name
role: ACCOUNTADMIN
env: PROD
bucket_duration: "DAY"
start_time: ~ # defaults to the last full day in UTC (or hour)
end_time: ~ # defaults to the last full day in UTC (or hour)
top_n_queries: 10 # number of queries to save for each table
```
:::note
This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake` source.
:::
### Kafka Connect `kafka-connect`
Extracts:
- Kafka Connect connector as individual `DataFlowSnapshotClass` entity
- Creating individual `DataJobSnapshotClass` entity using `{connector_name}:{source_dataset}` naming
- Lineage information between source database to Kafka topic
```yml
source:
type: "kafka-connect"
config:
connect_uri: "http://localhost:8083"
cluster_name: "connect-cluster"
connector_patterns:
deny:
- ^denied-connector.*
allow:
- ^allowed-connector.*
```
Current limitations:
- Currently works only for Debezium source connectors.
## Sinks
### DataHub Rest `datahub-rest`
Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface
is that any errors can immediately be reported.
```yml
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080"
```
### DataHub Kafka `datahub-kafka`
Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based
interface is that it's asynchronous and can handle higher throughput. This requires the
Datahub mce-consumer container to be running.
```yml
sink:
type: "datahub-kafka"
config:
connection:
bootstrap: "localhost:9092"
producer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.SerializingProducer
schema_registry_url: "http://localhost:8081"
schema_registry_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient
```
The options in the producer config and schema registry config are passed to the Kafka SerializingProducer and SchemaRegistryClient respectively.
For a full example with a number of security options, see this [example recipe](./examples/recipes/secured_kafka.yml).
### Console `console`
Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.
```yml
sink:
type: "console"
```
### File `file`
Outputs metadata to a file. This can be used to decouple metadata sourcing from the
process of pushing it into DataHub, and is particularly useful for debugging purposes.
Note that the file source can read files generated by this sink.
```yml
sink:
type: file
config:
filename: ./path/to/mce/file.json
```
A number of recipes are included in the [examples/recipes](./examples/recipes) directory. For full info and context on each source and sink, see the pages described in the [table of plugins](#installing-plugins).
## Transformations
@ -1040,10 +181,13 @@ If you're simply looking to run ingestion on a schedule, take a look at these sa
The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.
:::
1. You need to install the required dependency in your airflow. See https://registry.astronomer.io/providers/datahub/modules/datahublineagebackend
```shell
pip install acryl-datahub[airflow]
```
```shell
pip install acryl-datahub[airflow]
```
2. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
```shell

View File

@ -13,7 +13,6 @@ source:
collection_pattern: {}
enableSchemaInference: True
schemaSamplingSize: 1000
# database_pattern/collection_pattern are similar to schema_pattern/table_pattern from above
sink:
type: "datahub-rest"
config:

View File

@ -0,0 +1,33 @@
# Console
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
Works with `acryl-datahub` out of the box.
## Capabilities
Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
# source configs
sink:
type: "console"
```
## Config details
None!
## Questions
If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,87 @@
# DataHub
## DataHub Rest
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
### Setup
To install this plugin, run `pip install 'acryl-datahub[datahub-rest]'`.
### Capabilities
Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface
is that any errors can immediately be reported.
### Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
# source configs
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080"
```
### Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| -------- | -------- | ------- | ---------------------------- |
| `server` | ✅ | | URL of DataHub GMS endpoint. |
## DataHub Kafka
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
### Setup
To install this plugin, run `pip install 'acryl-datahub[datahub-kafka]'`.
### Capabilities
Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based
interface is that it's asynchronous and can handle higher throughput.
### Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
# source configs
sink:
type: "datahub-kafka"
config:
connection:
bootstrap: "localhost:9092"
schema_registry_url: "http://localhost:8081"
```
### Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| -------------------------------------------- | -------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `connection.bootstrap` | ✅ | | Kafka bootstrap URL. |
| `connection.producer_config.<option>` | | | Passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.SerializingProducer |
| `connection.schema_registry_url` | ✅ | | URL of schema registry being used. |
| `connection.schema_registry_config.<option>` | | | Passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient |
The options in the producer config and schema registry config are passed to the Kafka SerializingProducer and SchemaRegistryClient respectively.
For a full example with a number of security options, see this [example recipe](../examples/recipes/secured_kafka.yml).
## Questions
If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,41 @@
# File
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
Works with `acryl-datahub` out of the box.
## Capabilities
Outputs metadata to a file. This can be used to decouple metadata sourcing from the
process of pushing it into DataHub, and is particularly useful for debugging purposes.
Note that the [file source](../source_docs/file.md) can read files generated by this sink.
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
# source configs
sink:
type: file
config:
filename: ./path/to/mce/file.json
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| -------- | -------- | ------- | ------------------------- |
| filename | ✅ | | Path to file to write to. |
## Questions
If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,70 @@
# Athena
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[athena]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, and tables
- Column types associated with each table
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: athena
config:
# Coordinates
aws_region: my_aws_region_name
work_group: my_work_group
# Credentials
username: my_aws_access_key_id
password: my_aws_secret_access_key
database: my_database
# Options
s3_staging_dir: "s3://<bucket-name>/<folder>/"
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | Autodetected | Username credential. If not specified, detected with boto3 rules. See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `password` | | Autodetected | Same detection scheme as `username` |
| `database` | | Autodetected | |
| `aws_region` | ✅ | | AWS region code. |
| `s3_staging_dir` | ✅ | | Of format `"s3://<bucket-name>/prefix/"`. The `s3_staging_dir` parameter is needed because Athena always writes query results to S3. <br />See https://docs.aws.amazon.com/athena/latest/ug/querying.html. |
| `work_group` | ✅ | | Name of Athena workgroup. <br />See https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,135 @@
# BigQuery
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[bigquery]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, and tables
- Column types associated with each table
:::tip
You can also get fine-grained usage statistics for BigQuery using the `bigquery-usage` source described below.
:::
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: bigquery
config:
# Coordinates
project_id: my_project_id
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `project_id` | | Autodetected | Project ID to ingest from. If not specified, will infer from environment. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## BigQuery Usage Stats
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
### Setup
To install this plugin, run `pip install 'acryl-datahub[bigquery-usage]'`.
### Capabilities
This plugin extracts the following:
- Statistics on queries issued and tables and columns accessed (excludes views)
- Aggregation of these statistics into buckets, by day or hour granularity
Note: the client must have one of the following OAuth scopes, and should be authorized on all projects you'd like to ingest usage stats from.
- https://www.googleapis.com/auth/logging.read
- https://www.googleapis.com/auth/logging.admin
- https://www.googleapis.com/auth/cloud-platform.read-only
- https://www.googleapis.com/auth/cloud-platform
:::note
This source only does usage statistics. To get the tables, views, and schemas in your BigQuery project, use the `bigquery` source described above.
:::
### Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: bigquery-usage
config:
# Coordinates
projects:
- project_id_1
- project_id_2
# Options
top_n_queries: 10
sink:
# sink configs
```
### Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
By default, we extract usage stats for the last day, with the recommendation that this source is executed every day.
| Field | Required | Default | Description |
| ---------------------- | -------- | -------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `projects` | | | |
| `extra_client_options` | | | |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `start_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Earliest date of usage logs to consider. |
| `end_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Latest date of usage logs to consider. |
| `top_n_queries` | | `10` | Number of top queries to save to each table. |
| `extra_client_options` | | | Additional options to pass to `google.cloud.logging_v2.client.Client`. |
| `query_log_deplay` | | | To account for the possibility that the query event arrives after the read event in the audit logs, we wait for at least `query_log_delay` additional events to be processed before attempting to resolve BigQuery job information from the logs. If `query_log_delay` is `None`, it gets treated as an unlimited delay, which prioritizes correctness at the expense of memory usage. |
| `max_query_duration` | | `15` | Correction to pad `start_time` and `end_time` with. For handling the case where the read happens within our time range but the query completion event is delayed and happens after the configured end time. |
### Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,76 @@
# dbt
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
Works with `acryl-datahub` out of the box.
## Capabilities
This plugin pulls metadata from dbt's artifact files:
- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json)
- This file contains model, source and lineage data.
- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json)
- This file contains schema data.
- dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models
- [dbt sources file](https://docs.getdbt.com/reference/artifacts/sources-json)
- This file contains metadata for sources with freshness checks.
- We transfer dbt's freshness checks to DataHub's last-modified fields.
- Note that this file is optional if not specified, we'll use time of ingestion instead as a proxy for time last-modified.
- target_platform:
- The data platform you are enriching with dbt metadata.
- [data platforms](https://github.com/linkedin/datahub/blob/master/gms/impl/src/main/resources/DataPlatformInfo.json)
- load_schemas:
- Load schemas from dbt catalog file, not necessary when the underlying data platform already has this data.
- node_type_pattern:
- Use this filter to exclude and include node types using allow or deny method
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: "dbt"
config:
# Coordinates
manifest_path: "./path/dbt/manifest_file.json"
catalog_path: "./path/dbt/catalog_file.json"
sources_path: "./path/dbt/sources_file.json"
# Options
target_platform: "my_target_platform_id"
load_schemas: True # note: if this is disabled
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ------------------------- | -------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `manifest_path` | ✅ | | Path to dbt manifest JSON. See https://docs.getdbt.com/reference/artifacts/manifest-json |
| `catalog_path` | ✅ | | Path to dbt catalog JSON. See https://docs.getdbt.com/reference/artifacts/catalog-json |
| `sources_path` | | | Path to dbt sources JSON. See https://docs.getdbt.com/reference/artifacts/sources-json. If not specified, last-modified fields will not be populated. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `target_platform` | ✅ | | The platform that dbt is loading onto. |
| `load_schemas` | ✅ | | Whether to load database schemas. If set to `False`, table schema details (e.g. columns) will not be ingested. |
| `node_type_pattern.allow` | | | Regex pattern for dbt nodes to include in ingestion. |
| `node_type_pattern.deny` | | | Regex pattern for dbt nodes to exclude from ingestion. |
Note: when `load_schemas` is False, models that use [identifiers](https://docs.getdbt.com/reference/resource-properties/identifier) to reference their source tables are ingested using the model identifier as the model name to preserve the lineage.
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,67 @@
# Druid
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[druid]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, and tables
- Column types associated with each table
**Note**: It is important to explicitly define the deny schema pattern for internal Druid databases (lookup & sys) if adding a schema pattern. Otherwise, the crawler may crash before processing relevant databases. This deny pattern is defined by default but is overriden by user-submitted configurations.
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: druid
config:
# Coordinates
host_port: "localhost:8082"
# Credentials
username: admin
password: password
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | | Database username. |
| `password` | | | Database password. |
| `host_port` | ✅ | | Host URL and port to connect to. |
| `database` | | | Database to ingest. |
| `database_alias` | | | Alias to apply to database when ingesting. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | `"^(lookup \| sys).\*"` | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,56 @@
# Feast
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
**Note: Feast ingestion requires Docker to be installed.**
To install this plugin, run `pip install 'acryl-datahub[feast]'`.
## Capabilities
This plugin extracts the following:
- List of feature tables (modeled as [`MLFeatureTable`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLFeatureTableProperties.pdl)s),
features ([`MLFeature`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLFeatureProperties.pdl)s),
and entities ([`MLPrimaryKey`](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLPrimaryKeyProperties.pdl)s)
- Column types associated with each feature and entity
Note: this uses a separate Docker container to extract Feast's metadata into a JSON file, which is then
parsed to DataHub's native objects. This separation was performed because of a dependency conflict in the `feast` module.
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: feast
config:
# Coordinates
core_url: "localhost:6565"
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ----------------- | -------- | ------------------ | ------------------------------------------------------- |
| `core_url` | | `"localhost:6565"` | URL of Feast Core instance. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `use_local_build` | | `False` | Whether to build Feast ingestion Docker image locally. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,46 @@
# File
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
Works with `acryl-datahub` out of the box.
## Capabilities
This plugin pulls metadata from a previously generated file. The [file sink](../sink_docs/file.md)
can produce such files, and a number of samples are included in the
[examples/mce_files](../examples/mce_files) directory.
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: file
config:
# Coordinates
filename: ./path/to/mce/file.json
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------- | -------- | ------- | ----------------------- |
| `filename` | ✅ | | Path to file to ingest. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,62 @@
# Glue
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[glue]'`.
Note: if you also have files in S3 that you'd like to ingest, we recommend you use Glue's built-in data catalog. See [here](../s3-ingestion.md) for a quick guide on how to set up a crawler on Glue and ingest the outputs with DataHub.
## Capabilities
This plugin extracts the following:
- Tables in the Glue catalog
- Column types associated with each table
- Table metadata, such as owner, description and parameters
- Jobs and their component transformations, data sources, and data sinks
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: glue
config:
# Coordinates
aws_region: "my-aws-region"
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ------------------------ | -------- | --------------------------- | ---------------------------------------------------------------------------------- |
| `aws_region` | ✅ | | AWS region code. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `aws_access_key_id` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_secret_access_key` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_session_token` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_role` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `extract_transforms` | | `True` | Whether to extract Glue transform jobs. |
| `database_pattern.allow` | | | Regex pattern for databases to include in ingestion. |
| `database_pattern.deny` | | | Regex pattern for databases to exclude from ingestion. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `underlying_platform` | | Override for platform name. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,100 @@
# Hive
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[hive]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, and tables
- Column types associated with each table
- Detailed table and storage information
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: hive
config:
# Coordinates
host_port: localhost:10000
database: DemoDatabase # optional, if not specified, ingests from all databases
# Credentials
username: user # optional
password: pass # optional
# For more details on authentication, see the PyHive docs:
# https://github.com/dropbox/PyHive#passing-session-configuration.
# LDAP, Kerberos, etc. are supported using connect_args, which can be
# added under the `options` config parameter.
#scheme: 'hive+http' # set this if Thrift should use the HTTP transport
#scheme: 'hive+https' # set this if Thrift should use the HTTP with SSL transport
sink:
# sink configs
```
<details>
<summary>Example: using ingestion with Azure HDInsight</summary>
```yml
# Connecting to Microsoft Azure HDInsight using TLS.
source:
type: hive
config:
# Coordinates
host_port: <cluster_name>.azurehdinsight.net:443
# Credentials
username: admin
password: password
# Options
options:
connect_args:
http_path: "/hive2"
auth: BASIC
sink:
# sink configs
```
</details>
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | | Database username. |
| `password` | | | Database password. |
| `host_port` | ✅ | | Host URL and port to connect to. |
| `database` | | | Database to ingest. |
| `database_alias` | | | Alias to apply to database when ingesting. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,63 @@
# Kafka Connect
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[kafka-connect]'`.
## Capabilities
This plugin extracts the following:
- Kafka Connect connector as individual `DataFlowSnapshotClass` entity
- Creating individual `DataJobSnapshotClass` entity using `{connector_name}:{source_dataset}` naming
- Lineage information between source database to Kafka topic
Current limitations:
- Currently works only for Debezium source connectors.
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: "kafka-connect"
config:
# Coordinates
connect_uri: "http://localhost:8083"
cluster_name: "connect-cluster"
# Credentials
username: admin
password: password
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| -------------------------- | -------- | -------------------------- | ------------------------------------------------------- |
| `connect_uri` | | `"http://localhost:8083/"` | URI to connect to. |
| `username` | | | Kafka Connect username. |
| `password` | | | Kafka Connect password. |
| `cluster_name` | | `"connect-cluster"` | Cluster to ingest from. |
| `connector_patterns.deny` | | | Regex pattern for connectors to include in ingestion. |
| `connector_patterns.allow` | | | Regex pattern for connectors to exclude from ingestion. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,60 @@
# Kafka Metadata
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[kafka]'`.
## Capabilities
This plugin extracts the following:
- Topics from the Kafka broker
- Schemas associated with each topic from the schema registry
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: "kafka"
config:
# Coordinates
connection:
bootstrap: "broker:9092"
schema_registry_url: http://localhost:8081
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| -------------------------------------------- | -------- | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `conection.bootstrap` | | `"localhost:9092"` | Bootstrap servers. |
| `connection.schema_registry_url` | | `http://localhost:8081"` | Schema registry location. |
| `connection.schema_registry_config.<option>` | | | Extra schema registry config. These options will be passed into Kafka's SchemaRegistryClient. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html?#schemaregistryclient. |
| `connection.consumer_config.<option>` | | | Extra consumer config. These options will be passed into Kafka's DeserializingConsumer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#deserializingconsumer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md. |
| `connection.producer_config.<option>` | | | Extra producer config. These options will be passed into Kafka's SerializingProducer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#serializingproducer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md. |
| `topic_patterns.allow` | | | Regex pattern for topics to include in ingestion. |
| `topic_patterns.deny` | | | Regex pattern for topics to exclude from ingestion. |
The options in the consumer config and schema registry config are passed to the Kafka DeserializingConsumer and SchemaRegistryClient respectively.
For a full example with a number of security options, see this [example recipe](../examples/recipes/secured_kafka.yml).
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,65 @@
# LDAP
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[ldap]'`.
## Capabilities
This plugin extracts the following:
- People
- Names, emails, titles, and manager information for each person
- List of groups
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: "ldap"
config:
# Coordinates
ldap_server: ldap://localhost
# Credentials
ldap_user: "cn=admin,dc=example,dc=org"
ldap_password: "admin"
# Options
base_dn: "dc=example,dc=org"
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ------------------------------ | -------- | ------------------- | ----------------------------------------------------------------------- |
| `ldap_server` | ✅ | | LDAP server URL. |
| `ldap_user` | ✅ | | LDAP user. |
| `ldap_password` | ✅ | | LDAP password. |
| `base_dn` | ✅ | | LDAP DN. |
| `filter` | | `"(objectClass=*)"` | LDAP extractor filter. |
| `drop_missing_first_last_name` | | `True` | If set to true, any users without first and last names will be dropped. |
| `page_size` | | `20` | Size of each page to fetch when extracting metadata. |
The `drop_missing_first_last_name` should be set to true if you've got many "headless" user LDAP accounts
for devices or services should be excluded when they do not contain a first and last name. This will only
impact the ingestion of LDAP users, while LDAP groups will be unaffected by this config option.
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,62 @@
# Looker dashboards
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[looker]'`.
## Capabilities
This plugin extracts the following:
- Looker dashboards and dashboard elements (charts)
- Names, descriptions, URLs, chart types, input view for the charts
See the [Looker authentication docs](https://docs.looker.com/reference/api-and-integration/api-auth#authentication_with_an_sdk) for the steps to create a client ID and secret.
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: "looker"
config:
# Coordinates
base_url: https://company.looker.com:19999
# Credentials
client_id: admin
client_secret: password
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ------------------------- | -------- | ----------------------- | ------------------------------------------------------------------------------------------------------------ |
| `client_id` | ✅ | | Looker API3 client ID. |
| `client_secret` | ✅ | | Looker API3 client secret. |
| `base_url` | ✅ | | Url to your Looker instance: `https://company.looker.com:19999` or `https://looker.company.com`, or similar. |
| `platform_name` | | `"looker"` | Platform to use in namespace when constructing URNs. |
| `actor` | | `"urn:li:corpuser:etl"` | Actor to use in ownership properties of ingested metadata. |
| `dashboard_pattern.allow` | | | Regex pattern for dashboards to include in ingestion. |
| `dashboard_pattern.deny` | | | Regex pattern for dashboards to exclude from ingestion. |
| `chart_pattern.allow` | | | Regex pattern for charts to include in ingestion. |
| `chart_pattern.deny` | | | Regex pattern for charts to exclude from ingestion. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,66 @@
# LookML
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[lookml]'`.
Note! This plugin uses a package that requires Python 3.7+!
## Capabilities
This plugin extracts the following:
- LookML views from model files
- Name, upstream table names, dimensions, measures, and dimension groups
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: "lookml"
config:
# Coordinates
base_folder: /path/to/model/files
# Options
connection_to_platform_map:
connection_name: platform_name (or platform_name.database_name) # for ex. my_snowflake_conn: snowflake.my_database
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------------------------------- | -------- | ---------- | ----------------------------------------------------------------------- |
| `base_folder` | ✅ | | Where the `*.model.lkml` and `*.view.lkml` files are stored. |
| `connection_to_platform_map.<connection_name>` | ✅ | | Mappings between connection names in the model files to platform names. |
| `platform_name` | | `"looker"` | Platform to use in namespace when constructing URNs. |
| `model_pattern.allow` | | | Regex pattern for models to include in ingestion. |
| `model_pattern.deny` | | | Regex pattern for models to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `parse_table_names_from_sql` | | `False` | See note below. |
Note! The integration can use [`sql-metadata`](https://pypi.org/project/sql-metadata/) to try to parse the tables the
views depends on. As these SQL's can be complicated, and the package doesn't official support all the SQL dialects that
Looker supports, the result might not be correct. This parsing is disabled by default, but can be enabled by setting
`parse_table_names_from_sql: True`.
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,73 @@
# MongoDB
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[mongodb]'`.
## Capabilities
This plugin extracts the following:
- Databases and associated metadata
- Collections in each database and schemas for each collection (via schema inference)
By default, schema inference samples 1,000 documents from each collection. Setting `schemaSamplingSize: null` will scan the entire collection.
Moreover, setting `useRandomSampling: False` will sample the first documents found without random selection, which may be faster for large collections.
Note that `schemaSamplingSize` has no effect if `enableSchemaInference: False` is set.
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: "mongodb"
config:
# Coordinates
connect_uri: "mongodb://localhost"
# Credentials
username: admin
password: password
authMechanism: "DEFAULT"
# Options
enableSchemaInference: True
useRandomSampling: True
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| -------------------------- | -------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `connect_uri` | | `"mongodb://localhost"` | MongoDB connection URI. |
| `username` | | | MongoDB username. |
| `password` | | | MongoDB password. |
| `authMechanism` | | | MongoDB authentication mechanism. See https://pymongo.readthedocs.io/en/stable/examples/authentication.html for details. |
| `options` | | | Additional options to pass to `pymongo.MongoClient()`. |
| `enableSchemaInference` | | `True` | Whether to infer schemas. |
| `schemaSamplingSize` | | `1000` | Number of documents to use when inferring schema size. If set to `0`, all documents will be scanned. |
| `useRandomSampling` | | `True` | If documents for schema inference should be randomly selected. If `False`, documents will be selected from start. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `database_pattern.allow` | | | Regex pattern for databases to include in ingestion. |
| `database_pattern.deny` | | | Regex pattern for databases to exclude from ingestion. |
| `collection_pattern.allow` | | | Regex pattern for collections to include in ingestion. |
| `collection_pattern.deny` | | | Regex pattern for collections to exclude from ingestion. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,101 @@
# Microsoft SQL Server
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[mssql]'`.
We have two options for the underlying library used to connect to SQL Server: (1) [python-tds](https://github.com/denisenkom/pytds) and (2) [pyodbc](https://github.com/mkleehammer/pyodbc). The TDS library is pure Python and hence easier to install, but only PyODBC supports encrypted connections.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, views and tables
- Column types associated with each table/view
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: mssql
config:
# Coordinates
host_port: localhost:1433
database: DemoDatabase
# Credentials
username: user
password: pass
sink:
# sink configs
```
<details>
<summary>Example: using ingestion with ODBC and encryption</summary>
This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
```yml
source:
type: mssql
config:
# Coordinates
host_port: localhost:1433
database: DemoDatabase
# Credentials
username: admin
password: password
# Options
uri_args:
driver: "ODBC Driver 17 for SQL Server"
Encrypt: "yes"
TrustServerCertificate: "Yes"
ssl: "True"
sink:
# sink configs
```
</details>
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | | MSSQL username. |
| `password` | | | MSSQL password. |
| `host_port` | | `"localhost:1433"` | MSSQL host URL. |
| `database` | | | MSSQL database. |
| `database_alias` | | | Alias to apply to database when ingesting. |
| `use_odbc` | | `False` | See https://docs.sqlalchemy.org/en/14/dialects/mssql.html#module-sqlalchemy.dialects.mssql.pyodbc. |
| `uri_args.<uri_arg>` | | | Arguments to URL-encode when connecting. See https://docs.microsoft.com/en-us/sql/connect/odbc/dsn-connection-string-attribute?view=sql-server-ver15. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,66 @@
# MySQL
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[mysql]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, and tables
- Column types and schema associated with each table
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: mysql
config:
# Coordinates
host_port: localhost:3306
database: dbname
# Credentials
username: root
password: example
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | | MySQL username. |
| `password` | | | MySQL password. |
| `host_port` | | `"localhost:3306"` | MySQL host URL. |
| `database` | | | MySQL database. |
| `database_alias` | | | Alias to apply to database when ingesting. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,74 @@
# Oracle
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[oracle]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, and tables
- Column types associated with each table
Using the Oracle source requires that you've also installed the correct drivers; see the [cx_Oracle docs](https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html). The easiest one is the [Oracle Instant Client](https://www.oracle.com/database/technologies/instant-client.html).
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: oracle
config:
# Coordinates
host_port: localhost:5432
database: dbname
# Credentials
username: user
password: pass
# Options
service_name: svc # omit database if using this option
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
Exactly one of `database` or `service_name` is required.
| Field | Required | Default | Description |
| ---------------------- | ------------------------------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | | Oracle username. For more details on authentication, see the documentation: https://docs.sqlalchemy.org/en/14/dialects/oracle.html#dialect-oracle-cx_oracle-connect <br /> and https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html#connection-strings. |
| `password` | | | Oracle password. |
| `host_port` | | | Oracle host URL. |
| `database` | If `service_name` is not set | | If using, omit `service_name`. |
| `service_name` | If `database_alias` is not set | | Oracle service name. If using, omit `database`. |
| `database_alias` | | | Alias to apply to database when ingesting. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,71 @@
# PostgreSQL
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[postgres]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, views, and tables
- Column types associated with each table
- Also supports PostGIS extensions
- database_alias (optional) can be used to change the name of database to be ingested
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: postgres
config:
# Coordinates
host_port: localhost:5432
database: DemoDatabase
# Credentials
username: user
password: pass
# Options
database_alias: DatabaseNameToBeIngested
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | | PostgreSQL username. |
| `password` | | | PostgreSQL password. |
| `host_port` | ✅ | | PostgreSQL host URL. |
| `database` | | | PostgreSQL database. |
| `database_alias` | | | Alias to apply to database when ingesting. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,97 @@
# Redshift
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[redshift]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, views and tables
- Column types associated with each table
- Also supports PostGIS extensions
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: redshift
config:
# Coordinates
host_port: example.something.us-west-2.redshift.amazonaws.com:5439
database: DemoDatabase
# Credentials
username: user
password: pass
# Options
options:
# driver_option: some-option
include_views: True # whether to include views, defaults to True
include_tables: True # whether to include views, defaults to True
sink:
# sink configs
```
<details>
<summary>Extra options when running Redshift behind a proxy</summary>
This requires you to have already installed the Microsoft ODBC Driver for SQL Server.
See https://docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-1-configure-development-environment-for-pyodbc-python-development?view=sql-server-ver15
```yml
source:
type: redshift
config:
host_port: my-proxy-hostname:5439
options:
connect_args:
sslmode: "prefer" # or "require" or "verify-ca"
sslrootcert: ~ # needed to unpin the AWS Redshift certificate
sink:
# sink configs
```
</details>
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | | Redshift username. |
| `password` | | | Redshift password. |
| `host_port` | ✅ | | Redshift host URL. |
| `database` | | | Redshift database. |
| `database_alias` | | | Alias to apply to database when ingesting. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,62 @@
# SageMaker
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[sagemaker]'`.
## Capabilities
This plugin extracts the following:
- Feature groups
- Models, jobs, and lineage between the two (e.g. when jobs output a model or a model is used by a job)
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: sagemaker
config:
# Coordinates
aws_region: "my-aws-region"
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ------------------------------------- | -------- | ------------ | ---------------------------------------------------------------------------------- |
| `aws_region` | ✅ | | AWS region code. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `aws_access_key_id` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_secret_access_key` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_session_token` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_role` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `extract_feature_groups` | | `True` | Whether to extract feather groups. |
| `extract_models` | | `True` | Whether to extract models. |
| `extract_jobs.auto_ml` | | `True` | Whether to extract AutoML jobs. |
| `extract_jobs.compilation` | | `True` | Whether to extract compilation jobs. |
| `extract_jobs.edge_packaging` | | `True` | Whether to extract edge packaging jobs. |
| `extract_jobs.hyper_parameter_tuning` | | `True` | Whether to extract hyperparameter tuning jobs. |
| `extract_jobs.labeling` | | `True` | Whether to extract labeling jobs. |
| `extract_jobs.processing` | | `True` | Whether to extract processing jobs. |
| `extract_jobs.training` | | `True` | Whether to extract training jobs. |
| `extract_jobs.transform` | | `True` | Whether to extract transform jobs. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,147 @@
# Snowflake
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[snowflake]'`.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, views and tables
- Column types associated with each table
:::tip
You can also get fine-grained usage statistics for Snowflake using the `snowflake-usage` source described below.
:::
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: snowflake
config:
# Coordinates
host_port: account_name
warehouse: "COMPUTE_WH"
# Credentials
username: user
password: pass
role: "sysadmin"
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ------------------------ | -------- | -------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | | Snowflake username. |
| `password` | | | Snowflake password. |
| `host_port` | ✅ | | Snowflake host URL. |
| `warehouse` | | | Snowflake warehouse. |
| `role` | | | Snowflake role. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `database_pattern.allow` | | | Regex pattern for databases to include in ingestion. |
| `database_pattern.deny` | | `"^UTIL_DB$" `<br />`"^SNOWFLAKE$"`<br />`"^SNOWFLAKE_SAMPLE_DATA$"` | Regex pattern for databases to exclude from ingestion. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## Snowflake Usage Stats
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
### Setup
To install this plugin, run `pip install 'acryl-datahub[snowflake-usage]'`.
### Capabilities
This plugin extracts the following:
- Statistics on queries issued and tables and columns accessed (excludes views)
- Aggregation of these statistics into buckets, by day or hour granularity
Note: the user/role must have access to the account usage table. The "accountadmin" role has this by default, and other roles can be [granted this permission](https://docs.snowflake.com/en/sql-reference/account-usage.html#enabling-account-usage-for-other-roles).
Note: the underlying access history views that we use are only available in Snowflake's enterprise edition or higher.
:::note
This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake` source described above.
:::
### Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: snowflake-usage
config:
# Coordinates
host_port: account_name
warehouse: "COMPUTE_WH"
# Credentials
username: user
password: pass
role: "sysadmin"
# Options
top_n_queries: 10
sink:
# sink configs
```
### Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ----------------- | -------- | -------------------------------------------------------------- | --------------------------------------------------------------- |
| `username` | | | Snowflake username. |
| `password` | | | Snowflake password. |
| `host_port` | ✅ | | Snowflake host URL. |
| `warehouse` | | | Snowflake warehouse. |
| `role` | | | Snowflake role. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `bucket_duration` | | `"DAY"` | Duration to bucket usage events by. Can be `"DAY"` or `"HOUR"`. |
| `start_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Earliest date of usage logs to consider. |
| `end_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Latest date of usage logs to consider. |
| `top_n_queries` | | `10` | Number of top queries to save to each table. |
### Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,84 @@
# SQL Profiles
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[sql-profiles]'`.
The SQL-based profiler does not run alone, but rather can be enabled for other SQL-based sources.
Enabling profiling will slow down ingestion runs.
:::caution
Running profiling against many tables or over many rows can run up significant costs.
While we've done our best to limit the expensiveness of the queries the profiler runs, you
should be prudent about the set of tables profiling is enabled on or the frequency
of the profiling runs.
:::
## Capabilities
Extracts:
- Row and column counts for each table
- For each column, if applicable:
- null counts and proportions
- distinct counts and proportions
- minimum, maximum, mean, median, standard deviation, some quantile values
- histograms or frequencies of unique values
Supported SQL sources:
- AWS Athena
- BigQuery
- Druid
- Hive
- Microsoft SQL Server
- MySQL
- Oracle
- Postgres
- Redshift
- Snowflake
- Generic SQLAlchemy source
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: <sql-source> # can be bigquery, snowflake, etc - see above for the list
config:
# ... any other source-specific options ...
# Options
profiling:
enabled: true
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ----------------------- | -------- | ------- | ----------------------------------------------------------------------- |
| `profiling.enabled` | | `False` | Whether profiling should be done. |
| `profiling.limit` | | | Max number of documents to profile. By default, profiles all documents. |
| `profiling.offset` | | | Offset in documents to profile. By default, uses no offset. |
| `profile_pattern.allow` | | | Regex pattern for tables to profile. |
| `profile_pattern.deny` | | | Regex pattern for tables to not profile. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,62 @@
# Other SQLAlchemy databases
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[sqlalchemy]'`.
The `sqlalchemy` source is useful if we don't have a pre-built source for your chosen
database system, but there is an [SQLAlchemy dialect](https://docs.sqlalchemy.org/en/14/dialects/)
defined elsewhere. In order to use this, you must `pip install` the required dialect packages yourself.
## Capabilities
This plugin extracts the following:
- Metadata for databases, schemas, views, and tables
- Column types associated with each table
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: sqlalchemy
config:
# Coordinates
connect_uri: "dialect+driver://username:password@host:port/database"
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ---------------------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `platform` | ✅ | | Name of platform being ingested, used in constructing URNs. |
| `connect_uri` | ✅ | | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |
| `include_views` | | `True` | Whether views should be ingested. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!

View File

@ -0,0 +1,57 @@
# Superset
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
To install this plugin, run `pip install 'acryl-datahub[superset]'`.
See documentation for superset's `/security/login` at https://superset.apache.org/docs/rest-api for more details on superset's login api.
## Capabilities
This plugin extracts the following:
- Charts, dashboards, and associated metadata
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: superset
config:
# Coordinates
connect_uri: http://localhost:8088
# Credentials
username: user
password: pass
provider: ldap
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
| ------------- | -------- | ------------------ | ------------------------------------------------------- |
| `connect_uri` | | `"localhost:8088"` | Superset host URL. |
| `username` | | | Superset username. |
| `password` | | | Superset password. |
| `provider` | | `"db"` | Superset provider. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
## Compatibility
Coming soon!
## Questions
If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!