Doc update (#1716)

* GitBook: [#70] Update to Roadmap

* GitBook: [#72] Modified Looker and Oracle Connector

* GitBook: [#71] Change Events

* GitBook: [#74] API Component

* GitBook: [#73] MlFlow Connector

* GitBook: [#75] PUT Diagram

* GitBook: [#76] Change Events

* GitBook: [#77] Snowflake Single Sign-on changes

Co-authored-by: OpenMetadata <github@harsha.io>
Co-authored-by: Ayush Shah <ayush.shah@deuexsolutions.com>
Co-authored-by: pmbrull <peremiquelbrull@gmail.com>
This commit is contained in:
parthp2107 2021-12-13 12:49:33 +05:30 committed by GitHub
parent 4267f83433
commit c2f1b6bb8a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
11 changed files with 368 additions and 65 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 132 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 172 KiB

View File

@ -10,27 +10,27 @@ OpenMetadata enables metadata management end-to-end, giving you the ability to u
OpenMetadata provides connectors that enable you to perform metadata ingestion from a number of common database, dashboard, messaging, and pipeline services. With each release, we add additional connectors and the ingestion framework provides a structured and straightforward method for creating your own connectors. See the table below for a list of supported connectors.
| Database | Dashboard | Pipeline | Messaging | Modeling |
| ------------------------------------------------------------- | ----------------------------------------------- | ---------------------------------------------- | ----------------------------------------- | ------------------------ |
| [Athena](openmetadata/connectors/athena.md) | [Looker](openmetadata/connectors/looker.md) | [Airflow](install/metadata-ingestion/airflow/) | [Kafka](openmetadata/connectors/kafka.md) | [DBT](connectors/dbt.md) |
| [BigQuery](openmetadata/connectors/bigquery.md) | [Redash](openmetadata/connectors/redash.md) | Prefect | Pulsar (WIP) | |
| [BigQuery Usage](openmetadata/connectors/bigquery-usage.md) | [Superset](openmetadata/connectors/superset.md) | Glue | | |
| [Elasticsearch](openmetadata/connectors/elastic-search.md) | [Tableau](openmetadata/connectors/tableau.md) | | | |
| [Glue Catalog](connectors/glue-catalog.md) | | | | |
| [Hive](openmetadata/connectors/hive.md) | | | | |
| [MariaDB](connectors/mariadb.md) | | | | |
| [MSSQL](openmetadata/connectors/mssql.md) | | | | |
| [MySQL](openmetadata/connectors/mysql.md) | | | | |
| [Oracle](openmetadata/connectors/oracle.md) | | | | |
| [Postgres](openmetadata/connectors/postgres.md) | | | | |
| [Presto](openmetadata/connectors/presto.md) | | | | |
| [Redshift](openmetadata/connectors/redshift.md) | | | | |
| [Redshift Usage](openmetadata/connectors/redshift-usage.md) | | | | |
| [Salesforce](openmetadata/connectors/salesforce.md) | | | | |
| [Snowflake](openmetadata/connectors/snowflake.md) | | | | |
| [Snowflake Usage](openmetadata/connectors/snowflake-usage.md) | | | | |
| [Trino](openmetadata/connectors/trino.md) | | | | |
| [Vertica](openmetadata/connectors/vertica.md) | | | | |
| Database | Dashboard | Pipeline | Messaging | Modeling | ML Models |
| ------------------------------------------------------------- | ----------------------------------------------- | ---------------------------------------------- | ----------------------------------------- | ------------------------ | ------------------------------ |
| [Athena](openmetadata/connectors/athena.md) | [Looker](openmetadata/connectors/looker.md) | [Airflow](install/metadata-ingestion/airflow/) | [Kafka](openmetadata/connectors/kafka.md) | [DBT](connectors/dbt.md) | [MlFlow](connectors/mlflow.md) |
| [BigQuery](openmetadata/connectors/bigquery.md) | [Redash](openmetadata/connectors/redash.md) | Prefect | Pulsar (WIP) | | |
| [BigQuery Usage](openmetadata/connectors/bigquery-usage.md) | [Superset](openmetadata/connectors/superset.md) | Glue | | | |
| [Elasticsearch](openmetadata/connectors/elastic-search.md) | [Tableau](openmetadata/connectors/tableau.md) | | | | |
| [Glue Catalog](connectors/glue-catalog.md) | | | | | |
| [Hive](openmetadata/connectors/hive.md) | | | | | |
| [MariaDB](connectors/mariadb.md) | | | | | |
| [MSSQL](openmetadata/connectors/mssql.md) | | | | | |
| [MySQL](openmetadata/connectors/mysql.md) | | | | | |
| [Oracle](openmetadata/connectors/oracle.md) | | | | | |
| [Postgres](openmetadata/connectors/postgres.md) | | | | | |
| [Presto](openmetadata/connectors/presto.md) | | | | | |
| [Redshift](openmetadata/connectors/redshift.md) | | | | | |
| [Redshift Usage](openmetadata/connectors/redshift-usage.md) | | | | | |
| [Salesforce](openmetadata/connectors/salesforce.md) | | | | | |
| [Snowflake](openmetadata/connectors/snowflake.md) | | | | | |
| [Snowflake Usage](openmetadata/connectors/snowflake-usage.md) | | | | | |
| [Trino](openmetadata/connectors/trino.md) | | | | | |
| [Vertica](openmetadata/connectors/vertica.md) | | | | | |
## OpenMetadata Components

View File

@ -17,6 +17,7 @@
* [Kafka](openmetadata/connectors/kafka.md)
* [Looker](openmetadata/connectors/looker.md)
* [MariaDB](connectors/mariadb.md)
* [MlFlow](connectors/mlflow.md)
* [MsSQL](openmetadata/connectors/mssql.md)
* [MySQL](openmetadata/connectors/mysql.md)
* [Oracle](openmetadata/connectors/oracle.md)

79
docs/connectors/mlflow.md Normal file
View File

@ -0,0 +1,79 @@
---
description: This guide will help install the MlFlow connector and run it manually
---
# MlFlow
{% hint style="info" %}
**Prerequisites**
OpenMetadata is built using Java, DropWizard, Jetty, and MySQL.
1. Python 3.7 or above
{% endhint %}
### Install from PyPI
{% tabs %}
{% tab title="Install Using PyPI" %}
```bash
pip install 'openmetadata-ingestion[mlflow]'
```
{% endtab %}
{% endtabs %}
### Run Manually
```bash
metadata ingest -c ./examples/workflows/mlflow.json
```
### Configuration
{% code title="mlflow.json" %}
```javascript
{
"source": {
"type": "mlflow",
"config": {
"tracking_uri": "http://localhost:5000",
"registry_uri": "mysql+pymysql://mlflow:password@localhost:3307/experiments"
}
...
```
{% endcode %}
1. **tracking\_uri** - MlFlow server containing the tracking information of runs and experiments ([docs](https://mlflow.org/docs/latest/tracking.html#)).
2. **registry\_uri** - Backend store where the Tracking Server stores experiment and run metadata ([docs](https://mlflow.org/docs/latest/tracking.html#id14)).
## Publish to OpenMetadata
Below is the configuration to publish MlFlow data into the OpenMetadata service.
Add optionally `pii` processor and `metadata-rest` sink along with `metadata-server` config
{% code title="mlflow.json" %}
```javascript
{
"source": {
"type": "mlflow",
"config": {
"tracking_uri": "http://localhost:5000",
"registry_uri": "mysql+pymysql://mlflow:password@localhost:3307/experiments"
}
},
"sink": {
"type": "metadata-rest",
"config": {}
},
"metadata_server": {
"type": "metadata-server",
"config": {
"api_endpoint": "http://localhost:8585/api",
"auth_provider_type": "no-auth"
}
}
}
```
{% endcode %}

View File

@ -45,3 +45,7 @@ OpenMetadata supports connectors to some popular services. We will continue as a
* Airflow
* Prefect
* Glue
**ML Services**
* [MlFlow](../../../connectors/mlflow.md)

View File

@ -100,3 +100,127 @@ This JSON does not store any Relationship. E.g., a User owning a Dashboard is a
This separation helps us decouple concerns. We can process related entities independently and validate at runtime what information needs to be updated and/or retrieved. For example, if we delete a Dashboard being owned by a User, we will then clean up this row in `entity_relationship`, but that won't alter the information from the User.
Another trickier example would be trying to delete a Database that contains Tables. In this case, the process would check that the Database Entity is not empty, and therefore we cannot continue with the removal.
### Change Events Store
You might have already noticed that in all Entities definitions we have a `changeDescription` field. It is defined as _"Change that leads to this version of the entity_". If we inspect further the properties of `changeDescription`, we can see how it stores the differences between the current and last versions of an Entity.
This results in giving visibility on the last update step of each Entity instance. However, there might be times when this level of tracking is not enough.
One of the greatest features of OpenMetadata is the ability to track **all** **Entity versions**. Each operation that leads to a change (`PUT`, `POST`, `PATCH`) will generate a trace that is going to be stored in the table `change_event`.
Using the API to get events data, or directly exploring the different versions of each entity gives great debugging power to both data consumers and producers.
## API Component Diagram
Now that we have a clear picture of the main pieces and their roles, we will analyze the logical flow of a `POST` and a `PUT` calls to the API. The main goal of this section is to get familiar with the code organisation and its main steps.
{% hint style="info" %}
To take the most out of this section, it is recommended to follow the source code as well, from the Entity JSON you'd like to use as an example to its implementation of `Resource` and `Repository`.
{% endhint %}
### Create a new Entity - POST
We will start with the simplest scenario: Creating a new Entity via a `POST` call. This is a great first point to review as part of the logic and methods are reused during updates.
![Component Diagram of a POST call to the API](../../.gitbook/assets/system-context-diagram-API-component-POST-diagram.drawio.png)
#### Create
As we already know, the recipient of the HTTP call will be the `EntityResource`. In there, we have the `create` function with the `@POST` **annotation** and the description of the API endpoint and expected schemas.
The role of this first component is to receive the call and validate the request body and headers, but the real implementation happens in the `EntityRepository`, which we already described as the **DAO**.
For the `POST` operation, the internal flow is rather simple and is composed of two steps:
1. **Prepare**: Which validates the Entity data and computes some attributes at the server-side.
2. **Store**: This saves the Entity JSON and its Relationships to the backend DB.
#### Prepare
This method is used for **validating** an entity to be created during `POST`, `PUT`, and `PATCH` operations and **preparing** the entity with all the required attributes and relationships.
Here we handle, for example, the process of setting up the FQDN of an Entity based on its hierarchy. While all Entities require an FQDN, this is not an attribute we expect to receive in a request.
Moreover, this checks that the received attributes are being correctly informed, e.g., we have a valid `User` as an `owner` or a valid `Database` for a `Table`.
#### Store
The storing process is divided into two different steps (as we have two tables holding the information).
1. We strip the validated Entity from any `href` attribute (such as `owner` or `tags`) in order to just store a JSON document with the Entity intrinsic values.
2. We then store the graph representation of the Relationships for the attributes omitted above.&#x20;
At the end of these calls, we end up with a validated Entity holding all the required attributes, which have been validated and stored accordingly. We can then return the created Entity to the caller.
### Create or Update an Entity - PUT
Let's now build on top of what we learned during the `POST` discussion, expanding the example to a `PUT` request handling.
![Component Diagram of a PUT call to the API](../../.gitbook/assets/system-context-diagram-API-component-PUT-diagram.drawio.png)
The first steps are fairly similar:
1. We have a function in our `Resource` annotated as `@PUT` and handling headers, auth and schemas.
2. The `Resource` then calls the DAO at the `Repository`, bootstrapping the data-related logic.
3. We validate the Entity and cook some attributes during the `prepare` step.
After processing and validating the Entity request, we then check if the Entity instance has already been stored, querying the backend database by its FQDN. If it has not, then we proceed with the same logic as the `POST` operation -> simple creation. Otherwise, we need to validate the updated fields.
#### Set Fields
We cannot allow all fields to be updated for a given Entity instance. For example, the `id` or `name` stay immutable once the instance is created, and the same thing happens to the `Database` of a `Table`.
The list of specified fields that can change is defined at each Entity's `Repository`, and we should only allow changes on those attributes that can naturally evolve throughout the **lifecycle** of the object.
At this step, we set the fields to the Entity that are either required by the JSON schema definition (e.g., the `algorithm` for an `MlModel`) or, in the case of a `GET` operation, that are requested as `GET <url>/api/v1/<collectionName>/<id>?fields=field1,field2...`
#### Update
In the `EntityRepository` there is an abstract implementation of the `EntityUpdater` interface, which is in charge of defining the generic update logic flow common for all the Entities.
The main steps handled in the `update` calls are:
1. Update the Entity **generic** fields, such as the description or the owner.
2. Run Entity **specific** updates, which are implemented by each Entity's `EntityUpdater` extension.
3. **Store** the updated Entity JSON doc to the Entity Table in MySQL.
#### Entity Spcific Updates
Each Entity has a set of attributes that define it. These attributes are going to have a very specific behaviour, so the implementation of the `update` logic falls to each Entity `Repository`.
For example, we can update the `Columns` of a `Table`, or the `Dashboard` holding the performance metrics of an `MlModel`. Both of these changes are going to be treated differently, in terms of how the Entity performs internally the update, how the Entity **version** gets affected, or the impact on the **Relationship** data.
For the sake of discussion, we'll follow a couple of `update` scenarios.
#### Example 1 - Updating Columns of a Table
When updating `Columns`, we need to compare the existing set of columns in the original Entity vs. the incoming columns of the `PUT` request.
If we are receiving an existing column, we might need to update its `description` or `tags`. This change will be considered a **minor** change. Therefore, the version of the Entity will be bumped by `0.1`, following the software release specification model.
However, what happens if a stored column is not received in the updated instance? That would mean that such a column has been deleted. This is a type of change that could possibly break integrations on top of the Table's data. Therefore, we can mark this scenario as a **major** update. In this case, the version of the Entity will increase by `1.0`.
Checking the Change Events or visiting the Entity history will easily show us the evolution of an Entity instance, which will be immensely valuable when debugging data issues.
#### Example 2 - Updating the Dashboard of an ML Model
One of the attributes for an `MlModel` is the `EntityReference` to a `Dashboard` holding its performance metrics evolution.
As this attribute is a reference to another existing Entity, this data is not directly stored in the `MlModel` JSON doc, but rather as a Relationship graph, as we have been discussing previously. Therefore, during the `update` step we will need to:
1. Insert the relationship, if the original Entity had no Dashboard informed,
2. Delete the relationship if the Dashboard has been removed, or
3. Update the relationship if we now point to a different Dashboard.
Note how during the `POST` operation we needed to always call the `storeRelationship` function, as it was the first time we were storing the instance's information. During an update, we will just modify the Relationship data if the Entity's specific attributes require it.
### Handling Events
During all these discussions and examples we've been showing how the backend API handles HTTP requests and what the Entities' data lifecycle is. Not only we've been focusing on the JSON docs and Relationships, but from time to time we have talked about Change Events.
Moreover, In the _API Container Diagram ****_ we drew a Container representing the Table holding the Change Event data, but yet, we have not found any Component accessing it.
This is because the API server is powered by **Jetty**, which means that luckily we do not need to make those calls ourselves! By defining a `ChangeEventHandler` and registering it during the creation of the server, this postprocessing of the calls happens transparently.
Our `ChangeEventHandler` will check if the Entity has been Created, Updated or Deleted and will store the appropriate `ChangeEvent` data from our response to the backend DB.

View File

@ -36,18 +36,19 @@ metadata ingest -c ./examples/workflows/looker.json
"source": {
"type": "looker",
"config": {
"username": "username",
"password": "password",
"username": "Looker Client ID",
"password": "Looker Client Secret",
"url": "http://localhost",
"service_name": "looker",
"service_type": "Looker",
"service_type": "Looker"
}
},
}
...
```
{% endcode %}
1. **username** - pass the Looker username.
2. **password** - the password for the Looker username.
1. **username** - pass the Looker Client ID.
2. **password** - the password for the Looker Client Secret.
3. **url** - looker connector url
4. **service\_name** - Service Name for this Looker cluster. If you added the Looker cluster through OpenMetadata UI, make sure the service name matches the same.
5. **filter\_pattern** - It contains includes, excludes options to choose which pattern of datasets you want to ingest into OpenMetadata.
@ -61,15 +62,14 @@ Add Optionally`pii` processor and `metadata-rest` sink along with `metadata-serv
{% code title="looker.json" %}
```javascript
{
{
"source": {
"type": "looker",
"config": {
"username": "username",
"password": "password",
"username": "Looker Client ID",
"password": "Looker Client Secret",
"url": "http://localhost",
"service_name": "looker",
"service_type": "Looker",
"service_type": "Looker"
}
},
"sink": {

View File

@ -10,6 +10,7 @@ description: This guide will help install Oracle connector and run manually
OpenMetadata is built using Java, DropWizard, Jetty, and MySQL.
1. Python 3.7 or above
2. Oracle Client Libraries (ref: Click here to [download Oracle Client Libraries](https://help.ubuntu.com/community/Oracle%20Instant%20Client))
{% endhint %}
### Install from PyPI
@ -27,24 +28,27 @@ pip install 'openmetadata-ingestion[oracle]'
{% code title="oracle.json" %}
```javascript
{
"source": {
"type": "oracle",
"config": {
"host_port":"host_port",
"username": "openmetadata_user",
"password": "openmetadata_password",
"service_name": "local_oracle",
"service_type": "Oracle"
"source": {
"type": "oracle",
"config": {
"host_port":"host:1521",
"username": "pdbadmin",
"password": "password",
"service_name": "local_oracle",
"service_type": "Oracle",
"oracle_service_name": "ORCLPDB1"
}
},
...
...
```
{% endcode %}
1. **username** - pass the Oracle username. We recommend creating a user with read-only permissions to all the databases in your Oracle installation
2. **password** - password for the username
3. **service\_name** - Service Name for this Oracle cluster. If you added Oracle cluster through OpenMetadata UI, make sure the service name matches the same.
4. **filter\_pattern** - It contains includes, excludes options to choose which pattern of datasets you want to ingest into OpenMetadata
3. **host\_port** - Host Port where Oracle Instance is initiated
4. **service\_name** - Service Name for this Oracle cluster. If you added Oracle cluster through OpenMetadata UI, make sure the service name matches the same.
5. **oracle\_service\_name -** Oracle Service Name (TNS alias)
6. **filter\_pattern** - It contains includes, excludes options to choose which pattern of datasets you want to ingest into OpenMetadata
## Publish to OpenMetadata
@ -58,29 +62,23 @@ Add `metadata-rest` sink along with `metadata-server` config
"source": {
"type": "oracle",
"config": {
"host_port":"host_port",
"username": "openmetadata_user",
"password": "openmetadata_password",
"host_port": "host:1521",
"username": "pdbadmin",
"password": "password",
"service_name": "local_oracle",
"service_type": "Oracle"
}
},
"processor": {
"type": "pii",
"config": {
"api_endpoint": "http://localhost:8585/api"
"service_type": "Oracle",
"oracle_service_name": "ORCLPDB1"
}
},
"sink": {
"type": "metadata-rest",
"config": {
}
"config": {}
},
"metadata_server": {
"type": "metadata-server",
"config": {
"api_endpoint": "http://localhost:8585/api",
"auth_provider_type": "no-auth"
"auth_provider_type": "no-auth"
}
}
}

View File

@ -60,10 +60,61 @@ metadata ingest -c ./examples/workflows/snowflake.json
3. **service\_name** - Service Name for this Snowflake cluster. If you added the Snowflake cluster through OpenMetadata UI, make sure the service name matches the same.
4. **filter\_pattern** - It contains includes, excludes options to choose which pattern of datasets you want to ingest into OpenMetadata.
5. **database -** Database name from where data is to be fetched.
6. **data\_profiler\_enabled** - Enable data-profiling (Optional). It will provide you the newly ingested data.
6. **data\_profiler\_enabled** - Enable data-profiling (Optional). It will provide you with the newly ingested data.
7. **data\_profiler\_offset** - Specify offset.
8. **data\_profiler\_limit** - Specify limit.
### SSO Configuration
{% hint style="info" %}
Snowflake Sqlalchemy supports Single Sign-On with and without the password parameter.
Please refer to [this link](https://github.com/snowflakedb/snowflake-sqlalchemy/issues/115) for more information
{% endhint %}
#### SSO - with username and password
{% code title="snowflake.json" %}
```
{
"source": {
"type": "snowflake",
"config": {
"host_port": "account.region.service.snowflakecomputing.com",
"username": "OKTA_USER",
"password": "OKTA_PASSWORD",
"account": "account",
"service_name": "snowflake",
"options":{
"authenticator": "https://something.okta.com/",
}
}
},
...
```
{% endcode %}
**SSO - without password**
{% code title="snowflake.json" %}
```
{
"source": {
"type": "snowflake",
"config": {
"host_port": "account.region.service.snowflakecomputing.com",
"username": "email",
"account": "account",
"service_name": "snowflake",
"options":{
"authenticator": "externalbrowser",
}
}
},
...
```
{% endcode %}
### Publish to OpenMetadata
Below is the configuration to publish Snowflake data into the OpenMetadata service.

View File

@ -83,23 +83,69 @@ If you would like to prioritize any feature or would like to add a new feature t
### Airflow APIs
* Airflow APIs to deploy DAGS and manage them
* UI integration to deploy ingestion workflows&#x20;
* UI integration to deploy ingestion workflows
### Connectors
* AWS Glue
* DBT
* MariaDB
## 0.7 Release - Dec 15th, 2021
### Support for User Collaboration
#### Theme: Data Collaboration - Activity Feeds,&#x20;
* Allow users to ask questions, suggest changes, request new features for data assets
* Activity feeds for User and Data assets
* Tracking activity feeds as tasks
### UI - Activity Feed, Improved UX for Search
### Lineage new features
* Users will have access to Activity Feed of all the changes to the Metadata
* New and Improved UX for Search and Landing page
* Allow users to add lineage information manually for table and column levels
* Tier propagation to upstream datasets using lineage
* Propagating column level tags and descriptions using lineage (Work in progress)
### Support for Table Location
* Extract Location information from Glue, Redshift&#x20;
* Show Location details on the Table Page
### Elastic Search - Improvements
* Support SSL (including self-signed certs) enabled ElasticSearch
* New entities will be indexed into ElasticSearch directly
### Connectors
* Metabase
* Apache Druid
* Glue Improvements
* MSSQL - SSL support
* Apache Atlas Import connector
* Amundsen Import connector
### Other features
* Metadata Change Event integration into Slack and framework for integration into other services such as Kafka or other Notification frameworks
* Delta Lake support, Databricks, Iceberg
## 0.8 Release - Jan 15th, 2021
### Data Quality&#x20;
* Data Quality Tests support with Json Schemas and APIs
* UI Integration to enable user to write tests and run them on Airflow
* Store the test results and provide notifications via eventing apis&#x20;
* Provide integration of DBT tests into OpenMetadata&#x20;
### Access Control Policies
* Design of Access Control Policies
* Provide Role based access control with community feedback
### Eventing Webhook
* Register webhooks to get metadata event notifications
* Metadata Change Event integration into Slack and framework for integration into other services such as Kafka or other Notification frameworks
### Connectors
* Delta Lake
* Iceberg
* PowerBI
* Azure SQL