Add Athena, Lineage and Usage docs & Fix Athena UI Lineage and Usage workflows (#11148)

* Athena docs

* Lineage and Usage docs

* Missing section close

* Fix Athena Model
This commit is contained in:
Pere Miquel Brull 2023-04-20 06:31:53 +02:00 committed by GitHub
parent 3cf6442459
commit 91cd1491ee
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 428 additions and 19 deletions

View File

@ -13,7 +13,7 @@
Athena Models
"""
from datetime import datetime
from typing import List
from typing import List, Optional
from pydantic import BaseModel
@ -70,7 +70,7 @@ class AthenaQueryExecution(BaseModel):
Statistics: Statistics
WorkGroup: str
EngineVersion: EngineVersion
SubstatementType: str
SubstatementType: Optional[str]
class AthenaQueryExecutionList(BaseModel):

View File

@ -44,6 +44,113 @@ To deploy OpenMetadata, check the Deployment guides.
To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
custom Airflow plugins to handle the workflow deployment.
The Athena connector ingests metadata through JDBC connections.
{% note %}
According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
*If you are using the JDBC or ODBC driver, ensure that the IAM
permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
{% /note %}
This policy groups the following permissions:
- `athena` Allows the principal to run queries on Athena resources.
- `glue` Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
- `s3` Allows the principal to write and read query results from Amazon S3.
- `lakeformation` Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
And is defined as:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:BatchGetQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:GetQueryResultsStream",
"athena:ListQueryExecutions",
"athena:StartQueryExecution",
"athena:StopQueryExecution",
"athena:ListWorkGroups",
"athena:ListEngineVersions",
"athena:GetWorkGroup",
"athena:GetDataCatalog",
"athena:GetDatabase",
"athena:GetTableMetadata",
"athena:ListDataCatalogs",
"athena:ListDatabases",
"athena:ListTableMetadata"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject",
"s3:PutBucketPublicAccessBlock"
],
"Resource": [
"arn:aws:s3:::aws-athena-query-results-*"
]
},
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess"
],
"Resource": [
"*"
]
}
]
}
```
You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
### Python Requirements
To run the Athena ingestion, you will need to install:

View File

@ -44,6 +44,113 @@ To deploy OpenMetadata, check the Deployment guides.
To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
custom Airflow plugins to handle the workflow deployment.
The Athena connector ingests metadata through JDBC connections.
{% note %}
According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
*If you are using the JDBC or ODBC driver, ensure that the IAM
permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
{% /note %}
This policy groups the following permissions:
- `athena` Allows the principal to run queries on Athena resources.
- `glue` Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
- `s3` Allows the principal to write and read query results from Amazon S3.
- `lakeformation` Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
And is defined as:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:BatchGetQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:GetQueryResultsStream",
"athena:ListQueryExecutions",
"athena:StartQueryExecution",
"athena:StopQueryExecution",
"athena:ListWorkGroups",
"athena:ListEngineVersions",
"athena:GetWorkGroup",
"athena:GetDataCatalog",
"athena:GetDatabase",
"athena:GetTableMetadata",
"athena:ListDataCatalogs",
"athena:ListDatabases",
"athena:ListTableMetadata"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject",
"s3:PutBucketPublicAccessBlock"
],
"Resource": [
"arn:aws:s3:::aws-athena-query-results-*"
]
},
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess"
],
"Resource": [
"*"
]
}
]
}
```
You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
### Python Requirements
To run the Athena ingestion, you will need to install:

View File

@ -67,6 +67,113 @@ To deploy OpenMetadata, check the Deployment guides.
To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
custom Airflow plugins to handle the workflow deployment.
The Athena connector ingests metadata through JDBC connections.
{% note %}
According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
*If you are using the JDBC or ODBC driver, ensure that the IAM
permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
{% /note %}
This policy groups the following permissions:
- `athena` Allows the principal to run queries on Athena resources.
- `glue` Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
- `s3` Allows the principal to write and read query results from Amazon S3.
- `lakeformation` Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
And is defined as:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:BatchGetQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:GetQueryResultsStream",
"athena:ListQueryExecutions",
"athena:StartQueryExecution",
"athena:StopQueryExecution",
"athena:ListWorkGroups",
"athena:ListEngineVersions",
"athena:GetWorkGroup",
"athena:GetDataCatalog",
"athena:GetDatabase",
"athena:GetTableMetadata",
"athena:ListDataCatalogs",
"athena:ListDatabases",
"athena:ListTableMetadata"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject",
"s3:PutBucketPublicAccessBlock"
],
"Resource": [
"arn:aws:s3:::aws-athena-query-results-*"
]
},
{
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess"
],
"Resource": [
"*"
]
}
]
}
```
You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
## Metadata Ingestion
{% stepsContainer %}

View File

@ -54,10 +54,12 @@ the packages need to be present in the Airflow instances.
You will need to install:
```python
pip3 install "openmetadata-ingestion[<connector-name>]"
pip3 install "openmetadata-ingestion[<connector-name>]==x.y.z"
```
And then run the DAG as explained in each [Connector](/connectors).
And then run the DAG as explained in each [Connector](/connectors), where `x.y.z` is the same version of your
OpenMetadata server. For example, if you are on version 1.0.0, then you can install the `openmetadata-ingestion`
with versions `1.0.0.*`, e.g., `1.0.0.0`, `1.0.0.1`, etc., but not `1.0.1.x`.
### Airflow APIs
@ -85,13 +87,17 @@ The goal of this module is to add some HTTP endpoints that the UI calls for depl
The first step can be achieved by running:
```python
pip3 install "openmetadata-managed-apis"
pip3 install "openmetadata-managed-apis==x.y.z"
```
Then, check the Connector Modules guide above to learn how to install the `openmetadata-ingestion` package with the
necessary plugins. They are necessary because even if we install the APIs, the Airflow instance needs to have the
required libraries to connect to each source.
Here, the same versioning logic applies: `x.y.z` is the same version of your
OpenMetadata server. For example, if you are on version 1.0.0, then you can install the `openmetadata-managed-apis`
with versions `1.0.0.*`, e.g., `1.0.0.0`, `1.0.0.1`, etc., but not `1.0.1.x`.
### AIRFLOW_HOME
The APIs will look for the `AIRFLOW_HOME` environment variable to place the dynamically generated DAGs. Make
@ -192,6 +198,18 @@ Please update it accordingly.
## Ingestion Pipeline deployment issues
### Airflow APIs Not Found
Validate the installation, making sure that from the OpenMetadata server you can reach the Airflow host, and the
call to `/health` gives us the proper response:
```bash
$ curl -XGET ${AIRFLOW_HOST}/api/v1/openmetadata/health
{"status": "healthy", "version": "x.y.z"}
```
Also, make sure that the version of your OpenMetadata server matches the `openmetadata-ingestion` client version installed in Airflow.
### GetServiceException: Could not get service from type XYZ
In this case, the OpenMetadata client running in the Airflow host had issues getting the service you are trying to

View File

@ -75,6 +75,16 @@
"supportsQueryComment": {
"title": "Supports Query Comment",
"$ref": "../connectionBasicType.json#/definitions/supportsQueryComment"
},
"supportsUsageExtraction": {
"description": "Supports Usage Extraction.",
"type": "boolean",
"default": true
},
"supportsLineageExtraction": {
"description": "Supports Lineage Extraction.",
"type": "boolean",
"default": true
}
},
"additionalProperties": false,

View File

@ -6,9 +6,12 @@ In this section, we provide guides and references to use the Athena connector.
The Athena connector ingests metadata through JDBC connections.
$$note
According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
*If you are using the JDBC or ODBC driver, ensure that the IAM
permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
$$
This policy groups the following permissions:
@ -116,14 +119,14 @@ SQLAlchemy driver scheme options.
$$
$$section
### Aws Config $(id="awsConfig")
### AWS Config $(id="awsConfig")
AWS credentials configs.
<!-- awsConfig to be updated -->
$$
$$section
### Aws Access Key Id $(id="awsAccessKeyId")
### AWS Access Key ID $(id="awsAccessKeyId")
When you interact with AWS, you specify your AWS security credentials to verify who you are and whether you have
permission to access the resources that you are requesting. AWS uses the security credentials to authenticate and
@ -139,19 +142,9 @@ You can find further information on how to manage your access keys [here](https:
$$
$$section
### Aws Secret Access Key $(id="awsSecretAccessKey")
### AWS Secret Access Key $(id="awsSecretAccessKey")
When you interact with AWS, you specify your AWS security credentials to verify who you are and whether you have
permission to access the resources that you are requesting. AWS uses the security credentials to authenticate and
authorize your requests ([docs](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html)).
Access keys consist of two parts:
1. An access key ID (for example, `AKIAIOSFODNN7EXAMPLE`),
2. And a secret access key (for example, `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`).
You must use both the access key ID and secret access key together to authenticate your requests.
You can find further information on how to manage your access keys [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)
Secret access key (for example, `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`).
$$
$$section

View File

@ -0,0 +1,32 @@
# Lineage
Extracting lineage information from Database Services happens by extracting the queries ran against the service
and parsing them.
Depending on the service, these queries are picked up from query history tables such as `query_history` in Snowflake,
or via API calls for Databricks or Athena.
$$note
Note that in order to find the lineage information, you will first need to have the tables in OpenMetadata by running
the Metadata Workflow. We use the table names identified in the queries to match the information present in OpenMetadata.
$$
Depending on the number of queries ran in the service, this can become an expensive operation. We offer two ways of
limiting the number of parsed queries:
## Query Log Duration $(id="queryLogDuration")
This is the value in **days** to filter out past queries. For example, being today `2023/04/19`, if we set this value
as 2, we would be listing queries from `2023/04/17` until `2023/04/19` (included).
## Result Limit $(id="resultLimit")
Another way to limit data is by adding a maximum number of records to process. This value works as:
```sql
SELECT xyz FROM query_history limit <resultLimit>
```
This value will take precedence over the `Query Log Duration`.

View File

@ -0,0 +1,35 @@
# Usage
Extracting usage information from Database Services happens by extracting the queries ran against the service
and parsing them.
Depending on the service, these queries are picked up from query history tables such as `query_history` in Snowflake,
or via API calls for Databricks or Athena.
$$note
Note that in order to find the usage information, you will first need to have the tables in OpenMetadata by running
the Metadata Workflow. We use the table names identified in the queries to match the information present in OpenMetadata.
$$
We will use this information to compute asset relevancy, show in the UI the queries ran against each table, and process
frequently joined tables.
Depending on the number of queries ran in the service, this can become an expensive operation. We offer two ways of
limiting the number of parsed queries:
## Query Log Duration $(id="queryLogDuration")
This is the value in **days** to filter out past queries. For example, being today `2023/04/19`, if we set this value
as 2, we would be listing queries from `2023/04/17` until `2023/04/19` (included).
## Result Limit $(id="resultLimit")
Another way to limit data is by adding a maximum number of records to process. This value works as:
```sql
SELECT xyz FROM query_history limit <resultLimit>
```
This value will take precedence over the `Query Log Duration`.