mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2026-01-06 12:36:56 +00:00
Add Athena, Lineage and Usage docs & Fix Athena UI Lineage and Usage workflows (#11148)
* Athena docs * Lineage and Usage docs * Missing section close * Fix Athena Model
This commit is contained in:
parent
3cf6442459
commit
91cd1491ee
@ -13,7 +13,7 @@
|
||||
Athena Models
|
||||
"""
|
||||
from datetime import datetime
|
||||
from typing import List
|
||||
from typing import List, Optional
|
||||
|
||||
from pydantic import BaseModel
|
||||
|
||||
@ -70,7 +70,7 @@ class AthenaQueryExecution(BaseModel):
|
||||
Statistics: Statistics
|
||||
WorkGroup: str
|
||||
EngineVersion: EngineVersion
|
||||
SubstatementType: str
|
||||
SubstatementType: Optional[str]
|
||||
|
||||
|
||||
class AthenaQueryExecutionList(BaseModel):
|
||||
|
||||
@ -44,6 +44,113 @@ To deploy OpenMetadata, check the Deployment guides.
|
||||
To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
|
||||
custom Airflow plugins to handle the workflow deployment.
|
||||
|
||||
The Athena connector ingests metadata through JDBC connections.
|
||||
|
||||
{% note %}
|
||||
|
||||
According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
|
||||
|
||||
*If you are using the JDBC or ODBC driver, ensure that the IAM
|
||||
permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
|
||||
|
||||
{% /note %}
|
||||
|
||||
This policy groups the following permissions:
|
||||
|
||||
- `athena` – Allows the principal to run queries on Athena resources.
|
||||
- `glue` – Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
|
||||
- `s3` – Allows the principal to write and read query results from Amazon S3.
|
||||
- `lakeformation` – Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
|
||||
|
||||
And is defined as:
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"athena:BatchGetQueryExecution",
|
||||
"athena:GetQueryExecution",
|
||||
"athena:GetQueryResults",
|
||||
"athena:GetQueryResultsStream",
|
||||
"athena:ListQueryExecutions",
|
||||
"athena:StartQueryExecution",
|
||||
"athena:StopQueryExecution",
|
||||
"athena:ListWorkGroups",
|
||||
"athena:ListEngineVersions",
|
||||
"athena:GetWorkGroup",
|
||||
"athena:GetDataCatalog",
|
||||
"athena:GetDatabase",
|
||||
"athena:GetTableMetadata",
|
||||
"athena:ListDataCatalogs",
|
||||
"athena:ListDatabases",
|
||||
"athena:ListTableMetadata"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"glue:CreateDatabase",
|
||||
"glue:DeleteDatabase",
|
||||
"glue:GetDatabase",
|
||||
"glue:GetDatabases",
|
||||
"glue:UpdateDatabase",
|
||||
"glue:CreateTable",
|
||||
"glue:DeleteTable",
|
||||
"glue:BatchDeleteTable",
|
||||
"glue:UpdateTable",
|
||||
"glue:GetTable",
|
||||
"glue:GetTables",
|
||||
"glue:BatchCreatePartition",
|
||||
"glue:CreatePartition",
|
||||
"glue:DeletePartition",
|
||||
"glue:BatchDeletePartition",
|
||||
"glue:UpdatePartition",
|
||||
"glue:GetPartition",
|
||||
"glue:GetPartitions",
|
||||
"glue:BatchGetPartition"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:GetBucketLocation",
|
||||
"s3:GetObject",
|
||||
"s3:ListBucket",
|
||||
"s3:ListBucketMultipartUploads",
|
||||
"s3:ListMultipartUploadParts",
|
||||
"s3:AbortMultipartUpload",
|
||||
"s3:CreateBucket",
|
||||
"s3:PutObject",
|
||||
"s3:PutBucketPublicAccessBlock"
|
||||
],
|
||||
"Resource": [
|
||||
"arn:aws:s3:::aws-athena-query-results-*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"lakeformation:GetDataAccess"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
|
||||
|
||||
### Python Requirements
|
||||
|
||||
To run the Athena ingestion, you will need to install:
|
||||
|
||||
@ -44,6 +44,113 @@ To deploy OpenMetadata, check the Deployment guides.
|
||||
To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
|
||||
custom Airflow plugins to handle the workflow deployment.
|
||||
|
||||
The Athena connector ingests metadata through JDBC connections.
|
||||
|
||||
{% note %}
|
||||
|
||||
According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
|
||||
|
||||
*If you are using the JDBC or ODBC driver, ensure that the IAM
|
||||
permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
|
||||
|
||||
{% /note %}
|
||||
|
||||
This policy groups the following permissions:
|
||||
|
||||
- `athena` – Allows the principal to run queries on Athena resources.
|
||||
- `glue` – Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
|
||||
- `s3` – Allows the principal to write and read query results from Amazon S3.
|
||||
- `lakeformation` – Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
|
||||
|
||||
And is defined as:
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"athena:BatchGetQueryExecution",
|
||||
"athena:GetQueryExecution",
|
||||
"athena:GetQueryResults",
|
||||
"athena:GetQueryResultsStream",
|
||||
"athena:ListQueryExecutions",
|
||||
"athena:StartQueryExecution",
|
||||
"athena:StopQueryExecution",
|
||||
"athena:ListWorkGroups",
|
||||
"athena:ListEngineVersions",
|
||||
"athena:GetWorkGroup",
|
||||
"athena:GetDataCatalog",
|
||||
"athena:GetDatabase",
|
||||
"athena:GetTableMetadata",
|
||||
"athena:ListDataCatalogs",
|
||||
"athena:ListDatabases",
|
||||
"athena:ListTableMetadata"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"glue:CreateDatabase",
|
||||
"glue:DeleteDatabase",
|
||||
"glue:GetDatabase",
|
||||
"glue:GetDatabases",
|
||||
"glue:UpdateDatabase",
|
||||
"glue:CreateTable",
|
||||
"glue:DeleteTable",
|
||||
"glue:BatchDeleteTable",
|
||||
"glue:UpdateTable",
|
||||
"glue:GetTable",
|
||||
"glue:GetTables",
|
||||
"glue:BatchCreatePartition",
|
||||
"glue:CreatePartition",
|
||||
"glue:DeletePartition",
|
||||
"glue:BatchDeletePartition",
|
||||
"glue:UpdatePartition",
|
||||
"glue:GetPartition",
|
||||
"glue:GetPartitions",
|
||||
"glue:BatchGetPartition"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:GetBucketLocation",
|
||||
"s3:GetObject",
|
||||
"s3:ListBucket",
|
||||
"s3:ListBucketMultipartUploads",
|
||||
"s3:ListMultipartUploadParts",
|
||||
"s3:AbortMultipartUpload",
|
||||
"s3:CreateBucket",
|
||||
"s3:PutObject",
|
||||
"s3:PutBucketPublicAccessBlock"
|
||||
],
|
||||
"Resource": [
|
||||
"arn:aws:s3:::aws-athena-query-results-*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"lakeformation:GetDataAccess"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
|
||||
|
||||
### Python Requirements
|
||||
|
||||
To run the Athena ingestion, you will need to install:
|
||||
|
||||
@ -67,6 +67,113 @@ To deploy OpenMetadata, check the Deployment guides.
|
||||
To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
|
||||
custom Airflow plugins to handle the workflow deployment.
|
||||
|
||||
The Athena connector ingests metadata through JDBC connections.
|
||||
|
||||
{% note %}
|
||||
|
||||
According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
|
||||
|
||||
*If you are using the JDBC or ODBC driver, ensure that the IAM
|
||||
permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
|
||||
|
||||
{% /note %}
|
||||
|
||||
This policy groups the following permissions:
|
||||
|
||||
- `athena` – Allows the principal to run queries on Athena resources.
|
||||
- `glue` – Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
|
||||
- `s3` – Allows the principal to write and read query results from Amazon S3.
|
||||
- `lakeformation` – Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
|
||||
|
||||
And is defined as:
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"athena:BatchGetQueryExecution",
|
||||
"athena:GetQueryExecution",
|
||||
"athena:GetQueryResults",
|
||||
"athena:GetQueryResultsStream",
|
||||
"athena:ListQueryExecutions",
|
||||
"athena:StartQueryExecution",
|
||||
"athena:StopQueryExecution",
|
||||
"athena:ListWorkGroups",
|
||||
"athena:ListEngineVersions",
|
||||
"athena:GetWorkGroup",
|
||||
"athena:GetDataCatalog",
|
||||
"athena:GetDatabase",
|
||||
"athena:GetTableMetadata",
|
||||
"athena:ListDataCatalogs",
|
||||
"athena:ListDatabases",
|
||||
"athena:ListTableMetadata"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"glue:CreateDatabase",
|
||||
"glue:DeleteDatabase",
|
||||
"glue:GetDatabase",
|
||||
"glue:GetDatabases",
|
||||
"glue:UpdateDatabase",
|
||||
"glue:CreateTable",
|
||||
"glue:DeleteTable",
|
||||
"glue:BatchDeleteTable",
|
||||
"glue:UpdateTable",
|
||||
"glue:GetTable",
|
||||
"glue:GetTables",
|
||||
"glue:BatchCreatePartition",
|
||||
"glue:CreatePartition",
|
||||
"glue:DeletePartition",
|
||||
"glue:BatchDeletePartition",
|
||||
"glue:UpdatePartition",
|
||||
"glue:GetPartition",
|
||||
"glue:GetPartitions",
|
||||
"glue:BatchGetPartition"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:GetBucketLocation",
|
||||
"s3:GetObject",
|
||||
"s3:ListBucket",
|
||||
"s3:ListBucketMultipartUploads",
|
||||
"s3:ListMultipartUploadParts",
|
||||
"s3:AbortMultipartUpload",
|
||||
"s3:CreateBucket",
|
||||
"s3:PutObject",
|
||||
"s3:PutBucketPublicAccessBlock"
|
||||
],
|
||||
"Resource": [
|
||||
"arn:aws:s3:::aws-athena-query-results-*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"lakeformation:GetDataAccess"
|
||||
],
|
||||
"Resource": [
|
||||
"*"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
|
||||
|
||||
## Metadata Ingestion
|
||||
|
||||
{% stepsContainer %}
|
||||
|
||||
@ -54,10 +54,12 @@ the packages need to be present in the Airflow instances.
|
||||
You will need to install:
|
||||
|
||||
```python
|
||||
pip3 install "openmetadata-ingestion[<connector-name>]"
|
||||
pip3 install "openmetadata-ingestion[<connector-name>]==x.y.z"
|
||||
```
|
||||
|
||||
And then run the DAG as explained in each [Connector](/connectors).
|
||||
And then run the DAG as explained in each [Connector](/connectors), where `x.y.z` is the same version of your
|
||||
OpenMetadata server. For example, if you are on version 1.0.0, then you can install the `openmetadata-ingestion`
|
||||
with versions `1.0.0.*`, e.g., `1.0.0.0`, `1.0.0.1`, etc., but not `1.0.1.x`.
|
||||
|
||||
### Airflow APIs
|
||||
|
||||
@ -85,13 +87,17 @@ The goal of this module is to add some HTTP endpoints that the UI calls for depl
|
||||
The first step can be achieved by running:
|
||||
|
||||
```python
|
||||
pip3 install "openmetadata-managed-apis"
|
||||
pip3 install "openmetadata-managed-apis==x.y.z"
|
||||
```
|
||||
|
||||
Then, check the Connector Modules guide above to learn how to install the `openmetadata-ingestion` package with the
|
||||
necessary plugins. They are necessary because even if we install the APIs, the Airflow instance needs to have the
|
||||
required libraries to connect to each source.
|
||||
|
||||
Here, the same versioning logic applies: `x.y.z` is the same version of your
|
||||
OpenMetadata server. For example, if you are on version 1.0.0, then you can install the `openmetadata-managed-apis`
|
||||
with versions `1.0.0.*`, e.g., `1.0.0.0`, `1.0.0.1`, etc., but not `1.0.1.x`.
|
||||
|
||||
### AIRFLOW_HOME
|
||||
|
||||
The APIs will look for the `AIRFLOW_HOME` environment variable to place the dynamically generated DAGs. Make
|
||||
@ -192,6 +198,18 @@ Please update it accordingly.
|
||||
|
||||
## Ingestion Pipeline deployment issues
|
||||
|
||||
### Airflow APIs Not Found
|
||||
|
||||
Validate the installation, making sure that from the OpenMetadata server you can reach the Airflow host, and the
|
||||
call to `/health` gives us the proper response:
|
||||
|
||||
```bash
|
||||
$ curl -XGET ${AIRFLOW_HOST}/api/v1/openmetadata/health
|
||||
{"status": "healthy", "version": "x.y.z"}
|
||||
```
|
||||
|
||||
Also, make sure that the version of your OpenMetadata server matches the `openmetadata-ingestion` client version installed in Airflow.
|
||||
|
||||
### GetServiceException: Could not get service from type XYZ
|
||||
|
||||
In this case, the OpenMetadata client running in the Airflow host had issues getting the service you are trying to
|
||||
|
||||
@ -75,6 +75,16 @@
|
||||
"supportsQueryComment": {
|
||||
"title": "Supports Query Comment",
|
||||
"$ref": "../connectionBasicType.json#/definitions/supportsQueryComment"
|
||||
},
|
||||
"supportsUsageExtraction": {
|
||||
"description": "Supports Usage Extraction.",
|
||||
"type": "boolean",
|
||||
"default": true
|
||||
},
|
||||
"supportsLineageExtraction": {
|
||||
"description": "Supports Lineage Extraction.",
|
||||
"type": "boolean",
|
||||
"default": true
|
||||
}
|
||||
},
|
||||
"additionalProperties": false,
|
||||
|
||||
@ -6,9 +6,12 @@ In this section, we provide guides and references to use the Athena connector.
|
||||
|
||||
The Athena connector ingests metadata through JDBC connections.
|
||||
|
||||
$$note
|
||||
According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
|
||||
|
||||
*If you are using the JDBC or ODBC driver, ensure that the IAM
|
||||
permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
|
||||
$$
|
||||
|
||||
This policy groups the following permissions:
|
||||
|
||||
@ -116,14 +119,14 @@ SQLAlchemy driver scheme options.
|
||||
$$
|
||||
|
||||
$$section
|
||||
### Aws Config $(id="awsConfig")
|
||||
### AWS Config $(id="awsConfig")
|
||||
|
||||
AWS credentials configs.
|
||||
<!-- awsConfig to be updated -->
|
||||
$$
|
||||
|
||||
$$section
|
||||
### Aws Access Key Id $(id="awsAccessKeyId")
|
||||
### AWS Access Key ID $(id="awsAccessKeyId")
|
||||
|
||||
When you interact with AWS, you specify your AWS security credentials to verify who you are and whether you have
|
||||
permission to access the resources that you are requesting. AWS uses the security credentials to authenticate and
|
||||
@ -139,19 +142,9 @@ You can find further information on how to manage your access keys [here](https:
|
||||
$$
|
||||
|
||||
$$section
|
||||
### Aws Secret Access Key $(id="awsSecretAccessKey")
|
||||
### AWS Secret Access Key $(id="awsSecretAccessKey")
|
||||
|
||||
When you interact with AWS, you specify your AWS security credentials to verify who you are and whether you have
|
||||
permission to access the resources that you are requesting. AWS uses the security credentials to authenticate and
|
||||
authorize your requests ([docs](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html)).
|
||||
|
||||
Access keys consist of two parts:
|
||||
1. An access key ID (for example, `AKIAIOSFODNN7EXAMPLE`),
|
||||
2. And a secret access key (for example, `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`).
|
||||
|
||||
You must use both the access key ID and secret access key together to authenticate your requests.
|
||||
|
||||
You can find further information on how to manage your access keys [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)
|
||||
Secret access key (for example, `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`).
|
||||
$$
|
||||
|
||||
$$section
|
||||
|
||||
@ -0,0 +1,32 @@
|
||||
# Lineage
|
||||
|
||||
Extracting lineage information from Database Services happens by extracting the queries ran against the service
|
||||
and parsing them.
|
||||
|
||||
Depending on the service, these queries are picked up from query history tables such as `query_history` in Snowflake,
|
||||
or via API calls for Databricks or Athena.
|
||||
|
||||
$$note
|
||||
|
||||
Note that in order to find the lineage information, you will first need to have the tables in OpenMetadata by running
|
||||
the Metadata Workflow. We use the table names identified in the queries to match the information present in OpenMetadata.
|
||||
|
||||
$$
|
||||
|
||||
Depending on the number of queries ran in the service, this can become an expensive operation. We offer two ways of
|
||||
limiting the number of parsed queries:
|
||||
|
||||
## Query Log Duration $(id="queryLogDuration")
|
||||
|
||||
This is the value in **days** to filter out past queries. For example, being today `2023/04/19`, if we set this value
|
||||
as 2, we would be listing queries from `2023/04/17` until `2023/04/19` (included).
|
||||
|
||||
## Result Limit $(id="resultLimit")
|
||||
|
||||
Another way to limit data is by adding a maximum number of records to process. This value works as:
|
||||
|
||||
```sql
|
||||
SELECT xyz FROM query_history limit <resultLimit>
|
||||
```
|
||||
|
||||
This value will take precedence over the `Query Log Duration`.
|
||||
@ -0,0 +1,35 @@
|
||||
# Usage
|
||||
|
||||
Extracting usage information from Database Services happens by extracting the queries ran against the service
|
||||
and parsing them.
|
||||
|
||||
Depending on the service, these queries are picked up from query history tables such as `query_history` in Snowflake,
|
||||
or via API calls for Databricks or Athena.
|
||||
|
||||
$$note
|
||||
|
||||
Note that in order to find the usage information, you will first need to have the tables in OpenMetadata by running
|
||||
the Metadata Workflow. We use the table names identified in the queries to match the information present in OpenMetadata.
|
||||
|
||||
$$
|
||||
|
||||
We will use this information to compute asset relevancy, show in the UI the queries ran against each table, and process
|
||||
frequently joined tables.
|
||||
|
||||
Depending on the number of queries ran in the service, this can become an expensive operation. We offer two ways of
|
||||
limiting the number of parsed queries:
|
||||
|
||||
## Query Log Duration $(id="queryLogDuration")
|
||||
|
||||
This is the value in **days** to filter out past queries. For example, being today `2023/04/19`, if we set this value
|
||||
as 2, we would be listing queries from `2023/04/17` until `2023/04/19` (included).
|
||||
|
||||
## Result Limit $(id="resultLimit")
|
||||
|
||||
Another way to limit data is by adding a maximum number of records to process. This value works as:
|
||||
|
||||
```sql
|
||||
SELECT xyz FROM query_history limit <resultLimit>
|
||||
```
|
||||
|
||||
This value will take precedence over the `Query Log Duration`.
|
||||
Loading…
x
Reference in New Issue
Block a user