Add Athena, Lineage and Usage docs & Fix Athena UI Lineage and Usage workflows (#11148)

* Athena docs * Lineage and Usage docs * Missing section close * Fix Athena Model
2026-01-06 12:36:56 +00:00 · 2023-04-20 06:31:53 +02:00 · 2023-04-20 06:31:53 +02:00 · 91cd1491ee
commit 91cd1491ee
parent 3cf6442459
9 changed files with 428 additions and 19 deletions
--- a/ingestion/src/metadata/ingestion/source/database/athena/models.py
+++ b/ingestion/src/metadata/ingestion/source/database/athena/models.py
@ -13,7 +13,7 @@
 Athena Models
 """
 from datetime import datetime
-from typing import List
+from typing import List, Optional

 from pydantic import BaseModel

@ -70,7 +70,7 @@ class AthenaQueryExecution(BaseModel):
    Statistics: Statistics
    WorkGroup: str
    EngineVersion: EngineVersion
-    SubstatementType: str
+    SubstatementType: Optional[str]


 class AthenaQueryExecutionList(BaseModel):
--- a/openmetadata-docs-v1/content/v1.0.0/connectors/database/athena/airflow.md
+++ b/openmetadata-docs-v1/content/v1.0.0/connectors/database/athena/airflow.md
@ -44,6 +44,113 @@ To deploy OpenMetadata, check the Deployment guides.
 To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
 custom Airflow plugins to handle the workflow deployment.

+The Athena connector ingests metadata through JDBC connections.
+
+{% note %}
+
+According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
+
+*If you are using the JDBC or ODBC driver, ensure that the IAM
+permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
+
+{% /note %}
+
+This policy groups the following permissions:
+
+- `athena` – Allows the principal to run queries on Athena resources.
+- `glue` – Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
+- `s3` – Allows the principal to write and read query results from Amazon S3.
+- `lakeformation` – Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
+
+And is defined as:
+
+```json
+{
+    "Version": "2012-10-17",
+    "Statement": [
+        {
+            "Effect": "Allow",
+            "Action": [
+                "athena:BatchGetQueryExecution",
+                "athena:GetQueryExecution",
+                "athena:GetQueryResults",
+                "athena:GetQueryResultsStream",
+                "athena:ListQueryExecutions",
+                "athena:StartQueryExecution",
+                "athena:StopQueryExecution",
+                "athena:ListWorkGroups",
+                "athena:ListEngineVersions",
+                "athena:GetWorkGroup",
+                "athena:GetDataCatalog",
+                "athena:GetDatabase",
+                "athena:GetTableMetadata",
+                "athena:ListDataCatalogs",
+                "athena:ListDatabases",
+                "athena:ListTableMetadata"
+            ],
+            "Resource": [
+                "*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "glue:CreateDatabase",
+                "glue:DeleteDatabase",
+                "glue:GetDatabase",
+                "glue:GetDatabases",
+                "glue:UpdateDatabase",
+                "glue:CreateTable",
+                "glue:DeleteTable",
+                "glue:BatchDeleteTable",
+                "glue:UpdateTable",
+                "glue:GetTable",
+                "glue:GetTables",
+                "glue:BatchCreatePartition",
+                "glue:CreatePartition",
+                "glue:DeletePartition",
+                "glue:BatchDeletePartition",
+                "glue:UpdatePartition",
+                "glue:GetPartition",
+                "glue:GetPartitions",
+                "glue:BatchGetPartition"
+            ],
+            "Resource": [
+                "*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "s3:GetBucketLocation",
+                "s3:GetObject",
+                "s3:ListBucket",
+                "s3:ListBucketMultipartUploads",
+                "s3:ListMultipartUploadParts",
+                "s3:AbortMultipartUpload",
+                "s3:CreateBucket",
+                "s3:PutObject",
+                "s3:PutBucketPublicAccessBlock"
+            ],
+            "Resource": [
+                "arn:aws:s3:::aws-athena-query-results-*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "lakeformation:GetDataAccess"
+            ],
+            "Resource": [
+                "*"
+            ]
+        }
+    ]
+}
+```
+
+You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
+
 ### Python Requirements

 To run the Athena ingestion, you will need to install:
--- a/openmetadata-docs-v1/content/v1.0.0/connectors/database/athena/cli.md
+++ b/openmetadata-docs-v1/content/v1.0.0/connectors/database/athena/cli.md
@ -44,6 +44,113 @@ To deploy OpenMetadata, check the Deployment guides.
 To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
 custom Airflow plugins to handle the workflow deployment.

+The Athena connector ingests metadata through JDBC connections.
+
+{% note %}
+
+According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
+
+*If you are using the JDBC or ODBC driver, ensure that the IAM
+permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
+
+{% /note %}
+
+This policy groups the following permissions:
+
+- `athena` – Allows the principal to run queries on Athena resources.
+- `glue` – Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
+- `s3` – Allows the principal to write and read query results from Amazon S3.
+- `lakeformation` – Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
+
+And is defined as:
+
+```json
+{
+    "Version": "2012-10-17",
+    "Statement": [
+        {
+            "Effect": "Allow",
+            "Action": [
+                "athena:BatchGetQueryExecution",
+                "athena:GetQueryExecution",
+                "athena:GetQueryResults",
+                "athena:GetQueryResultsStream",
+                "athena:ListQueryExecutions",
+                "athena:StartQueryExecution",
+                "athena:StopQueryExecution",
+                "athena:ListWorkGroups",
+                "athena:ListEngineVersions",
+                "athena:GetWorkGroup",
+                "athena:GetDataCatalog",
+                "athena:GetDatabase",
+                "athena:GetTableMetadata",
+                "athena:ListDataCatalogs",
+                "athena:ListDatabases",
+                "athena:ListTableMetadata"
+            ],
+            "Resource": [
+                "*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "glue:CreateDatabase",
+                "glue:DeleteDatabase",
+                "glue:GetDatabase",
+                "glue:GetDatabases",
+                "glue:UpdateDatabase",
+                "glue:CreateTable",
+                "glue:DeleteTable",
+                "glue:BatchDeleteTable",
+                "glue:UpdateTable",
+                "glue:GetTable",
+                "glue:GetTables",
+                "glue:BatchCreatePartition",
+                "glue:CreatePartition",
+                "glue:DeletePartition",
+                "glue:BatchDeletePartition",
+                "glue:UpdatePartition",
+                "glue:GetPartition",
+                "glue:GetPartitions",
+                "glue:BatchGetPartition"
+            ],
+            "Resource": [
+                "*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "s3:GetBucketLocation",
+                "s3:GetObject",
+                "s3:ListBucket",
+                "s3:ListBucketMultipartUploads",
+                "s3:ListMultipartUploadParts",
+                "s3:AbortMultipartUpload",
+                "s3:CreateBucket",
+                "s3:PutObject",
+                "s3:PutBucketPublicAccessBlock"
+            ],
+            "Resource": [
+                "arn:aws:s3:::aws-athena-query-results-*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "lakeformation:GetDataAccess"
+            ],
+            "Resource": [
+                "*"
+            ]
+        }
+    ]
+}
+```
+
+You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
+
 ### Python Requirements

 To run the Athena ingestion, you will need to install:
--- a/openmetadata-docs-v1/content/v1.0.0/connectors/database/athena/index.md
+++ b/openmetadata-docs-v1/content/v1.0.0/connectors/database/athena/index.md
@ -67,6 +67,113 @@ To deploy OpenMetadata, check the Deployment guides.
 To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with
 custom Airflow plugins to handle the workflow deployment.

+The Athena connector ingests metadata through JDBC connections.
+
+{% note %}
+
+According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html):
+
+*If you are using the JDBC or ODBC driver, ensure that the IAM
+permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
+
+{% /note %}
+
+This policy groups the following permissions:
+
+- `athena` – Allows the principal to run queries on Athena resources.
+- `glue` – Allows principals access to AWS Glue databases, tables, and partitions. This is required so that the principal can use the AWS Glue Data Catalog with Athena.
+- `s3` – Allows the principal to write and read query results from Amazon S3.
+- `lakeformation` – Allows principals to request temporary credentials to access data in a data lake location that is registered with Lake Formation.
+
+And is defined as:
+
+```json
+{
+    "Version": "2012-10-17",
+    "Statement": [
+        {
+            "Effect": "Allow",
+            "Action": [
+                "athena:BatchGetQueryExecution",
+                "athena:GetQueryExecution",
+                "athena:GetQueryResults",
+                "athena:GetQueryResultsStream",
+                "athena:ListQueryExecutions",
+                "athena:StartQueryExecution",
+                "athena:StopQueryExecution",
+                "athena:ListWorkGroups",
+                "athena:ListEngineVersions",
+                "athena:GetWorkGroup",
+                "athena:GetDataCatalog",
+                "athena:GetDatabase",
+                "athena:GetTableMetadata",
+                "athena:ListDataCatalogs",
+                "athena:ListDatabases",
+                "athena:ListTableMetadata"
+            ],
+            "Resource": [
+                "*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "glue:CreateDatabase",
+                "glue:DeleteDatabase",
+                "glue:GetDatabase",
+                "glue:GetDatabases",
+                "glue:UpdateDatabase",
+                "glue:CreateTable",
+                "glue:DeleteTable",
+                "glue:BatchDeleteTable",
+                "glue:UpdateTable",
+                "glue:GetTable",
+                "glue:GetTables",
+                "glue:BatchCreatePartition",
+                "glue:CreatePartition",
+                "glue:DeletePartition",
+                "glue:BatchDeletePartition",
+                "glue:UpdatePartition",
+                "glue:GetPartition",
+                "glue:GetPartitions",
+                "glue:BatchGetPartition"
+            ],
+            "Resource": [
+                "*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "s3:GetBucketLocation",
+                "s3:GetObject",
+                "s3:ListBucket",
+                "s3:ListBucketMultipartUploads",
+                "s3:ListMultipartUploadParts",
+                "s3:AbortMultipartUpload",
+                "s3:CreateBucket",
+                "s3:PutObject",
+                "s3:PutBucketPublicAccessBlock"
+            ],
+            "Resource": [
+                "arn:aws:s3:::aws-athena-query-results-*"
+            ]
+        },
+        {
+            "Effect": "Allow",
+            "Action": [
+                "lakeformation:GetDataAccess"
+            ],
+            "Resource": [
+                "*"
+            ]
+        }
+    ]
+}
+```
+
+You can find further information on the Athena connector in the [docs](https://docs.open-metadata.org/connectors/database/athena).
+
 ## Metadata Ingestion

 {% stepsContainer %}
--- a/openmetadata-docs-v1/content/v1.0.0/deployment/airflow/index.md
+++ b/openmetadata-docs-v1/content/v1.0.0/deployment/airflow/index.md
@ -54,10 +54,12 @@ the packages need to be present in the Airflow instances.
 You will need to install:

 ```python
-pip3 install "openmetadata-ingestion[<connector-name>]"
+pip3 install "openmetadata-ingestion[<connector-name>]==x.y.z"
 ```

-And then run the DAG as explained in each [Connector](/connectors).
+And then run the DAG as explained in each [Connector](/connectors), where `x.y.z` is the same version of your
+OpenMetadata server. For example, if you are on version 1.0.0, then you can install the `openmetadata-ingestion`
+with versions `1.0.0.*`, e.g., `1.0.0.0`, `1.0.0.1`, etc., but not `1.0.1.x`.

 ### Airflow APIs

@ -85,13 +87,17 @@ The goal of this module is to add some HTTP endpoints that the UI calls for depl
 The first step can be achieved by running:

 ```python
-pip3 install "openmetadata-managed-apis"
+pip3 install "openmetadata-managed-apis==x.y.z"
 ```

 Then, check the Connector Modules guide above to learn how to install the `openmetadata-ingestion` package with the
 necessary plugins. They are necessary because even if we install the APIs, the Airflow instance needs to have the
 required libraries to connect to each source.

+Here, the same versioning logic applies: `x.y.z` is the same version of your
+OpenMetadata server. For example, if you are on version 1.0.0, then you can install the `openmetadata-managed-apis`
+with versions `1.0.0.*`, e.g., `1.0.0.0`, `1.0.0.1`, etc., but not `1.0.1.x`.
+
 ### AIRFLOW_HOME

 The APIs will look for the `AIRFLOW_HOME` environment variable to place the dynamically generated DAGs. Make
@ -192,6 +198,18 @@ Please update it accordingly.

 ## Ingestion Pipeline deployment issues

+### Airflow APIs Not Found
+
+Validate the installation, making sure that from the OpenMetadata server you can reach the Airflow host, and the
+call to `/health` gives us the proper response:
+
+```bash
+$ curl -XGET ${AIRFLOW_HOST}/api/v1/openmetadata/health
+{"status": "healthy", "version": "x.y.z"}
+```
+
+Also, make sure that the version of your OpenMetadata server matches the `openmetadata-ingestion` client version installed in Airflow.
+
 ### GetServiceException: Could not get service from type XYZ

 In this case, the OpenMetadata client running in the Airflow host had issues getting the service you are trying to
--- a/openmetadata-spec/src/main/resources/json/schema/entity/services/connections/database/athenaConnection.json
+++ b/openmetadata-spec/src/main/resources/json/schema/entity/services/connections/database/athenaConnection.json
@ -75,6 +75,16 @@
    "supportsQueryComment": {
      "title": "Supports Query Comment",
      "$ref": "../connectionBasicType.json#/definitions/supportsQueryComment"
+    },
+    "supportsUsageExtraction": {
+      "description": "Supports Usage Extraction.",
+      "type": "boolean",
+      "default": true
+    },
+    "supportsLineageExtraction": {
+      "description": "Supports Lineage Extraction.",
+      "type": "boolean",
+      "default": true
    }
  },
  "additionalProperties": false,
--- a/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/Athena.md
+++ b/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/Athena.md
@ -6,9 +6,12 @@ In this section, we provide guides and references to use the Athena connector.

 The Athena connector ingests metadata through JDBC connections.

+$$note
 According to AWS's official [documentation](https://docs.aws.amazon.com/athena/latest/ug/policy-actions.html): 
+
 *If you are using the JDBC or ODBC driver, ensure that the IAM 
 permissions policy includes all of the actions listed in [AWS managed policy: AWSQuicksightAthenaAccess](https://docs.aws.amazon.com/athena/latest/ug/managed-policies.html#awsquicksightathenaaccess-managed-policy).*
+$$

 This policy groups the following permissions:

@ -116,14 +119,14 @@ SQLAlchemy driver scheme options.
 $$

 $$section
-### Aws Config $(id="awsConfig")
+### AWS Config $(id="awsConfig")

 AWS credentials configs.
 <!-- awsConfig to be updated -->
 $$

 $$section
-### Aws Access Key Id $(id="awsAccessKeyId")
+### AWS Access Key ID $(id="awsAccessKeyId")

 When you interact with AWS, you specify your AWS security credentials to verify who you are and whether you have 
 permission to access the resources that you are requesting. AWS uses the security credentials to authenticate and
@ -139,19 +142,9 @@ You can find further information on how to manage your access keys [here](https:
 $$

 $$section
-### Aws Secret Access Key $(id="awsSecretAccessKey")
+### AWS Secret Access Key $(id="awsSecretAccessKey")

-When you interact with AWS, you specify your AWS security credentials to verify who you are and whether you have 
-permission to access the resources that you are requesting. AWS uses the security credentials to authenticate and
-authorize your requests ([docs](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html)).
-
-Access keys consist of two parts: 
-1. An access key ID (for example, `AKIAIOSFODNN7EXAMPLE`),
-2. And a secret access key (for example, `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`).
-
-You must use both the access key ID and secret access key together to authenticate your requests.
-
-You can find further information on how to manage your access keys [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html)
+Secret access key (for example, `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY`).
 $$

 $$section
--- a/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/workflows/lineage.md
+++ b/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/workflows/lineage.md
@ -0,0 +1,32 @@
+# Lineage
+
+Extracting lineage information from Database Services happens by extracting the queries ran against the service
+and parsing them.
+
+Depending on the service, these queries are picked up from query history tables such as `query_history` in Snowflake,
+or via API calls for Databricks or Athena.
+
+$$note
+
+Note that in order to find the lineage information, you will first need to have the tables in OpenMetadata by running
+the Metadata Workflow. We use the table names identified in the queries to match the information present in OpenMetadata.
+
+$$
+
+Depending on the number of queries ran in the service, this can become an expensive operation. We offer two ways of
+limiting the number of parsed queries:
+
+## Query Log Duration $(id="queryLogDuration")
+
+This is the value in **days** to filter out past queries. For example, being today `2023/04/19`, if we set this value
+as 2, we would be listing queries from `2023/04/17` until `2023/04/19` (included).
+
+## Result Limit $(id="resultLimit")
+
+Another way to limit data is by adding a maximum number of records to process. This value works as:
+
+```sql
+SELECT xyz FROM query_history limit <resultLimit>
+```
+
+This value will take precedence over the `Query Log Duration`.
--- a/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/workflows/usage.md
+++ b/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/workflows/usage.md
@ -0,0 +1,35 @@
+# Usage
+
+Extracting usage information from Database Services happens by extracting the queries ran against the service
+and parsing them.
+
+Depending on the service, these queries are picked up from query history tables such as `query_history` in Snowflake,
+or via API calls for Databricks or Athena.
+
+$$note
+
+Note that in order to find the usage information, you will first need to have the tables in OpenMetadata by running
+the Metadata Workflow. We use the table names identified in the queries to match the information present in OpenMetadata.
+
+$$
+
+We will use this information to compute asset relevancy, show in the UI the queries ran against each table, and process
+frequently joined tables.
+
+Depending on the number of queries ran in the service, this can become an expensive operation. We offer two ways of
+limiting the number of parsed queries:
+
+## Query Log Duration $(id="queryLogDuration")
+
+This is the value in **days** to filter out past queries. For example, being today `2023/04/19`, if we set this value
+as 2, we would be listing queries from `2023/04/17` until `2023/04/19` (included).
+
+## Result Limit $(id="resultLimit")
+
+Another way to limit data is by adding a maximum number of records to process. This value works as:
+
+```sql
+SELECT xyz FROM query_history limit <resultLimit>
+```
+
+This value will take precedence over the `Query Log Duration`.