10 KiB
Raw Blame History

Prerequisites

In order to ingest metadata from tableau, you will need:

Integration Details

This plugin extracts Sheets, Dashboards, Embedded and Published Data sources metadata within Workbooks in a given project on a Tableau site. This plugin is in beta and has only been tested on PostgreSQL database and sample workbooks on Tableau online. Tableau's GraphQL interface is used to extract metadata information. Queries used to extract metadata are located in metadata-ingestion/src/datahub/ingestion/source/tableau_common.py

Concept Mapping

This ingestion source maps the following Source System Concepts to DataHub Concepts:

Source Concept DataHub Concept Notes
"Tableau" Data Platform
Embedded DataSource Dataset SubType "Embedded Data Source"
Published DataSource Dataset SubType "Published Data Source"
Custom SQL Table Dataset SubTypes "View", "Custom SQL"
Embedded or External Tables Dataset
Sheet Chart
Dashboard Dashboard
User User (a.k.a CorpUser)
Workbook Container SubType "Workbook"
Tag Tag

Workbook

Workbooks from Tableau are ingested as Container in datahub.

  • GraphQL query
{
  workbooksConnection(first: 10, offset: 0, filter: {projectNameWithin: ["default", "Project 2"]}) {
    nodes {
      id
      name
      luid
      uri
      projectName
      owner {
        username
      }
      description
      uri
      createdAt
      updatedAt
    }
    pageInfo {
      hasNextPage
      endCursor
    }
    totalCount
  }
}

Dashboard

Dashboards from Tableau are ingested as Dashboard in datahub.

  • GraphQL query
{
  workbooksConnection(first: 10, offset: 0, filter: {projectNameWithin: ["default", "Project 2"]}) {
    nodes {
      .....
      dashboards {
        id
        name
        path
        createdAt
        updatedAt
        sheets {
          id
          name
        }
      }
    }
    pageInfo {
      hasNextPage
      endCursor
    }
    totalCount
  }
}

Sheet

Sheets from Tableau are ingested as charts in datahub.

  • GraphQL query
{
  workbooksConnection(first: 10, offset: 0, filter: {projectNameWithin: ["default"]}) {
    .....
      sheets {
        id
        name
        path
        createdAt
        updatedAt
        tags {
          name
        }
        containedInDashboards {
          name
          path
        }
        upstreamDatasources {
          id
          name
        }
        datasourceFields {
          __typename
          id
          name
          description
          upstreamColumns {
            name
          }
          ... on ColumnField {
            dataCategory
            role
            dataType
            aggregation
          }
          ... on CalculatedField {
            role
            dataType
            aggregation
            formula
          }
          ... on GroupField {
            role
            dataType
          }
          ... on DatasourceField {
            remoteField {
              __typename
              id
              name
              description
              folderName
              ... on ColumnField {
                dataCategory
                role
                dataType
                aggregation
              }
              ... on CalculatedField {
                role
                dataType
                aggregation
                formula
              }
              ... on GroupField {
                role
                dataType
              }
            }
          }
        }
      }
    }
     .....
  }
}

Embedded Data Source

Embedded Data source from Tableau is ingested as a Dataset in datahub.

  • GraphQL query
{
  workbooksConnection(first: 10, offset: 0, filter: {projectNameWithin: ["default"]}) {
    nodes {
      ....
      embeddedDatasources {
        __typename
        id
        name
        hasExtracts
        extractLastRefreshTime
        extractLastIncrementalUpdateTime
        extractLastUpdateTime
        upstreamDatabases {
          id
          name
          connectionType
          isEmbedded
        }
        upstreamTables {
          name
          schema
          columns {
            name
            remoteType
          }
        }
        fields {
          __typename
          id
          name
          description
          isHidden
          folderName
          ... on ColumnField {
            dataCategory
            role
            dataType
            defaultFormat
            aggregation
            columns {
              table {
                ... on CustomSQLTable {
                  id
                  name
                }
              }
            }
          }
          ... on CalculatedField {
            role
            dataType
            defaultFormat
            aggregation
            formula
          }
          ... on GroupField {
            role
            dataType
          }
        }
        upstreamDatasources {
          id
          name
        }
        workbook {
          name
          projectName
        }
      }
    }
    ....
  }
}

Published Data Source

Published Data source from Tableau is ingested as a Dataset in datahub.

  • GraphQL query
{
  publishedDatasourcesConnection(first: 10, offset: 0, filter: {idWithin: ["00cce29f-b561-bb41-3557-8e19660bb5dd", "618c87db-5959-338b-bcc7-6f5f4cc0b6c6"]}) {
    nodes {
      __typename
      id
      name
      hasExtracts
      extractLastRefreshTime
      extractLastIncrementalUpdateTime
      extractLastUpdateTime
      downstreamSheets {
        id
        name
      }
      upstreamTables {
        name
        schema
        fullName
        connectionType
        description
        contact {
          name
        }
      }
      fields {
        __typename
        id
        name
        description
        isHidden
        folderName
        ... on ColumnField {
          dataCategory
          role
          dataType
          defaultFormat
          aggregation
          columns {
            table {
              ... on CustomSQLTable {
                id
                name
              }
            }
          }
        }
        ... on CalculatedField {
          role
          dataType
          defaultFormat
          aggregation
          formula
        }
        ... on GroupField {
          role
          dataType
        }
      }
      owner {
        username
      }
      description
      uri
      projectName
    }
    pageInfo {
      hasNextPage
      endCursor
    }
    totalCount
  }
}

Custom SQL Data Source

For custom sql data sources, the query is viewable in UI under View Definition tab.

  • GraphQL query
{
  customSQLTablesConnection(first: 10, offset: 0, filter: {idWithin: ["22b0b4c3-6b85-713d-a161-5a87fdd78f40"]}) {
    nodes {
      id
      name
      query
      columns {
        id
        name
        remoteType
        description
        referencedByFields {
          datasource {
            id
            name
            upstreamDatabases {
              id
              name
            }
            upstreamTables {
              id
              name
              schema
              connectionType
              columns {
                id
              }
            }
            ... on PublishedDatasource {
              projectName
            }
            ... on EmbeddedDatasource {
              workbook {
                name
                projectName
              }
            }
          }
        }
      }
      tables {
        id
        name
        schema
        connectionType
      }
    }
  }
}

Lineage

Lineage is emitted as received from Tableau's metadata API for

  • Sheets contained in Dashboard
  • Embedded or Published datasources upstream to Sheet
  • Published datasources upstream to Embedded datasource
  • Tables upstream to Embedded or Published datasource
  • Custom SQL datasources upstream to Embedded or Published datasource
  • Tables upstream to Custom SQL datasource

Caveats

  • Tableau metadata API might return incorrect schema name for tables for some databases, leading to incorrect metadata in DataHub. This source attempts to extract correct schema from databaseTable's fully qualified name, wherever possible. Read Using the databaseTable object in query for caveats in using schema attribute.

Troubleshooting

Why are only some workbooks/custom SQLs/published datasources ingested from the specified project?

This may happen when the Tableau API returns NODE_LIMIT_EXCEEDED error in response to metadata query and returns partial results with message "Showing partial results. , The request exceeded the n node limit. Use pagination, additional filtering, or both in the query to adjust results." To resolve this, consider

  • reducing the page size using the page_size config param in datahub recipe (Defaults to 10).
  • increasing tableau configuration metadata query node limit to higher value.