OpenMetadata/docs/tutorials/tutorial-data-discovery-with-openmetadata.md
parthp2107 e2578d6be3
Added documentation changes done in 0.5.0 branch to main (#1168)
* GitBook: [#177] Documentation Update - Airflow

* GitBook: [#195] Removing Cron from databaseServices

* GitBook: [#196] Added trino

* GitBook: [#197] removed cron from config

* GitBook: [#198] Added Redash Documentation

* GitBook: [#199] Added Bigquery Usage Documentation

* GitBook: [#200] Added page link for presto

* GitBook: [#201] Added Local Docker documentation

* GitBook: [#202] Added Documentation for Local Docker Setup

* GitBook: [#203] Added Git Command to clone Openmetadata in docs

* GitBook: [#207] links update

* GitBook: [#208] Updating Airflow Documentation

* GitBook: [#210] Adding Python installation package under Airflow Lineage config

* GitBook: [#211] Change the links to 0.5..0

* GitBook: [#213] Move buried connectors page up

* GitBook: [#214] Update to connectors page

* GitBook: [#215] Removed sub-categories

* GitBook: [#212] Adding Discovery tutorial

* GitBook: [#220] Updated steps to H2s.

* GitBook: [#230] Complex queries

* GitBook: [#231] Add lineage to feature overview

* GitBook: [#232] Make feature overview headers verbs instead of nouns

* GitBook: [#233] Add data reliability to features overview

* GitBook: [#234] Add complex data types to feature overview

* GitBook: [#235] Simplify and further distinguish discovery feature headers

* GitBook: [#236] Add data importance to feature overview

* GitBook: [#237] Break Connectors into its own section

* GitBook: [#238] Reorganize first section of docs.

* GitBook: [#239] Add connectors to feature overview

* GitBook: [#240] Organize layout of feature overview into feature categories as agreed with Harsha.

* GitBook: [#242] Make overview paragraph more descriptive.

* GitBook: [#243] Create a link to Connectors section from feature overview.

* GitBook: [#244] Add "discover data through association" to feature overview.

* GitBook: [#245] Update importance and owners gifs

* GitBook: [#246] Include a little more descriptive documentation for key features.

* GitBook: [#248] Small tweaks to intro paragraph.

* GitBook: [#249] Clean up data profiler paragraph.

* GitBook: [#250] Promote Complex Data Types to its own feature.

* GitBook: [#251] Update to advanced search

* GitBook: [#252] Update Roadmap

* GitBook: [#254] Remove old features page (text and screenshot based).

* GitBook: [#255] Remove references to removed page.

* GitBook: [#256] Add Descriptions and Tags section to feature overview.

* GitBook: [#257] Update title for "Know Your Data"

Co-authored-by: Ayush Shah <ayush.shah@deuexsolutions.com>
Co-authored-by: Suresh Srinivas <suresh@getcollate.io>
Co-authored-by: Shannon Bradshaw <shannon.bradshaw@arrikto.com>
Co-authored-by: OpenMetadata <github@harsha.io>
2021-11-13 09:33:20 -08:00

6.6 KiB
Raw Permalink Blame History

Tutorial: Data Discovery with OpenMetadata

In this tutorial, we will explore key features of the OpenMetadata standard and Discovery and Collaboration User Interface. Specifically, we will demonstrate how to:

  • Find data using keyword search across services, databases, tables, tags, etc.
  • Use tags to identify the relative importance of different datasets.
  • Use data descriptions to distinguish the right data to use for your use case from among many possibilities.

For this tutorial, we will assume the role of data analysts who have been asked to analyze product sales by region. We will use the OpenMetadata sandbox. The sandbox is an environment in which you can explore OpenMetadata in the context of data assets and the metadata with which a community of users has annotated these resources.

1. Log in to the OpenMetadata sandbox using a Google account

2. Add yourself as a user and add yourself to several teams

This is only necessary if you have previously logged in to OpenMetadata.

Once logged in, your view of the sandbox should look something like the figure below.

3. Search for "sales"

In the search box, enter the search term, sales. OpenMetadata will perform the search across all assets, regardless of type, and retrieve those that match by name or based on the text of metadata associated with that asset.

Note that as we type the search term sales, OpenMetadata auto-suggests a number of matching assets categorized by type in a dropdown just below the search box. In this case, there are assets of type Table, Topic, and Dashboard displayed. See the figure below for an example. OpenMetadata search also looks for pipelines, column names, tags, and other assets matching your query. Keyword search is, therefore, a powerful tool for locating relevant assets.

4. Explore the search results: Tables, Dashboards, Pipelines

Having issued our search for sales, we see results similar to those depicted below. This query matches 12 tables across the BigQuery and Redshift services.

In addition, weve identified four dashboards...

...and an ETL pipeline for sales data.

5. Take note of descriptions and tags

As we look through all of this, its important to note the descriptions for these assets. For example, the fact_order_and_sales_etl pipeline identifies the fact_sale table as a critical reporting table.

We also see tags that other users have applied to help identify data types of particular interest contained in each asset.

Finally, we see that some of the assets are identified with a tag specifying tiers ranging from Tier1 to Tier5. Tiers are a means of identifying the relative importance of assets.

6. View in-product documentation for Tiers

To learn more about Tiers and other tags, we can visit Settings > Tags.

Clicking Tier from the Tag Categories provides us with a description of the Tier tag type as well as a detailed description of each tier.

Note also that the description for each tier includes a Usage label identifying the number of assets to which that tag has been applied. This number is linked to all assets tagged accordingly. Usage data is maintained for Tier tags and all other tags as well.

7. Focus on Tier1 (important) assets

In general, for analyses that will drive business decisions, we want to ensure that the data we are using is important and already being used to drive other decisions. As we saw in the previous step, Tier1 assets meet this criterion.

Based on our consideration of asset descriptions, tags, and tiers, we now have a better sense for how to locate the data we need in order to perform an analysis of sales by region**.**

Lets go back to the tables tab in our search results since thats where well find the source data we need. Looking at the options for filtering search results, we can select Tier1 to limit results to just the most important tables among the assets matching our query.

8. Sort by usage frequency

In addition to tiers, another determiner of importance is how frequently a table is used. The OpenMetadata search UI enables us to sort results by weekly usage. Lets go ahead and do that.

9. Limit consideration to high usage, Tier1 assets

Having sorted the Tier1 assets, we can see that there are probably only two tables that warrant further consideration: fact_sale and fact_order. Both of these tables are roughly among the top quarter of the most frequently used tables. Based on their names, either could serve our purpose so well need to dig deeper.

10. Use descriptions to distinguish between candidate assets

At this point, we can see that well need to compare fact_sale and fact_order to determine which best suits our needs. Looking at the descriptions for each table we see a couple of statements that help clarify which table we should use.

First from the fact_sale description we see a statement that indicates that we should use fact_sale.

Then from the fact_order description we see a statement that directs us to use the fact_sale table when computing financial metrics.

As further evidence, if youll recall, the description of the fact_order_and_sales_etl pipeline that we reviewed in step 5 above also calls out the use of fact_sale for critical reporting.

Taken together, the Tier1 designation, the frequency of use, and the direction weve gleaned from three asset descriptions provides a high degree of confidence that fact_sale is the right table for us to use.

In the next tutorial, we will explore how to assess an asset to learn what we need to know about the individual fields, related tables and other assets, and how to get help with specific questions about the asset.

Thanks for following along with this introduction to OpenMetadata! Have questions? Please join the OpenMetadata Slack. We have an active and engaged community that is ready to help!