Onkar Ravgan 14fa96958f
Added dbt workflow docs (#9493)
* Added dbt workflow docs

* added dbt small case

* Fixed review comments
2022-12-22 13:11:18 +00:00

8.5 KiB

title slug
Upgrade 0.11 to 0.12 /deployment/upgrade/versions/011-to-012

Upgrade from 0.11 to 0.12

Upgrading from 0.11 to 0.12 can be done directly on your instances. This page will list few general details you should take into consideration when running the upgrade.

Highlights

Database Connection Environment Variables

On 0.11, the Environment Variables to connect to Database used were

  1. MYSQL_USER
  2. MYSQL_USER_PASSWORD
  3. MYSQL_HOST
  4. MYSQL_PORT
  5. MYSQL_DATABASE.

These environment variables are changed in 0.12.0 Release

  1. DB_USER
  2. DB_USER_PASSWORD
  3. DB_HOST
  4. DB_PORT
  5. OM_DATABASE.

This will effect to all the bare metal and docker instances which configures a custom database depending on the above environment variable values.

This change is however not affected for Kubernetes deployments.

Data Profiler and Data Quality Tests

On 0.11, the Profiler Workflow handled two things:

  • Computing metrics on the data
  • Running the configured Data Quality Tests

There has been a major overhaul where not only the UI greatly improved, now showing all historical data, but on the internals as well. Main topics to consider:

  1. Tests now run with the Test Suite workflow and cannot be configured in the Profiler Workflow
  2. Any past test data will be cleaned up during the upgrade to 0.12.0, as the internal data storage has been improved
  3. The Profiler Ingestion Pipelines will be cleaned up during the upgrade to 0.12.0 as well
  4. You will see broken profiler DAGs in airflow -- you can simply delete these DAGs

dbt Tests Integration

From 0.12, OpenMetadata supports ingesting the tests and the test results from your dbt project.

  • Along with manifest.json and catalog.json files, We now provide an option to ingest the run_results.json file generated from the dbt run and ingest the test results from it.
  • The field to enter the run_results.json file path is an optional field in the case of local and http dbt configs. The test results will be ingested from this file if it is provided.
  • For others the file will be picked from their respective sources if it is available.

Profiler Workflow Updates

On top of the information above, the fqnFilterPattern has been converted into the same patterns we use for ingestion, databaseFilterPattern, schemaFilterPattern and tableFilterPattern.

In the processor you can now configure:

  • profileSample to specify the % of the table to run the profiling on
  • columnConfig.profileQuery as a query to use to sample the data of the table
  • columnConfig.excludeColumns and columnConfig.includeColumns to mark which columns to skip.
  • In columnConfig.includeColumns we can also specify a list of metrics to run from our supported metrics.

Profiler Multithreading for Snowflake users

In OpenMetadata 0.12 we have migrated the metrics computation to multithreading. This migration reduced metrics computation time by 70%.

Snowflake users may experience a circular import error. This is a known issue with snowflake-connector-python. If you experience such error we recommend to either 1) run the ingestion workflow in Python 3.8 environment or 2) if you can't manage your environement set ThreadCount to 1. You can find more information on the profiler setting here

Airflow Version

The Airflow version from the Ingestion container image has been upgraded to 2.3.3.

Note that this means that now this is the version that will be used to run the Airflow metadata extraction. This impacted for example when ingesting status from Airflow 2.1.4 (issue[https://github.com/open-metadata/OpenMetadata/issues/7228]).

Moreover, the authentication mechanism that Airflow exposes for the custom plugins has changed. This required us to fully update how we were handling the managed APIs, both on the plugin side and the OpenMetadata API (which is the one sending the authentication).

To continue working with your own Airflow linked to the OpenMetadata UI for ingestion management, we recommend migrating to Airflow 2.3.3.

If you are using your own Airflow to prepare the ingestion from the UI, which is stuck in version 2.1.4, and you cannot upgrade that, but you want to use OM 0.12, reach out to us.

Note
Upgrading airflow from 2.1.4 to 2.3 requires a few steps. If you are using your airflow instance only to run OpenMetadata workflow we recommend you to simply drop the airflow database. You can simply connect to your database engine where your airflow database exist, make a back of your database, drop it and recreate it.

DROP DATABASE airflow_db;
CREATE DATABASE airflow_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
GRANT ALL PRIVILEGES ON airflow_db.* TO 'airflow_user'@'%' WITH GRANT OPTION;

If you would like to keep the existing data in your airflow instance or simply cannot drop your airflow database you will need to follow the steps below:

  1. Make a backup of your database (this will come handing if the migration fails and you need to perform any kind of restore)
  2. Upgrade your airflow instance from 2.1.x to 2.2.x
  3. Before upgrading to the 2.3.x version perform the steps describe on the airflow documentation page [here] to make sure character set/collation uses the correct encoding -- this has been changing across MySQL versions.
  4. Once the above has been performed you should be able to upgrade to airflow 2.3.x

Connector Improvements

  • Oracle: In 0.11.x and previous releases, we were using the Cx_Oracle driver to extract the metadata from oracledb. The drawback of using this driver was it required Oracle Client libraries to be installed in the host machine in order to run the ingestion. With the 0.12 release, we will be using the python-oracledb driver which is a upgraded version of Cx_Oracle. python-oracledb with Thin mode does not need Oracle Client libraries.

  • Azure SQL & MSSQL: Azure SQL & MSSQL with pyodbc scheme requires ODBC driver to be installed, with 0.12 release we are shipping the ODBC Driver 18 for SQL Server out of the box in our ingestion docker image.

Service Connection Updates

  • DynamoDB
    • Removed: database
  • Deltalake:
    • Removed: connectionOptions and supportsProfiler
  • Looker
    • Renamed username to clientId and password to clientSecret to align on the internals required for the metadata extraction.
    • Removed: env
  • Oracle
    • Removed: databaseSchema and oracleServiceName from the root.
    • Added: oracleConnectionType which will either contain oracleServiceName or databaseSchema. This will reduce confusion on setting up the connection.
  • Athena
    • Removed: hostPort
  • Databricks
    • Removed: username and password
  • dbt Config
    • Added: dbtRunResultsFilePath and dbtRunResultsHttpPath where path of the run_results.json file can be passed to get the test results data from dbt.

In 0.12.1

  • DeltaLake:
    • Updated the structure of the connection to better reflect the possible options.
      • Removed metastoreHostPort and metastoreFilePath, which are now embedded inside metastoreConnection
      • Added metastoreDb as an option to be passed inside metastoreConnection
    • You can find more information about the current structure here

Ingestion from CLI

We have stopped updating the service connection parameters when running the ingestion workflow from the CLI. The connection parameter will be retrieved from the server if the service already exists. Therefore, the connection parameters of a service will only be possible to be updated from the OpenMetadata UI.

Bots configuration

In the 0.12.1 version, AIRFLOW_AUTH_PROVIDER and OM_AUTH_AIRFLOW_{AUTH_PROVIDER} parameters are not needed to configure how the ingestion is performed from Airflow when our OpenMetadata server is secured. This can be achieved directly from UI through the Bots configuration in the settings page. For more information, visit the section of each SSO configuration in the Enable Security chapter.

Note that the ingestion-bot bot is created (or updated if it already exists) as a system bot that cannot be deleted, and the credentials used for this bot, if they did not exist before, will be the ones present in the OpenMetadata configuration. Otherwise, a JWT Token will be generated to be the default authentication mechanism of the ingestion-bot.