Map repo owner fix, change 'main' to 'Producer' and reset sort id

2025-12-28 10:28:22 +00:00 · 2016-09-02 13:39:12 -07:00 · 2016-09-02 13:39:12 -07:00 · 4c500402fe
commit 4c500402fe
parent a809b0ac47 9bbbf1a68e
10 changed files with 252 additions and 14 deletions
--- a/README.md
+++ b/README.md
@ -67,6 +67,11 @@ Execute the [DDL files][DDL] to create the required repository tables in **where

 Want to contribute? Check out the [Contributors Guide][CON]

+## Community
+
+Want help? Check out the [Google Groups][LIST]
+
+
 [wiki]: https://github.com/LinkedIn/Wherehows/wiki
 [GS]: https://github.com/LinkedIn/Wherehows/wiki/Getting-Started
 [CON]: https://github.com/LinkedIn/Wherehows/wiki/Contributing
@ -74,3 +79,4 @@ Want to contribute? Check out the [Contributors Guide][CON]
 [EXJAR]: https://github.com/LinkedIn/Wherehows/wiki/Getting-Started#download-third-party-jar-files
 [DDL]: https://github.com/linkedin/WhereHows/tree/master/data-model/DDL
 [DB]: https://github.com/LinkedIn/Wherehows/wiki/Getting-Started#set-up-your-database
+[LIST]: https://groups.google.com/forum/#!forum/wherehows
--- a/backend-service/README
+++ b/backend-service/README
@ -1,4 +0,0 @@
-This is your new Play application
-=====================================
-
-This file will be packaged with your application, when using `play dist`.
--- a/backend-service/README.md
+++ b/backend-service/README.md
@ -0,0 +1,184 @@
+#Linkedin Wherehows - a Metadata data warehouse
+
+Wherehows works by sending out ‘crawlers’ to capture metadata from databases, hdfs, directory services, schedulers, and data integration tools. The collected metadata is loaded into an integrated data warehouse. Wherehows provides a web-ui service and a backend service.
+
+Wherehows comes in three operational components:
+- A web-ui service
+- Backend service
+- Database schema for MySQL
+
+The backend service provides the RESTful api but more importantly runs the ETL jobs that go and gather the metadata. The backend service relies heavily on the mysql wherehows database instance for configuration information and as a location for where the metadata will land.
+
+The Web UI provides navigation between the bits of information and the ability to annotate the collected data with comments, ownership and more. The example below is for collecting Hive metadata collected from the Cloudera Hadoop VM
+
+
+Configuration notes:
+MySQL database for the Wherehows metadata database
+```
+host:	<mysqlhost>
+db:		wherehows
+user:	wherehows
+pass:	wherehows
+```
+Wherehows application directory (in test):
+```
+Host:	<edge node>
+Folder:	/opt/wherehows
+```
+
+# Key notes:
+
+Please become familiar with these pages:
+- https://github.com/linkedin/WhereHows/wiki/Architecture (Nice tech overview)
+- https://github.com/dmoore247/WhereHows  (this is my fork, used to stabilize the release)
+- https://github.com/linkedin/WhereHows
+- https://github.com/LinkedIn/Wherehows/wiki/Getting-Started
+
+First set env variables to Play and Gradle:
+```
+export WH_HOME=~/development/wherehows/src/deployment/
+export PLAY_HOME=~/development/play-2.2.4
+export GRADLE_HOME=~/development/gradle-2.4
+export PATH=$PATH:$GRADLE_HOME/bin:$PLAY_HOME
+```
+
+### Build:
+```
+gradlew dist
+```
+
+### Install:
+Download/upload the distribution binaries, unzip to 
+```
+/opt/wherehows/backend-service-1.0-SNAPSHOT
+```
+
+Create temp space for wherehows
+```
+sudo mkdir /var/tmp/wherehows
+sudo chmod a+rw /var/tmp/wherehows
+```
+
+```
+cd /opt/wherehows/backend-service-1.0-SNAPSHOT
+```
+Ensure that wherehows configuration tables are initialized by running the insert scripts (download 1.9 KB wherehows.dump ). Please note, to change the mysql host property for wherehows database (on <mysqlhost>). The initial SQL:
+~~~~
+--
+-- Dumping data for table `wh_etl_job`
+--
+
+INSERT INTO `wh_etl_job` VALUES (21,'HIVE_DATASET_METADATA_ETL','DATASET','5 * * * * ?',61,'DB',NULL,1470390365,'comments','','Y');
+
+
+
+--
+-- Dumping data for table `wh_etl_job_property`
+--
+
+INSERT INTO `wh_etl_job_property` VALUES (117,'HIVE_DATASET_METADATA_ETL',61,'DB','hive.metastore.jdbc.url','jdbc:mysql://10.153.252.111:3306/metastore','N','url to connect to hive metastore'),(118,'HIVE_DATASET_METADATA_ETL',61,'DB','hive.metastore.jdbc.driver','com.mysql.jdbc.Driver','N',NULL),(119,'HIVE_DATASET_METADATA_ETL',61,'DB','hive.metastore.password','hive','N',NULL),(120,'HIVE_DATASET_METADATA_ETL',61,'DB','hive.metastore.username','hive','N',NULL),(121,'HIVE_DATASET_METADATA_ETL',61,'DB','hive.schema_json_file','/var/tmp/wherehows/hive_schema.json','N',NULL),(122,'HIVE_DATASET_METADATA_ETL',61,'DB','hive.schema_csv_file','/var/tmp/wherehows/hive_schema.csv','N',NULL),(123,'HIVE_DATASET_METADATA_ETL',61,'DB','hive.field_metadata','/var/tmp/wherehows/hive_field_metadata.csv','N',NULL);
+
+--
+-- Table structure for table `wh_property`
+--
+
+
+--
+-- Dumping data for table `wh_property`
+--
+
+INSERT INTO `wh_property` VALUES ('wherehows.app_folder','/var/tmp/wherehows','N',NULL),('wherehows.db.driver','com.mysql.jdbc.Driver','N',NULL),('wherehows.db.jdbc.url','jdbc:mysql://localhost/wherehows','N',NULL),('wherehows.db.password','wherehows','N',NULL),('wherehows.db.username','wherehows','N',NULL),('wherehows.encrypt.master.key.loc','/var/tmp/wherehows/.wherehows/master_key','N',NULL),('wherehows.ui.tree.dataset.file','/var/tmp/wherehows/resource/dataset.json','N',NULL),('wherehows.ui.tree.flow.file','/var/tmp/wherehows/resource/flow.json','N',NULL);
+
+
+~~~~
+
+The hive metastore (as MySQL database) properties need to match the hadoop cluster:
+```
+Host	 <metastore host>
+Port	 3306
+Username hive
+Password hive
+URL		 jdbc:mysql://<metastore host>:3306/metastore
+```
+Set the hive metastore driver class to ```com.mysql.jdbc.Driver```
+other properties per configuration.
+
+Ensure these JAR files are present
+```
+ lib/jython-standalone-2.7.0.jar
+ lib/mysql-connector-java-5.1.36.jar
+```
+### Run
+To run the backend service:
+Set these variables to configure the application (or edit conf/database.conf)
+```
+export WHZ_DB_URL=jdbc:mysql://<mysql host>:3306/wherehows
+export WHZ_DB_USERNAME=wherehows
+export WHZ_DB_PASSWORD=wherehows
+export WHZ_DB_HOST=<mysql host>
+```
+Run backend service application on port 9001 (from the backend-service folder run:
+```
+$PLAY_HOME/play “run -Dhttp.port=9001”
+```
+In separate window, monitor 
+```tail -f /var/tmp/wherehows/wherehows.log```
+
+Open browser to ```http://<edge node>:9001/```
+This will show ‘TEST’. This is the RESTful api endpoint
+
+Run the web ui
+```
+cd <web ui deployment dir>
+cd web
+# <ensure the conf/*.conf files are configured>
+$PLAY_HOME/play run
+```
+
+## Next steps
+Once the Hive ETL is fully flushed out, look at the HDFS metadata ETL
+Configure multiple Hive & HDFS jobs to gather data from all Hadoop clusters
+Add additional crawlers, for Oracle, Teradata, ETL and schedulers
+
+### Troubleshooting
+To check the configuration properties:
+```
+select * from wh_etl_job;
+select * from wh_etl_job_property;
+select * from wh_property;
+
+select distinct wh_etl_job_name from wh_etl_job;
+
+select j.wh_etl_job_name, j.ref_id_type, j.ref_id, 
+       coalesce(d.db_code, a.app_code) db_or_app_code,
+       j.cron_expr, p.property_name, p.property_value
+from wh_etl_job j join wh_etl_job_property p
+  on j.wh_etl_job_name = p.wh_etl_job_name
+ and j.ref_id_type = p.ref_id_type
+ and j.ref_id = p.ref_id
+     left join cfg_database d
+  on j.ref_id = d.db_id
+ and j.ref_id_type = 'DB'
+     left join cfg_application a
+  on j.ref_id = a.app_id
+ and j.ref_id_type = 'APP'
+where j.wh_etl_job_name = 'HIVE_DATASET_METADATA_ETL'
+/*  AZKABAN_EXECUTION_METADATA_ETL
+    AZKABAN_LINEAGE_METADATA_ETL
+    ELASTICSEARCH_EXECUTION_INDEX_ETL
+    HADOOP_DATASET_METADATA_ETL
+    HADOOP_DATASET_OWNER_ETL
+    HIVE_DATASET_METADATA_ETL
+    KAFKA_CONSUMER_ETL
+    LDAP_USER_ETL
+    OOZIE_EXECUTION_METADATA_ETL
+    ORACLE_DATASET_METADATA_ETL
+    PRODUCT_REPO_METADATA_ETL
+    TERADATA_DATASET_METADATA_ETL */
+--  and j.ref_id = 123
+/*  based on cfg_database or cfg_application */
+order by j.wh_etl_job_name, db_or_app_code, p.property_name;
+```
+To log in the first time to the web UI:
+
+You have to create an account. In the upper right corner there is a "Not a member yet? Join Now" link. Click on that and get a form to fill out.
--- a/metadata-etl/src/main/resources/jython/MultiproductExtract.py
+++ b/metadata-etl/src/main/resources/jython/MultiproductExtract.py
@ -262,7 +262,7 @@ class MultiproductLoad:
                repo_fullname,
                scm_type,
                repo_id,
-                acl_name,
+                acl_name.title(),
                owner,
                sort_id,
                paths,
--- a/metadata-etl/src/main/resources/jython/MultiproductLoad.py
+++ b/metadata-etl/src/main/resources/jython/MultiproductLoad.py
@ -95,6 +95,7 @@ class MultiproductLoad:

  def merge_repo_owners_into_dataset_owners(self):
    merge_repo_owners_into_dataset_owners_cmd = '''
+    -- find owner app_id, 300 for USER, 301 for GROUP
    UPDATE stg_repo_owner stg
    JOIN (select app_id, user_id from dir_external_user_info) ldap
    ON stg.owner_name = ldap.user_id
@ -105,16 +106,20 @@ class MultiproductLoad:
    ON stg.owner_name = ldap.group_id
    SET stg.app_id = ldap.app_id;

+    -- INSERT/UPDATE into dataset_owner
    INSERT INTO dataset_owner (
    dataset_id, dataset_urn, owner_id, sort_id, namespace, app_id, owner_type, owner_sub_type, owner_id_type,
    owner_source, db_ids, is_group, is_active, source_time, created_time, wh_etl_exec_id
    )
    SELECT * FROM (
    SELECT ds.id, ds.urn, r.owner_name n_owner_id, r.sort_id n_sort_id,
-        'urn:li:corpuser' n_namespace, r.app_id, r.owner_type n_owner_type, null n_owner_sub_type,
+        'urn:li:corpuser' n_namespace, r.app_id,
+        IF(r.owner_type = 'main', 'Producer', r.owner_type) n_owner_type,
+        null n_owner_sub_type,
        case when r.app_id = 300 then 'USER' when r.app_id = 301 then 'GROUP' else null end n_owner_id_type,
-        'SCM' n_owner_source, null db_ids, case when r.app_id = 301 then 'Y' else 'N' end is_group, 'Y' is_active,
-        0 source_time, unix_timestamp(NOW()) created_time, r.wh_etl_exec_id
+        'SCM' n_owner_source, null db_ids,
+        IF(r.app_id = 301, 'Y', 'N') is_group,
+        'Y' is_active, 0 source_time, unix_timestamp(NOW()) created_time, r.wh_etl_exec_id
    FROM (SELECT id, urn FROM dict_dataset WHERE urn like 'dalids:///%') ds
      JOIN (SELECT object_name, mapped_object_name FROM cfg_object_name_map WHERE mapped_object_type = 'scm') m
        ON m.object_name = concat('/', substring_index(substring_index(ds.urn, '/', 4), '/', -1))
@ -133,6 +138,19 @@ class MultiproductLoad:
    namespace = COALESCE(namespace, n.n_namespace),
    wh_etl_exec_id = n.wh_etl_exec_id,
    modified_time = unix_timestamp(NOW());
+
+    -- reset dataset owner sort id
+    UPDATE dataset_owner d
+      JOIN (
+        select dataset_urn, dataset_id, owner_type, owner_id, sort_id,
+            @owner_rank := IF(@current_dataset_id = dataset_id, @owner_rank + 1, 0) rank,
+            @current_dataset_id := dataset_id
+        from dataset_owner, (select @current_dataset_id := 0, @owner_rank := 0) t
+        where dataset_urn like 'dalids:///%'
+        order by dataset_id asc, owner_type desc, sort_id asc, owner_id asc
+      ) s
+    ON d.dataset_id = s.dataset_id AND d.owner_id = s.owner_id
+    SET d.sort_id = s.rank;
    '''

    self.executeCommands(merge_repo_owners_into_dataset_owners_cmd)
--- a/web/app/views/index.scala.html
+++ b/web/app/views/index.scala.html
@ -321,7 +321,7 @@
              </td>
              <td class="commentsArea">
                <div class="commentsArea">
-                  {{#schema-comment schema=schema datasetId=this.model.id fieldId=schema.id getSchema="getSchema"}}{{/schema-comment}}
+                  {{#schema-comment schema=schema datasetId=dataset.id fieldId=schema.id}}{{/schema-comment}}
                  {{schema.commentHtml}}
                </div>
              </td>
--- a/web/public/javascripts/components/components.js
+++ b/web/public/javascripts/components/components.js
@ -102,6 +102,36 @@ App.DatasetSchemaComponent = Ember.Component.extend({
    }, 500);
  },
  actions: {
+    getSchema: function(){
+      var _this = this
+      var id = _this.get('dataset.id')
+      var columnUrl = 'api/v1/datasets/' + id + "/columns";
+      _this.set("isTable", true);
+      _this.set("isJSON", false);
+      $.get(columnUrl, function(data) {
+        if (data && data.status == "ok")
+        {
+          if (data.columns && (data.columns.length > 0))
+          {
+            _this.set("hasSchemas", true);
+            data.columns = data.columns.map(function(item, idx){
+              item.commentHtml = marked(item.comment).htmlSafe()
+              return item
+            })
+            _this.set("schemas", data.columns);
+            setTimeout(initializeColumnTreeGrid, 500);
+          }
+          else
+          {
+            _this.set("hasSchemas", false);
+          }
+        }
+        else
+        {
+          _this.set("hasSchemas", false);
+        }
+      });
+    },
    setView: function (view) {
      switch (view) {
        case "tabular":
--- a/web/public/javascripts/components/schema-comment.js
+++ b/web/public/javascripts/components/schema-comment.js
@ -497,7 +497,10 @@ App.SchemaCommentComponent = Ember.Component.extend({
          })
          .on('hidden.bs.modal', function(){
            _this.set('propModal', false)
-            _this.sendAction('getSchema')
+            if (_this.parentView && _this.parentView.controller)
+            {
+              _this.parentView.controller.send('getSchema')
+            }
            $("#datasetSchemaColumnCommentModal").modal('hide');
          })
      }, 300)
--- a/web/public/javascripts/main.js
+++ b/web/public/javascripts/main.js
@ -339,3 +339,8 @@ function filterListView(category, filter)
    }
 }

+function initializeColumnTreeGrid()
+{
+    $('#json-table').treegrid();
+}
+
--- a/web/public/javascripts/routers/datasets.js
+++ b/web/public/javascripts/routers/datasets.js
@ -1,7 +1,3 @@
-function initializeColumnTreeGrid()
-{
-  $('#json-table').treegrid();
-}

 function initializeDependsTreeGrid()
 {