176 Commits

Author SHA1 Message Date
Na Zhang
ce5e8a2a14 Add exception handling for code search job 2017-02-10 16:10:41 -08:00
Yi (Alan) Wang
665a5dbded Add retry for ETL jobs failed at initialization (#308) 2017-01-27 11:17:38 -08:00
Na Zhang
ecb302efee change default value for static boosting score to 1 2017-01-25 16:40:10 -08:00
Yi Wang
4bc63cd4bd Fix bug for Teradata Load job, not loading samples on days that won't extract sample data. 2017-01-18 17:01:33 -08:00
camelliazhang
724f754f03 clean and refactor elastic serach ETL job (#300) 2016-12-14 21:22:30 -08:00
Eric Sun
a3504fa57f Fix jsonpath after upgrading com.jayway.jsonpath to 2.2 (#299)
* use schema_url_helper to fetch avro schema from hdfs or http location

* trim space

* add dfs.namenode.kerberos.principal.pattern; include htrace for SchemaUrlHelper

* fix jsonpath for job history log parser; do not throw exception if kerberos config files are missing for job history http connection

* avoid null return value for sepCommaString(); fix a typo
2016-12-13 21:14:52 -08:00
Yi (Alan) Wang
20db44df20 Merge pull request #277 from alyiwang/master
Map git repo and owners to Oracle/espresso/dali datasets
2016-11-30 17:40:05 -08:00
Na Zhang
63d3b2c82d refine elastic search boosting rule 2016-11-29 11:24:00 -08:00
Yi Wang
51f911b400 Map git repo and owners to Oracle/espresso/dali datasets 2016-11-22 10:51:00 -08:00
camelliazhang
2aaafed98c Merge pull request #274 from camelliazhang/master
mark SCM users confirmed by system automatically
2016-11-11 11:53:22 -08:00
Yi (Alan) Wang
06ada42bb9 Merge pull request #272 from alyiwang/master
Add JobExecutionLineageEvent and kafka processor
2016-11-11 11:39:09 -08:00
Na Zhang
1962f0a477 mark SCM users confirmed by system automatically 2016-11-11 11:12:28 -08:00
Na Zhang
2facf409b2 update the score table during elastic search dataset update 2016-11-11 10:09:31 -08:00
Yi Wang
b4f5e438e2 Add JobExecutionLineageEvent and kafka processor 2016-11-08 19:11:37 -08:00
Na Zhang
725e689326 add exception handling for DATABASE_SCM_METADATA_ETL and collect info 2016-11-08 17:37:36 -08:00
Na Zhang
217b7d9d09 search ranking improvement with static boosting 2016-11-08 15:18:51 -08:00
Eric Sun
7b36d09b58 Add get_schema_literal_from_url() to fetch schema literal based on schema url (#268)
* use schema_url_helper to fetch avro schema from hdfs or http location

* trim space

* add dfs.namenode.kerberos.principal.pattern; include htrace for SchemaUrlHelper
2016-11-07 08:14:45 -08:00
Yi Wang
664e4072bb Upgrade to play 2.4.8 2016-10-19 17:42:28 -07:00
Na Zhang
dbaf053e76 Add local test properties template for teradata and scm owners ETL 2016-10-19 14:10:29 -07:00
Na Zhang
043dc25e89 Get owners for espresso and oracle, and fix a bug for teradata 2016-10-19 11:13:32 -07:00
Yi Wang
5049c847fa Update Kafka consumer actors to reduce memory usage 2016-10-10 14:49:14 -07:00
Yi Wang
c9f4f18d9c Update Azkaban_Execution job to fetch cronExpression in flow scheduling 2016-10-06 13:43:10 -07:00
Yi (Alan) Wang
c9dfb637af Update MetadataChangeEvent APIs according to schema change (#243)
* Update MetadataChangeEvent APIs according to schema change

* Update MultiproductLoad to reflect new Owner types

* Add comments for Owner_type precedence (priority) and compliance
2016-10-06 13:33:45 -07:00
Yi Wang
0356497124 Add comments for Owner_type precedence (priority) and compliance 2016-10-06 13:24:29 -07:00
Yi Wang
8ab5c824b0 Update MultiproductLoad to reflect new Owner types 2016-10-03 18:39:21 -07:00
camelliazhang
fe1e698b8a remove hive instance hardcode cluster name (#236) 2016-09-30 17:15:43 -07:00
Na Zhang
10339690a9 Update HiveTransform and HiveLoad, remove hardcoded cluster name 2016-09-30 16:59:59 -07:00
Eric Sun
fd3b4baef8 avoid loop in LDAP org hierarchy (#242) 2016-09-30 16:45:38 -07:00
jerrybai2009
5f0426ea6b using the dynamic cursor to reduce the memory usage (#241) 2016-09-30 16:45:17 -07:00
jbai
a11e4908dc tracking the GobblinTrackingEvent_autit to get owner information 2016-09-29 15:01:32 -07:00
Yi Wang
ac34eb683f Update Kafka processor casting Object to String, also add debug info if can't fetch schema from Registery 2016-09-26 15:06:33 -07:00
Na Zhang
5c76f47313 remove hive instance hardcode cluster name 2016-09-26 15:06:30 -07:00
Yi Wang
1ad2b1528e logback redirect ETL job logs into corresponding files 2016-09-23 16:54:52 -07:00
jerrybai2009
f7878cdfe4 fix the elastic search index out of gc issue (#223) 2016-09-13 16:43:48 -07:00
Eric Sun
86bf71499f Reformat the ETL job info message in log. (#222)
* Use ProcessBuilder and redirected log file for HDFS Extract

* relax urn validation rule

* continue process if hive sql parsor encounters error

* reformat etl job log message
2016-09-13 14:01:14 -07:00
Yi Wang
33e592da14 Modify HdfsLoad to improve speed 2016-09-09 17:41:13 -07:00
Yi Wang
4c500402fe Map repo owner fix, change 'main' to 'Producer' and reset sort id 2016-09-02 13:52:00 -07:00
Yi Wang
a809b0ac47 Map repo owner fix to use dataset group mapping 2016-09-01 18:19:41 -07:00
Yi Wang
81f891bfab Map scm repo owner to dataset owner table 2016-08-30 15:35:28 -07:00
Yi (Alan) Wang
579b8fc9d7 Add metadataChangeEvent APIs to backend-service (#205)
* Add multiproduct and git repo metadata etl job

* Extract commit hash use it when querying acl

* Use FileWriter to write records into CSV file

* Remove unnecessary log entries from kafka processor

* Fix the incompatibility between integer repo_id in db and string field in record

* merge API tables to existing dataset owner and schema field table

* Add confidential and recursive column to dict_dataset_field
2016-08-24 09:10:35 -07:00
Yi (Alan) Wang
078e90e8bd Add multiproduct and git repo metadata etl job (#202)
* Add multiproduct and git repo metadata etl job

* implement the dataset availability section

* Extract commit hash use it when querying acl

* Use FileWriter to write records into CSV file

* Remove unnecessary log entries from kafka processor

* Fix the incompatibility between integer repo_id in db and string field in record
2016-08-12 12:26:55 -07:00
Eric Sun
cd4853d0a5 Use ProcessBuilder and redirected log file for HDFS Extract (#198)
* Use ProcessBuilder and redirected log file for HDFS Extract

* relax urn validation rule
2016-08-08 14:02:34 -07:00
Yi Wang
3d3b2a8075 Get kafka job id from applicatoin.conf and then get ref_id and configs from DB 2016-08-03 18:55:07 -07:00
Yi Wang
dbbdb6e2fb Modify Oracle metadata ETL job, use Json dumps and remove unnecessary quotes 2016-08-03 18:49:00 -07:00
jerrybai2009
b4a718efd0 Merge pull request #195 from ericsun2/master
temp fix for hdfs_schema_crawler getRuntime().exec() hangs problem
2016-08-03 18:15:43 -07:00
jerrybai2009
e7c7175cba Merge pull request #188 from jerrybai2009/master
load the teradata and hadoop data into table dict_dataset_instance
2016-08-03 18:13:06 -07:00
Eric Sun
1cd5872369 temp fix for hdfs_schema_crawler getRuntime().exec() hangs problem; exclude log4j 2016-08-03 15:50:00 -07:00
Eric Sun
6355ccc039 add python module [requests] for simple REST client 2016-07-29 23:10:33 -07:00
jbai
ea1ac0da9f load the teradata and hadoop data into table dict_dataset_instance 2016-07-29 10:59:33 -07:00
Yi Wang
74ed769bab add Oracle dataset metadata ETL job 2016-07-28 14:07:07 -07:00