Start adding java ETL examples, starting with kafka etl. (#1805)

Start adding java ETL examples, starting with kafka etl. We've had a few requests to start providing Java examples rather than Python due to type safety. I've also started to add these to metadata-ingestion-examples to make it clearer these are *examples*. They can be used directly or as a basis for other things. As we port to Java we'll move examples to contrib.
2025-10-01 20:14:13 +00:00 · 2020-09-11 13:04:21 -07:00 · 2020-09-11 13:04:21 -07:00 · 6ece2d6469
commit 6ece2d6469
parent 91486a2ffd
38 changed files with 453 additions and 27 deletions
--- a/README.md
+++ b/README.md
@ -46,7 +46,7 @@ Please follow the [DataHub Quickstart Guide](docs/quickstart.md) to get a copy o
 * [Frontend](datahub-frontend)
 * [Web App](datahub-web)
 * [Generalized Metadata Service](gms)
-* [Metadata Ingestion](metadata-ingestion)
+* [Metadata Ingestion](metadata-ingestion-examples)
 * [Metadata Processing Jobs](metadata-jobs)
 ## Releases
--- a/build.gradle
+++ b/build.gradle
@ -44,7 +44,8 @@ project.ext.externalDependency = [
    'httpClient': 'org.apache.httpcomponents:httpclient:4.5.9',
    'jacksonCore': 'com.fasterxml.jackson.core:jackson-core:2.9.7',
    'jacksonDataBind': 'com.fasterxml.jackson.core:jackson-databind:2.9.7',
-    "javatuples": "org.javatuples:javatuples:1.2",
+    'javatuples': 'org.javatuples:javatuples:1.2',
    'javaxInject' : 'javax.inject:javax.inject:1',
    'jerseyCore': 'org.glassfish.jersey.core:jersey-client:2.25.1',
    'jerseyGuava': 'org.glassfish.jersey.bundles.repackaged:jersey-guava:2.25.1',
    'jsonSimple': 'com.googlecode.json-simple:json-simple:1.1.1',
@ -57,8 +58,8 @@ project.ext.externalDependency = [
    'mariadbConnector': 'org.mariadb.jdbc:mariadb-java-client:2.6.0',
    'mockito': 'org.mockito:mockito-core:3.0.0',
    'mysqlConnector': 'mysql:mysql-connector-java:5.1.47',
-    "neo4jHarness": "org.neo4j.test:neo4j-harness:3.4.11",
+    'neo4jHarness': 'org.neo4j.test:neo4j-harness:3.4.11',
-    "neo4jJavaDriver": "org.neo4j.driver:neo4j-java-driver:4.0.0",
+    'neo4jJavaDriver': 'org.neo4j.driver:neo4j-java-driver:4.0.0',
    'parseqTest': 'com.linkedin.parseq:parseq:3.0.7:test',
    'playDocs': 'com.typesafe.play:play-docs_2.11:2.6.18',
    'playGuice': 'com.typesafe.play:play-guice_2.11:2.6.18',
@ -66,7 +67,7 @@ project.ext.externalDependency = [
    'playTest': 'com.typesafe.play:play-test_2.11:2.6.18',
    'postgresql': 'org.postgresql:postgresql:42.2.14',
    'reflections': 'org.reflections:reflections:0.9.11',
-    "rythmEngine": "org.rythmengine:rythm-engine:1.3.0",
+    'rythmEngine': 'org.rythmengine:rythm-engine:1.3.0',
    'servletApi': 'javax.servlet:javax.servlet-api:3.1.0',
    'springBeans': 'org.springframework:spring-beans:5.2.3.RELEASE',
    'springContext': 'org.springframework:spring-context:5.2.3.RELEASE',
--- a/contrib/metadata-ingestion/haskell/README.md
+++ b/contrib/metadata-ingestion/haskell/README.md
--- a/contrib/metadata-ingestion/haskell/bin/datahub-producer.hs
+++ b/contrib/metadata-ingestion/haskell/bin/datahub-producer.hs
--- a/contrib/metadata-ingestion/haskell/bin/datahub-producer.hs.nix
+++ b/contrib/metadata-ingestion/haskell/bin/datahub-producer.hs.nix
--- a/contrib/metadata-ingestion/haskell/bin/dataset-hive-generator.py
+++ b/contrib/metadata-ingestion/haskell/bin/dataset-hive-generator.py
--- a/contrib/metadata-ingestion/haskell/bin/dataset-hive-generator.py.nix
+++ b/contrib/metadata-ingestion/haskell/bin/dataset-hive-generator.py.nix
--- a/contrib/metadata-ingestion/haskell/bin/dataset-jdbc-generator.hs
+++ b/contrib/metadata-ingestion/haskell/bin/dataset-jdbc-generator.hs
--- a/contrib/metadata-ingestion/haskell/bin/dataset-jdbc-generator.hs.nix
+++ b/contrib/metadata-ingestion/haskell/bin/dataset-jdbc-generator.hs.nix
--- a/contrib/metadata-ingestion/haskell/bin/lineage_hive_generator.hs
+++ b/contrib/metadata-ingestion/haskell/bin/lineage_hive_generator.hs
--- a/contrib/metadata-ingestion/haskell/bin/lineage_hive_generator.hs.nix
+++ b/contrib/metadata-ingestion/haskell/bin/lineage_hive_generator.hs.nix
--- a/contrib/metadata-ingestion/haskell/config/MetadataChangeEvent.avsc
+++ b/contrib/metadata-ingestion/haskell/config/MetadataChangeEvent.avsc
--- a/contrib/metadata-ingestion/haskell/config/datahub-config.nix
+++ b/contrib/metadata-ingestion/haskell/config/datahub-config.nix
--- a/contrib/metadata-ingestion/haskell/sample/hive_1.sql
+++ b/contrib/metadata-ingestion/haskell/sample/hive_1.sql
--- a/contrib/metadata-ingestion/haskell/sample/mce.json.dat
+++ b/contrib/metadata-ingestion/haskell/sample/mce.json.dat
--- a/contrib/metadata-ingestion/python/README.md
+++ b/contrib/metadata-ingestion/python/README.md
@ -0,0 +1,23 @@
 # Python ETL examples
 ETL scripts written in Python.
 ## Prerequisites
 1. Before running any python metadata ingestion job, you should make sure that DataHub backend services are all running.
 The easiest way to do that is through [Docker images](../../docker).
 2. You also need to build the `mxe-schemas` module as below.
   ```
   ./gradlew :metadata-events:mxe-schemas:build
   ```
   This is needed to generate `MetadataChangeEvent.avsc` which is the schema for `MetadataChangeEvent` Kafka topic.
 3. All the scripts are written using Python 3 and most likely won't work with Python 2.x interpreters.
   You can verify the version of your Python using the following command.
   ```
   python --version
   ```
   We recommend using [pyenv](https://github.com/pyenv/pyenv) to install and manage your Python environment.
 4. Before launching each ETL ingestion pipeline, you can install/verify the library versions as below.
   ```
   pip install --user -r requirements.txt
   ```
--- a/contrib/metadata-ingestion/python/kafka-etl/README.md
+++ b/contrib/metadata-ingestion/python/kafka-etl/README.md
@ -0,0 +1,17 @@
 # Kafka ETL
 ## Ingest metadata from Kafka to DataHub
 The kafka_etl provides you ETL channel to communicate with your kafka.
 ```
 ➜  Config your kafka environmental variable in the file.
    ZOOKEEPER      # Your zookeeper host.
 ➜  Config your Kafka broker environmental variable in the file.
    AVROLOADPATH   # Your model event in avro format.
    KAFKATOPIC     # Your event topic.
    BOOTSTRAP      # Kafka bootstrap server.
    SCHEMAREGISTRY # Kafka schema registry host.
 ➜  python kafka_etl.py
 ```
 This will bootstrap DataHub with your metadata in the kafka as a dataset entity.
--- a/contrib/metadata-ingestion/python/kafka-etl/kafka_etl.py
+++ b/contrib/metadata-ingestion/python/kafka-etl/kafka_etl.py
--- a/contrib/metadata-ingestion/python/kafka-etl/requirements.txt
+++ b/contrib/metadata-ingestion/python/kafka-etl/requirements.txt
--- a/contrib/metadata-ingestion/python/openldap-etl/README.md
+++ b/contrib/metadata-ingestion/python/openldap-etl/README.md
--- a/contrib/metadata-ingestion/python/openldap-etl/docker-compose.yml
+++ b/contrib/metadata-ingestion/python/openldap-etl/docker-compose.yml
--- a/contrib/metadata-ingestion/python/openldap-etl/openldap-etl.py
+++ b/contrib/metadata-ingestion/python/openldap-etl/openldap-etl.py
--- a/contrib/metadata-ingestion/python/openldap-etl/requirements.txt
+++ b/contrib/metadata-ingestion/python/openldap-etl/requirements.txt
--- a/contrib/metadata-ingestion/python/openldap-etl/sample-ldif.txt
+++ b/contrib/metadata-ingestion/python/openldap-etl/sample-ldif.txt
--- a/metadata-events/mxe-avro-1.7/build.gradle
+++ b/metadata-events/mxe-avro-1.7/build.gradle
@ -12,19 +12,24 @@ dependencies {
  avsc project(':metadata-events:mxe-schemas')
 }
 def genDir = file("src/generated/java")
 task avroCodeGen(type: com.commercehub.gradle.plugin.avro.GenerateAvroJavaTask, dependsOn: configurations.avsc) {
  source("$rootDir/metadata-events/mxe-schemas/src/renamed/avro")
-  outputDir = file("src/generated/java")
+  outputDir = genDir
  dependsOn(':metadata-events:mxe-schemas:renameNamespace')
 }
 compileJava.source(avroCodeGen.outputs)
 build.dependsOn avroCodeGen
-clean {
+idea {
-  project.delete('src/generated')
+  module {
    sourceDirs += genDir
    generatedSourceDirs += genDir
  }
 }
-avroCodeGen.dependsOn(':metadata-events:mxe-schemas:renameNamespace')
+project.rootProject.tasks.idea.dependsOn(avroCodeGen)
 // Exclude classes from avro-schemas
 jar {
--- a/metadata-events/mxe-utils-avro-1.7/src/main/java/com/linkedin/metadata/EventUtils.java
+++ b/metadata-events/mxe-utils-avro-1.7/src/main/java/com/linkedin/metadata/EventUtils.java
@ -53,7 +53,7 @@ public class EventUtils {
  @Nonnull
  private static Schema getAvroSchemaFromResource(@Nonnull String resourcePath) {
-    URL url = Resources.getResource(resourcePath);
+    URL url = EventUtils.class.getClassLoader().getResource(resourcePath);
    try {
      return Schema.parse(Resources.toString(url, Charsets.UTF_8));
    } catch (IOException e) {
--- a/metadata-ingestion-examples/README.md
+++ b/metadata-ingestion-examples/README.md
@ -0,0 +1,47 @@
 # Metadata Ingestion
 This directory contains example apps for ingesting data into DataHub.
 You are more than welcome to use these examples directly, or use them as a reference for you own jobs.
 See the READMEs of each example for more information on each.
 ### Common themes
 All these examples ingest by firing MetadataChangeEvent Kafka events. They do not ingest directly into DataHub, though
 this is possible. Instead, the mce-consumer-job should be running, listening for these events, and perform the ingestion
 for us.
 ### A note on languages
 We initially wrote these examples in Python (they still exist in `metadata-ingestion`; TODO to delete them once they're
 all ported). The idea was that these were very small example scripts, that should've been easy to use. However, upon
 reflection, not all developers are familiar with Python, and the lack of types can hinder development. So the decision
 was made to port the examples to Java.
 You're more than welcome to extrapolate these examples into whatever languages you like. At LinkedIn, we primarily use
 Java.
 ### Ingestion at LinkedIn
 It is worth noting that we do not use any of these examples directly (in Java, Python, or anything else) at LinkedIn. We
 have several different pipelines for ingesting data; it all depends on the source.
 - Some pipelines are based off other Kafka events, where we'll transform some existing Kafka event to a metadata event.
  - For example, we get Kafka events hive changes. We make MCEs out of those hive events to ingest hive data.
 - For others, we've directly instrumented existing pipelines / apps / jobs to also emit metadata events.
  - For example, TODO? Gobblin?
 - For others still, we've created a series offline jobs to ingest data.
  - For example, we have an Azkaban job to process our HDFS datasets.
 For some sources of data one of these example scripts may work fine. For others, it may make more sense to have some
 custom logic, like the above list. Namely, all these examples today are one-off (they run, fire events, and then stop),
 you may wish to build continuous ingestion pipelines instead.
 ### "Real" Ingestion Applications
 We appreciate any contributions of apps you may wish to make to ingest data from other sources.
 TODO this section feels a little weird. Are our ingestion apps not really real apps? :p LDAP is real, as is kafka.
 Granted, these are just one off apps to ingest. Maybe we should provide a library for these, then expose the one off
 apps as examples?
--- a/metadata-ingestion-examples/common/build.gradle
+++ b/metadata-ingestion-examples/common/build.gradle
@ -0,0 +1,20 @@
 plugins {
  id 'java'
 }
 dependencies {
  compile project(':metadata-dao-impl:kafka-producer')
  compile externalDependency.javaxInject
  compile externalDependency.kafkaAvroSerde
  compile externalDependency.kafkaSerializers
  compile externalDependency.lombok
  compile externalDependency.springBeans
  compile externalDependency.springBootAutoconfigure
  compile externalDependency.springCore
  compile externalDependency.springKafka
  annotationProcessor externalDependency.lombok
  runtime externalDependency.logbackClassic
 }
--- a/metadata-ingestion-examples/common/src/main/java/com/linkedin/metadata/examples/configs/KafkaConfig.java
+++ b/metadata-ingestion-examples/common/src/main/java/com/linkedin/metadata/examples/configs/KafkaConfig.java
@ -0,0 +1,42 @@
 package com.linkedin.metadata.examples.configs;
 import io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig;
 import io.confluent.kafka.serializers.KafkaAvroSerializer;
 import java.util.Arrays;
 import java.util.Map;
 import org.apache.avro.generic.IndexedRecord;
 import org.apache.kafka.clients.producer.KafkaProducer;
 import org.apache.kafka.clients.producer.Producer;
 import org.apache.kafka.common.serialization.StringSerializer;
 import org.springframework.beans.factory.annotation.Value;
 import org.springframework.boot.autoconfigure.kafka.KafkaProperties;
 import org.springframework.context.annotation.Bean;
 import org.springframework.context.annotation.Configuration;
@Configuration
 public class KafkaConfig {
  @Value("${KAFKA_BOOTSTRAP_SERVER:localhost:29092}")
  private String kafkaBootstrapServers;
  @Value("${KAFKA_SCHEMAREGISTRY_URL:http://localhost:8081}")
  private String kafkaSchemaRegistryUrl;
  @Bean(name = "kafkaEventProducer")
  public Producer<String, IndexedRecord> kafkaListenerContainerFactory(KafkaProperties properties) {
    KafkaProperties.Producer producerProps = properties.getProducer();
    producerProps.setKeySerializer(StringSerializer.class);
    producerProps.setValueSerializer(KafkaAvroSerializer.class);
    // KAFKA_BOOTSTRAP_SERVER has precedence over SPRING_KAFKA_BOOTSTRAP_SERVERS
    if (kafkaBootstrapServers != null && kafkaBootstrapServers.length() > 0) {
      producerProps.setBootstrapServers(Arrays.asList(kafkaBootstrapServers.split(",")));
    } // else we rely on KafkaProperties which defaults to localhost:9092
    Map<String, Object> props = properties.buildProducerProperties();
    props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, kafkaSchemaRegistryUrl);
    return new KafkaProducer<>(props);
  }
 }
--- a/metadata-ingestion-examples/common/src/main/java/com/linkedin/metadata/examples/configs/SchemaRegistryConfig.java
+++ b/metadata-ingestion-examples/common/src/main/java/com/linkedin/metadata/examples/configs/SchemaRegistryConfig.java
@ -0,0 +1,19 @@
 package com.linkedin.metadata.examples.configs;
 import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient;
 import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient;
 import org.springframework.beans.factory.annotation.Value;
 import org.springframework.context.annotation.Bean;
 import org.springframework.context.annotation.Configuration;
@Configuration
 public class SchemaRegistryConfig {
  @Value("${SCHEMAREGISTRY_URL:http://localhost:8081}")
  private String schemaRegistryUrl;
  @Bean(name = "schemaRegistryClient")
  public SchemaRegistryClient schemaRegistryFactory() {
    return new CachedSchemaRegistryClient(schemaRegistryUrl, 512);
  }
 }
--- a/metadata-ingestion-examples/common/src/main/java/com/linkedin/metadata/examples/configs/ZooKeeperConfig.java
+++ b/metadata-ingestion-examples/common/src/main/java/com/linkedin/metadata/examples/configs/ZooKeeperConfig.java
@ -0,0 +1,26 @@
 package com.linkedin.metadata.examples.configs;
 import java.io.IOException;
 import org.apache.zookeeper.Watcher;
 import org.apache.zookeeper.ZooKeeper;
 import org.springframework.beans.factory.annotation.Value;
 import org.springframework.context.annotation.Bean;
 import org.springframework.context.annotation.Configuration;
@Configuration
 public class ZooKeeperConfig {
  @Value("${ZOOKEEPER:localhost:2181}")
  private String zookeeper;
  @Value("${ZOOKEEPER_TIMEOUT_MILLIS:3000}")
  private int timeoutMillis;
  @Bean(name = "zooKeeper")
  public ZooKeeper zooKeeperFactory() throws IOException {
    Watcher noopWatcher = event -> {
    };
    return new ZooKeeper(zookeeper, timeoutMillis, noopWatcher);
  }
 }
--- a/metadata-ingestion-examples/kafka-etl/README.md
+++ b/metadata-ingestion-examples/kafka-etl/README.md
@ -0,0 +1,40 @@
 # Kafka ETL
 A small application which reads existing Kafka topics from ZooKeeper, retrieves their schema from the schema registry,
 and then fires an MCE for each schema.
 ## Running the Application
 First, ensure that services this depends on, like schema registry / zookeeper / mce-consumer-job / gms / etc, are all
 running.
 This application can be run via gradle:
 ```
 ./gradlew :metadata-ingestion-examples:kafka-etl:bootRun
 ```
 Or by building and running the jar:
 ```
 ./gradlew :metadata-ingestion-examples:kafka-etl:build
 java -jar metadata-ingestion-examples/kafka-etl/build/libs/kafka-etl.jar 
 ```
 ### Environment Variables
 See the files under `src/main/java/com/linkedin/metadata/examples/kafka/config` for a list of customizable spring
 environment variables.
 ### Common pitfalls
 For events to be fired correctly, schemas must exist in the schema registry. If a topic was newly created, but no schema
 has been registered for it yet, this application will fail to retrieve the schema for that topic. Check the output of
 the application to see if this happens. If you see a message like
 ```
 io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject not found.; error code: 40401
 ```
 Then the odds are good that you need to register the schema for this topic.
--- a/metadata-ingestion-examples/kafka-etl/build.gradle
+++ b/metadata-ingestion-examples/kafka-etl/build.gradle
@ -0,0 +1,29 @@
 plugins {
  id 'org.springframework.boot'
  id 'java'
 }
 dependencies {
  compile project(':metadata-utils')
  compile project(':metadata-builders')
  compile project(':metadata-dao-impl:kafka-producer')
  compile project(':metadata-events:mxe-schemas')
  compile project(':metadata-ingestion-examples:common')
  compile externalDependency.javaxInject
  compile externalDependency.kafkaAvroSerde
  compile externalDependency.kafkaSerializers
  compile externalDependency.lombok
  compile externalDependency.springBeans
  compile externalDependency.springBootAutoconfigure
  compile externalDependency.springCore
  compile externalDependency.springKafka
  annotationProcessor externalDependency.lombok
  runtime externalDependency.logbackClassic
 }
 bootJar {
  mainClassName = 'com.linkedin.metadata.examples.kafka.KafkaEtlApplication'
 }
--- a/metadata-ingestion-examples/kafka-etl/src/main/java/com/linkedin/metadata/examples/kafka/KafkaEtl.java
+++ b/metadata-ingestion-examples/kafka-etl/src/main/java/com/linkedin/metadata/examples/kafka/KafkaEtl.java
@ -0,0 +1,115 @@
 package com.linkedin.metadata.examples.kafka;
 import com.linkedin.common.AuditStamp;
 import com.linkedin.common.FabricType;
 import com.linkedin.common.urn.CorpuserUrn;
 import com.linkedin.common.urn.DataPlatformUrn;
 import com.linkedin.common.urn.DatasetUrn;
 import com.linkedin.metadata.aspect.DatasetAspect;
 import com.linkedin.metadata.dao.producer.KafkaMetadataEventProducer;
 import com.linkedin.metadata.snapshot.DatasetSnapshot;
 import com.linkedin.mxe.MetadataChangeEvent;
 import com.linkedin.schema.KafkaSchema;
 import com.linkedin.schema.SchemaField;
 import com.linkedin.schema.SchemaFieldArray;
 import com.linkedin.schema.SchemaFieldDataType;
 import com.linkedin.schema.SchemaMetadata;
 import com.linkedin.schema.StringType;
 import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient;
 import java.util.List;
 import javax.inject.Inject;
 import javax.inject.Named;
 import lombok.extern.slf4j.Slf4j;
 import org.apache.avro.generic.IndexedRecord;
 import org.apache.kafka.clients.producer.Producer;
 import org.apache.zookeeper.ZooKeeper;
 import org.springframework.boot.CommandLineRunner;
 import org.springframework.stereotype.Component;
 /**
 * Gathers Kafka topics from the local zookeeper instance and schemas from the schema registry, and then fires
 * MetadataChangeEvents for their schemas.
 *
 * <p>This should cause DataHub to be populated with this information, assuming it and the mce-consumer-job are running
 * locally.
 *
 * <p>Can be run with {@code ./gradlew :metadata-ingestion-examples:java:kafka-etl:bootRun}.
 */
@Slf4j
@Component
 public final class KafkaEtl implements CommandLineRunner {
  private static final DataPlatformUrn KAFKA_URN = new DataPlatformUrn("kafka");
  @Inject
  @Named("kafkaEventProducer")
  private Producer<String, IndexedRecord> _producer;
  @Inject
  @Named("zooKeeper")
  private ZooKeeper _zooKeeper;
  @Inject
  @Named("schemaRegistryClient")
  private SchemaRegistryClient _schemaRegistryClient;
  private SchemaMetadata buildDatasetSchema(String datasetName, String schema, int schemaVersion) {
    final AuditStamp auditStamp = new AuditStamp();
    auditStamp.setTime(System.currentTimeMillis());
    auditStamp.setActor(new CorpuserUrn(System.getenv("USER")));
    final SchemaMetadata.PlatformSchema platformSchema = new SchemaMetadata.PlatformSchema();
    platformSchema.setKafkaSchema(new KafkaSchema().setDocumentSchema(schema));
    return new SchemaMetadata().setSchemaName(datasetName)
        .setPlatform(KAFKA_URN)
        .setCreated(auditStamp)
        .setLastModified(auditStamp)
        .setVersion(schemaVersion)
        .setHash("")
        .setPlatformSchema(platformSchema)
        .setFields(new SchemaFieldArray(new SchemaField().setFieldPath("")
            .setDescription("")
            .setNativeDataType("string")
            .setType(new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())))));
  }
  private void produceKafkaDatasetMce(SchemaMetadata schemaMetadata) {
    MetadataChangeEvent.class.getClassLoader().getResource("avro/com/linkedin/mxe/MetadataChangeEvent.avsc");
    // Kafka topics are considered datasets in the current DataHub metadata ecosystem.
    final KafkaMetadataEventProducer<DatasetSnapshot, DatasetAspect, DatasetUrn> eventProducer =
        new KafkaMetadataEventProducer<>(DatasetSnapshot.class, DatasetAspect.class, _producer);
    eventProducer.produceSnapshotBasedMetadataChangeEvent(
        new DatasetUrn(KAFKA_URN, schemaMetadata.getSchemaName(), FabricType.PROD), schemaMetadata);
    _producer.flush();
  }
  @Override
  public void run(String... args) throws Exception {
    log.info("Starting up");
    final List<String> topics = _zooKeeper.getChildren("/brokers/topics", false);
    for (String datasetName : topics) {
      if (datasetName.startsWith("_")) {
        continue;
      }
      final String topic = datasetName + "-value";
      io.confluent.kafka.schemaregistry.client.SchemaMetadata schemaMetadata;
      try {
        schemaMetadata = _schemaRegistryClient.getLatestSchemaMetadata(topic);
      } catch (Throwable t) {
        log.error("Failed to get schema for topic " + datasetName, t);
        log.error("Common failure: does this event schema exist in the schema registry?");
        continue;
      }
      if (schemaMetadata == null) {
        log.warn(String.format("Skipping topic without schema: %s", topic));
        continue;
      }
      log.trace(topic);
      produceKafkaDatasetMce(buildDatasetSchema(datasetName, schemaMetadata.getSchema(), schemaMetadata.getVersion()));
      log.info("Successfully fired MCE for " + datasetName);
    }
  }
 }
--- a/metadata-ingestion-examples/kafka-etl/src/main/java/com/linkedin/metadata/examples/kafka/KafkaEtlApplication.java
+++ b/metadata-ingestion-examples/kafka-etl/src/main/java/com/linkedin/metadata/examples/kafka/KafkaEtlApplication.java
@ -0,0 +1,16 @@
 package com.linkedin.metadata.examples.kafka;
 import org.springframework.boot.WebApplicationType;
 import org.springframework.boot.autoconfigure.SpringBootApplication;
 import org.springframework.boot.autoconfigure.elasticsearch.rest.RestClientAutoConfiguration;
 import org.springframework.boot.builder.SpringApplicationBuilder;
@SuppressWarnings("checkstyle:HideUtilityClassConstructor")
@SpringBootApplication(exclude = {RestClientAutoConfiguration.class}, scanBasePackages = {
    "com.linkedin.metadata.examples.configs", "com.linkedin.metadata.examples.kafka"})
 public class KafkaEtlApplication {
  public static void main(String[] args) {
    new SpringApplicationBuilder(KafkaEtlApplication.class).web(WebApplicationType.NONE).run(args);
  }
 }
--- a/metadata-ingestion-examples/kafka-etl/src/main/resources/logback.xml
+++ b/metadata-ingestion-examples/kafka-etl/src/main/resources/logback.xml
@ -0,0 +1,40 @@
 <configuration>
    <property name="LOG_DIR" value="${LOG_DIR:- /tmp/datahub/logs}"/>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${LOG_DIR}/kafka-etl-java.log</file>
        <append>true</append>
        <encoder>
            <pattern>%d{HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
        <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
            <FileNamePattern>${LOG_DIR}/kafka-etl.%i.log</FileNamePattern>
            <minIndex>1</minIndex>
            <maxIndex>3</maxIndex>
        </rollingPolicy>
        <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
            <MaxFileSize>100MB</MaxFileSize>
        </triggeringPolicy>
    </appender>
    <logger name="org.apache.kafka.clients" level="warn" additivity="false">
        <appender-ref ref="STDOUT" />
        <appender-ref ref="FILE"/>
    </logger>
    <logger name="com.linkedin.metadata.examples.kafka" level="info" additivity="false">
        <appender-ref ref="STDOUT" />
        <appender-ref ref="FILE"/>
    </logger>
    <root level="warn">
        <appender-ref ref="STDOUT" />
        <appender-ref ref="FILE"/>
    </root>
 </configuration>
--- a/metadata-ingestion/README.md
+++ b/metadata-ingestion/README.md
@ -90,22 +90,6 @@ The ldap_etl provides you ETL channel to communicate with your LDAP server.
 ```
 This will bootstrap DataHub with your metadata in the LDAP server as an user entity.
 ## Ingest metadata from Kafka to DataHub
 The kafka_etl provides you ETL channel to communicate with your kafka.
 ```
 ➜  Config your kafka environmental variable in the file.
    ZOOKEEPER      # Your zookeeper host.
 ➜  Config your Kafka broker environmental variable in the file.
    AVROLOADPATH   # Your model event in avro format.
    KAFKATOPIC     # Your event topic.
    BOOTSTRAP      # Kafka bootstrap server.
    SCHEMAREGISTRY # Kafka schema registry host.
 ➜  python kafka_etl.py
 ```
 This will bootstrap DataHub with your metadata in the kafka as a dataset entity.
 ## Ingest metadata from MySQL to DataHub
 The mysql_etl provides you ETL channel to communicate with your MySQL.
 ```
--- a/settings.gradle
+++ b/settings.gradle
@ -18,6 +18,8 @@ include 'metadata-events:mxe-avro-1.7'
 include 'metadata-events:mxe-registration'
 include 'metadata-events:mxe-schemas'
 include 'metadata-events:mxe-utils-avro-1.7'
 include 'metadata-ingestion-examples:common'
 include 'metadata-ingestion-examples:kafka-etl'
 include 'metadata-jobs:mae-consumer-job'
 include 'metadata-jobs:mce-consumer-job'
 include 'metadata-models'