mirror of
https://github.com/datahub-project/datahub.git
synced 2025-11-09 07:53:33 +00:00
Port mce-cli to Java. Also moved off the avro format event file to json instead. Much nicer to use :)
MCE Producer/Consumer CLI
mce_cli.py script provides a convenient way to produce a list of MCEs from a data file.
Every MCE in the data file should be in a single line. It also supports consuming from
MetadataChangeEvent topic.
Tested & confirmed platforms:
- Red Hat Enterprise Linux Workstation release 7.6 (Maipo) w/Python 3.6.8
- MacOS 10.15.5 (19F101) Darwin 19.5.0 w/Python 3.7.3
➜ python mce_cli.py --help
usage: mce_cli.py [-h] [-b BOOTSTRAP_SERVERS] [-s SCHEMA_REGISTRY]
[-d DATA_FILE] [-l SCHEMA_RECORD]
{produce,consume}
Client for producing/consuming MetadataChangeEvent
positional arguments:
{produce,consume} Execution mode (produce | consume)
optional arguments:
-h, --help show this help message and exit
-b BOOTSTRAP_SERVERS Kafka broker(s) (localhost[:port])
-s SCHEMA_REGISTRY Schema Registry (http(s)://localhost[:port]
-l SCHEMA_RECORD Avro schema record; required if running 'producer' mode
-d DATA_FILE MCE data file; required if running 'producer' mode
Bootstrapping DataHub
- Ensure DataHub is running and you have run
./gradlew :metadata-events:mxe-schemas:build(required to generate event definitions). - [Optional] Open a new terminal to consume the events:
➜ python3 contrib/metadata-ingestin/python/mce-cli/mce_cli.py consume -l metadata-events/mxe-schemas/src/renamed/avro/com/linkedin/mxe/MetadataChangeEvent.avsc
- Run the mce-cli to quickly ingest lots of sample data and test DataHub in action, you can run below command:
➜ python3 contrib/metadata-ingestin/python/mce-cli/mce_cli.py produce -l metadata-events/mxe-schemas/src/renamed/avro/com/linkedin/mxe/MetadataChangeEvent.avsc -d metadata-ingestion/mce-cli/bootstrap_mce.dat
Producing MetadataChangeEvent records to topic MetadataChangeEvent. ^c to exit.
MCE1: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:foo", "aspects": [{"active": True,"email": "foo@linkedin.com"}]}), "proposedDelta": None}
MCE2: {"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:bar", "aspects": [{"active": False,"email": "bar@linkedin.com"}]}), "proposedDelta": None}
Flushing records...
This will bootstrap DataHub with sample datasets and sample users.
Note
There is a known issue with the Python Avro serialization library that can lead to unexpected result when it comes to union of types. Always use the tuple notation to avoid encountering these difficult-to-debug issues.