The `datahub-protobuf` module is designed to be used with the Java Emitter, the input is a compiled protobuf binary `*.protoc` files and optionally the corresponding `*.proto` source code. In addition, you can supply the root message in cases where a single protobuf source file includes multiple non-nested messages.
Additionally, the raw protobuf source can be included as well as information to allow parsing of additional references to GitHub and Slack in the source code comments.
In order to extract even more metadata from the protobuf schema we can extend the FieldOptions and MessageOptions to be able to annotate Messages and Fields with arbitrary information. This information can then be emitted as DataHub primary key information, tags, glossary terms or properties on the dataset.
*Note*: Extending FieldOptions and MessageOptions does not change the messages themselves. The metadata is not included in messages being sent over the wire.
In order to use the annotations above, create a proto file called `meta.proto`. Feel free to customize the kinds of metadata and how it is emitted to DataHub for your use cases.
Repeated values will be collected and the value will be stored as a serialized json array. The following example would result in the value of `["a","b","c"]`.
The tag list assumes a string that contains the comma delimited values of the tags. In the example below, tags would be added as `a`, `b`, `c`.
```protobuf
message msg {
extend google.protobuf.MessageOptions {
string tags = 5000 [(meta.fld.type) = TAG_LIST];
}
}
message Message {
option(meta.msg.tags) = "a, b, c";
}
```
Tags could also be represented as separate boolean options. Only the `true` options result in tags. In this example, a single tag of `tagA` would be added to the dataset.
```protobuf
message msg {
extend google.protobuf.MessageOptions {
bool tagA = 5000 [(meta.fld.type) = TAG];
bool tagB = 5001 [(meta.fld.type) = TAG];
}
}
message Message {
option(meta.msg.tagA) = true;
option(meta.msg.tagB) = false;
}
```
Alternatively, tags can be separated into different fields with the option name as a dot delimited prefix. The following would produce two tags with values of `tagA.a` and `tagB.a`.
```protobuf
message msg {
extend google.protobuf.MessageOptions {
string tagA = 5000 [(meta.fld.type) = TAG];
string tagB = 5001 [(meta.fld.type) = TAG];
}
}
message Message {
option(meta.msg.tagA) = "a";
option(meta.msg.tagB) = "a";
}
```
The dot delimited prefix also works with enum types where the prefix is the enum type name. In this example two tags are created, `MetaEnumExample.ENTITY`.
```protobuf
enum MetaEnumExample {
UNKNOWN = 0;
ENTITY = 1;
EVENT = 2;
}
message msg {
extend google.protobuf.MessageOptions {
MetaEnumExample tag = 5000 [(meta.fld.type) = TAG];
Terms are specified by either a fully qualified string value or an enum where the enum type's name is the first element in the fully qualified term name.
The following example shows both methods, either of which would result in the term `Classification.HighlyConfidential` being applied.
```protobuf
enum Classification {
HighlyConfidential = 0;
Confidential = 1;
Sensitive = 2;
}
message msg {
extend google.protobuf.MessageOptions {
Classification term = 5000 [(meta.fld.type) = TERM];
One or more owners can be specified and can be any combination of `corpUser` and `corpGroup` entities. The default entity type is `corpGroup`. By default, the ownership type is set to `technical_owner`, see the second example for setting the ownership type.
Set the domain id for the dataset. The domain should exist already. Note that the *id* of the domain is the value. If not specified during domain creation it is likely a random string.
Follow the specific instructions for your build system to declare a dependency on the appropriate version of the package.
**_Note_**: Check the [Maven repository](https://mvnrepository.com/artifact/io.acryl/datahub-protobuf) for the latest version of the package before following the instructions below.
An example application **Proto2DataHub** is included as part of this project.
You can also set up a standalone project that works with the `protobuf-gradle-plugin`, see the standalone [example project](../datahub-protobuf-example) as an example of such a project.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
✅ Successfully emitted 90 events for 5 files to DataHub REST
```
You can also route results to a file by using the `--transport file --filename events.json` options.
##### Important Flags
Here are a few important flags to use with this command
- --env : Defaults to DEV, you should use PROD once you have ironed out all the issues with running this command.
- --platform: Defaults to Kafka (as most people use protobuf schema repos with Kafka), but you can provide a custom platform name for this e.g. (`schema_repo` or `<company_name>_schemas`). If you use a custom platform, make sure to provision the custom platform on your DataHub instance with a logo etc, to get a native experience.
- --subtype : This gives your entities a more descriptive category than Dataset in the UI. Defaults to schema, but you might find topic, event or message more descriptive.
## Example Application (separate project)
The standalone [example project](../datahub-protobuf-example) shows you how you can create an independent project that uses this as part of a build task.