The `datahub-protobuf` module is designed to be used with the Java Emitter, the input is a compiled protobuf binary `*.protoc` files and optionally the corresponding `*.proto` source code. In addition, you can supply the root message in cases where a single protobuf source file includes multiple non-nested messages.
Additionally, the raw protobuf source can be included as well as information to allow parsing of additional references to GitHub and Slack in the source code comments.
In order to extract even more metadata from the protobuf schema we can extend the FieldOptions and MessageOptions to be able to annotate Messages and Fields with arbitrary information. This information can then be emitted as DataHub primary key information, tags, glossary terms or properties on the dataset.
*Note*: Extending FieldOptions and MessageOptions does not change the messages themselves. The metadata is not included in messages being sent over the wire.
In order to use the annotations above, create a proto file called `meta.proto`. Feel free to customize the kinds of metadata and how it is emitted to DataHub for your use cases.
Repeated values will be collected and the value will be stored as a serialized json array. The following example would result in the value of `["a","b","c"]`.
The tag list assumes a string that contains the comma delimited values of the tags. In the example below, tags would be added as `a`, `b`, `c`.
```protobuf
message msg {
extend google.protobuf.MessageOptions {
string tags = 5000 [(meta.fld.type) = TAG_LIST];
}
}
message Message {
option(meta.msg.tags) = "a, b, c";
}
```
Tags could also be represented as separate boolean options. Only the `true` options result in tags. In this example, a single tag of `tagA` would be added to the dataset.
```protobuf
message msg {
extend google.protobuf.MessageOptions {
bool tagA = 5000 [(meta.fld.type) = TAG];
bool tagB = 5001 [(meta.fld.type) = TAG];
}
}
message Message {
option(meta.msg.tagA) = true;
option(meta.msg.tagB) = false;
}
```
Alternatively, tags can be separated into different fields with the option name as a dot delimited prefix. The following would produce two tags with values of `tagA.a` and `tagB.a`.
```protobuf
message msg {
extend google.protobuf.MessageOptions {
string tagA = 5000 [(meta.fld.type) = TAG];
string tagB = 5001 [(meta.fld.type) = TAG];
}
}
message Message {
option(meta.msg.tagA) = "a";
option(meta.msg.tagB) = "a";
}
```
The dot delimited prefix also works with enum types where the prefix is the enum type name. In this example two tags are created, `MetaEnumExample.ENTITY`.
```protobuf
enum MetaEnumExample {
UNKNOWN = 0;
ENTITY = 1;
EVENT = 2;
}
message msg {
extend google.protobuf.MessageOptions {
MetaEnumExample tag = 5000 [(meta.fld.type) = TAG];
}
}
message Message {
option(meta.msg.tag) = ENTITY;
}
```
##### TERM
Terms are specified by either a fully qualified string value or an enum where the enum type's name is the first element in the fully qualified term name.
The following example shows both methods, either of which would result in the term `Classification.HighlyConfidential` being applied.
```protobuf
enum Classification {
HighlyConfidential = 0;
Confidential = 1;
Sensitive = 2;
}
message msg {
extend google.protobuf.MessageOptions {
Classification term = 5000 [(meta.fld.type) = TERM];
One or more owners can be specified and can be any combination of `corpUser` and `corpGroup` entities. The default entity type is `corpGroup`. By default, the ownership type is set to `producer`, see the second example for setting the ownership type.
The following example assigns the ownership to a group of `myGroup` and a user called `myName`.
Set the domain id for the dataset. The domain should exist already. Note that the *id* of the domain is the value. If not specified during domain creation it is likely a random string.