DataHub Lite is a lightweight embeddable version of DataHub with no external dependencies. It is intended to enable local developer tooling use-cases such as simple access to metadata for scripts and other tools.
DataHub Lite is compatible with the DataHub metadata format and all the ingestion connectors that DataHub supports.
Currently DataHub Lite uses DuckDB under the covers as its default storage layer, but that might change in the future.
## Features
- Designed for the CLI
- Available as a Python library or REST API
- Ingest metadata from all DataHub ingestion sources
- Metadata Reads
- navigate metadata using a hierarchy
- get metadata for an entity
- search / query metadata across all entities
- Forward metadata automatically to a central GMS or Kafka instance
DataHub Lite is NOT meant to be a replacement for the production Java DataHub server ([datahub-gms](./architecture/metadata-serving.md)). It does not offer the full set of API-s that the DataHub GMS server does.
The following features are **NOT** supported:
- Full-text search with ranking and relevance features
- Graph traversal of relationships (e.g. lineage)
- Metadata change stream over Kafka (only forwarding of writes is supported)
To use `datahub lite` commands, you need to install [`acryl-datahub`](https://pypi.org/project/acryl-datahub/) > 0.9.6 ([install instructions](./cli.md#using-pip)) and the `datahub-lite` plugin.
To ingest metadata into DataHub Lite, all you have to do is change the `sink:` in your recipe to be a `datahub-lite` instance. See the detailed sink docs [here](../metadata-ingestion/sink_docs/datahub.md#datahub-lite-experimental).
For example, here is a simple recipe file that ingests mysql metadata into datahub-lite.
```yaml
# mysql.in.dhub.yaml
source:
type: mysql
config:
host_port: localhost:3306
username: datahub
password: datahub
sink:
type: datahub-lite
```
By default, `lite` will create a local instance under `~/.datahub/lite/`.
Ingesting metadata into DataHub Lite is as simple as running ingestion:
`datahub ingest -c mysql.in.dhub.yaml`
:::note
DataHub Lite currently doesn't support stateful ingestion, so you'll have to turn off stateful ingestion in your recipe to use it. This will be fixed shortly.
:::
### Forwarding to a central DataHub GMS over REST or Kafka
DataHub Lite can be configured to forward all writes to a central DataHub GMS using either the REST API or the Kafka API.
To configure forwarding, add a **forward_to** section to your DataHub Lite config that conforms to a valid DataHub Sink configuration. Here is an example:
```yaml
# mysql.in.dhub.yaml with forwarding to datahub-gms over REST
Forwarding is currently best-effort, so there can be losses in metadata if the remote server is down. For a reliable sync mechanism, look at the [exporting metadata](#export-metadata-export) section to generate a standard metadata file.
:::
### Importing from a file
As a convenient short-cut, you can import metadata from any standard DataHub metadata json file (e.g. generated via using a file sink) by issuing a *datahub lite import* command.
Listing functions like a directory structure that is customized based on the kind of system being explored. DataHub's metadata is automatically organized into databases, tables, views, dashboards, charts, etc.
Using the `ls` command below is much more pleasant when you have tab completion enabled on your shell. Check out the [Setting up Tab Completion](#tab-completion) section at the bottom of the guide.
Once you have located a path of interest, you can read metadata at that entity, by issuing a **get**. You can additionally filter the metadata retrieved from an entity by the aspect type of the metadata (e.g. to request the schema, filter by the **schemaMetadata** aspect).
Aside: If you are curious what all types of entities and aspects DataHub supports, check out the metadata model of entities like [Dataset](./generated/metamodel/entities/dataset.md), [Dashboard](./generated/metamodel/entities/dashboard.md) etc.
The general template for the get responses looks like:
```
{
"urn": <urn_of_the_entity>,
<aspect_name>: {
"value": <aspect_value>, # aspect value as a dictionary
"systemMetadata": <system_metadata> # present if details are requested
}
}
```
Here is what executing a *get* command looks like:
Using the `get` command by path is much more pleasant when you have tab completion enabled on your shell. Check out the [Setting up Tab Completion](#tab-completion) section at the bottom of the guide.
DataHub Lite preserves every version of metadata ingested, just like DataHub GMS. You can also query the metadata as of a specific point in time by adding the *--asof* parameter to your *get* command.
```shell
> datahub lite get "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)" --aspect status --asof 2020-01-01
null
> datahub lite get "urn:li:dataset:(urn:li:dataPlatform:mysql,datahub.metadata_aspect_v2,PROD)" --aspect status --asof 2023-01-16
DataHub Lite also allows you to search using queries within the metadata using the `datahub lite search` command.
You can provide a free form search query like: "customer" and DataHub Lite will attempt to find entities that match the name customer either in the id of the entity or within the name fields of aspects in the entities.
You can also query the metadata precisely using DuckDB's [JSON](https://duckdb.org/docs/extensions/json.html) extract functions.
Writing these functions requires that you understand the DataHub metadata model and how the data is laid out in DataHub Lite.
For example, to find all entities whose *datasetProperties* aspect includes the *view_definition* in its *customProperties* sub-field, we can issue the following command:
```shell
> datahub lite search --aspect datasetProperties --flavor exact "metadata -> '$.customProperties' ->> '$.view_definition' IS NOT NULL"
Search will return results that include the *id* of the entity that matched along with the *aspect* and the content of the aspect as part of the *snippet* field. If you just want the *id* of the entity to be returned, use the *--no-details* flag.
```shell
> datahub lite search --aspect datasetProperties --flavor exact "metadata -> '$.customProperties' ->> '$.view_definition' IS NOT NULL" --no-details
DataHub Lite can be run as a lightweight HTTP server, exposing an OpenAPI spec over FastAPI.
```shell
> datahub lite serve
INFO: Started server process [33364]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8979 (Press CTRL+C to quit)
```
OpenAPI docs are available via your browser at the same port: http://localhost:8979
The server exposes similar commands as the **lite** cli commands over HTTP:
- entities: list of all entity ids and get metadata for an entity
- browse: traverse the entity hierarchy in a path based way
- search: execute search against the metadata
#### Server Configuration
Configuration for the server is picked up from the standard location for the **datahub** cli: **~/.datahubenv** and can be created using **datahub lite init**.
Here is a sample config file with the **lite** section filled out:
The *export* command allows you to export the contents of DataHub Lite into a metadata events file that you can then send to another DataHub instance (e.g. over REST).
```shell
> datahub lite export --file datahub_lite_export.json
Successfully exported 1785 events to datahub_lite_export.json
```
### Clear (nuke)
If you want to clear your DataHub lite instance, you can just issue the `nuke` command.
```shell
> datahub lite nuke
DataHub Lite destroyed at <path>
```
### Use a different file (init)
By default, DataHub Lite will create and use a local duckdb instance located at `~/.datahub/lite/datahub.duckdb`.
If you want to use a different location, you can configure it using the `datahub lite init` command.
```shell
> datahub lite init --type duckdb --file my_local_datahub.duckdb
Will replace datahub lite config type='duckdb' config={'file': '/Users/<username>/.datahub/lite/datahub.duckdb', 'options': {}} with type='duckdb' config={'file': 'my_local_datahub.duckdb', 'options': {}} [y/N]: y
DataHub Lite inited at my_local_datahub.duckdb
```
### Reindex
DataHub Lite maintains a few derived tables to make access possible via both the native id (urn) as well as the logical path of the entity. The `reindex` command recomputes these indexes.
DataHub Lite is a very new project. Do not use it for production use-cases. The API-s and storage formats are subject to change and we get feedback from early adopters. That said, we are really interested in accepting PR-s and suggestions for improvements to this fledgling project.
## Advanced Options
### Tab Completion
Using the datahub lite commands like `ls` or `get` is much more pleasant when you have tab completion enabled on your shell. Tab completion is supported on the command line through the [Click Shell completion](https://click.palletsprojects.com/en/8.1.x/shell-completion/) module.
Using eval means that the command is invoked and evaluated every time a shell is started, which can delay shell responsiveness. To speed it up, write the generated script to a file, then source that.