DataHub Lite is a lightweight embeddable version of DataHub with no external dependencies. It is intended to enable local developer tooling use-cases such as simple access to metadata for scripts and other tools.
DataHub Lite is compatible with the DataHub metadata format and all the ingestion connectors that DataHub supports.
Currently DataHub Lite uses DuckDB under the covers as its default storage layer, but that might change in the future.
DataHub Lite is NOT meant to be a replacement for the production Java DataHub server ([datahub-gms](./architecture/metadata-serving.md)). It does not offer the full set of API-s that the DataHub GMS server does.
To use `datahub lite` commands, you need to install [`acryl-datahub`](https://pypi.org/project/acryl-datahub/) > 0.9.6 ([install instructions](./cli.md#using-pip)) and the `datahub-lite` plugin.
To ingest metadata into DataHub Lite, all you have to do is change the `sink:` in your recipe to be a `datahub-lite` instance. See the detailed sink docs [here](../metadata-ingestion/sink_docs/datahub.md#datahub-lite-experimental).
For example, here is a simple recipe file that ingests mysql metadata into datahub-lite.
```yaml
# mysql.in.dhub.yaml
source:
type: mysql
config:
host_port: localhost:3306
username: datahub
password: datahub
sink:
type: datahub-lite
```
By default, `lite` will create a local instance under `~/.datahub/lite/`.
Ingesting metadata into DataHub Lite is as simple as running ingestion:
`datahub ingest -c mysql.in.dhub.yaml`
:::note
DataHub Lite currently doesn't support stateful ingestion, so you'll have to turn off stateful ingestion in your recipe to use it. This will be fixed shortly.
:::
### Forwarding to a central DataHub GMS over REST or Kafka
DataHub Lite can be configured to forward all writes to a central DataHub GMS using either the REST API or the Kafka API.
To configure forwarding, add a **forward_to** section to your DataHub Lite config that conforms to a valid DataHub Sink configuration. Here is an example:
```yaml
# mysql.in.dhub.yaml with forwarding to datahub-gms over REST
Forwarding is currently best-effort, so there can be losses in metadata if the remote server is down. For a reliable sync mechanism, look at the [exporting metadata](#export-metadata-export) section to generate a standard metadata file.
As a convenient short-cut, you can import metadata from any standard DataHub metadata json file (e.g. generated via using a file sink) by issuing a _datahub lite import_ command.
Listing functions like a directory structure that is customized based on the kind of system being explored. DataHub's metadata is automatically organized into databases, tables, views, dashboards, charts, etc.
Using the `ls` command below is much more pleasant when you have tab completion enabled on your shell. Check out the [Setting up Tab Completion](#tab-completion) section at the bottom of the guide.
# Listing a leaf entity functions just like the unix ls command
> datahub lite ls /databases/mysql/instances/default/databases/datahub/tables/metadata_aspect_v2
metadata_aspect_v2
```
### Read (get)
Once you have located a path of interest, you can read metadata at that entity, by issuing a **get**. You can additionally filter the metadata retrieved from an entity by the aspect type of the metadata (e.g. to request the schema, filter by the **schemaMetadata** aspect).
Aside: If you are curious what all types of entities and aspects DataHub supports, check out the metadata model of entities like [Dataset](./generated/metamodel/entities/dataset.md), [Dashboard](./generated/metamodel/entities/dashboard.md) etc.
The general template for the get responses looks like:
Using the `get` command by path is much more pleasant when you have tab completion enabled on your shell. Check out the [Setting up Tab Completion](#tab-completion) section at the bottom of the guide.
DataHub Lite preserves every version of metadata ingested, just like DataHub GMS. You can also query the metadata as of a specific point in time by adding the _--asof_ parameter to your _get_ command.
DataHub Lite also allows you to search using queries within the metadata using the `datahub lite search` command.
You can provide a free form search query like: "customer" and DataHub Lite will attempt to find entities that match the name customer either in the id of the entity or within the name fields of aspects in the entities.
For example, to find all entities whose _datasetProperties_ aspect includes the _view_definition_ in its _customProperties_ sub-field, we can issue the following command:
Search will return results that include the _id_ of the entity that matched along with the _aspect_ and the content of the aspect as part of the _snippet_ field. If you just want the _id_ of the entity to be returned, use the _--no-details_ flag.
- entities: list of all entity ids and get metadata for an entity
- browse: traverse the entity hierarchy in a path based way
- search: execute search against the metadata
#### Server Configuration
Configuration for the server is picked up from the standard location for the **datahub** cli: **~/.datahubenv** and can be created using **datahub lite init**.
Here is a sample config file with the **lite** section filled out:
The _export_ command allows you to export the contents of DataHub Lite into a metadata events file that you can then send to another DataHub instance (e.g. over REST).
> datahub lite export --file datahub_lite_export.json
Successfully exported 1785 events to datahub_lite_export.json
```
### Clear (nuke)
If you want to clear your DataHub lite instance, you can just issue the `nuke` command.
```shell
> datahub lite nuke
DataHub Lite destroyed at <path>
```
### Use a different file (init)
By default, DataHub Lite will create and use a local duckdb instance located at `~/.datahub/lite/datahub.duckdb`.
If you want to use a different location, you can configure it using the `datahub lite init` command.
```shell
> datahub lite init --type duckdb --file my_local_datahub.duckdb
Will replace datahub lite config type='duckdb' config={'file': '/Users/<username>/.datahub/lite/datahub.duckdb', 'options': {}} with type='duckdb' config={'file': 'my_local_datahub.duckdb', 'options': {}} [y/N]: y
DataHub Lite inited at my_local_datahub.duckdb
```
### Reindex
DataHub Lite maintains a few derived tables to make access possible via both the native id (urn) as well as the logical path of the entity. The `reindex` command recomputes these indexes.
DataHub Lite is a very new project. Do not use it for production use-cases. The API-s and storage formats are subject to change and we get feedback from early adopters. That said, we are really interested in accepting PR-s and suggestions for improvements to this fledgling project.
## Advanced Options
### Tab Completion
Using the datahub lite commands like `ls` or `get` is much more pleasant when you have tab completion enabled on your shell. Tab completion is supported on the command line through the [Click Shell completion](https://click.palletsprojects.com/en/8.1.x/shell-completion/) module.
Using eval means that the command is invoked and evaluated every time a shell is started, which can delay shell responsiveness. To speed it up, write the generated script to a file, then source that.