mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-10-24 15:25:10 +00:00

* Update the partial references for 1.7.x-SNAPSHOT * Delete image folder for 1.4 and add images for 1.7 * Update the new connector stages
318 lines
17 KiB
Markdown
318 lines
17 KiB
Markdown
---
|
|
title: High Level Design
|
|
slug: /main-concepts/high-level-design
|
|
---
|
|
|
|
# High Level Design
|
|
|
|
This Solution Design document will help us explore and understand the internals of OpenMetadata services, how are they built and
|
|
their interactions.
|
|
|
|
We will start by describing the big picture of the software design of the application. Bit by bit we will get inside
|
|
specific components, describing their behaviour and showing examples on how to use them.
|
|
|
|
## System Context
|
|
|
|
The goal of this first section is to get familiar with the high-level concepts and technologies involved. The learning objectives here are:
|
|
|
|
- Describe the elements that compose OpenMetadata and their relationships.
|
|
- How end-users and external applications can communicate with the system.
|
|
|
|
Here we have the main actors of the solution:
|
|
|
|
{% image
|
|
src="/images/v1.7/main-concepts/high-level-design/system-context.png"
|
|
alt="system-context" /%}
|
|
|
|
|
|
|
|
- **API**: This is the main pillar of OpenMetadata. Here we have defined how we can interact with the metadata Entities.
|
|
It powers all the other components of the solution.
|
|
- **UI**: Discovery-focused tool that helps users keep track of all the data assets in the organisation. Its goal is
|
|
enabling and fueling collaboration.
|
|
- **Ingestion Framework**: Based on the API specifications, this system is the foundation of all the Connectors, i.e., the
|
|
components that define the interaction between OpenMetadata and external systems containing the metadata we want to integrate.
|
|
- **Entity Store**: MySQL storage that contains real-time information on the state of all the Entities and their Relationships.
|
|
- **Search Engine**: Powered by ElasticSearch, it is the indexing system for the UI to help users discover the metadata.
|
|
|
|
## JSON Schemas
|
|
|
|
If we abstract away from the Storage Layer for a moment, we then realize that the OpenMetadata implementation is the
|
|
integration of three blocks:
|
|
|
|
- The core **API**, unifying and centralising the communication with internal and external systems.
|
|
- The **UI** for a team-centric metadata Serving Layer.
|
|
- The **Ingestion Framework** as an Interface between OpenMetadata and external sources.
|
|
|
|
The only thing these components have in common is the **vocabulary** -> All of them are shaping, describing, and moving
|
|
around metadata Entities.
|
|
|
|
OpenMetadata is based on a **standard definition** for metadata. Therefore, we need to make sure that in our implementation
|
|
of this standard we share this definition in the end-to-end workflow. To this end, the main lexicon is defined as JSON Schemas,
|
|
a readable and language-agnostic solution.
|
|
|
|
Then, when packaging the main components, we generate the specific programming classes for all the Entities.
|
|
What we achieve is three views from the same source:
|
|
|
|
- Java Classes for the API,
|
|
- Python Classes for the Ingestion Framework and
|
|
- TypeScript Types for the UI,
|
|
|
|
each of them modeled after a single source of truth. Thanks to this approach we can be sure that it does not matter at
|
|
which point we zoom in throughout the whole process, we are always going to find a univocal well-defined Entity.
|
|
|
|
## API Container Diagram
|
|
|
|
Now we are going to zoom inside the API Container. As the central Software System of the solution, its goal is to manage
|
|
calls (both from internal and external sources, e.g., Ingestion Framework or any custom integration) and update the
|
|
state of the metadata Entities.
|
|
|
|
While the data is stored in the MySQL container, the API will be the one fetching it and completing the necessary
|
|
information, validating the Entities data and all the relationships.
|
|
|
|
Having a Serving Layer (API) decoupled from the Storage Layer allows users and integrations to ask for what they need
|
|
in a simple language (REST), without the learning curve of diving into specific data models and design choices.
|
|
|
|
|
|
{% image
|
|
src="/images/v1.7/main-concepts/high-level-design/api-container-diagram.png"
|
|
alt="api-container-diagram" /%}
|
|
|
|
|
|
## Entity Resource
|
|
|
|
When we interact with most of our Entities, we follow the same endpoint structure. For example:
|
|
|
|
- `GET <url>/api/v1/<collectionName>/<id>` to retrieve an Entity instance by ID, or
|
|
- `GET <url>/api/v1/<collectionName>/name/<FQN>` to query by its fully qualified domain name.
|
|
|
|
Similarly, we support other CRUD operations, each of them expecting a specific incoming data structure, and returning
|
|
the Entity's class. As the foundations of OpenMetadata are the Entities definitions, we have this data contract with
|
|
any consumer, where the backend will validate the received data, as well as the outputs.
|
|
|
|
The endpoint definition and datatype setting are what happens at the Entity Resource. Each metadata Entity is packed
|
|
with a Resource class, which builds the API definition for the given Entity.
|
|
|
|
This logic is what then surfaces in the [API docs](/swagger.html).
|
|
|
|
## Entity Repository
|
|
|
|
The goal of the Entity Repository is to perform Read & Write operations to the **backend database** to Create, Retrieve,
|
|
Update and Delete Entities.
|
|
|
|
While the Entity Resource handles external communication, the Repository is in charge of managing how the whole
|
|
process interacts with the Storage Layer, making sure that incoming and outgoing Entities are valid and hold proper
|
|
and complete information.
|
|
|
|
This means that here is where we define our **DAO** (Data Access Object), with all the validation and data storage logic.
|
|
|
|
As there are processes repeated across all Entities (e.g., listing entities in a collection or getting a specific
|
|
version from an Entity), the Entity Repository extends an **Interface** that implements some basic functionalities and
|
|
abstracts Entity specific logic.
|
|
|
|
Each Entity then needs to implement its **server-side processes** such as building the FQN based on the Entity hierarchy,
|
|
how the Entity stores and retrieves **Relationship** information with other Entities or how the Entity reacts to **Change Events**.
|
|
|
|
## Entity Storage Layer
|
|
|
|
In the API Container Diagram, we showed how the Entity Repository interacts with three different Storage Containers
|
|
(tables) depending on what type of information is being processed.
|
|
|
|
To fully understand this decision, we should first talk about the information contained by Entities instances.
|
|
|
|
An Entity has two types of fields: **attributes** (JSON Schema properties) and **relationships** (JSON Schema href):
|
|
|
|
- **Attributes** are the core properties of the Entity: the name and id, the columns for a table, or the algorithm
|
|
for an ML Model. Those are intrinsic pieces of information of an Entity and their existence and values are what
|
|
help us differentiate both Entity instances (Table A vs. Table B) and Entity definitions (Dashboard vs. Topic).
|
|
- **Relationships** are associations between two Entities. For example, a Table belongs to a Database, a User owns a
|
|
Dashboard, etc. Relationships are a special type of attribute that is captured using Entity References.
|
|
|
|
## Entity and Relationship Store
|
|
|
|
Entities are stored as JSON documents in the database. Each entity has an associated table (`<entityName>_entity`) which
|
|
contains the JSON defining the Entity attributes and other metadata fields, such as the id, `updatedAt` or `updatedBy`.
|
|
|
|
This JSON does not store any Relationship. E.g., a User owning a Dashboard is a piece of information that is materialised
|
|
in a separate table entity_relationship as graph nodes, where the edge holds the type of the Relationship (e.g., `contains`,
|
|
`uses`, `follows`...).
|
|
|
|
This separation helps us decouple concerns. We can process related entities independently and validate at runtime what
|
|
information needs to be updated and/or retrieved. For example, if we delete a Dashboard being owned by a User, we will then
|
|
clean up this row in `entity_relationship`, but that won't alter the information from the User.
|
|
|
|
Another trickier example would be trying to delete a Database that contains Tables. In this case, the process would check
|
|
that the Database Entity is not empty, and therefore we cannot continue with the removal.
|
|
|
|
## Change Events Store
|
|
|
|
You might have already noticed that in all Entities definitions we have a `changeDescription` field. It is defined as
|
|
"Change that leads to this version of the entity". If we inspect further the properties of `changeDescription`, we can
|
|
see how it stores the differences between the current and last versions of an Entity.
|
|
|
|
This results in giving visibility on the last update step of each Entity instance. However, there might be times when
|
|
this level of tracking is not enough.
|
|
|
|
One of the greatest features of OpenMetadata is the ability to track all Entity versions. Each operation that leads
|
|
to a change (`PUT`, `POST`, `PATCH`) will generate a trace that is going to be stored in the table `change_event`.
|
|
|
|
Using the API to get events data, or directly exploring the different versions of each entity gives great debugging
|
|
power to both data consumers and producers.
|
|
|
|
## API Component Diagram
|
|
|
|
Now that we have a clear picture of the main pieces and their roles, we will analyze the logical flow of a `POST` and a
|
|
`PUT` calls to the API. The main goal of this section is to get familiar with the code organisation and its main steps.
|
|
|
|
{% note %}
|
|
|
|
To take the most out of this section, it is recommended to follow the source code as well, from the Entity JSON you'd
|
|
like to use as an example to its implementation of Resource and Repository.
|
|
|
|
{% /note %}
|
|
|
|
### Create a new Entity - POST
|
|
|
|
We will start with the simplest scenario: Creating a new Entity via a `POST` call. This is a great first point to review
|
|
as part of the logic and methods are reused during updates.
|
|
|
|
{% image
|
|
src="/images/v1.7/main-concepts/high-level-design/create-new-entity.png"
|
|
alt="create-new-entity" /%}
|
|
|
|
|
|
#### Create
|
|
|
|
As we already know, the recipient of the HTTP call will be the `EntityResource`. In there, we have the create function
|
|
with the @POST annotation and the description of the API endpoint and expected schemas.
|
|
|
|
The role of this first component is to receive the call and validate the request body and headers, but the real
|
|
implementation happens in the `EntityRepository`, which we already described as the **DAO**. For the `POST` operation, the
|
|
internal flow is rather simple and is composed of two steps:
|
|
|
|
- **Prepare**: Which validates the Entity data and computes some attributes at the server-side.
|
|
- **Store**: This saves the Entity JSON and its Relationships to the backend DB.
|
|
|
|
#### Prepare
|
|
|
|
This method is used for validating an entity to be created during `POST`, `PUT`, and `PATCH` operations and preparing the
|
|
entity with all the required attributes and relationships.
|
|
|
|
Here we handle, for example, the process of setting up the FQN of an Entity based on its hierarchy. While all Entities
|
|
require an FQN, this is not an attribute we expect to receive in a request.
|
|
|
|
Moreover, this checks that the received attributes are being correctly informed, e.g., we have a valid `User` as an `owner`
|
|
or a valid `Database` for a `Table`.
|
|
|
|
#### Store
|
|
|
|
The storing process is divided into two different steps (as we have two tables holding the information).
|
|
|
|
We strip the validated Entity from any `href` attribute (such as `owner` or `tags`) in order to just store a JSON document
|
|
with the Entity intrinsic values.
|
|
|
|
We then store the graph representation of the Relationships for the attributes omitted above.
|
|
|
|
At the end of these calls, we end up with a validated Entity holding all the required attributes,
|
|
which have been validated and stored accordingly. We can then return the created Entity to the caller.
|
|
|
|
### Create or Update an Entity - PUT
|
|
|
|
Let's now build on top of what we learned during the `POST` discussion, expanding the example to a `PUT` request handling.
|
|
|
|
{% image
|
|
src="/images/v1.7/main-concepts/high-level-design/create-or-update.png"
|
|
alt="create-update-entity" /%}
|
|
|
|
|
|
The first steps are fairly similar:
|
|
|
|
1. We have a function in our `Resource` annotated as `@PUT` and handling headers, auth and schemas.
|
|
2. The `Resource` then calls the DAO at the Repository, bootstrapping the data-related logic.
|
|
3. We validate the Entity and cook some attributes during the prepare step.
|
|
|
|
After processing and validating the Entity request, we then check if the Entity instance has already been stored,
|
|
querying the backend database by its FQN. If it has not, then we proceed with the same logic as the `POST`
|
|
operation -> simple creation. Otherwise, we need to validate the updated fields.
|
|
|
|
#### Set Fields
|
|
|
|
We cannot allow all fields to be updated for a given Entity instance. For example, the `id` or `name` stay immutable once
|
|
the instance is created, and the same thing happens to the `Database` of a `Table`.
|
|
|
|
The list of specified fields that can change is defined at each Entity's Repository, and we should only allow changes
|
|
on those attributes that can naturally evolve throughout the lifecycle of the object.
|
|
|
|
At this step, we set the fields to the Entity that are either required by the JSON schema definition (e.g.,
|
|
the algorithm for an `MlModel`) or, in the case of a `GET` operation, that are requested as
|
|
`GET <url>/api/v1/<collectionName>/<id>?fields=field1,field2...`
|
|
|
|
#### Update
|
|
|
|
In the `EntityRepository` there is an abstract implementation of the `EntityUpdater` interface, which is in charge of
|
|
defining the generic update logic flow common for all the Entities.
|
|
|
|
The main steps handled in the update calls are:
|
|
|
|
**1.** Update the Entity **generic** fields, such as the description or the owner.
|
|
**2.** Run Entity **specific** updates, which are implemented by each Entity's `EntityUpdater` extension.
|
|
**3.** **Store** the updated Entity JSON doc to the Entity Table in MySQL.
|
|
|
|
#### Entity Specific Updates
|
|
|
|
Each Entity has a set of attributes that define it. These attributes are going to have a very specific behaviour,
|
|
so the implementation of the `update` logic falls to each Entity Repository.
|
|
|
|
For example, we can update the `Columns` of a `Table`, or the `Dashboard` holding the performance metrics of an `MlModel`.
|
|
Both of these changes are going to be treated differently, in terms of how the Entity performs internally the update,
|
|
how the Entity version gets affected, or the impact on the **Relationship** data.
|
|
|
|
For the sake of discussion, we'll follow a couple of update scenarios.
|
|
|
|
#### Example 1 - Updating Columns of a Table
|
|
|
|
When updating `Columns`, we need to compare the existing set of columns in the original Entity vs. the incoming columns
|
|
of the `PUT` request.
|
|
|
|
If we are receiving an existing column, we might need to update its description or tags. This change will be
|
|
considered a minor change. Therefore, the version of the Entity will be bumped by 0.1, following the software
|
|
release specification model.
|
|
|
|
However, what happens if a stored column is not received in the updated instance? That would mean that such a column
|
|
has been deleted. This is a type of change that could possibly break integrations on top of the `Table`'s data.
|
|
Therefore, we can mark this scenario as a major update. In this case, the version of the Entity will increase by `1.0`.
|
|
|
|
Checking the Change Events or visiting the Entity history will easily show us the evolution of an Entity instance,
|
|
which will be immensely valuable when debugging data issues.
|
|
|
|
#### Example 2 - Updating the Dashboard of an ML Model
|
|
|
|
One of the attributes for an MlModel is the `EntityReference` to a `Dashboard` holding its performance metrics evolution.
|
|
|
|
As this attribute is a reference to another existing Entity, this data is not directly stored in the `MlModel` JSON doc,
|
|
but rather as a Relationship graph, as we have been discussing previously. Therefore, during the update step we will need to:
|
|
|
|
**1.** Insert the relationship, if the original Entity had no `Dashboard` informed,
|
|
**2.** Delete the relationship if the `Dashboard` has been removed, or
|
|
**3.** Update the relationship if we now point to a different `Dashboard`.
|
|
|
|
Note how during the `POST` operation we needed to always call the `storeRelationship` function, as it was the first
|
|
time we were storing the instance's information. During an update, we will just modify the Relationship data if the
|
|
Entity's specific attributes require it.
|
|
|
|
## Handling Events
|
|
|
|
During all these discussions and examples we've been showing how the backend API handles HTTP requests and what the
|
|
Entities' data lifecycle is. Not only we've been focusing on the JSON docs and **Relationships**, but from time to time we
|
|
have talked about Change Events.
|
|
|
|
Moreover, In the API Container Diagram we drew a Container representing the `Table` holding the Change Event data,
|
|
but yet, we have not found any Component accessing it.
|
|
|
|
This is because the API server is powered by Jetty, which means that luckily we do not need to make those calls ourselves!
|
|
By defining a `ChangeEventHandler` and registering it during the creation of the server, this postprocessing of the calls
|
|
happens transparently.
|
|
|
|
Our `ChangeEventHandler` will check if the Entity has been `Created`, `Updated` or `Deleted` and will store the appropriate
|
|
`ChangeEvent` data from our response to the backend DB.
|