docs: move rfcs to separate repo (#6621)
* moving rfc's * ignoring rfc/ * moving RFC * updating rfc policy page * removing unused filter patter * removing unused filter patter
@ -9,7 +9,6 @@ This module contains a React application that serves as the DataHub UI.
|
||||
|
||||
Feel free to take a look around, deploy, and contribute.
|
||||
|
||||
For details about the motivation please see [this RFC](../docs/rfc/active/2055-react-app/README.md).
|
||||
|
||||
## Functional Goals
|
||||
The initial milestone for the app was to achieve functional parity with the previous Ember app. This meant supporting
|
||||
|
||||
@ -121,7 +121,6 @@ function list_markdown_files(): string[] {
|
||||
/^metadata-ingestion\/docs\/sources\//, // these are used to generate docs, so we don't want to consider them here
|
||||
/^metadata-ingestion-examples\//,
|
||||
/^docker\/(?!README|datahub-upgrade|airflow\/local_airflow)/, // Drop all but a few docker docs.
|
||||
/^docs\/rfc\/templates\/000-template\.md$/,
|
||||
/^docs\/docker\/README\.md/, // This one is just a pointer to another file.
|
||||
/^docs\/README\.md/, // This one is just a pointer to the hosted docs site.
|
||||
/^SECURITY\.md$/,
|
||||
|
||||
@ -479,9 +479,6 @@ module.exports = {
|
||||
"docs/CONTRIBUTING",
|
||||
"docs/links",
|
||||
"docs/rfc",
|
||||
{
|
||||
RFCs: list_ids_in_directory("docs/rfc/active"),
|
||||
},
|
||||
],
|
||||
|
||||
"Release History": ["releases"],
|
||||
|
||||
10
docs/rfc.md
@ -23,7 +23,7 @@ for more visibility.
|
||||
- *Landed*: when an RFC's proposed changes are shipped in an actual release.
|
||||
- *Rejected*: when an RFC PR is closed without being merged.
|
||||
|
||||
[Pending RFC List](https://github.com/datahub-project/datahub/pulls?q=is%3Apr+is%3Aopen+label%3Arfc+)
|
||||
[Pending RFC List](https://github.com/datahub-project/rfcs/pulls?q=is%3Apr+is%3Aopen)
|
||||
|
||||
## When to follow this process
|
||||
|
||||
@ -56,8 +56,8 @@ pull request with the specific implementation design. We also highly recommend s
|
||||
In short, to get a major feature added to DataHub, one must first get the RFC merged into the RFC repo as a markdown
|
||||
file. At that point the RFC is 'active' and may be implemented with the goal of eventual inclusion into DataHub.
|
||||
|
||||
- Fork the DataHub repository.
|
||||
- Copy the `000-template.md` template file to `docs/rfc/active/000-my-feature.md`, where `my-feature` is more
|
||||
- Fork the [datahub-project/rfc repository](https://github.com/datahub-project/rfcs).
|
||||
- Copy the `000-template.md` template file to `rfc/active/000-my-feature.md`, where `my-feature` is more
|
||||
descriptive. Don't assign an RFC number yet.
|
||||
- Fill in the RFC. Put care into the details. *RFCs that do not present convincing motivation, demonstrate understanding
|
||||
of the impact of the design, or are disingenuous about the drawback or alternatives tend to be poorly-received.*
|
||||
@ -108,8 +108,8 @@ already working on it, feel free to ask (e.g. by leaving a comment on the associ
|
||||
## Implemented RFCs
|
||||
|
||||
Once an RFC has finally be implemented, first off, congratulations! And thank you for your contribution! Second, to
|
||||
help track the status of the RFC, please make one final PR to move the RFC from `docs/rfc/active` to
|
||||
`docs/rfc/finished`.
|
||||
help track the status of the RFC, please make one final PR to move the RFC from `rfc/active` to
|
||||
`rfc/finished`.
|
||||
|
||||
## Reviewing RFCs
|
||||
|
||||
|
||||
@ -1,121 +0,0 @@
|
||||
- Start Date: 2020-08-03
|
||||
- RFC PR: https://github.com/datahub-project/datahub/pull/1778
|
||||
- Implementation PR(s): https://github.com/datahub-project/datahub/pull/1775
|
||||
|
||||
# Dashboards
|
||||
|
||||
## Summary
|
||||
|
||||
Adding support for dashboards (and charts) metadata cataloging and enabling search & discovery for them.
|
||||
The design should accommodate for different dashboarding ([Looker](https://looker.com), [Redash](https://redash.io/)) tools used within a company.
|
||||
|
||||
## Motivation
|
||||
|
||||
Dashboards are a key piece within a data ecosystem of a company. They are used by different groups of employees across different organizations.
|
||||
They provide a way to visualize some data assets (tracking datasets or metrics) by allowing slice and dicing of the input data source.
|
||||
When a company scales, data assets including dashboards gets richer and bigger. Therefore, it's important to find and access to the right dashboard.
|
||||
|
||||
## Goals
|
||||
|
||||
By having dashboards as a top-level entity in DataHub, we achieve below goals:
|
||||
|
||||
- Enabling Search & Discovery for dashboard assets by using dashboard metadata
|
||||
- Link dashboards to underlying data sources and have a more complete picture of data lineage
|
||||
|
||||
## Non-goals
|
||||
|
||||
DataHub will only serve as a catalog for dashboards where users search dashboards by using keywords.
|
||||
Entity page for a dashboard might contain links to the dashboard to direct users to view the dashboard after finding it.
|
||||
However, DataHub will not try to show the actual dashboard or any charts within that. This is not desired and shouldn't be allowed because:
|
||||
|
||||
- Dashboards or charts within a dashboard might have different ACLs that prevent users without the necessary permission to display the dashboard.
|
||||
Generally, the source of truth for these ACLs are dashboarding tools.
|
||||
- Underlying data sources might have some ACLs too. Again, the source of truth for these ACLs are specific data platforms.
|
||||
|
||||
## Detailed design
|
||||
|
||||

|
||||
|
||||
As shown in the above diagram, dashboards are composed of a collection of charts at a very high level. These charts
|
||||
could be shared by different dashboards. In the example sketched above, `Chart_1`, `Chart_2` and `Chart_3` are part of
|
||||
`Dashboard_A` and `Chart_3` and `Chart_4` are part of `Dashboard_B`.
|
||||
|
||||
### Entities
|
||||
There will be 2 top level GMA [entities](../../../what/entity.md) in the design: dashboards and charts.
|
||||
It's important to make charts as a top level entity because charts could be shared between different dashboards.
|
||||
We'll need to build `Contains` relationships between Dashboard and Chart entities.
|
||||
|
||||
### URN Representation
|
||||
We'll define two [URNs](../../../what/urn.md): `DashboardUrn` and `ChartUrn`.
|
||||
These URNs should allow for unique identification for dashboards and charts even there are multiple dashboarding tools
|
||||
are used within a company. Most of the time, dashboards & charts are given unique ids by the used dashboarding tool.
|
||||
An example Dashboard URN for Looker will look like below:
|
||||
```
|
||||
urn:li:dashboard:(Looker,<<dashboard_id>>)
|
||||
```
|
||||
An example Chart URN for Redash will look like below:
|
||||
```
|
||||
urn:li:chart:(Redash,<<chart_id>>)
|
||||
```
|
||||
|
||||
### Chart metadata
|
||||
Dashboarding tools generally have different jargon to denote a chart.
|
||||
They are called as [Look](https://docs.looker.com/exploring-data/saving-and-editing-looks) in Looker
|
||||
and [Visualization](https://redash.io/help/user-guide/visualizations/visualization-types) in Redash.
|
||||
But, irrespective of the name, charts are the different tiles which exists in a dashboard.
|
||||
Charts are mainly used for delivering some information visually to make it easily understandable.
|
||||
They might be using single or multiple data sources and generally have an associated query running against
|
||||
the underlying data source to generate the data that it will present.
|
||||
|
||||
Below is a list of metadata which can be associated with a chart:
|
||||
|
||||
- Title
|
||||
- Description
|
||||
- Type (Bar chart, Pie chart, Scatter plot etc.)
|
||||
- Input sources
|
||||
- Query (and its type)
|
||||
- Access level (public, private etc.)
|
||||
- Ownership
|
||||
- Status (removed or not)
|
||||
- Audit info (last modified, last refreshed)
|
||||
|
||||
### Dashboard metadata
|
||||
Aside from containing a set of charts, dashboards carry metadata attached to them.
|
||||
Below is a list of metadata which can be associated with a dashboard:
|
||||
|
||||
- Title
|
||||
- Description
|
||||
- List of charts
|
||||
- Access level (public, private etc.)
|
||||
- Ownership
|
||||
- Status (removed or not)
|
||||
- Audit info (last modified, last refreshed)
|
||||
|
||||
### Metadata graph
|
||||
|
||||

|
||||
|
||||
An example metadata graph showing complete data lineage picture is shown above.
|
||||
In this picture, `Dash_A` and `Dash_B` are dashboards, and they are connected to charts through `Contains` edges.
|
||||
`C1`, `C2`, `C3` and `C4` are charts, and they are connected to underlying datasets through `DownstreamOf` edges.
|
||||
`D1`, `D2` and `D3` are datasets.
|
||||
|
||||
## How we teach this
|
||||
|
||||
We should create/update user guides to educate users for:
|
||||
- Search & discovery experience (how to find a dashboard in DataHub)
|
||||
- Lineage experience (how to find upstream datasets of a dashboard and how to find dashboards generated from a dataset)
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
The design is supposed to be generic enough that any user of the DataHub should easily be able
|
||||
to onboard their dashboard metadata to DataHub irrespective of their dashboarding platform.
|
||||
|
||||
Only thing users will need to do is to write an ETL script customized for their
|
||||
dashboarding platform (if it's not already provided in DataHub repo). This ETL script will:
|
||||
- Extract the metadata for all available dashboards and charts using the APIs of the dashboarding platform
|
||||
- Construct and emit this metadata in the form of [MCEs](../../../what/mxe.md)
|
||||
|
||||
## Unresolved questions (To-do)
|
||||
|
||||
1. We'll be adding social features like subscribe and follow later on. However, it's out of scope for this RFC.
|
||||
|
Before Width: | Height: | Size: 53 KiB |
|
Before Width: | Height: | Size: 30 KiB |
@ -1,125 +0,0 @@
|
||||
- Start Date: 08/18/2020
|
||||
- RFC PR: https://github.com/datahub-project/datahub/pull/1812
|
||||
- Implementation PR(s): https://github.com/datahub-project/datahub/pull/1721
|
||||
|
||||
# Machine Learning Models
|
||||
|
||||
## Summary
|
||||
|
||||
Adding support for trained machine learning models and features metadata cataloging and enabling search and discovery over them. This is a step towards organizing the essential facts of machine learning models in a structured way leading to responsible democratization of machine
|
||||
learning and related artificial intelligence technology. The work is inspired by Google's model card [paper](https://arxiv.org/pdf/1810.03993.pdf).
|
||||
|
||||
## Motivation
|
||||
|
||||
We need to model ML model metadata for transparent model reporting. Below are some of the reasons why storing machine learning model metadata is important:
|
||||
- Search and discovery of ML models trained, across an organization.
|
||||
- Drawing boundaries around a model's capabilities and limitations: There is a need to store the conditions under which a model performs best and most consistently and if it has some blind spots. It helps potential users of the models be better informed on which models are best for their specific purposes. Also, it helps minimize usage of machine learning models in contexts for which they are not well suited.
|
||||
- Metrics and Limitations: A model’s performance can be measured in countless ways, but we need to catalog the metrics that are most relevant and useful. Similarly there is a need to store a model's potential limitations that are most useful to quantify.
|
||||
- Ensure comparability across models in a well-informed way: Modeling metadata of ML models allows us to compare candidate models' results across not only traditional evaluation metrics but also along the axes of ethical, inclusive, and fairness
|
||||
considerations.
|
||||
- Promote reproducibility: Often a model is trained on transformed data, there are some preprocessing steps involved in transforming the data e.g. centering, scaling, dealing with missing values, etc. These transforms should be stored as part of model metadata to ensure reproducibility.
|
||||
- Ensure Data Governance: Increasing public concern over consumer privacy is resulting in new data laws, such as GDPR and CCPA, causing enterprises to strengthen their data governance & compliance efforts. Therefore, there is a need to store compliance information of ML models containing PII or condidential data (through manual tags or automated process) to eliminate the risk of sensitive data exposure.
|
||||
|
||||
## Detailed design
|
||||

|
||||
|
||||
|
||||
As shown in the above diagram, machine learning models are using machine learning features as inputs. These machine learning features
|
||||
could be shared across different machine learning models. In the example sketched above, `ML_Feature_1` and `ML_Feature_2` are used as inputs for `ML_Model_A` while `ML_Feature_2`, `ML_Feature_3` and `ML_Feature_4` are inputs for `ML_Model_B`.
|
||||
|
||||
### URN Representation
|
||||
We'll define two [URNs](../../../what/urn.md): `MLModelUrn` and `MLFeatureUrn`.
|
||||
These URNs should allow for unique identification of machine learning models and features, respectively. Machine learning models, like datasets, will be identified by combination of standardized platform urn, name of the model and the fabric type where the model belongs to or where it was generated. Here platform urn corresponds to the data platform for ML Models (like TensorFlow) - representing the platform as an urn enables us to attach platform-specific metadata to it.
|
||||
|
||||
A machine learning model URN will look like below:
|
||||
```
|
||||
urn:li:mlModel:(<<platform>>,<<modelName>>,<<fabric>>)
|
||||
```
|
||||
A machine learning feature will be uniquely identified by it's name and the namespace this feature belongs to.
|
||||
A machine learning feature URN will look like below:
|
||||
```
|
||||
urn:li:mlFeature:(<<namespace>>,<<featureName>>)
|
||||
```
|
||||
|
||||
### Entities
|
||||
There will be 2 top level GMA [entities](../../../what/entity.md) in the design: ML models and ML features.
|
||||
It's important to make ML features as a top level entity because ML features could be shared between different ML models.
|
||||
|
||||
### ML Model metadata
|
||||
- Model properties: Basic information about the ML model
|
||||
- Model date
|
||||
- Model desription
|
||||
- Model version
|
||||
- Model type: Basic model architecture details e.g. if it is Naive Bayes classifier, Convolutional Neural Network, etc
|
||||
- ML features used for training
|
||||
- Hyperparameters of the model, used to control the learning process
|
||||
- Tags: Primarily to enhance search and discovery of ML models
|
||||
- Ownership: Users who own the ML model, to help with directing questions or comments about the model.
|
||||
- Intended Use
|
||||
- Primary intended use cases
|
||||
- Primary intended user types
|
||||
- Out-of-scope use cases
|
||||
- Model Factors: Factors affecting model performance including groups, instrumentation and environments
|
||||
- Relevant Factors: Foreseeable factors for which model performance may vary
|
||||
- Evaluation Factors: Factors that are being reported
|
||||
- Metrics: Measures of model performance being reported, as well as decision thresholds (if any) used.
|
||||
- Training Data: Details on datasets used for training ML Models
|
||||
- Datasets used to train the ML model
|
||||
- Motivation behind choosing these datasets
|
||||
- Preprocessing steps involved: crucial for reproducibility
|
||||
- Link to the process/job that captures training execution
|
||||
- Evaluation Data: Mirrors Training Data.
|
||||
- Quantitative Analyses: Provides the results of evaluating the model according to the chosen metrics by linking to relevant dashboard.
|
||||
- Ethical Considerations: Demonstrate the ethical considerations that went into model development, surfacing ethical challenges and solutions to stakeholders.
|
||||
- Caveats and Recommendations: Captures additional concerns regarding the model
|
||||
- Did the results suggest any further testing?
|
||||
- Relevant groups that were not represented in the evaluation dataset
|
||||
- Recommendations for model use
|
||||
- Ideal characteristics of an evaluation dataset
|
||||
- Source Code: Contains training and evaluation pipeline source code, along with the source code where the ML Model is defined.
|
||||
- Institutional Memory: Institutional knowledge for easy search and discovery.
|
||||
- Status: Captures if the model has been soft deleted or not.
|
||||
- Cost: Cost associated with the model based on the project/component this model belongs to.
|
||||
- Deprecation: Captures if the model has been deprecated or not.
|
||||
|
||||
### ML Feature metadata
|
||||
- Feature Properties: Basic information about the ML Feature
|
||||
- Description of the feature
|
||||
- Data type of the feature i.e. boolean, text, etc. These also include [data types](https://towardsdatascience.com/7-data-types-a-better-way-to-think-about-data-types-for-machine-learning-939fae99a689#:~:text=In%20the%20machine%20learning%20world,groups%20are%20often%20called%20out.) particularly for Machine Learning practitioners.
|
||||
- Ownership: Owners of the ML Feature.
|
||||
- Institutional Memory: Institutional knowledge for easy search and discovery.
|
||||
- Status: Captures if the feature has been soft deleted or not.
|
||||
- Deprecation: Captures if the feature has been deprecated or not.
|
||||
|
||||
### Metadata graph
|
||||

|
||||
|
||||
An example metadata graph with complete data lineage picture is shown above. Below are the main edges of the graph
|
||||
1. Evaluation dataset contains data used for quantitative analyses and is used for evaluating ML Model hence ML Model is connected to the evaluation dataset(s) through `EvaluatedOn` edge
|
||||
2. Training dataset(s) contain the training data and is used for training ML Model hence ML Model is connected to the training dataset(s) through `TrainedOn` edge.
|
||||
3. ML Model is connected to `DataProcess` entity which captures the training execution through a (newly proposed) `TrainedBy` edge.
|
||||
4. `DataProcess` entity itself uses the training dataset(s) (mentioned in 2) as it's input and hence is connected to the training datasets through `Consumes` edge.
|
||||
5. ML Model is connected to ML Feature(s) through `Contains` edge.
|
||||
6. Results of the performance of ML Model can be viewed in a dashboard and is therefore connected to `Dashboard` entity through `Produces` edge.
|
||||
|
||||
## How we teach this
|
||||
|
||||
We should create/update user guides to educate users for:
|
||||
- Search & discovery experience (how to find a machine learning model in DataHub)
|
||||
- Lineage experience (how to find different entities connected to the machine learning model)
|
||||
|
||||
## Alternatives
|
||||
A machine learning model could as well store a model ID which uniquely identifies a machine learning model in the machine learning model lifecycle management system. This can then be the only component of `MLModelUrn` however we would then need a system to retrieve model name given the model ID. Hence we chose the approach of modeling `MLModelUrn` similar to `DatasetUrn`.
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
The design is supposed to be generic enough that any user of DataHub should easily be able
|
||||
to onboard their ML model and ML feature metadata to DataHub irrespective of their machine learning platform.
|
||||
|
||||
Only thing users will need to do is to write an ETL script customized for their machine learning platform (if it's not already provided in DataHub repo). This ETL script will construct and emit ML model and ML feature metadata in the form of [MCEs](../../../what/mxe.md).
|
||||
|
||||
## Future Work
|
||||
|
||||
- This RFC does not cover model evolution/versions, linking related models together and how we will handle it - that will require it's own RFC.
|
||||
- This RFC does not cover the UI design of ML Model and ML Feature.
|
||||
- This RFC does not cover social features like subscribe and follow on ML Model and/or ML Feature.
|
||||
|
Before Width: | Height: | Size: 141 KiB |
|
Before Width: | Height: | Size: 154 KiB |
@ -1,105 +0,0 @@
|
||||
- Start Date: 2020-08-25
|
||||
- RFC PR: https://github.com/datahub-project/datahub/pull/1820
|
||||
- Implementation PR(s): https://github.com/datahub-project/datahub/pull/1732
|
||||
|
||||
# Azkaban Flows and Jobs
|
||||
|
||||
## Summary
|
||||
|
||||
Adding support for [Azkaban](https://azkaban.github.io/) job and flow metadata and enabling search and discovery for them.
|
||||
|
||||
The design includes the metadata needed to represent Azkaban jobs and flows as data job entities and their relationships to other
|
||||
entities like Datasets.
|
||||
|
||||
## Motivation
|
||||
|
||||
Azkaban is a popular open source workflow manager created and extensively used at LinkedIn. Azkaban metadata is a critical piece
|
||||
in the metadata graph since data processing jobs are the primary driver of data movement and creation.
|
||||
|
||||
Without job metadata, it is not possible to understand the data flow across an organization. Additionally, jobs are needed in the
|
||||
lineage graph to surface operational metadata and have a complete view of data movement and processing. Capturing jobs and flows
|
||||
metadata in the lineage graph also allows in understanding dependency between multiple flows and jobs and structure of data
|
||||
pipelines in end to end data flow.
|
||||
|
||||
## Requirements
|
||||
|
||||
The following requirements exists as part of this rfc:
|
||||
|
||||
- Define Data flow and job as entities and model metadata for azkaban data job and flows
|
||||
- Enable Search & Discovery for Data jobs and flows
|
||||
- Link DataJob entities to existing entities like Datasets to build a more complete metadata graph
|
||||
- Automatically derive dataset upstream lineage from data job metadata (inputs and outputs)
|
||||
|
||||
## Non Requirements
|
||||
|
||||
Azkaban has its own application to surface jobs, flows, operational metadata and job logs. DataHub doesn't intend to be
|
||||
a replacement for it. Users will still need to go to Azkaban UI to look at logs and debug issues. DataHub will only show
|
||||
important and high level metadata in the context of search, discovery and exploration including lineage and will link to
|
||||
Azkaban UI for further debugging or finer grained information.
|
||||
|
||||
## Detailed design
|
||||
|
||||

|
||||
|
||||
The graph diagram above shows the relationships and high level metadata associated with Data Job and Flow entities.
|
||||
|
||||
An Azkaban flow is a DAG of one or more Azkaban jobs. Usually, most data processing jobs consume one or more inputs and
|
||||
produce one of more outputs (represented by datasets in the diagram). There can be other kinds of housekeeping jobs as well
|
||||
like cleanup jobs which don't have any data processing involved.
|
||||
|
||||
In the diagram above, the Azkaban job node consumes datasets `ds1` and `ds2` and produces `ds3`. It is also linked to the
|
||||
flow it is part of. As shown in the diagram, dataset upstream lineage is derived from the azkaban job metadata which results
|
||||
in `ds1` and `ds2` being upstreams of `ds3`.
|
||||
|
||||
### Entities
|
||||
There will be 2 top level GMA [entities](../../../what/entity.md) in the design: DataJob and DataFlow.
|
||||
|
||||
### URN Representation
|
||||
We'll define two [URNs](../../../what/urn.md): `DataJobUrn` and `DataFlowUrn`.
|
||||
These URNs should allow for unique identification for a Data job and flow respectively.
|
||||
|
||||
DataFlow URN will consist of the following parts:
|
||||
1. Workflow manager type (e.g. azkaban, airflow etc)
|
||||
2. Flow id - Id of a flow unique within a cluster
|
||||
3. Cluster - Cluster where the flow is deployed/executed
|
||||
|
||||
DataJob URN will consist of the following parts:
|
||||
1. Flow Urn - Urn of the data flow this job is part of
|
||||
2. Job id - Unique id of the job within the flow
|
||||
|
||||
An example DataFlow URN will look like below:
|
||||
```
|
||||
urn:li:dataFlow:(azkaban,flow_id,cluster)
|
||||
```
|
||||
|
||||
An example DataJob URN will look like below:
|
||||
```
|
||||
urn:li:dataJob:(urn:li:dataFlow:(azkaban,flow_id,cluster),job_id)
|
||||
```
|
||||
|
||||
### Azkaban Flow metadata
|
||||
|
||||
Below is a list of metadata which can be associated with an azkaban flow:
|
||||
|
||||
- Project for the flow (the concept of project may not exist for other workflow managers so it may not apply in all cases)
|
||||
- Flow name
|
||||
- Ownership
|
||||
|
||||
### Azkaban Job metadata
|
||||
|
||||
Below is a list of metadata which can be associated with an azkaban job:
|
||||
|
||||
- Job name
|
||||
- Job type (could be spark, mapreduce, hive, presto. command etc)
|
||||
- Inputs consumed by the job
|
||||
- Outputs produced by the job
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
The design references open source Azkaban so it is adoptable by anyone using Azkaban as their
|
||||
workflow manager.
|
||||
|
||||
## Future Work
|
||||
|
||||
1. Adding operational metadata associated with Azkaban entities
|
||||
2. Adding azkaban references in Upstream lineage so that the jobs show up in the lineage graph
|
||||
|
Before Width: | Height: | Size: 50 KiB |
|
Before Width: | Height: | Size: 187 KiB |
|
Before Width: | Height: | Size: 225 KiB |
|
Before Width: | Height: | Size: 103 KiB |
@ -1,251 +0,0 @@
|
||||
- Start Date: 2020-08-28
|
||||
- RFC PR: #1841
|
||||
- Discussion Issue: #1731
|
||||
- Implementation PR(s):
|
||||
|
||||
# RFC - Field Level Lineage
|
||||
|
||||
## Summary
|
||||
|
||||
DataHub supports dataset level lineage. UpStreamLineage is an aspect of dataset that powers the dataset level lineage (a.k.a., coarse-grained lineage).
|
||||
However, there is a need to understand the lineage at the field level (a.k.a., fine-grained lineage)
|
||||
|
||||
In this RFC, we will discuss below and get consensus on the modelling involved.
|
||||
- Representation of a field in a dataset
|
||||
- Representation of the field level lineage
|
||||
- Process of creating dataset fields and its relations to other entities.
|
||||
- Transformation function involved in the field level lineage is out of scope of the current RFC.
|
||||
|
||||

|
||||
|
||||
## Basic example
|
||||
|
||||
|
||||
### DatasetFieldURN
|
||||
A unique identifier for a field in a dataset will be introduced in the form of DatasetFieldUrn. And this urn will be the key for DatasetField entity. A sample is as below.
|
||||
|
||||
> urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/restaurant)
|
||||
|
||||
### Aspect representing the field level lineage.
|
||||
|
||||
```json
|
||||
{
|
||||
"DatasetUpstreamLineage": {
|
||||
"fieldMappings": [
|
||||
{
|
||||
"created": null,
|
||||
"transformationFunction": {
|
||||
"string": "com.linkedin.finegrainedmetadata.transformation.Identity"
|
||||
},
|
||||
"sourceFields": [
|
||||
{
|
||||
"string": "urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/header/customerId)"
|
||||
}
|
||||
],
|
||||
"destinationField": "urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:hdfs,/demo/fact_orders,PROD),/customerId)"
|
||||
},
|
||||
{
|
||||
"created": null,
|
||||
"transformationFunction": {
|
||||
"string": "com.linkedin.finegrainedmetadata.transformation.Identity"
|
||||
},
|
||||
"sourceFields": [
|
||||
{
|
||||
"string": "urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/header/timeStamp)"
|
||||
}
|
||||
],
|
||||
"destinationField": "urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:hdfs,/demo/fact_orders,PROD),/timeStamp)"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## Motivation
|
||||
There is a lot of interest in the field level lineage for datasets. Related issues/rfcs are
|
||||
|
||||
1. when does Fine grain lineage feature come out? #1649
|
||||
2. dataset field level Lineage support? #1519
|
||||
3. add lineage workflow schedule support #1615
|
||||
4. Design Review: column level lineage feature #1731
|
||||
5. Alternate proposal for field level lineage #1784
|
||||
|
||||
There are alternate proposals for field level lineage (refer #1731 and #1784). However, I believe a there is a need to uniquely idenity a dataset field and represent as URN for the following reasons.
|
||||
- It provides a natural path forward to make dataset field a first class entity. Producers and Consumers of this dataset field can naturally provide more metadata for a field which doesn't come/can't be expressed as part of the schema definition.
|
||||
- Search and discovery of datasets based on the field and its metadata will be natural extension of this.
|
||||
|
||||
## Detailed design
|
||||
|
||||
### Models
|
||||
#### DatasetField
|
||||
We propose a standard identifier for the dataset field in the below format.
|
||||
> `urn:li:datasetField:(\<datasetUrn>,\<fieldPath>)`
|
||||
|
||||
It contains two parts
|
||||
- Dataset Urn -> Standard Identifier of the dataset. This URN is already part of DataHub models
|
||||
- Field Path -> Represents the field of a dataset
|
||||
|
||||
FieldPath in most typical cases is the fieldName or column name of the dataset. Where the fields are nested in nature this will be a path to reach the leaf node.
|
||||
To standardize the field paths for different formats, there is a need to build standardized `schema normalizers`.
|
||||
|
||||
```json
|
||||
{
|
||||
"type":"record",
|
||||
"name":"Record1",
|
||||
"fields":[
|
||||
{
|
||||
"name":"foo1",
|
||||
"type":"int"
|
||||
},
|
||||
{
|
||||
"name":"foo2",
|
||||
"type":{
|
||||
"type":"record",
|
||||
"name":"Record2",
|
||||
"fields":[
|
||||
{
|
||||
"name":"bar1",
|
||||
"type":"string"
|
||||
},
|
||||
{
|
||||
"name":"bar2",
|
||||
"type":[
|
||||
"null",
|
||||
"int"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
If this is the schema of the dataset, then the dataset fields that emanate from this schema are
|
||||
>1. urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/foo1)
|
||||
>2. urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/foo2/bar1)
|
||||
>3. urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/foo2/bar2/int)
|
||||
|
||||
#### Aspect
|
||||
|
||||
```pdl
|
||||
/**
|
||||
* ASPECT :: Fine Grained upstream lineage for fields in a dataset
|
||||
*/
|
||||
record DatasetUpstreamLineage {
|
||||
/**
|
||||
* Upstream to downstream field level lineage mappings
|
||||
*/
|
||||
fieldMappings: array[DatasetFieldMapping]
|
||||
}
|
||||
```
|
||||
|
||||
```pdl
|
||||
/**
|
||||
* Representation of mapping between fields in sourcer dataset to the field in destination dataset
|
||||
*/
|
||||
record DatasetFieldMapping{
|
||||
/**
|
||||
* Source fields from which the fine grained lineage is derived
|
||||
*/
|
||||
sourceFields: array[ typeref : union[DatasetFieldUrn]]
|
||||
|
||||
/**
|
||||
* Destination field which is derived from source fields
|
||||
*/
|
||||
destinationField: DatasetFieldUrn
|
||||
/**
|
||||
* A UDF mentioning how the source got transformed to destination.
|
||||
* UDF also annotates how some metadata can be carried over from source fields to destination fields.
|
||||
* BlackBox UDF implies the transformation function is unknwon from source to destination.
|
||||
* Identity UDF implies pure copy of the data from source to destination.
|
||||
*/
|
||||
transformationFunction: string = "com.linkedin.finegrainedmetadata.transformation.BlackBox"
|
||||
}
|
||||
```
|
||||
|
||||
- An aspect with name `DatasetUpstreamLineage` will be introduced to capture fine grained lineage. Technically coarse grained is already captured with fine-grained lineage.
|
||||
- One can also provide a transformation function on how the data got transformed from source fields to destination field. The exact syntax of such function is out of scope of this document.
|
||||
+ BlackBox UDF means destination field is derived from source fields, but the transformation function is not knwon.
|
||||
+ Identity UDF means destination field is a pure copy from source field and the transformation is Identity.
|
||||
- Upstream sources in the field level relations are dataset field urns and is extensible to support other types in future. Think of rest api as a possible upstream in producing a field in dataset.
|
||||
|
||||
|
||||
|
||||
### DataFlow in DataHub for Field Level Lineage
|
||||
|
||||
As part of the POC we did, we used the below workflow. Essentially, DatasetFieldUrn is introduced paving the path for that being the first class entity.
|
||||

|
||||
|
||||
1. GraphBuilder on receiving MAE for `SchemaMetadata` aspect, will do below
|
||||
1. Create Dataset Entity in graph db.
|
||||
2. Use schema normalizers and extract field paths. This schema and hence forth the fields are the source of truth for dataset fields.
|
||||
3. Creates dataset field entities in graph db.
|
||||
4. A new relationship builder `Data Derived From Relation Builder` will create `HasField` relation between `Dataset` entity and `DatasetFields` entities
|
||||
2. GraphBuilder on receiving MAE for `DatasetUpstreamLienage` aspect will create the field level lineages (relationship `DataDerivedFrom`) between `Source DatasetFields` and `Destination DatasetField`
|
||||
|
||||
#### Models representing the lineage
|
||||
|
||||
```pdl
|
||||
@pairings = [ {
|
||||
"destination" : "com.linkedin.common.urn.DatasetFieldUrn",
|
||||
"source" : "com.linkedin.common.urn.DatasetUrn"
|
||||
} ]
|
||||
record HasField includes BaseRelationship {
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
```pdl
|
||||
@pairings = [ {
|
||||
"destination" : "com.linkedin.common.urn.DatasetFieldUrn",
|
||||
"source" : "com.linkedin.common.urn.DatasetFieldUrn"
|
||||
} ]
|
||||
record DataDerivedFrom includes BaseRelationship {
|
||||
}
|
||||
```
|
||||
|
||||
Two new relationships can be introduced to represent the relationships in graph.
|
||||
- `HasField` relationship represents the relation from dataset to dataset field.
|
||||
- `DataDerivedFrom` relationship model represents the data in destination dataset field derived from source dataset fields.
|
||||
|
||||
|
||||
### DataFlow When DatasetField is First Class Entity
|
||||
Once we decide to make dataset field as a first class entity, producers can start emitting MCEs for dataset fields.
|
||||
Below represents the end to end flow of dataset field entity will look like in the larger picture.
|
||||
|
||||

|
||||
|
||||
- Schema Normalizers as a utility will be developed.
|
||||
- `DatasetField` entity will be introduced with aspect `FieldInfo`
|
||||
- Producers can use Schema Normalizers and send emit `DatasetField` MCEs for every field in the schema.
|
||||
- Producers will still emit the `SchemaMetadata` as an aspect of `Dataset` entity. This aspect serves as the metadata for the relationship `HasField` between `Dataset` and `DatasetField` entities.
|
||||
- An aspect with name `DatasetUpstreamLineage` will be introduced to capture field level lineage. Technically coarse grained is already captured with fine-grained lineage.
|
||||
|
||||
|
||||
## How we teach this
|
||||
We are introducing the capability of field level lineage in DataHub. As part of this, below are the salient features one should know
|
||||
1. `Schema Normalizers` will be defined to standardize the field paths in a schema. Once this is done, field level lineage will be relation between two standardized field paths of source and destination paths.
|
||||
2. `Dataset Field URN` will be introduced and `DatasetField` will be a first class entity in DataHub.
|
||||
3. `HasField` relations will be populated in graph db between `Dataset` and `DatasetField`
|
||||
4. `DataDerived` relations will be populated at the field level.
|
||||
5. `SchemaMetadata` will still serve the schema information of a dataset. But, it is uses as a SOT for presence of dataset field entities.
|
||||
|
||||
This is an extension to the current support of coarse grained lineage by DataHub.
|
||||
Relationships tab in DataHub UI can also be enhanced to show field level lineage.
|
||||
|
||||
## Drawbacks
|
||||
Haven't thought about any potential drawbacks.
|
||||
|
||||
## Alternatives
|
||||
In the alternate design, we wouldn't need to consider defining a dataset field urn. There is an extensive RFC and discussion on this at ( #1784 )
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
This introduces a new aspect `DatasetUpstreamLineage` which is capable of defining lineage at field level. Hence, the existing customers shouldn't be impacted with this change.
|
||||
|
||||
## Unresolved questions
|
||||
- The syntax of transformation function representing how the source fields got transformed to destination fields is not thought through.
|
||||
- How to automatically get the field level lineage by parsing the higher level languages or query plans of different execution environments.
|
||||
|
||||
For the above two, we need to have more detailed RFCs.
|
||||
|
||||
@ -1,279 +0,0 @@
|
||||
- Start Date: 08/28/2020
|
||||
- RFC PR: 1842
|
||||
- Implementation PR(s): TBD
|
||||
|
||||
# Business Glossary
|
||||
|
||||
## Summary
|
||||
|
||||
Adding Support for Business Glossary enhances the value of metadata and brings the business view. This helps to document the business terms used across the business and provides the common vocabulary to entire data stakeholders/community. This encourages/motivates the business community to interact with Data Catalog to discover the relevant data assets of interest. This also enables finding the relationship in the data assets through business terms that belong to them. This following link illustrates the importance of business glossary [article](https://dataedo.com/blog/business-glossary-vs-data-dictionary).
|
||||
|
||||
## Motivation
|
||||
|
||||
We need to model Business Glossary, where the business team can define the business terms and link them to the data elements being onboarded to Data Platforms/data catalogs. This gives the following benefits :
|
||||
- Define and enable common vocabulary in the organizations and enable easy collaborations with the business & technical communities
|
||||
- Organizations can leverage the existing industry taxonomies where they can import the definitions and can enhance or define there specific terms/definitions
|
||||
- the crux and use of business glossary will be by linking the dataset/elements to Business Terms, so that business/consumers can discover the interested datasets easily with the helps of business terms
|
||||
- Promote the usage and reduce the redundancy: Business Glossary helps to discover the datasets quickly through business terms and this also helps reducing unnecessary onboarding the same/similar datasets by different consumers.
|
||||
|
||||
## Detailed design
|
||||
|
||||
### What is Business Glossary
|
||||
**Business Glossary**, is a list of business terms with their definitions. It defines business concepts for an organization or industry and is independent from any specific database or platform or vendor.
|
||||
|
||||
**Data Dictionary** is a description of a data set, provides the details about the attributes and data types
|
||||
|
||||
### Relationship
|
||||
Even though Data Dictionary and Business Glossary are separate entities, they work nicely together to describe different aspects and levels of abstraction of the data environment of an organization.
|
||||
Business terms can be linked to specific entities/tables and columns in a data asset/data dictionary to provide more context and consistent approved definition to different instances of the terms in different platforms/databases.
|
||||
|
||||
|
||||
### Sample Business Glossary Definition
|
||||
|URN|Business Term |Definition | Domain/Namespace | Owner | Ext Source| Ext Reference |
|
||||
|--|--|--|--|--|--|--|
|
||||
|urn:li:glossaryTerm:instrument.cashInstrument | instrument.cashInstrument| time point including a date and a time, optionally including a time zone offset| Foundation | abc@domain.com | fibo | https://spec.edmcouncil.org/fibo/ontology/FBC/FinancialInstruments/FinancialInstruments/CashInstrument |
|
||||
|urn:li:glossaryTerm:common.dateTime | common.dateTime| a financial instrument whose value is determined by the market and that is readily transferable (highly liquid)| Finance | xyz@domain.com | fibo | https://spec.edmcouncil.org/fibo/ontology/FND/DatesAndTimes/FinancialDates/DateTime |
|
||||
|urn:li:glossaryTerm:market.bidSize | market.bidSize| The bid size represents the quantity of a security that investors are willing to purchase at a specified bid price| Trading | xyz@domain.com | - | - | - |
|
||||
|--|--|--|--|--|--|--|
|
||||
| | | | | | | |
|
||||
|
||||
### Business Term & Dataset - Relationship
|
||||
|
||||
| Attribute Name| Data Type| Nullable?| **Business Term**| Description|
|
||||
|--|--|--|--|--|
|
||||
| recordId| int| N| | |n the case of FX QuoteData the RecordId is equal to the UIC from SymbolsBase|
|
||||
| arrivalTime| TimestampTicks| N| | Time the price book was received by the TickCollector. 100s of Nanoseconds since 1st January 1970 (ticks)|
|
||||
| bid1Price| com.xxxx.yyy.schema.common.Price| N| **common.monetoryAmount**|The bid price with rank 1/29.|
|
||||
| bid1Size| int| N| market.bidSize| The amount the bid price with rank 5/29 is good for.|
|
||||
|--|--|--|--|--|--|--|
|
||||
| | | | | | | |
|
||||
|
||||
### Stiching Together
|
||||
|
||||
|
||||
Business Glossary will be a first class entity where one can define the `GlossaryTerm`s and this will be similar to entities like Dataset, CorporateUser etc. Business Term can be linked to other entities like Dataset, DatasetField. In future Business terms can be linked to Dashboards, Metrics etc
|
||||
|
||||
|
||||

|
||||
|
||||
The above diagram illustrates how Business Terms will be connected to other entities entities like Dataset, DatasetField. The above example depicts business terms are `Term-1`, `Term-2`, .. `Term-n` and how they are linked to `DatasetField` and `Dataset`.
|
||||
Dataset (`DS-1`) fields `e11` is linked to Business Term `Term-2` and `e12` is linked to `Term-1`.
|
||||
Dataset (`DS-2`) element `e23` linked the Business Term `Term-2`, `e22` with `Term-3` and `e24` with `Term-5`. Dataset (DS-2) is linked to business term `Term-4`
|
||||
Dataset (`DS-2`) it-self linked to Business Term `Term-4`
|
||||
|
||||
## Metadata Model Enhancements
|
||||
|
||||
There will be 1 top level GMA [entities](../../../what/entity.md) in the design: glossaryTerm (Business Glossary).
|
||||
It's important to make glossaryTerm as a top level entity because it can exist without a Dataset and can be defined independently by the business team.
|
||||
|
||||
### URN Representation
|
||||
We'll define a [URNs](../../../what/urn.md): `GlossaryTermUrn`.
|
||||
These URNs should allow for unique identification of business term.
|
||||
|
||||
A business term URN (GlossaryTermUrn) will look like below:
|
||||
```
|
||||
urn:li:glossaryTerm:<<name>>
|
||||
```
|
||||
|
||||
### New Snapshot Object
|
||||
There will be new snapshot object to onboard business terms along with definitions
|
||||
|
||||
Path : metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/
|
||||
```java
|
||||
/**
|
||||
* A metadata snapshot for a specific GlossaryTerm entity.
|
||||
*/
|
||||
record GlossaryTermSnapshot {
|
||||
|
||||
/**
|
||||
* URN for the entity the metadata snapshot is associated with.
|
||||
*/
|
||||
urn: GlossaryTermUrn
|
||||
|
||||
/**
|
||||
* The list of metadata aspects associated with the dataset. Depending on the use case, this can either be all, or a selection, of supported aspects.
|
||||
*/
|
||||
aspects: array[GlossaryTermAspect]
|
||||
}
|
||||
```
|
||||
|
||||
Path : metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/
|
||||
|
||||
### GlossaryTermAspect
|
||||
There will be new aspect defined to capture the required attributes & ownership information
|
||||
|
||||
```
|
||||
/**
|
||||
* A union of all supported metadata aspects for a GlossaryTerm
|
||||
*/
|
||||
typeref GlossaryTermAspect = union[
|
||||
GlossaryTermInfo,
|
||||
Ownership
|
||||
]
|
||||
```
|
||||
|
||||
Business Term Entity Definition
|
||||
```java
|
||||
/**
|
||||
* Data model for a Business Term entity
|
||||
*/
|
||||
record GlossaryTermEntity includes BaseEntity {
|
||||
|
||||
/**
|
||||
* Urn for the dataset
|
||||
*/
|
||||
urn: GlossaryTermUrn
|
||||
|
||||
/**
|
||||
* Business Term native name e.g. CashInstrument
|
||||
*/
|
||||
name: optional string
|
||||
|
||||
}
|
||||
```
|
||||
|
||||
### Entity GlossaryTermInfo
|
||||
|
||||
```java
|
||||
/**
|
||||
* Properties associated with a GlossaryTerm
|
||||
*/
|
||||
record GlossaryTermInfo {
|
||||
|
||||
/**
|
||||
* Definition of business term
|
||||
*/
|
||||
definition: string
|
||||
|
||||
/**
|
||||
* Source of the Business Term (INTERNAL or EXTERNAL) with default value as INTERNAL
|
||||
*/
|
||||
termSource: string
|
||||
|
||||
/**
|
||||
* External Reference to the business-term (URL)
|
||||
*/
|
||||
sourceRef: optional string
|
||||
|
||||
/**
|
||||
* The abstracted URI such as https://spec.edmcouncil.org/fibo/ontology/FBC/FinancialInstruments/FinancialInstruments/CashInstrument.
|
||||
*/
|
||||
sourceUrl: optional Url
|
||||
|
||||
/**
|
||||
* A key-value map to capture any other non-standardized properties for the glossary term
|
||||
*/
|
||||
customProperties: map[string, string] = { }
|
||||
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
### Business Term Realationship with Owner
|
||||
Business Terms will be owened by certain business users
|
||||
|
||||
```
|
||||
/**
|
||||
* A generic model for the Owned-By relationship
|
||||
*/
|
||||
@pairings = [ {
|
||||
"destination" : "com.linkedin.common.urn.CorpuserUrn",
|
||||
"source" : "com.linkedin.common.urn.GlossaryTermUrn"
|
||||
}, {
|
||||
"destination" : "com.linkedin.common.urn.GlossaryTermUrn",
|
||||
"source" : "com.linkedin.common.urn.CorpuserUrn"
|
||||
} ]
|
||||
record OwnedBy includes BaseRelationship {
|
||||
|
||||
/**
|
||||
* The type of the ownership
|
||||
*/
|
||||
type: OwnershipType
|
||||
}
|
||||
```
|
||||
|
||||
### Business Glossary Aspect
|
||||
Business Term can be asociated with Dataset Field as well as Dataset. Defning the aspect that can be asociated with Dataset and DatasetField
|
||||
|
||||
```
|
||||
record GlossaryTerms {
|
||||
/**
|
||||
* The related business terms
|
||||
*/
|
||||
terms: array[GlossaryTermAssociation]
|
||||
|
||||
/**
|
||||
* Audit stamp containing who reported the related business term
|
||||
*/
|
||||
auditStamp: AuditStamp
|
||||
}
|
||||
|
||||
record GlossaryTermAssociation {
|
||||
/**
|
||||
* Urn of the applied glossary term
|
||||
*/
|
||||
urn: GlossaryTermUrn
|
||||
}
|
||||
```
|
||||
|
||||
Proposed to have the following changes to the SchemaField to associate (optionally) with Business Glossary (terms)
|
||||
|
||||
```
|
||||
record SchemaField {
|
||||
...
|
||||
/**
|
||||
* Tags associated with the field
|
||||
*/
|
||||
globalTags: optional GlobalTags
|
||||
|
||||
+/**
|
||||
+ * Glossary terms associated with the field
|
||||
+ */
|
||||
+glossaryTerms: optional GlossaryTerms
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
Proposed to have the following changes to the Dataset aspect to associate (optionally) with Business Glossary (terms)
|
||||
|
||||
```
|
||||
/**
|
||||
* A union of all supported metadata aspects for a Dataset
|
||||
*/
|
||||
typeref DatasetAspect = union[
|
||||
DatasetProperties,
|
||||
DatasetDeprecation,
|
||||
UpstreamLineage,
|
||||
InstitutionalMemory,
|
||||
Ownership,
|
||||
Status,
|
||||
SchemaMetadata
|
||||
+ GlossaryTerms
|
||||
]
|
||||
```
|
||||
|
||||
## Metadata Graph
|
||||
|
||||
This might not be a crtical requirement, but nice to have.
|
||||
|
||||
1. Users should be able to search for Business Terms and would like to see all the Datasets that have elements that linked to that Business term.
|
||||
|
||||
## How we teach this
|
||||
|
||||
We should create/update user guides to educate users for:
|
||||
- Importance and value that Business Glossary bringing to the Data Catalog
|
||||
- Search & discovery experience through business terms (how to find a relevant datasets quickly in DataHub)
|
||||
|
||||
## Alternatives
|
||||
This is a new feature in Datahub that brings the common vocabulry across data stake holders and also enable better discoverability to the datasets. I see there is no clear alternative to this feature, at the most users can document the `business term` outside the `Data Catalog` and can reference/assosciate those terms as an additional property to Dataset column.
|
||||
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
The design is supposed to be generic enough that any user of DataHub should easily be able to onboard their Business Glossary (list of terms and definitions) to DataHub irrespective of their industry. Some organizations can subscribe/download industry standard taxonomy with slight modelling and integration should be able to bring the business glossary quickly
|
||||
|
||||
While onboarding datasets, business/tech teams need to link the business terms to the data elements, once users see the value of this will be motivated to link the elements with appropriate business terms.
|
||||
|
||||
## Unresolved questions
|
||||
|
||||
- This RFC does not cover the UI design for Business Glossary Definition.
|
||||
|
Before Width: | Height: | Size: 268 KiB |
@ -1,643 +0,0 @@
|
||||
- Start Date: 12/17/2020
|
||||
- RFC PR: 2042
|
||||
- Implementation PR(s): 2044
|
||||
|
||||
# GraphQL Frontend (Part 1)
|
||||
|
||||
## Summary
|
||||
|
||||
This RFC outlines a proposal to implement a GraphQL specification in `datahub-frontend`. Ultimately, this proposal aims to model the following using GraphQL:
|
||||
|
||||
1. Reads against the Metadata Catalog (P1)
|
||||
2. Writes against the Metadata Catalog (P2)
|
||||
3. Auth, Search, & Browse against the Metadata Catalog (P2)
|
||||
|
||||
We propose that this initiative take place in phases, starting with CRUD-style read support against the entities, aspects, and relationships comprising the DataHub catalog. The scope of this RFC is limited to Part 1: reading against the catalog. It will cover the topics of introducing a dedicated GraphQL endpoint into ``datahub-frontend`` and provide a recipe for onboarding GMS entities to GQL. Subsequent RFCs will address writing, searching, and browsing against the catalog.
|
||||
|
||||
Along with the RFC, we've included a proof-of-concept demonstrating partial GQL read support, showcasing `Dataset` and its relationship to `CorpUser`. The following files will be useful to reference as you read along:
|
||||
- `datahub-frontend.graphql` - Where the GQL Schema is defined.
|
||||
- `datahub-frontend/app/conf/routes` - Where the frontend API routes are defined.
|
||||
- `datahub-frontend/app/controllers/GraphQLController.java` - The entry point for executing GQL queries.
|
||||
- `datahub-frontend/app/graphql/resolvers` - Definition of GQL DataFetchers, discussed below.
|
||||
- `datahub-dao` - Module containing the DAOs used in fetching downstream data from GMS
|
||||
|
||||
It is important to note that there are some questions that would be best discussed among the community, especially those pertaining to modeling of the Metadata Catalog on the frontend. These will be covered in the **Unresolved Questions** section below.
|
||||
|
||||
|
||||
## Motivation
|
||||
Exposing a GQL API for client-side apps has numerous benefits with respect to developer productivity, maintainability, performance among other things. This RFC will not attempt to fully enumerate the benefits of GraphQL as an IDL. For a more in-depth look at these advantages, [this](https://www.apollographql.com/docs/intro/benefits/) is a good place to start.
|
||||
|
||||
We will provide a few reasons GraphQL is particularly suited for DataHub:
|
||||
- **Reflects Reality**: The metadata managed by DataHub can naturally be represented by a graph. Providing the ability to query it as such will not only lead to a more intuitive experience for client-side developers, but also provide more numerous opportunities for code reuse within both client side apps and the frontend server.
|
||||
|
||||
- **Minimizes Surface Area**: Different frontend use cases typically require differing quantities of information. This is shown by the numerous subresource endpoints exposed in the current api: `/datasets/urn/owners`, `/dataset/urn/institutionalmemory`, etc. Instead of requiring the creation of these endpoints individually, GraphQL allows the client to ask for exactly what it needs, in doing so reducing the number of endpoints that need to be maintained.
|
||||
|
||||
- **Reduces API Calls**: Frontend apps are naturally oriented around pages, the data for which can typically be represented using a single document (view). DataHub is no exception. GraphQL allows frontend developers to easily materialize those views, without requiring complex frontend logic to coordinate multiple calls to the API.
|
||||
|
||||
## Detailed design
|
||||
|
||||
This section will outline the changes required to introduce a GQL support within ``datahub-frontend``, along with a description of how we can model GMA entities in the graph.
|
||||
|
||||
At a high level, the following changes will be made within `datahub-frontend`:
|
||||
1. Define a GraphQL Schema spec (datahub-frontend.graphql)
|
||||
|
||||
2. Configure a `/graphql` endpoint accepting POST requests (routes)
|
||||
|
||||
3. Introduce a dedicated `GraphQL` Play Controller
|
||||
|
||||
a. Configure GraphQL Engine
|
||||
|
||||
b. Parse, validate, & prepare inbound queries
|
||||
|
||||
c. Execute Queries
|
||||
|
||||
d. Format & send client response
|
||||
|
||||
We will continue by taking a detailed look at each step. Throughout the design, we will reference the existing `Dataset` entity, its `Ownership` aspect, and its relationship to the `CorpUser` entity for purposes of illustration.
|
||||
|
||||
### Defining the GraphQL Schema
|
||||
GraphQL APIs must have a corresponding schema, representing the types & relationships present in the graph. This is usually defined centrally within a `.graphql` file. We've introduced `datahub-frontend/app/conf/datahub-frontend.graphql`
|
||||
for this purpose:
|
||||
|
||||
*datahub-frontend.graphql*
|
||||
```graphql
|
||||
schema {
|
||||
query: Query
|
||||
}
|
||||
|
||||
type Query {
|
||||
dataset(urn: String): Dataset
|
||||
}
|
||||
|
||||
type Dataset {
|
||||
urn: String!
|
||||
....
|
||||
}
|
||||
|
||||
type CorpUser {
|
||||
urn: String!
|
||||
....
|
||||
}
|
||||
|
||||
...
|
||||
```
|
||||
There are two classes of object types within the GraphQL spec: **user-defined** types and **system** types.
|
||||
|
||||
**User-defined** types model the entities & relationships in your domain model. In the case of DataHub: Datasets, Users, Metrics, Schemas, & more. These types can reference one another in composition, creating edges among types.
|
||||
|
||||
**System** types include special "root" types which provide entry points to the graph:
|
||||
- `Query`: Reads against the graph (covered in RFC 2042)
|
||||
- `Mutation`: Writes against the graph
|
||||
- `Subscription`: Subscribing to changes within the graph
|
||||
|
||||
In this design, we will focus on the `Query` type. Based on the types defined above, the following would be a valid GQL query:
|
||||
|
||||
*Example Query*
|
||||
```graphql
|
||||
query datasets($urn: String!) {
|
||||
dataset(urn: $urn) {
|
||||
ownership {
|
||||
owner {
|
||||
username
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
For more details about the GraphQL Schemas & Types, see [here](https://graphql.org/learn/schema/).
|
||||
|
||||
|
||||
### Configuring a GraphQL Endpoint
|
||||
All GraphQL queries are serviced via a single endpoint. We place the new POST route in `datahub-frontend/conf/routes`:
|
||||
|
||||
```POST /api/v2/graphql react.controllers.GraphQLController.execute()```
|
||||
|
||||
We also provide an implementation of a GraphQL Play Controller, exposing an "execute" method. The controller is responsible for
|
||||
- parsing & validating incoming queries
|
||||
- delegating execution of the query
|
||||
- formatting the client response
|
||||
|
||||
### Executing a Query
|
||||
|
||||
For executing queries, we will use the [graphql-java](https://www.graphql-java.com/) library.
|
||||
|
||||
There are 2 components provided to the engine that enable execution:
|
||||
1. **Data Resolvers**: Resolve individual projections of a query. Defined for top-level entities and foreign key relationship fields
|
||||
|
||||
2. **Data Loaders**: Efficiently load data required to resolve field(s) by aggregating calls to downstream data sources
|
||||
|
||||
**Data Resolvers**
|
||||
|
||||
During read queries, the library "resolves" each field in the selection set of the query. It does so by intelligently invoking user-provided classes extending `DataFetcher`. These 'resolvers' define how to fetch a particular field in the graph. Once implemented, 'resolvers' must be registered with the query engine.
|
||||
|
||||
In DataHub's case, resolvers will be required for
|
||||
- Entities available for query (query type fields)
|
||||
- Relationships available for traversal (foreign-key fields)
|
||||
|
||||
*Defining resolvers*
|
||||
|
||||
Below you'll find sample resolvers corresponding to
|
||||
- the `dataset` query type defined above (query field resolver)
|
||||
- an `owner` field within the Ownership aspect (user-defined field resolver, foreign key reference)
|
||||
|
||||
```java
|
||||
/**
|
||||
* Resolver responsible for resolving the 'dataset' field of Query
|
||||
*/
|
||||
public class DatasetResolver implements DataFetcher<CompletableFuture<Map<String, Object>>> {
|
||||
@Override
|
||||
public CompletableFuture<Map<String, Object>> get(DataFetchingEnvironment environment) throws Exception {
|
||||
final DataLoader<String, Dataset> dataLoader = environment.getDataLoader("datasetLoader");
|
||||
return dataLoader.load(environment.getArgument("urn"))
|
||||
.thenApply(RecordTemplate::data);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Resolver responsible for resolving the 'owner' field of Ownership.
|
||||
*/
|
||||
public class OwnerResolver implements DataFetcher<CompletableFuture<Map<String, Object>>> {
|
||||
@Override
|
||||
public CompletableFuture<Map<String, Object>> get(DataFetchingEnvironment environment) throws Exception {
|
||||
final Map<String, Object> parent = environment.getSource();
|
||||
final DataLoader<String, CorpUser> dataLoader = environment.getDataLoader("corpUserLoader");
|
||||
return dataLoader.load((String) parent.get("owner"))
|
||||
.thenApply(RecordTemplate::data);
|
||||
}
|
||||
}
|
||||
```
|
||||
Resolvers serve to load the correct data when requested by the GraphQL engine using the ``get`` method. Provided as input to resolvers include:
|
||||
- parent field resolver result
|
||||
- optional arguments
|
||||
- optional context object
|
||||
- query variable map
|
||||
|
||||
For a more detailed look of resolvers in graphql-java, check out the [Data Fetching](https://www.graphql-java.com/documentation/v11/data-fetching/) documentation.
|
||||
|
||||
*Registering resolvers*
|
||||
|
||||
To register resolvers, we first construct a `RuntimeWiring` object provided by graphql-java:
|
||||
|
||||
```java
|
||||
private static RuntimeWiring configureResolvers() {
|
||||
/*
|
||||
* Register GraphQL field Resolvers.
|
||||
*/
|
||||
return newRuntimeWiring()
|
||||
/*
|
||||
* Query Resolvers
|
||||
*/
|
||||
.type("Query", typeWiring -> typeWiring
|
||||
.dataFetcher("dataset", new DatasetResolver())
|
||||
)
|
||||
/*
|
||||
* Relationship Resolvers
|
||||
*/
|
||||
.type("Owner", typeWiring -> typeWiring
|
||||
.dataFetcher("owner", new OwnerResolver())
|
||||
)
|
||||
.build();
|
||||
}
|
||||
```
|
||||
This tells the engine which classes should be invoked to resolve which fields.
|
||||
|
||||
The `RuntimeWiring` object is then used to create a GraphQL engine:
|
||||
```java
|
||||
GraphQLSchema graphQLSchema = schemaGenerator.makeExecutableSchema(typeDefinitionRegistry, configureResolvers());
|
||||
GraphQL engine = GraphQL.newGraphQL(graphQLSchema).build();
|
||||
```
|
||||
|
||||
You'll notice within the resolvers we use ``DataLoaders`` to materialize the desired data. We'll discuss these next.
|
||||
|
||||
|
||||
**Data Loaders**
|
||||
|
||||
DataLoaders are an abstraction provided by ``graphql-java`` to make retrieval of data from downstream sources more efficient, by batching calls for the same data types.
|
||||
DataLoaders are defined and registered with the GraphQL engine for each entity type to be loaded from a remote source.
|
||||
|
||||
*Defining a DataLoader*
|
||||
|
||||
Below you'll find sample loaders corresponding to the ``Dataset`` and ``CorpUser`` GMA entities.
|
||||
|
||||
```java
|
||||
// Create Dataset Loader
|
||||
BatchLoader<String, com.linkedin.dataset.Dataset> datasetBatchLoader = new BatchLoader<String, com.linkedin.dataset.Dataset>() {
|
||||
@Override
|
||||
public CompletionStage<List<com.linkedin.dataset.Dataset>> load(List<String> keys) {
|
||||
return CompletableFuture.supplyAsync(() -> {
|
||||
try {
|
||||
return DaoFactory.getDatasetsDao().getDatasets(keys);
|
||||
} catch (Exception e) {
|
||||
throw new RuntimeException("Failed to batch load Datasets", e);
|
||||
}
|
||||
});
|
||||
}
|
||||
};
|
||||
DataLoader datasetLoader = DataLoader.newDataLoader(datasetBatchLoader);
|
||||
|
||||
// Create CorpUser Loader
|
||||
BatchLoader<String, com.linkedin.identity.CorpUser> corpUserBatchLoader = new BatchLoader<String, com.linkedin.identity.CorpUser>() {
|
||||
@Override
|
||||
public CompletionStage<List<com.linkedin.identity.CorpUser>> load(List<String> keys) {
|
||||
return CompletableFuture.supplyAsync(() -> {
|
||||
try {
|
||||
return DaoFactory.getCorpUsersDao().getCorpUsers(keys);
|
||||
} catch (Exception e) {
|
||||
throw new RuntimeException("Failed to batch load CorpUsers", e);
|
||||
}
|
||||
});
|
||||
}
|
||||
};
|
||||
DataLoader corpUserLoader = DataLoader.newDataLoader(corpUserBatchLoader);
|
||||
```
|
||||
In extending `BatchLoader`, a single batch "load" method must be provided. This API is exploited by the GraphQL engine, which aggregates calls for entities of the same type, reducing the number of downstream calls made within a single query.
|
||||
|
||||
*Registering a DataLoader*
|
||||
|
||||
DataLoaders are registered with the `DataLoaderRegistry` and subsequently included as input to the GraphQL engine:
|
||||
|
||||
```java
|
||||
/*
|
||||
* Build DataLoader Registry
|
||||
*/
|
||||
DataLoaderRegistry registry = new DataLoaderRegistry();
|
||||
registry.register("datasetLoader", datasetLoader);
|
||||
registry.register("corpUserLoader", corpUserLoader);
|
||||
|
||||
/*
|
||||
* Construct execution input
|
||||
*/
|
||||
ExecutionInput executionInput = ExecutionInput.newExecutionInput()
|
||||
.query(queryJson.asText())
|
||||
.variables(variables)
|
||||
.dataLoaderRegistry(register)
|
||||
.build();
|
||||
|
||||
/*
|
||||
* Execute GraphQL Query
|
||||
*/
|
||||
ExecutionResult executionResult = _engine.execute(executionInput);
|
||||
```
|
||||
|
||||
For more information about `DataLoaders` see the [Using DataLoader](https://www.graphql-java.com/documentation/v15/batching/) doc.
|
||||
|
||||
For a full reference implementation of the query execution process, see the ``GraphController.java`` class associated with this PR.
|
||||
|
||||
### Bonus: Instrumentation
|
||||
[graphql-java](https://www.graphql-java.com/) provides an [Instrumentation](https://github.com/graphql-java/graphql-java/blob/master/src/main/java/graphql/execution/instrumentation/Instrumentation.java) interface that can be implemented to record information about steps in the query execution process.
|
||||
|
||||
Conveniently, `graphql-java` provides a `TracingInstrumentation` implementation out of the box. This can be used to gain a deeper understanding of the performance of queries, by capturing granular (ie. field-level) tracing metrics for each query. This tracing information is included in the engine `ExecutionResult`. From there it can be sent to a remote monitoring service, logged, or simply provided as part of the GraphQL response.
|
||||
|
||||
For now, we will simply return the tracing results in the "extensions" portion of the GQL response, as described in [this doc](https://github.com/apollographql/apollo-tracing). In future designs, we can consider providing a more formal extension points for injecting custom remote monitoring logic.
|
||||
|
||||
|
||||
### Modeling Queries
|
||||
|
||||
The first phase of the GQL rollout will support primary-key lookups of GMA entities and projection of their associated aspects. In order to achieve this, we will
|
||||
- model entities, aspects, and the relationships among them
|
||||
- expose queries against top-level entities
|
||||
using GQL.
|
||||
|
||||
The proposed steps for onboarding a GMA entity, it's related aspects, and the relationships among them will be outlined next.
|
||||
|
||||
#### 1. **Model an entity in GQL**
|
||||
|
||||
The entity model should include its GMA aspects, as shown below, as simple object fields. The client will need not be intimately aware of the concept of "aspects". Instead, it should be concerned with entities and their relationships.
|
||||
|
||||
*Modeling Dataset*
|
||||
```graphql
|
||||
"""
|
||||
Represents the GMA Dataset Entity
|
||||
"""
|
||||
type Dataset {
|
||||
|
||||
urn: String!
|
||||
|
||||
platform: String!
|
||||
|
||||
name: String!
|
||||
|
||||
origin: FabricType
|
||||
|
||||
description: String
|
||||
|
||||
uri: String
|
||||
|
||||
platformNativeType: PlatformNativeType
|
||||
|
||||
tags: [String]!
|
||||
|
||||
properties: [PropertyTuple]
|
||||
|
||||
createdTime: Long!
|
||||
|
||||
modifiedTime: Long!
|
||||
|
||||
ownership: Ownership
|
||||
}
|
||||
|
||||
"""
|
||||
Represents Ownership
|
||||
"""
|
||||
type Ownership {
|
||||
|
||||
owners: [Owner]
|
||||
|
||||
lastModified: Long!
|
||||
}
|
||||
|
||||
"""
|
||||
Represents an Owner
|
||||
"""
|
||||
type Owner {
|
||||
"""
|
||||
The fully-resolved owner
|
||||
"""
|
||||
owner: CorpUser!
|
||||
|
||||
"""
|
||||
The type of the ownership
|
||||
"""
|
||||
type: OwnershipType
|
||||
|
||||
"""
|
||||
Source information for the ownership
|
||||
"""
|
||||
source: OwnershipSource
|
||||
}
|
||||
```
|
||||
|
||||
Notice that the Dataset's ``Ownership`` aspect includes a nested ``Owner`` field that references a ``CorpUser`` type. In the GMS model, this is represented as a foreign-key relationship (urn). In GraphQL, we include the *resolved* relationship, allowing the client to easily retrieve information about a Dataset's owners.
|
||||
|
||||
To support traversal of this relationship, we additionally include a ``CorpUser`` type:
|
||||
|
||||
```graphql
|
||||
"""
|
||||
Represents the CorpUser GMA Entity
|
||||
"""
|
||||
type CorpUser {
|
||||
|
||||
urn: String!
|
||||
|
||||
username: String!
|
||||
|
||||
info: CorpUserInfo
|
||||
|
||||
editableInfo: CorpUserEditableInfo
|
||||
}
|
||||
```
|
||||
|
||||
#### 2. **Extend the 'Query' type**:
|
||||
|
||||
GraphQL defines a top-level 'Query' type that serves as the entry point for reads against the graph. This is extended to support querying the new entity type.
|
||||
|
||||
```graphql
|
||||
type Query {
|
||||
dataset(urn: String!): Dataset # Add this!
|
||||
datasets(urn: [String]!): [Dataset] # Or if batch support required, add this!
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. **Define & Register DataLoaders**
|
||||
This is illustrated in the previous section. It involves extending `DataLoader` and registering the loader in the `DataLoaderRegistry`.
|
||||
|
||||
#### 4. **Define & Register DataFetcher (Data Resolver)**
|
||||
This is illustrated in the previous section. It involves implementing the DataFetcher interface and attaching it to fields in the graph.
|
||||
|
||||
#### 5. **Query your new entity**
|
||||
Deploy & start issuing queries against the graph.
|
||||
|
||||
```graphql
|
||||
// Input
|
||||
query datasets($urn: String!) {
|
||||
dataset(urn: $urn) {
|
||||
urn
|
||||
ownership {
|
||||
owners {
|
||||
owner {
|
||||
username
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Output
|
||||
{
|
||||
"data": {
|
||||
"dataset": {
|
||||
"urn": "urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)",
|
||||
"ownership": {
|
||||
"owners": [
|
||||
{
|
||||
"owner": {
|
||||
"username": "james joyce",
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Disclaimer**
|
||||
|
||||
It is the intention that the GraphQL type system be **auto-generated** based on the PDL models declared at the GMA layer. This means that the frontend schema should need not be maintained separately from the GMA it is derived from.
|
||||
|
||||
Various changes will be required to accomplish this:
|
||||
|
||||
1. Introduce `readOnly` annotation into GMA PDLs used to identify which field should *not* be writable.
|
||||
2. Introduce `relationship` annotation into GMA PDLs used to declare foreign-key relationships (alternate to the relationship + pairings convention that exists today)
|
||||
3. Implement a GraphQL schema generator that can be configured to
|
||||
- Load relevant entity PDLs
|
||||
- Generate `Query` types, including resolved relationships
|
||||
- Generate `Mutation` types, omitting specific fields
|
||||
- Write generated types to GraphQL schema file
|
||||
- Run as a build-time gradle task
|
||||
|
||||
We omit proposal of this portion of the design from this RFC. There will be a subsequent RFC proposing implementation of the items above.
|
||||
|
||||
*Auto-generated GQL resolvers generation should also be explored. This would be possible provided that standardized DAOs are available at the frontend service layer. It'd be incredible if we could
|
||||
- Auto generate "client" objects from a rest spec
|
||||
- Auto generate "dao" objects that use a client
|
||||
- Auto generate "resolvers" that use a "dao"
|
||||
|
||||
## How we teach this
|
||||
|
||||
We will create user guides that cover:
|
||||
- Modeling & onboarding entities to the GQL type system
|
||||
- Modeling & onboarding entity aspects to the GQL type system
|
||||
- Modeling & onboarding relationships among entities to the GQL type system
|
||||
|
||||
## Alternatives
|
||||
Keep the resource-oriented approach, with different endpoints for each entity / aspect.
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
1. Rollout CRUD-style reads against entities / aspects in the graph
|
||||
- Entities: Dataset, CorpUser, DataPlatform
|
||||
- Relationships: Dataset->CorpUser, Dataset->DataPlatform, [if there is demand] CorpUser -> Dataset
|
||||
2. Rollout CRUD-style writes against entities / aspects in the graph
|
||||
3. Rollout full text search against entities / aspects in the graph
|
||||
4. Rollout browse against entities / aspects in the graph
|
||||
5. Migrate client-side apps to leverage new GraphQL API
|
||||
- Build out parallel data fecthing layer, where the existing models are populated either by the old clients or the new GQL clients. Place this behind a client-side feature flag.
|
||||
- Allow users of DataHub to configure which API they want to run with, swap at their own pace.
|
||||
|
||||
## Unresolved questions
|
||||
|
||||
**How should aspects be modeled on the GQL Graph?**
|
||||
|
||||
Aspects should be modeled as nothing more than fields on their parent entity. The frontend clients should not require understanding of the aspect concept. Instead, fetching specific aspects should be a matter of querying the parent entities with particular projections.
|
||||
|
||||
For example, retrieving the `Ownership` aspect of `Datasets` is done using the following GQL query:
|
||||
|
||||
```graphql
|
||||
query datasets($urn: String!) {
|
||||
dataset(urn: $urn) {
|
||||
ownership {
|
||||
owners {
|
||||
owner {
|
||||
username
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
as opposed to the following:
|
||||
|
||||
```graphql
|
||||
query ownership($datasetUrn: String!) {
|
||||
ownership(datasetUrn: $datasetUrn) {
|
||||
owners {
|
||||
owner {
|
||||
username
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Aspects should not be exposed in the top-level 'Query' model.
|
||||
|
||||
**What should the GraphQL graph model contain? How should it be constructed?**
|
||||
|
||||
There are 2 primary options:
|
||||
|
||||
1. Directly expose transposed GMS models (entities, aspects as is) via the GQL type system. One alteration would be introducing resolved relationships extending outward from the aspect objects.
|
||||
- Pros: Simpler because no new POJOs need to be maintained at the `datahub-frontend` layer (can reuse the Rest.li models provided by GMS). In the longer term, GQL type system can be generated directly from GMS Pegasus models.
|
||||
- Cons: Exposes frontend clients to the entire GMA graph, much of which may not be useful for presentation
|
||||
2. Create a brand new presentation-layer Graph model
|
||||
- Pros: Only expose what frontend clients need from the graph (simplifies topology)
|
||||
- Cons: Requires that we maintain (perhaps generate) additional POJOs specific to the presentation layer (maintenance cost?)
|
||||
|
||||
We're interested to get community feedback on this question!
|
||||
|
||||
**Should foreign-keys corresponding to outbound relationships be included in the GQL schema?**
|
||||
|
||||
Using the example from above, that would entail a model like:
|
||||
```graphql
|
||||
"""
|
||||
Represents an Owner
|
||||
"""
|
||||
type Owner {
|
||||
"""
|
||||
Owner URN, e.g. urn:li:corpuser:ldap, urn:li:corpGroup:group_name, and urn:li:product:product_name
|
||||
"""
|
||||
ownerUrn: String!
|
||||
|
||||
"""
|
||||
The fully resolved owner!
|
||||
"""
|
||||
owner: CorpUser!
|
||||
|
||||
"""
|
||||
The type of the ownership
|
||||
"""
|
||||
type: OwnershipType
|
||||
|
||||
"""
|
||||
Source information for the ownership
|
||||
"""
|
||||
source: OwnershipSource
|
||||
}
|
||||
```
|
||||
We should *not* need to include such fields. This is because we can efficiently implement the following query:
|
||||
|
||||
```graphql
|
||||
query datasets($urn: String!) {
|
||||
dataset(urn: $urn) {
|
||||
urn
|
||||
ownership {
|
||||
owners {
|
||||
owner {
|
||||
urn
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
without actually calling downstream services. By cleverly implementing the "owner" field resolver, we can return quickly when the urn is the only projection:
|
||||
|
||||
```java
|
||||
/**
|
||||
* GraphQL Resolver responsible for fetching Dataset owners.
|
||||
*/
|
||||
public class OwnerResolver implements DataFetcher<CompletableFuture<Map<String, Object>>> {
|
||||
@Override
|
||||
public CompletableFuture<Map<String, Object>> get(DataFetchingEnvironment environment) throws Exception {
|
||||
final Map<String, Object> parent = environment.getSource();
|
||||
if (environment.getSelectionSet().contains("urn") && environment.getSelectionSet().getFields().size() == 1) {
|
||||
if (parent.get("owner") != null) {
|
||||
return CompletableFuture.completedFuture(ImmutableMap.of("urn", parent.get("owner")))
|
||||
}
|
||||
}
|
||||
... else load as normal
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**How can I play with these changes?**
|
||||
|
||||
1. Apply changes from this branch locally
|
||||
2. Launch datahub-gms & dependencies & populate with some data as usual
|
||||
3. Launch `datahub-frontend` server using ``cd datahub-frontend/run && ./run-local-frontend``.
|
||||
4. Authenticate yourself at http://localhost:9001 (username: datahub) and extract the PLAY_SESSION cookie that is set in your browser.
|
||||
5. Issue a GraphQL query using CURL or a tool like Postman. For example:
|
||||
```
|
||||
curl --location --request POST 'http://localhost:9001/api/v2/graphql' \
|
||||
--header 'X-RestLi-Protocol-Version: 2.0.0' \
|
||||
--header 'Content-Type: application/json' \
|
||||
--header 'Cookie: PLAY_SESSION=<your-cookie-here>' \
|
||||
--data-raw '{"query":"query datasets($urn: String!) {
|
||||
\n dataset(urn: $urn) {
|
||||
\n urn
|
||||
\n ownership {
|
||||
\n owners {
|
||||
\n owner {
|
||||
\n username
|
||||
\n info {
|
||||
\n manager {
|
||||
\n username
|
||||
\n }
|
||||
\n }
|
||||
\n }
|
||||
\n type
|
||||
\n source {
|
||||
\n type
|
||||
\n url
|
||||
\n }
|
||||
\n }\n }
|
||||
\n platform
|
||||
\n }
|
||||
\n}",
|
||||
"variables":{"urn":"<your dataset urn>"}}'```
|
||||
|
||||
|
||||
@ -1,162 +0,0 @@
|
||||
- Start Date: 1/12/2020
|
||||
- RFC PR: 2055
|
||||
- Implementation PR(s): N/A
|
||||
|
||||
# Proposal to Incubate a new React Application
|
||||
|
||||
## Proposal
|
||||
|
||||
In this document, we propose the incubation of a new React application inside the DataHub repository. ‘Incubation’ implies iterative development by the community over time, as opposed to a big-bang rewrite, which is impractical given the scope of work.
|
||||
|
||||
We’ll begin by outlining the motivations for this proposal, followed by a characterization of the design principles & functional requirements, and conclude with a look at the proposed architecture. We will largely omit specific implementation details from this RFC, which will be left to subsequent RFCs + PRs.
|
||||
|
||||
## Goals
|
||||
|
||||
The goal of this RFC is to get community buy-in on the development of a React app that will exist in parallel to the existing Ember app inside the DataHub repository.
|
||||
|
||||
## Non Goals
|
||||
|
||||
The following are omitted from the scope of this RFC
|
||||
|
||||
- GraphQL server-side implementation (Play Server versus separate server)
|
||||
- Specific React component architecture
|
||||
- Specific tech / tooling choices within React ecosystem (state mgmt, client, etc)
|
||||
|
||||
## Motivation
|
||||
|
||||
The primary motivation behind developing a new React app is improving the reach & accessibility of DataHub. It’s no secret that React is a much more popular technology than Ember by the numbers:
|
||||
|
||||
- React GitHub stars: ~160k
|
||||
- Ember GitHub stars: ~20k
|
||||
|
||||
Adopting a more familiar stack will facilitate an active community by lowering the barrier to contribution, in addition to providing access to a richer ecosystem.
|
||||
|
||||
A secondary motivation is that a new client would allow us to address tech debt present in the existing Ember app, including
|
||||
|
||||
- **Legacy & unused code**: Special handling logic exists to support legacy versions of DataHub (ie. WhereHows). An example of this can be found in [legacy.ts](https://github.com/datahub-project/datahub/blob/master/datahub-web/@datahub/data-models/addon/entity/dataset/utils/legacy.ts). Additionally, there is code that goes unused in the OSS client, such as that pertaining to Dataset [compliance](https://github.com/datahub-project/datahub/blob/master/datahub-web/packages/data-portal/app/utils/datasets/compliance-suggestions.ts). A new client will provide legibility benefits, lacking historical baggage.
|
||||
|
||||
|
||||
- **Difficulty of extension**: Given the lack of formal guidance, steep learning curve for Ember (& the addon structure), & presence of legacy / unused code, it is nontrivial to extend the existing web client.
|
||||
|
||||
|
||||
- **Difficulty of customization**: There is a lack of clear customization levers for modifying the Ember application. Because DataHub is deployed in a variety of different organizations, it would be useful to support customization of
|
||||
- Theme: How it looks (color, ux, assets, copy)
|
||||
- Features: How it behaves (enable / disable features)
|
||||
|
||||
out of the box!
|
||||
|
||||
|
||||
- **Coupling with GMA**: GMA concepts of [entity](https://github.com/datahub-project/datahub/blob/master/datahub-web/@datahub/data-models/addon/entity/base-entity.ts) and [aspect](https://github.com/datahub-project/datahub/blob/master/datahub-web/@datahub/data-models/addon/entity/utils/aspects.ts) are rooted in the Ember client. With the new client, we can revisit the abstractions exposed to the client side & look for opportunities to simplify.
|
||||
|
||||
A clean slate will allow us to address these items, improving the frontend development experience & making community contribution easier.
|
||||
|
||||
It is important to note that we are not proposing deprecation of the Ember client at this time. Maintenance and feature development should be free to continue on Ember as the React app evolves in isolation.
|
||||
|
||||
### Design Principles
|
||||
|
||||
In developing the new application, it is important that we have an agreed-upon set of design principles to guide our decisions.
|
||||
|
||||
Such principles should promote the health of the community (eg. by increasing the likelihood of contribution) & the value proposition of the DataHub product for organizations (eg. by permitting domain-specific modification of the default deployment).
|
||||
|
||||
Specifically, the new client should be
|
||||
|
||||
1. **Extensible**
|
||||
- Modular, composable architecture
|
||||
- Formal guidance on extending the client to support domain-specific needs
|
||||
|
||||
2. **Configurable**
|
||||
- Clear, consistent, & documented levers to alter style & behavior between DataHub deployments
|
||||
- Support injection of custom ‘applets’ or ‘widgets’ where appropriate
|
||||
|
||||
3. **Scalable**
|
||||
- An architecture suited for scale, both along the people & feature dimensions
|
||||
- Easy to contribute!
|
||||
|
||||
These principles should serve as evaluation criteria used by authors & reviewers of application changes.
|
||||
|
||||
|
||||
### Functional Requirements
|
||||
|
||||
#### Near term
|
||||
|
||||
Initially, our goal is to achieve functional parity with the existing Ember frontend for common use-cases. Specifically, the React app should support
|
||||
|
||||
- Authenticating a user
|
||||
- Displaying metadata entities
|
||||
- Updating metadata entities
|
||||
- Browsing metadata entities
|
||||
- Searching metadata entities
|
||||
- Managing a user account
|
||||
|
||||
The finer details of which entities fall into each feature bucket will be dictated by the needs of the community, with the short-term milestone to achieve parity with entities appearing in the Ember client (Datasets, CorpUsers).
|
||||
|
||||
|
||||
#### Long term
|
||||
|
||||
In the longer term, we will work with the community to define a more extensive functional road map, which may include
|
||||
|
||||
- Providing migration pathway from the Ember application to the React application
|
||||
- New entities, aspects, operations (eg. Dashboards, Charts, etc)
|
||||
- Custom, server-driven ‘extensions’ or ‘applets’ to display in the UI
|
||||
- Admin Dashboard
|
||||
- Metrics Collection
|
||||
- Social features
|
||||
& more!
|
||||
|
||||
### Architecture
|
||||
|
||||
The figure below depicts the updated DataHub architecture given this proposal:
|
||||
|
||||

|
||||
|
||||
Where the boxes outlined in green denote newly introduced components.
|
||||
|
||||
Notice that the app will be completely independent of the existing Ember client, meaning there are no compatibility risks for existing deployments. Moreover, the React app will communicate exclusively with a GraphQL server (See [RFC 2042](https://github.com/datahub-project/datahub/pulls?q=is%3Apr+is%3Aclosed) for proposal). This will improve the frontend development experience by providing
|
||||
- a clearly defined API contract
|
||||
- simplified state management (via Apollo GQL client -- no redux required)
|
||||
- auto-generated models for queries and data types
|
||||
|
||||
That’s the extent of the technical specifics we’ll cover for now. Stay tuned for a proof-of-concept PR coming soon that will present an initial React shell.
|
||||
|
||||
## How we teach this
|
||||
|
||||
A major goal of this initiative is to develop a frontend web client that can be easily extended by the DataHub community. Toward that end, we will provide documentation detailing the process of changing the frontend client to do things like:
|
||||
|
||||
- Add a new entity page
|
||||
- Extend an existing entity page
|
||||
- Enable / disable specific features
|
||||
- Modify configurations
|
||||
- Test new components
|
||||
& more!
|
||||
|
||||
|
||||
## Alternatives
|
||||
|
||||
### Evolve the Ember App in place
|
||||
|
||||
*What?*: Iterate on the existing Ember appclient.
|
||||
|
||||
*Why not?* Firstly, we actually do not consider this to be mutually exclusive with introducing a separate React app. Regardless, there are benefits to adopting a more accessible technology like React that do not change with improvements to the existing Ember app.
|
||||
|
||||
### Mixing Ember & React
|
||||
|
||||
*What?*: Migrate from Ember to React incrementally by incrementally replacing Ember components with React components.
|
||||
|
||||
*Why not?*: The intermediate state of a half-react, half-ember app is something we’d rather not think about -- it’s scary & sad. We’d like to avoid degrading client-side developer experience with this type of complexity. Since this migration will take some time, we feel it more productive to iterate independently.
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
As described above, the rollout of the React frontend will be iterative. In the short term, existing deployments will continue using Ember. In the long term, organizations will be free to validate and migrate to the new client at their own pace.
|
||||
|
||||
## Open Questions
|
||||
|
||||
**Can we reuse code from the Ember client?**
|
||||
|
||||
Great Question :) Yes -- we should actively try to extract as much common code as possible from Ember (most likely shared UI components), so long as it conforms to the principles laid out above. This will hopefully speed up the development process and allow for improvements across both clients at the same time.
|
||||
|
||||
**Which GMS entities should appear in the new frontend? Which update operations?**
|
||||
|
||||
This is something we’ll look to the community to help define! Initially, we’ll target functional parity with the Ember app, which today supports
|
||||
|
||||
- reading Dataset & CorpUser
|
||||
- writing certain Dataset aspects (eg. ownership)
|
||||
|
Before Width: | Height: | Size: 274 KiB |
|
Before Width: | Height: | Size: 14 KiB |
@ -1,74 +0,0 @@
|
||||
- Start Date: (fill me in with today's date, 2022-02-22)
|
||||
- RFC PR: https://github.com/datahub-project/datahub/pull/4237
|
||||
- Discussion Issue: (GitHub issue this was discussed in before the RFC, if any)
|
||||
- Implementation PR(s): (leave this empty)
|
||||
|
||||
# Extend data model to model Notebook entity
|
||||
|
||||
## Background
|
||||
[Querybook](https://www.querybook.org/) is Pinterest’s open-source big data IDE via a notebook interface.
|
||||
We(Included Health) leverage it as our main querying tool. It has a feature, DataDoc, which organizes rich text,
|
||||
queries, and charts into a notebook to easily document analyses. People could work collaboratively with others in a
|
||||
DataDoc and get real-time updates. We believe it would be valuable to ingest the DataDoc metadata to Datahub and make
|
||||
it easily searchable and discoverable by others.
|
||||
|
||||
## Summary
|
||||
This RFC proposes the data model used to model DataDoc entity. It does not talk about any architecture, API or other
|
||||
implementation details. This RFC only includes minimum data model which could meet our initial goal. If the community
|
||||
decides to adopt this new entity, further effort is needed.
|
||||
|
||||
## Detailed design
|
||||
|
||||
### DataDoc Model
|
||||

|
||||
|
||||
As shown in the above diagram, DataDoc is a document which contains a list of DataDoc cells. It organizes rich text,
|
||||
queries, and charts into a notebook to easily document analyses. We could see that the DataDoc model is very similar as
|
||||
Notebook. DataDoc would be viewed as a subset of Notebook. Therefore we are going to model Notebook rather than DataDoc.
|
||||
We will include "subTypes" aspect to differentiate Notebook and DataDoc
|
||||
|
||||
### Notebook Data Model
|
||||
This section talks about the mininum data model of Notebook which could meet our needs.
|
||||
- notebookKey (keyAspect)
|
||||
- notebookTool: The name of the DataDoc tool such as QueryBook, Notebook, and etc
|
||||
- notebookId: Unique id for the DataDoc
|
||||
- notebookInfo
|
||||
- title(Searchable): The title of this DataDoc
|
||||
- description(Searchable): Detailed description about the DataDoc
|
||||
- lastModified: Captures information about who created/last modified/deleted this DataDoc and when
|
||||
- notebookContent
|
||||
- content: The content of a DataDoc which is composed by a list of DataDocCell
|
||||
- editableDataDocProperties
|
||||
- ownership
|
||||
- status
|
||||
- globalTags
|
||||
- institutionalMemory
|
||||
- browsePaths
|
||||
- domains
|
||||
- subTypes
|
||||
- dataPlatformInstance
|
||||
- glossaryTerms
|
||||
|
||||
### Notebook Cells
|
||||
Notebook cell is the unit that compose a Notebook. There are three types of cells: Text Cell, Query Cell, Chart Cell. Each
|
||||
type of cell has its own metadata. Since the cell only lives within a Notebook, we model cells as one aspect of Notebook
|
||||
rather than another entity. Here are the metadata of each type of cell:
|
||||
- TextCell
|
||||
- cellTitle: Title of the cell
|
||||
- cellId: Unique id for the cell.
|
||||
- lastModified: Captures information about who created/last modified/deleted this Notebook cell and when
|
||||
- text: The actual text in a TextCell in a Notebook
|
||||
- QueryCell
|
||||
- cellTitle: Title of the cell
|
||||
- cellId: Unique id for the cell.
|
||||
- lastModified: Captures information about who created/last modified/deleted this Notebook cell and when
|
||||
- rawQuery: Raw query to explain some specific logic in a Notebook
|
||||
- lastExecuted: Captures information about who last executed this query cell and when
|
||||
- ChartCell
|
||||
- cellTitle: Title of the cell
|
||||
- cellId: Unique id for the cell.
|
||||
- lastModified: Captures information about who created/last modified/deleted this Notebook cell and when
|
||||
|
||||
## Future Work
|
||||
Querybook provides an embeddable feature. We could embed a query tab which utilize the embedded feature in Datahub
|
||||
which provide a search-and-explore experience to user.
|
||||
@ -1,401 +0,0 @@
|
||||
# RBAC: Fine-grained Access Controls in DataHub
|
||||
|
||||
## Abstract
|
||||
|
||||
Access control is about managing what operations can be performed by whom. There are 2 broad buckets comprising access control:
|
||||
|
||||
- **Authentication**: Logging in. Associating an actor with a known identity.
|
||||
- **Authorization**: Performing an action. Allowing / denying known identities to perform specific types of operations.
|
||||
|
||||
Over the past few months, numerous requests have surfaced around controlling access to metadata stored inside DataHub.
|
||||
In this doc, we will propose a design for supporting pluggable authentication along with fine-grained authorization within DataHub's backend (GMS).
|
||||
|
||||
## Requirements
|
||||
|
||||
We will cover the use cases around access control in this section, gathered from a multitude of sources.
|
||||
|
||||
### Personas
|
||||
|
||||
This feature is targeted primarily at the DataHub **Operator** & **Admin** personas (often the same person). This feature can help admins of DataHub comply with their respective company policies.
|
||||
|
||||
The secondary beneficiary are **Data Users** themselves. Fine-grained access controls will permit Data Owners, Data Stewards to more tightly control the evolution of the metadata under management. It will also make it more difficult to make mistakes while changing metadata, such as accidentally overriding / removing good metadata authored by someone else.
|
||||
|
||||
### Community Asks
|
||||
|
||||
Sheetal Pratik (Saxo Bank)
|
||||
|
||||
**Asks**
|
||||
|
||||
- Model metadata "domains" (ie. resource scopes, namespaces) using DataHub
|
||||
- Define access policies that are scoped to a particular domain
|
||||
- Ability to define policies against DataHub resources at the following granularities:
|
||||
- individual resource (primary key based)
|
||||
- resource type (eg. all 'datasets')
|
||||
- action (eg. VIEW, UPDATE)
|
||||
which can be associated with requests against DataHub backend via mapping from resolved Actor information (principal / username, groups, etc).
|
||||
Resources can include entities, their aspects, access policies, etc.
|
||||
- Ability to compose & reuse groups of access policies.
|
||||
- Support for integrating with Active Directory (users, groups, and mappings to access policies)
|
||||
|
||||
Alasdair McBride (G-Research)
|
||||
|
||||
**Asks**
|
||||
|
||||
- Ability to organize multiple assets into groups and assign bucketed policies to these groups.
|
||||
- Ability to define READ / UPDATE / DELETE policies against DataHub resources at the following granularities:
|
||||
- individual resource (primary key based)
|
||||
- resource type (eg. all 'datasets')
|
||||
- resource group
|
||||
which can be associated with requests against DataHub backend via mapping from resolved Actor information (principal / username, groups, etc).
|
||||
Resources can include entities, their aspects, roles, policies, etc.
|
||||
- Support for service principals
|
||||
- Support for integrating with Active Directory (users, groups, and mappings to access policies)
|
||||
|
||||
|
||||
As you may have noticed, the concepts of "domain" and "group" described in each set of requirements are quite similar. From here
|
||||
on out, we will refer to a bucket of related entities that should be managed together as a metadata "domain".
|
||||
|
||||
### User Stories
|
||||
|
||||
|As a... |I want to.. |Because.. |
|
||||
|-----------------|----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
|DataHub Operator |Restrict the types of metadata that certain teams / individuals can change. |Reduce the changes of mistakes or malicious changes to metadata. Improve quality of metadata by putting it in the hands of the most knowledgable|
|
||||
|DataHub Operator |Restrict the types of metadata that certain teams / individuals can view. |Reduce the risk of falling out of compliance by displaying sensitive data in the Metadata UI (sample data values & beyond) |
|
||||
|DataHub Operator |Grant the ability to manage access policies to other users of DataHub. |I want to delegate this task to individual team managers. (Large org) |
|
||||
|DataHub Operator |Define bounded contexts, or "domains", of related metadata that can be access controlled together |I want to empower teams with most domain knowledge to manage their own access controls. |
|
||||
|DataHub Operator |Map users & groups from 3rd party identity providers to resolved access policies |I want to reuse the identity definitions that my organization already has |
|
||||
|DataHub Operator |Create identities for services and associate them with policies. (service principals) |I want to access DataHub programmatically while honoring with restricted access controls. |
|
||||
|DataHub User |Update Metadata that I know intimately. For example, table descriptions. |I want to provide high-quality metadata to my consumers. |
|
||||
|
||||
|
||||
### Concrete Requirements
|
||||
|
||||
#### Must Haves
|
||||
|
||||
a. a central notion of "authenticated user" in the DataHub backend (GMS).
|
||||
|
||||
b. pluggable authentication responsible for resolving DataHub users
|
||||
|
||||
- in scope: file-based username password plugin (for built-in roles), continue to support OIDC
|
||||
- in the future: saml, ldap / ad, api key, native authentication plugins
|
||||
|
||||
c. ability to define fine-grained access control policies based on a combination of
|
||||
|
||||
- actors: the users + groups the policy should be applied to (with ability to specify "all users" or "all groups")
|
||||
- resource type: the type of resource being accessed on the DataHub platform (eg. dataset entity, dataset aspect, roles, privileges etc) (exact match or ALL)
|
||||
- resource identifier: the primary key identifier for a resource (eg. dataset urn) (support for pattern matching)
|
||||
- action (bound to resource type. eg. read + write)
|
||||
- [in the future] domains
|
||||
|
||||
with support for optional conjunctions of filtering on resource type, & identifier (eg. resource type = "entity:dataset:ownership", resource identifier = "urn:li:dataset:1", action = "UPDATE")
|
||||
and including with support for the following resource types:
|
||||
|
||||
- metadata entities: datasets, charts, dashboards, etc.
|
||||
- metadata aspects: dataset ownership, chart info, etc.
|
||||
- access control objects: access policies, etc.
|
||||
|
||||
d. ability to resolve DataHub users to a set of access policies
|
||||
|
||||
- where User metadata includes principal name, group names, freeform string properties
|
||||
|
||||
e. ability to manage access policies programmatically via Rest API
|
||||
|
||||
f. ability to enforce fine-grained access control policies (ref.b) (Authorizer implementation)
|
||||
- Inputs: resolved access policies, resource type, resource key
|
||||
|
||||
#### Nice to Haves
|
||||
|
||||
a. policies that are tied to arbitrary attributes of a target resource object. (full ABAC)
|
||||
|
||||
b. ability to manage access policies via React UI
|
||||
|
||||
c. domain-partitioned access controls (asigning domains to all DH assets + then allowing policies including domain-based predicates)
|
||||
|
||||
### What success looks like
|
||||
|
||||
Based on the requirements gathered from talking with folks in the community, we decided to rally around the following goal. It should be possible to
|
||||
|
||||
1. Define a named access control policy
|
||||
- Resource Granularity: individual, asset type
|
||||
- Action Granularity: VIEW, UPDATE
|
||||
against an individual or group of DataHub resources (entities, aspects, roles, policies)
|
||||
2. Define mapping conditions from an authenticated user (DataHub user, groups) to one or more access policies
|
||||
|
||||
Within 15 minutes or less.
|
||||
|
||||
|
||||
## Implementation
|
||||
|
||||
This section will outline the technical solution proposed to address the stated requirements.
|
||||
|
||||
### In Scope
|
||||
|
||||
- Pluggable **Authentication** at GMS layer.
|
||||
- **Access Management** at GMS layer.
|
||||
- **Authorization** at GMS layer.
|
||||
|
||||
#### API-based Role Management
|
||||
|
||||
We aim to provide a rich API for defining access control policies. A default admin policy will be the `datahub` account.
|
||||
New users will be automatically assigned to a configurable "default" policy.
|
||||
|
||||
### Out of Scope
|
||||
|
||||
#### UI-based Role Management
|
||||
|
||||
Eventually, we aim to provide an in-app experience for defining access policies. This, however, is not in scope of the first milestone deliverable.
|
||||
|
||||
#### Support for Dynamic Local Username / Password Authentication
|
||||
|
||||
Initially, we aim to support limited local username / password authentication driven by a configuration file provided to GMS. We will not support persisting sessions, hashed passwords, groups to a native store inside DataHub (yet).
|
||||
|
||||
#### Support for LDAP & AD Username / Password Authentication
|
||||
|
||||
Though the APIs we are building *will* be amenable to supporting both Active Directory and LDAP authentication (discussed more below) we will not include implementation of these plugins as part of the scope of the initial impl, as we will use this as an opportunity to focus on getting the foundational aspects of access management right.
|
||||
|
||||
#### Modeling Domains in DataHub
|
||||
|
||||
As part of *this* particular initiative, we will omit from scope implementation of the domains, or sub-scopes / namespaces
|
||||
tied to resources on DataHub. However, we aim to design a system suitable which can accommodate policies based on domain
|
||||
predicates in the future.
|
||||
|
||||
### Concepts
|
||||
|
||||
We propose the introduction of the following concepts into the DataHub platform.
|
||||
|
||||
1. **Actor**: A user or system actor recognized by DataHub. Defined by a unique **principal** name & an optional set of *group* names. In practice, an authenticated actor will be identified via a CorpUser urn. (`urn:li:corpuser:johndoe`)
|
||||
1. **Principal**: A unique username associated with an actor. (Captured via a CorpUser urn)
|
||||
2. **Groups**: A set of groups that a user belongs to. (Captured via CorpGroup urns)
|
||||
2. **Resource**: Any resource that can be access controlled on the DataHub platform. Examples include Entities, Relationships, Roles, etc. Resources can include
|
||||
- Type: the unique type of the resource on DataHub's platform.
|
||||
3. **Policy**: A fine-grained access control rule comprised of target actors, resource type, a resource reference, and an action (specific to a resource type, eg. Read, Read / write)
|
||||
- Actors: Who the policy applies to (users + groups)
|
||||
- Action: CREATE, READ, UPDATE, DELETE
|
||||
- Match Criteria: resource type, reference filter
|
||||
|
||||
### Components
|
||||
|
||||
#### DataHub Backend (datahub-gms)
|
||||
|
||||
GMS will be augmented to include
|
||||
|
||||
1. a set of Auth-related primary store tables. (SQL)
|
||||
2. a set of Auth-related Rest APIs.
|
||||
3. an Authentication Filter executed on each request to GMS.
|
||||
4. an Authorizer component executed within endpoints to authorize particular actions.
|
||||
|
||||
**Auth Tables & Endpoints**
|
||||
|
||||
1. *Policies*: Create, Read, Update fine-grained access policies.
|
||||
|
||||
```
|
||||
// Create a policy.
|
||||
POST /gms/policy
|
||||
|
||||
{
|
||||
name: "manage_datasets_msd",
|
||||
users: ["urn:li:corpuser:johndoe", "urn:li:corpuser:test"],
|
||||
groups: ["urn:li:corpGroup:eng_all"],
|
||||
actions: ["VIEW_AND_UPDATE"],
|
||||
resource: {
|
||||
type: "ENTITY",
|
||||
attributes: {
|
||||
entity: "dataset",
|
||||
urn: ["*"],
|
||||
}
|
||||
},
|
||||
// optional, defaults to "true"
|
||||
allow: "true"
|
||||
}
|
||||
```
|
||||
|
||||
In the above example, we are creating an access policy that permits reads & writes against the
|
||||
"ownership" aspect of the "dataset" entity. There are a few important pieces to note:
|
||||
|
||||
1. Name - All policies are named
|
||||
2. Users / Groups - The users and groups the policy should apply to. Can be wildcard for all.
|
||||
3. Action - The action to be permitted or denied. We will initially ship with ("VIEW", "VIEW_AND_UPDATE")
|
||||
4. Resource - The resource filters. The resource that the action is being requested against. Examples can be specific metadata assets,
|
||||
policies, operator stats and more.
|
||||
5. Allow - A flag determining whether to allow the action on a match of user / group, action, and resource filters.
|
||||
|
||||
Notice the use of a resource type along with resource-type-specific attributes. These
|
||||
attributes will serve as matching criteria for resource specifications passed into an Authorizer component at runtime.
|
||||
Also note that policy attribute fields will support wildcard matching.
|
||||
|
||||
The attributes section of the policies provides a mechanism for extension in the future. For example, adding a "domain" qualification
|
||||
to a resource, and defining policies that leverage the domain attribute would be simply a matter of adding to the resource attributes.
|
||||
|
||||
An additional "allow" flag will be supported to determine whether the action against the specified resource should be allowed or denied.
|
||||
This will default to "true", meaning that the action should be permitted given that the actor, action, and resource types match the policy.
|
||||
|
||||
At authorization time, users will be resolved to policies by matching the user + groups specified in the policy against the authenticated user.
|
||||
Moreover, resource specs will be constructed by the code invoking the authorization component and matched against the resource
|
||||
filters defined within the policies.
|
||||
|
||||
2. *Tokens*:
|
||||
|
||||
Tokens are used for **authentication** to GMS. They can be retrieved given authentication via another means, such as username / password auth.
|
||||
|
||||
- `/generateTokenForActor`: Generates a signed, Oauth-compliant GMS access token+refresh token pair based on **provided principal, group, metadata**. Caller must be authorized to use this functionality.
|
||||
- `/generateToken`: Generates a signed, OAuth-compliant GMS access token + refresh token pair **based on the currently authenticated actor**.
|
||||
|
||||
**Auth Filter**
|
||||
|
||||
The auth filter will be a configurable Rest filter that executes on each request to GMS.
|
||||
|
||||
Responsibility 1: Authentication
|
||||
|
||||
*Authenticator Chain*
|
||||
|
||||
Inside the filter will live a configurable chain of "Authenticators" that will be executed in sequence with the goal of resolving a standardized "Actor" object model, which will contain the following fields:
|
||||
|
||||
1. `principal` (required): a unique identifier used on DataHub, represented as a CorpUser urn
|
||||
2. `groups` (optional): a list of groups associated with the user, represented as a set of CorpGroup urns
|
||||
|
||||
Upon resolution of an "Actor" object, the authentication stage will be considered complete.
|
||||
|
||||
Responsibility 2: Saving to Thread Context
|
||||
|
||||
After resolving the authenticated user, the state of the Actor object will be written to the local ThreadContext, from which it will be retrieved to perform Authorization.
|
||||
|
||||
**Authorizer**
|
||||
|
||||
The authorizer is a component that will be called by endpoints + services internal to GMS in order to authorize a particular action, e.g. editing an entity, relationship, or permission.
|
||||
|
||||
It will accept the following arguments:
|
||||
|
||||
1. The resource spec:
|
||||
- resource type
|
||||
- resource attributes
|
||||
2. The action being attempted on the resource
|
||||
3. The actor attempting the action
|
||||
|
||||
and perform the following steps:
|
||||
|
||||
1. Resolve the Actor to a set of relevant access policies
|
||||
2. Evaluate the fetched policies against the inputs
|
||||
3. If the Actor is authorized to perform the action, allow the action.
|
||||
4. If the Actor is not authorized to perform the action, deny the action.
|
||||
|
||||

|
||||
|
||||
The authorizer will additionally be designed to support multiple authorizer filters in a single authorizer chain.
|
||||
This permits the addition of custom authorization logic in the future, for example for resolving "virtual policies" based on
|
||||
edges in the metadata graph (discussed further below)
|
||||
|
||||
#### DataHub Frontend (datahub-frontend)
|
||||
|
||||
DataHub frontend will continue to handle much of the heavy lifting when it comes to OIDC SSO for the time being. However, the specific details of both OIDC and username / password authentication will be slightly different going forward.
|
||||
|
||||
|
||||
##### Case 1: OIDC
|
||||
|
||||
DataHub frontend will continue to handle OIDC authentication by performing redirects to the Identity Provider and handling the callback from the Identity Provider for backwards compatibility. What occurs after authentication on the Identity Provider is what will change.
|
||||
|
||||
After successful authentication with an IdP, DataHub frontend will perform the following steps on `/callback` :
|
||||
|
||||
1. Contact a protected "generateTokenForUser" endpoint exposed by GMS to generate an access token and refresh token from a principal & set of groups extracted from the IdP UserInfo. In this call, `datahub-frontend` will identify itself using a service principal that will come preconfigured in GMS, allowing it the ability to generate a token on behalf of a user on demand. In this world, `datahub-frontend` is considered a highly trusted party by GMS.
|
||||
2. Set the access + refresh tokens in cookies returned the the UI client.
|
||||
|
||||
For all subsequent calls, `datahub-frontend` will be expected to validate the authenticity of the GMS-issued access token using a public key provided in its configuration. This public key must match the private key that GMS uses to generate the original access token.
|
||||
|
||||
Upon expiration of the access token, `datahub-frontend` will be responsible for fetching a new access token from GMS and updating client side cookies. (In the datahub-frontend auth stage)
|
||||
|
||||

|
||||
|
||||
##### Case 2: Username / Password
|
||||
|
||||
In the case of username / password authentication, `datahub-frontend` will do something new: it will call a "generateToken" endpoint in GMS with a special Authorization header containing basic authentication - the username and password provided by the user on the UI.
|
||||
|
||||
This endpoint will validate the username / password combination using an **Authenticator** (by default one will exist to validate the "datahub" super user account) and return a pair of access token, refresh token to datahub-frontend. DataHub frontend will then set these as cookies in the UI and validate them using the same mechanism as discussed previously.
|
||||
|
||||
This allows us to evolve GMS to include LDAP, AD, and native username / password Authenticators while keeping **datahub-frontend** the same.
|
||||
|
||||

|
||||
|
||||
In the future, we will be able to easily add support for remote username / password authentication, such as using an LDAP / AD directory service. The call flow in such cases is shown below
|
||||
|
||||

|
||||
|
||||
##### Summary
|
||||
|
||||
On login, datahub-frontend will *always* call GMS to get an access token + refresh token. These will then serve as the credentials for both datahub-frontend, who will do lightweight validation on the token, and GMS who will handle authorization based on the principal + groups associated with the token.
|
||||
|
||||
It is the intention that GMS eventually take the heavy lifting of *all* authentication, including OIDC and SAML authentication, which both require a UI component. It will have a set of APIs that `datahub-frontend` will be able to use to perform the correct SSO redirects and validation endpoints for creating GMS tokens on successful login.
|
||||
|
||||
Because this means implementing OpenID Connect (OIDC) specification at the GMS layer as well as adding a host of new apis between datahub-frontend and GMS, we've decided to delay moving full OIDC responsibility to GMS at this time. This will be part of a followup phase 2 milestone on the auth track.
|
||||
|
||||
In the future, we imagine 3 cases that the `frontend` server will have to handle in different ways:
|
||||
|
||||
- OIDC
|
||||
- SAML
|
||||
- username / password
|
||||
|
||||
In contrast, GMS will have to know about the finer details of each, for example the ability to authenticate usernames and passwords using LDAP/AD, native (local db), or file-based credential stores.
|
||||
|
||||
### Milestones
|
||||
|
||||
**Milestone 1**
|
||||
|
||||
- The first version of all components described above are implemented. Basic authenticators implemented
|
||||
a. gms issued oauth tokens
|
||||
b. file-based username / password combinations
|
||||
- Support Entity-level policy granularity (no aspect level, yet)
|
||||
- Support enforcement of write-side policies. (VIEW_AND_UPDATE actions)
|
||||
- Support "default" access control policy with limited customization capability (allow VIEW on all resources as the default access control policy)
|
||||
- Configurable ownership-based authorizer
|
||||
|
||||
**Milestone 2**
|
||||
|
||||
- Support enforcement of read-side policies. (VIEW actions)
|
||||
- Add support for associating each resource stored within DataHub with a particular domain. Permit domain-based predicates in policies.
|
||||
- Add UI for managing access policies
|
||||
|
||||
**Milestone 3**
|
||||
|
||||
a. OpenLDAP authenticator impl
|
||||
b. Active Directory authenticator impl
|
||||
c. SAML authenticator impl
|
||||
|
||||
## Bonus 1: Modeling - To Graph or Not
|
||||
|
||||
### Risks with modeling on the Graph
|
||||
|
||||
- No strong read-after-write consistency based on non-primary keys
|
||||
- Don't think this will be common given that the name is the primary key and likely what folks will query by.
|
||||
- Philosophy: Are "policies" really *metadata*? Do they belong on the metadata graph?
|
||||
- Should policies be retrievable on the "/entities" endpoint
|
||||
- Should there be a separate, internal DataHub system graph? Stored in separate mysql table?
|
||||
- Query pattern: Is the query pattern drastically different than for other entities? Policies will often be "read all" and cached.
|
||||
- If we need to migrate away from the graph, we'd need to do a data migration from aspects table to another table.
|
||||
|
||||
### Benefits of modeling on the Graph
|
||||
|
||||
- Can reuse existing APIs for searching, fetching, searching, creating etc.
|
||||
- Less boilerplate code. No hardcoded tables. We will still want a specific REST api for managing policies, however.
|
||||
|
||||
## Bonus 2: The Ownership Question
|
||||
|
||||
The ownership question: How can we support the requirement that "owners should be allowed to edit the resources they own". We can do so using pluggable Authorizers.
|
||||
|
||||
Possible solution: Pluggable Authorizer chain.
|
||||
|
||||

|
||||
|
||||
We plan to implement an "Ownership Authorizer" that is responsible for resolving an entity to their
|
||||
ownership information and generating a "virtual policy" that allows asset owners to make changes to the
|
||||
entities they own.
|
||||
|
||||
An alternate solution is to permit an additional flag inside a policy that marks it as applying to "owners"
|
||||
of the target resource. The challenge here is that not all resource types are guaranteed to have owners in the first place.
|
||||
|
||||
## References
|
||||
|
||||
In the process of writing this ERD, I researched the following systems to learn & take inspiration:
|
||||
|
||||
- Elasticsearch
|
||||
- Pinot
|
||||
- Airflow
|
||||
- Apache Atlas
|
||||
- Apache Ranger
|
||||
|
Before Width: | Height: | Size: 111 KiB |
|
Before Width: | Height: | Size: 138 KiB |
|
Before Width: | Height: | Size: 126 KiB |
|
Before Width: | Height: | Size: 125 KiB |
|
Before Width: | Height: | Size: 145 KiB |
@ -1,190 +0,0 @@
|
||||
- Start Date: 2021-02-17
|
||||
- RFC PR: https://github.com/datahub-project/datahub/pull/2112
|
||||
- Discussion Issue: (GitHub issue this was discussed in before the RFC, if any)
|
||||
- Implementation PR(s): (leave this empty)
|
||||
|
||||
# Tags
|
||||
|
||||
## Summary
|
||||
|
||||
We suggest a generic, global tagging solution for Datahub. As the solution is quite generic and flexible, it can also
|
||||
hopefully serve as an stepping stone for new, cool features in the future.
|
||||
|
||||
## Motivation
|
||||
|
||||
Currently some entities, such as Datasets, can be tagged using strings, but unfortunately this solution is quite
|
||||
limited.
|
||||
|
||||
A general tag implementation will allow us to define and attach a new and simple type of metadata to all type of
|
||||
entities. As the tags would be defined globally, tagging multiple objects with the same tag will give us the possibility
|
||||
to define and search based on a new kind of relationship, for example which datasets and ML Models that are tagged to
|
||||
include PII data. This allows for describing relationships between object that would otherwise not have a direct lineage
|
||||
relationship. Moreover, tags would lower that bar to add simple metadata to any object in the Datahub instance and open
|
||||
the door to crowd-sourcing metadata. Remembering that tags themselves are entities, it would also be possible to tag
|
||||
tags, enabling a hierarchy of sorts.
|
||||
|
||||
The solution is meant to be quite generic and flexible, and we're not trying to be too opinionated about how a user
|
||||
should use the feature. We hope that this initial generic solution can serve as a stepping stone for cool futures in the
|
||||
future.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Ability to associate tags with any type of entity, even other tags!
|
||||
- Ability to tag the same entity with multiple tags.
|
||||
- Ability to tag multiple objects with the same tag instance.
|
||||
- To the point above, ability to make easy tag-based searches later on.
|
||||
- Metadata on tags is TBD
|
||||
|
||||
### Extensibility
|
||||
|
||||
The normal new-entity-onboarding work is obviously required.
|
||||
|
||||
Hopefully this can serve as a stepping stone to work on special cases such as the tag-based privacy tagging mentioned in
|
||||
the roadmap.
|
||||
|
||||
## Non-Requirements
|
||||
|
||||
Let's leave the UI work required for this to another time.
|
||||
|
||||
## Detailed design
|
||||
|
||||
We want to introduce some new under `datahub/metadata-models/src/main/pegasus/com/linkedin/common/`.
|
||||
|
||||
### `Tag` entity
|
||||
|
||||
First we create a `TagMetadata` entity, which defines the actual tag-object.
|
||||
|
||||
The edit property defines the edit rights of the tag, as some tags (like sensitivity tags) should be read-only for a
|
||||
majority of users
|
||||
|
||||
```
|
||||
/**
|
||||
* Tag information
|
||||
*/
|
||||
record TagMetadata {
|
||||
/**
|
||||
* Tag URN, e.g. urn:li:tag:<name>
|
||||
*/
|
||||
urn: TagUrn
|
||||
|
||||
/**
|
||||
* Tag value.
|
||||
*/
|
||||
value: string
|
||||
|
||||
/**
|
||||
* Optional tag description
|
||||
*/
|
||||
description: optional string
|
||||
|
||||
/**
|
||||
* Audit stamp associated with creation of this tag
|
||||
*/
|
||||
createStamp: AuditStamp
|
||||
}
|
||||
```
|
||||
|
||||
### `TagAttachment`
|
||||
|
||||
We define a `TagAttachment`-model, which describes the application of a tag to a entity
|
||||
|
||||
```
|
||||
/**
|
||||
* Tag information
|
||||
*/
|
||||
record TagAttachment {
|
||||
|
||||
/**
|
||||
* Tag in question
|
||||
*/
|
||||
tag: TagUrn
|
||||
|
||||
/**
|
||||
* Who has edit rights to this employment.
|
||||
* WIP, pending access-control support in Datahub.
|
||||
* Relevant for privacy tags at least.
|
||||
* We might also want to add view rights?
|
||||
*/
|
||||
edit: union[None, any, role-urn]
|
||||
|
||||
/**
|
||||
* Audit stamp associated with employment of this tag to this entity
|
||||
*/
|
||||
attachmentStamp: AuditStamp
|
||||
}
|
||||
```
|
||||
|
||||
### `Tags` container
|
||||
|
||||
Then we define a `Tags`-aspect, which is used as a container for tag employments.
|
||||
|
||||
```
|
||||
namespace com.linkedin.common
|
||||
|
||||
/**
|
||||
* Tags information
|
||||
*/
|
||||
record Tags {
|
||||
|
||||
/**
|
||||
* List of tag employments
|
||||
*/
|
||||
elements: array[TagAttachment] = [ ]
|
||||
}
|
||||
```
|
||||
|
||||
This can easily be taken into use with wall entities that we want to be able to use tags, e.g. `Datasets`. As we see a
|
||||
lot of potential in tagging individual dataset fields as well, we can either add a reference to a Tags-object in the
|
||||
`SchemaField` object, or alternative create a new `DatasetFieldTags`, similar to `DatasetFieldMapping`.
|
||||
|
||||
## How we teach this
|
||||
|
||||
We should create/update user guides to educate users for:
|
||||
|
||||
- Suggestions on how to use tags: low threshold metadata-addition, and the possibility of doing new types of searches
|
||||
|
||||
## Drawbacks
|
||||
|
||||
This is definitely more complex than just adding strings to an array.
|
||||
|
||||
## Alternatives
|
||||
|
||||
An array of string is a simple solution but does allow for the same functionality as suggested here.
|
||||
|
||||
Another alternative would be simplify the models by removing some of the metadata in the `TagMetadata` and
|
||||
`TagAttachment` entities, such as the the edit/view permission field, the audit stamps, and the descriptions.
|
||||
|
||||
Apache Atlas uses a similar approach. The require you to create a Tag instance before it can be associated with an
|
||||
"asset", and the attachment is done using a dropdown list. The tags can also have attributes and a description. See
|
||||
[here](https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.5.3/bk_data-governance/content/ch_working_with_atlas_tags.html)
|
||||
for an example. The tags are a central piece in the UI and readably searchable, as easily as datasets.
|
||||
|
||||
Atlas also has concept very closely related to tags, called _classification_. Classifications are similar to tags in
|
||||
that they need to be created separately, can have attributes (but no description?) and are attached to assets is done
|
||||
using a dropdown list. Classifications have the added functionality of propagation, which means that they are
|
||||
automatically applied to downstream assets, unless specifically set to not do so. Any change to a classification (say an
|
||||
attribute change) also flows downstream, and in downstream assets you're able to see from where the classification
|
||||
propagated from. See
|
||||
[here](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-atlas/content/propagate_classifications_to_derived_entities.html)
|
||||
for an example.
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
Using the functionality is optional and does not break other functionality as is. The solution is generic enough that
|
||||
the users can easily take into use. It can be take into use as any other entity and aspect.
|
||||
|
||||
## Future Work
|
||||
|
||||
- add `Tags` to aspects for entities.
|
||||
- Implement relationship builders as needed.
|
||||
- The implementation of and need for access control to tags is an open question
|
||||
- As this is first and foremost a tool for discovery, the UI work is extensible:
|
||||
- Creating tags in a way that makes duplication and spelling mistakes difficult.
|
||||
- Attaching tags to entities: autocomplete, dropdown, etc.
|
||||
- Visualizing existing tags, and which are most popular?
|
||||
- Explore the idea about a special "classification" type, that propagates downstream, as in Atlas.
|
||||
|
||||
## Unresolved questions
|
||||
|
||||
- How do we want to map dataset fields to tags?
|
||||
- Do we want to implement edit/view rights?
|
||||
@ -1,93 +0,0 @@
|
||||
- Start Date: (fill me in with today's date, YYYY-MM-DD)
|
||||
- RFC PR: (after opening the RFC PR, update this with a link to it and update the file name)
|
||||
- Discussion Issue: (GitHub issue this was discussed in before the RFC, if any)
|
||||
- Implementation PR(s): (leave this empty)
|
||||
|
||||
# <RFC title>
|
||||
|
||||
## Summary
|
||||
|
||||
> Brief explanation of the feature.
|
||||
|
||||
## Basic example
|
||||
|
||||
> If the proposal involves a new or changed API, include a basic code example. Omit this section if it's not applicable.
|
||||
|
||||
## Motivation
|
||||
|
||||
> Why are we doing this? What use cases does it support? What is the expected outcome?
|
||||
>
|
||||
> Please focus on explaining the motivation so that if this RFC is not accepted, the motivation could be used to develop
|
||||
> alternative solutions. In other words, enumerate the constraints you are trying to solve without coupling them too
|
||||
> closely to the solution you have in mind.
|
||||
|
||||
## Requirements
|
||||
|
||||
> What specific requirements does your design need to meet? This should ideally be a bulleted list of items you wish
|
||||
> to achieve with your design. This can help everyone involved (including yourself!) make sure your design is robust
|
||||
> enough to meet these requirements.
|
||||
>
|
||||
> Once everyone has agreed upon the set of requirements for your design, we can use this list to review the detailed
|
||||
> design.
|
||||
|
||||
### Extensibility
|
||||
|
||||
> Please also call out extensibility requirements. Is this proposal meant to be extended in the future? Are you adding
|
||||
> a new API or set of models that others can build on in later? Please list these concerns here as well.
|
||||
|
||||
## Non-Requirements
|
||||
|
||||
> Call out things you don't want to discuss in detail during this review here, to help focus the conversation. This can
|
||||
> include things you may build in the future based off this design, but don't wish to discuss in detail, in which case
|
||||
> it may also be wise to explicitly list that extensibility in your design is a requirement.
|
||||
>
|
||||
> This list can be high level and not detailed. It is to help focus the conversation on what you want to focus on.
|
||||
|
||||
## Detailed design
|
||||
|
||||
> This is the bulk of the RFC.
|
||||
|
||||
> Explain the design in enough detail for somebody familiar with the framework to understand, and for somebody familiar
|
||||
> with the implementation to implement. This should get into specifics and corner-cases, and include examples of how the
|
||||
> feature is used. Any new terminology should be defined here.
|
||||
|
||||
## How we teach this
|
||||
|
||||
> What names and terminology work best for these concepts and why? How is this idea best presented? As a continuation
|
||||
> of existing DataHub patterns, or as a wholly new one?
|
||||
|
||||
> What audience or audiences would be impacted by this change? Just DataHub backend developers? Frontend developers?
|
||||
> Users of the DataHub application itself?
|
||||
|
||||
> Would the acceptance of this proposal mean the DataHub guides must be re-organized or altered? Does it change how
|
||||
> DataHub is taught to new users at any level?
|
||||
|
||||
> How should this feature be introduced and taught to existing audiences?
|
||||
|
||||
## Drawbacks
|
||||
|
||||
> Why should we *not* do this? Please consider the impact on teaching DataHub, on the integration of this feature with
|
||||
> other existing and planned features, on the impact of the API churn on existing apps, etc.
|
||||
|
||||
> There are tradeoffs to choosing any path, please attempt to identify them here.
|
||||
|
||||
## Alternatives
|
||||
|
||||
> What other designs have been considered? What is the impact of not doing this?
|
||||
|
||||
> This section could also include prior art, that is, how other frameworks in the same domain have solved this problem.
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
> If we implemented this proposal, how will existing users / developers adopt it? Is it a breaking change? Can we write
|
||||
> automatic refactoring / migration tools? Can we provide a runtime adapter library for the original API it replaces?
|
||||
|
||||
## Future Work
|
||||
|
||||
> Describe any future projects, at a very high level, that will build off this proposal. This does not need to be
|
||||
> exhaustive, nor does it need to be anything you work on. It just helps reviewers see how this can be used in the
|
||||
> future, so they can help ensure your design is flexible enough.
|
||||
|
||||
## Unresolved questions
|
||||
|
||||
> Optional, but suggested for first drafts. What parts of the design are still TBD?
|
||||