mirror of
https://github.com/datahub-project/datahub.git
synced 2025-07-08 01:31:55 +00:00
359 lines
19 KiB
Markdown
359 lines
19 KiB
Markdown
![]() |
# Metadata Service Authentication
|
||
|
|
||
|
## Introduction
|
||
|
|
||
|
We recently introduced Authentication in the Metadata Service layer. This document will provide a technical overview of the feature aimed at developers
|
||
|
evaluating or operating DataHub. It will include a characterization of the motivations for the feature, the key components in its design, the new capabilities it provides, & configuration instructions.
|
||
|
|
||
|
## Background
|
||
|
|
||
|
Let's recall 2 critical components of DataHub's architecture:
|
||
|
|
||
|
- **DataHub Frontend Proxy** (datahub-frontend) - Resource server that routes requests to downstream Metadata Service
|
||
|
- **DataHub Metadata Service** (datahub-gms) - Source of truth for storing and serving DataHub Metadata Graph.
|
||
|
|
||
|
Previously, Authentication was exclusively handled by the Frontend Proxy. This service would perform the following steps
|
||
|
when a user navigated to `http://localhost:9002/`:
|
||
|
|
||
|
a. Check for the presence of a special `PLAY_SESSION` cookie.
|
||
|
|
||
|
b. If cookie was present + valid, redirect to the home page
|
||
|
|
||
|
c. If cookie was invalid, redirect to either a) the DataHub login screen (for [JAAS authentication](https://datahubproject.io/docs/how/auth/jaas/) or b) a [configured OIDC Identity Provider](https://datahubproject.io/docs/how/auth/sso/configure-oidc-react/) to perform authentication.
|
||
|
|
||
|
Once authentication had succeeded at the frontend proxy layer, a stateless (token-based) session cookie (PLAY_SESSION) would be set in the users browser.
|
||
|
All subsequent requests, including the GraphQL requests issued by the React UI, would be authenticated using this session cookie. Once a request had made it beyond
|
||
|
the frontend service layer, it was assumed to have been already authenticated. Hence, there was **no native authentication inside of the Metadata Service**.
|
||
|
|
||
|
### Problems with this approach
|
||
|
|
||
|
The major challenge with this situation is that requests to the backend Metadata Service were completely unauthenticated. There were 2 options for folks who required authentication at the Metadata Service layer:
|
||
|
|
||
|
1. Set up a proxy in front of Metadata Service that performed authentication
|
||
|
2. [A more recent possibility] Route requests to Metadata Service through DataHub Frontend Proxy, including the PLAY_SESSION
|
||
|
Cookie with every request.
|
||
|
|
||
|
Neither of which are ideal. Setting up a proxy to do authentication takes time & expertise. Extracting and setting a session cookie from the browser for programmatic is
|
||
|
clunky & unscalable. On top of that, extending the authentication system was difficult, requiring implementing a new [Play module](https://www.playframework.com/documentation/2.8.8/api/java/play/mvc/Security.Authenticator.html) within DataHub Frontend.
|
||
|
|
||
|
## Introducing Authentication in DataHub Metadata Service
|
||
|
|
||
|
To address these problems, we introduced configurable Authentication inside the **Metadata Service** itself,
|
||
|
meaning that requests are no longer considered trusted until they are authenticated by the Metadata Service.
|
||
|
|
||
|
Why push Authentication down? In addition to the problems described above, we wanted to plan for a future
|
||
|
where Authentication of Kafka-based-writes could be performed in the same manner as Rest writes.
|
||
|
|
||
|
Next, we'll cover the components being introduced to support Authentication inside the Metadata Service.
|
||
|
|
||
|
### Concepts & Key Components
|
||
|
|
||
|
We introduced a few important concepts to the Metadata Service to make authentication work:
|
||
|
|
||
|
1. Actor
|
||
|
2. Authenticator
|
||
|
3. AuthenticatorChain
|
||
|
4. AuthenticationFilter
|
||
|
5. DataHub Access Token
|
||
|
6. DataHub Token Service
|
||
|
|
||
|
In following sections, we'll take a closer look at each individually.
|
||
|
|
||
|

|
||
|
*High level overview of Metadata Service Authentication*
|
||
|
|
||
|
#### What is an Actor?
|
||
|
|
||
|
An **Actor** is a concept within the new Authentication subsystem to represent a unique identity / principal that is initiating actions (e.g. read & write requests)
|
||
|
on the platform.
|
||
|
|
||
|
An actor can be characterized by 2 attributes:
|
||
|
|
||
|
1. **Type**: The "type" of the actor making a request. The purpose is to for example distinguish between a "user" & "service" actor. Currently, the "user" actor type is the only one
|
||
|
formally supported.
|
||
|
2. **Id**: A unique identifier for the actor within DataHub. This is commonly known as a "principal" in other systems. In the case of users, this
|
||
|
represents a unique "username". This username is in turn used when converting from the "Actor" concept into a Metadata Entity Urn (e.g. CorpUserUrn).
|
||
|
|
||
|
For example, the root "datahub" super user would have the following attributes:
|
||
|
|
||
|
```
|
||
|
{
|
||
|
"type": "USER",
|
||
|
"id": "datahub"
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Which is mapped to the CorpUser urn:
|
||
|
|
||
|
```
|
||
|
urn:li:corpuser:datahub
|
||
|
```
|
||
|
|
||
|
for Metadata retrieval.
|
||
|
|
||
|
#### What is an Authenticator?
|
||
|
|
||
|
An **Authenticator** is a pluggable component inside the Metadata Service that is responsible for authenticating an inbound request provided context about the request (currently, the request headers).
|
||
|
Authentication boils down to successfully resolving an **Actor** to associate with the inbound request.
|
||
|
|
||
|
There can be many types of Authenticator. For example, there can be Authenticators that
|
||
|
|
||
|
- Verify the authenticity of access tokens (ie. issued by either DataHub itself or a 3rd-party IdP)
|
||
|
- Authenticate username / password credentials against a remote database (ie. LDAP)
|
||
|
|
||
|
and more! A key goal of the abstraction is *extensibility*: a custom Authenticator can be developed to authenticate requests
|
||
|
based on an organization's unique needs.
|
||
|
|
||
|
DataHub ships with 2 Authenticators by default:
|
||
|
|
||
|
- **DataHubSystemAuthenticator**: Verifies that inbound requests have originated from inside DataHub itself using a shared system identifier
|
||
|
and secret. This authenticator is always present.
|
||
|
|
||
|
- **DataHubTokenAuthenticator**: Verifies that inbound requests contain a DataHub-issued Access Token (discussed further in the "DataHub Access Token" section below) in their
|
||
|
'Authorization' header. This authenticator is required if Metadata Service Authentication is enabled.
|
||
|
|
||
|
#### What is an AuthenticatorChain?
|
||
|
|
||
|
An **AuthenticatorChain** is a series of **Authenticators** that are configured to run one-after-another. This allows
|
||
|
for configuring multiple ways to authenticate a given request, for example via LDAP OR via local key file.
|
||
|
|
||
|
Only if each Authenticator within the chain fails to authenticate a request will it be rejected.
|
||
|
|
||
|
The Authenticator Chain can be configured in the `application.yml` file under `authentication.authenticators`:
|
||
|
|
||
|
```
|
||
|
authentication:
|
||
|
....
|
||
|
authenticators:
|
||
|
# Configure the Authenticators in the chain
|
||
|
- type: com.datahub.authentication.Authenticator1
|
||
|
...
|
||
|
- type: com.datahub.authentication.Authenticator2
|
||
|
....
|
||
|
```
|
||
|
|
||
|
#### What is the AuthenticationFilter?
|
||
|
|
||
|
The **AuthenticationFilter** is a [servlet filter](http://tutorials.jenkov.com/java-servlets/servlet-filters.html) that authenticates each and requests to the Metadata Service.
|
||
|
It does so by constructing and invoking an **AuthenticatorChain**, described above.
|
||
|
|
||
|
If an Actor is unable to be resolved by the AuthenticatorChain, then a 401 unauthorized exception will be returned by the filter.
|
||
|
|
||
|
|
||
|
#### What is a DataHub Token Service? What are Access Tokens?
|
||
|
|
||
|
Along with Metadata Service Authentication comes an important new component called the **DataHub Token Service**. The purpose of this
|
||
|
component is twofold:
|
||
|
|
||
|
1. Generate Access Tokens that grant access to the Metadata Service
|
||
|
2. Verify the validity of Access Tokens presented to the Metadata Service
|
||
|
|
||
|
**Access Tokens** granted by the Token Service take the form of [Json Web Tokens](https://jwt.io/introduction), a type of stateless token which
|
||
|
has a finite lifespan & is verified using a unique signature. JWTs can also contain a set of claims embedded within them. Tokens issued by the Token
|
||
|
Service contain the following claims:
|
||
|
|
||
|
- exp: the expiration time of the token
|
||
|
- version: version of the DataHub Access Token for purposes of evolvability (currently 1)
|
||
|
- type: The type of token, currently SESSION (used for UI-based sessions) or PERSONAL (used for personal access tokens)
|
||
|
- actorType: The type of the **Actor** associated with the token. Currently, USER is the only type supported.
|
||
|
- actorId: The id of the **Actor** associated with the token.
|
||
|
|
||
|
Today, Access Tokens are granted by the Token Service under two scenarios:
|
||
|
|
||
|
1. **UI Login**: When a user logs into the DataHub UI, for example via [JaaS](https://datahubproject.io/docs/how/auth/jaas/) or
|
||
|
[OIDC](https://datahubproject.io/docs/how/auth/sso/configure-oidc-react/), the `datahub-frontend` service issues an
|
||
|
request to the Metadata Service to generate a SESSION token *on behalf of* of the user logging in. (*Only the frontend service is authorized to perform this action).
|
||
|
2. **Generating Personal Access Tokens**: When a user requests to generate a Personal Access Token (described below) from the UI.
|
||
|
|
||
|
> At present, the Token Service supports the symmetric signing method `HS256` to generate and verify tokens.
|
||
|
|
||
|
Now that we're familiar with the concepts, we will talk concretely about what new capabilities have been built on top
|
||
|
of Metadata Service Authentication.
|
||
|
|
||
|
### New Capabilities
|
||
|
|
||
|
#### Personal Access Tokens
|
||
|
|
||
|
With these changes, we introduced a way to generate a "Personal Access Token" suitable for programmatic use with both the DataHub GraphQL
|
||
|
and DataHub Rest.li (Ingestion) APIs.
|
||
|
|
||
|
Personal Access Tokens have a finite lifespan (default 3 months) and currently cannot be revoked without changing the signing key that
|
||
|
DataHub uses to generate these tokens (via the TokenService described above). Most importantly, they inherit the permissions
|
||
|
granted to the user who generates them.
|
||
|
|
||
|
##### Generating Personal Access Tokens
|
||
|
|
||
|
To generate a personal access token, users must have been granted the "Generate Personal Access Tokens" (GENERATE_PERSONAL_ACCESS_TOKENS) Privilege via a [DataHub Policy](./policies.md). Once
|
||
|
they have this permission, users can navigate to **'Settings'** > **'Access Tokens'** > **'Generate Personal Access Token'** to generate a token.
|
||
|
|
||
|

|
||
|
|
||
|
The token expiration dictates how long the token will be valid for. We recommend setting the shortest duration possible, as tokens are not currently
|
||
|
revokable once granted (without changing the signing key).
|
||
|
|
||
|
|
||
|
#### Using a Personal Access Token
|
||
|
|
||
|
The user will subsequently be able to make authenticated requests to DataHub frontend proxy or DataHub GMS directly by providing
|
||
|
the generated Access Token as a Bearer token in the `Authorization` header:
|
||
|
|
||
|
```
|
||
|
Authorization: Bearer <generated-access-token>
|
||
|
```
|
||
|
|
||
|
For example, using a curl to the frontend proxy (preferred in production):
|
||
|
|
||
|
`curl 'http://localhost:9002/api/gms/entities/urn:li:corpuser:datahub' -H 'Authorization: Bearer <access-token>`
|
||
|
|
||
|
or to Metadata Service directly:
|
||
|
|
||
|
`curl 'http://localhost:8080/entities/urn:li:corpuser:datahub' -H 'Authorization: Bearer <access-token>`
|
||
|
|
||
|
Without an access token, making programmatic requests will result in a 401 result from the server if Metadata Service Authentication
|
||
|
is enabled.
|
||
|
|
||
|
### Configuring Metadata Service Authentication
|
||
|
|
||
|
Metadata Service Authentication is currently **opt-in**. This means that you may continue to use DataHub without Metadata Service Authentication without interruption.
|
||
|
To enable Metadata Service Authentication:
|
||
|
|
||
|
- set the `METADATA_SERVICE_AUTH_ENABLED` environment variable to "true" for the `datahub-gms` AND `datahub-frontend` containers / pods.
|
||
|
|
||
|
OR
|
||
|
|
||
|
- change the Metadata Service `application.yml` configuration file to set `authentication.enabled` to "true" AND
|
||
|
- change the Frontend Proxy Service `application.config` configuration file to set `metadataService.auth.enabled` to "true"
|
||
|
|
||
|
After setting the configuration flag, simply restart the Metadata Service to start enforcing Authentication.
|
||
|
|
||
|
Once enabled, all requests to the Metadata Service will need to be authenticated; if you're using the default Authenticators
|
||
|
that ship with DataHub, this means that all requests will need to present an Access Token in the Authorization Header as follows:
|
||
|
|
||
|
```
|
||
|
Authorization: Bearer <access-token>
|
||
|
```
|
||
|
|
||
|
For users logging into the UI, this process will be handled for you. When logging in, a cookie will be set in your browser that internally
|
||
|
contains a valid Access Token for the Metadata Service. When browsing the UI, this token will be extracted and sent to the Metadata Service
|
||
|
to authenticate each request.
|
||
|
|
||
|
For users who want to access the Metadata Service programmatically, i.e. for running ingestion, the current recommendation is to generate
|
||
|
a **Personal Access Token** (described above) from the root "datahub" user account, and using this token when configuring your [Ingestion Recipes](https://datahubproject.io/docs/metadata-ingestion/#recipes).
|
||
|
To configure the token for use in ingestion, simply populate the "token" configuration for the `datahub-rest` sink:
|
||
|
|
||
|
```
|
||
|
source:
|
||
|
# source configs
|
||
|
sink:
|
||
|
type: "datahub-rest"
|
||
|
config:
|
||
|
...
|
||
|
token: <your-personal-access-token-here!>
|
||
|
```
|
||
|
|
||
|
> Note that ingestion occurring via `datahub-kafka` sink will continue to be Unauthenticated *for now*. Soon, we will be introducing
|
||
|
> support for providing an access token in the event payload itself to authenticate ingestion requests over Kafka.
|
||
|
|
||
|
|
||
|
### The Role of DataHub Frontend Proxy Going Forward
|
||
|
|
||
|
With these changes, DataHub Frontend Proxy will continue to play a vital part in the complex dance of Authentication. It will serve as the place
|
||
|
where UI-based session authentication originates and will continue to support 3rd Party SSO configuration (OIDC)
|
||
|
and JAAS configuration as it does today.
|
||
|
|
||
|
The major improvement is that the Frontend Service will validate credentials provided at UI login time
|
||
|
and generate a DataHub **Access Token**, embedding it into traditional session cookie (which will continue to work).
|
||
|
|
||
|
In summary, DataHub Frontend Service will continue to play a vital role to Authentication. It's scope, however, will likely
|
||
|
remain limited to concerns specific to the React UI.
|
||
|
|
||
|
## Where to go from here
|
||
|
|
||
|
These changes represent the first milestone in Metadata Service Authentication. They will serve as a foundation upon which we can build new features, prioritized based on Community demand:
|
||
|
|
||
|
1. **Dynamic Authenticator Plugins**: Configure + register custom Authenticator implementations, without forking DataHub.
|
||
|
2. **Service Accounts**: Create service accounts and generate Access tokens on their behalf.
|
||
|
3. **Kafka Ingestion Authentication**: Authenticate ingestion requests coming from the Kafka ingestion sink inside the Metadata Service.
|
||
|
4. **Access Token Management**: Ability to view, manage, and revoke access tokens that have been generated. (Currently, access tokens inlcude no server side state, and thus cannot be revoked once granted)
|
||
|
|
||
|
...and more! To advocate for these features or others, reach out on [Slack](https://datahubspace.slack.com/join/shared_invite/zt-nx7i0dj7-I3IJYC551vpnvvjIaNRRGw#/shared-invite/email).
|
||
|
|
||
|
## Q&As
|
||
|
|
||
|
### What if I don't want to use Metadata Service Authentication?
|
||
|
|
||
|
That's perfectly fine, for now. Metadata Service Authentication is disabled by default, only enabled if you provide the
|
||
|
environment variable `METADATA_SERVICE_AUTH_ENABLED` to the `datahub-gms` container or change the `authentication.enabled` to "true"
|
||
|
inside your DataHub Metadata Service configuration (`application.yml`).
|
||
|
|
||
|
That being said, we will be recommending that you enable Authentication for production use cases, to prevent
|
||
|
arbitrary actors from ingesting metadata into DataHub.
|
||
|
|
||
|
### If I enable Metadata Service Authentication, will ingestion stop working?
|
||
|
|
||
|
If you enable Metadata Service Authentication, you will want to provide a value for the "token" configuration value
|
||
|
when using the `datahub-rest` sink in your [Ingestion Recipes](https://datahubproject.io/docs/metadata-ingestion/#recipes). See
|
||
|
the [Rest Sink Docs](https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub#config-details) for configuration details.
|
||
|
|
||
|
We'd recommend generating a Personal Access Token (described above) from a trusted DataHub Account (e.g. root 'datahub' user) when configuring
|
||
|
your Ingestion sources.
|
||
|
|
||
|
Note that you can also provide the "extraHeaders" configuration in `datahub-rest` sink to specify a custom header to
|
||
|
pass with each request. This can be used in conjunction to authenticate using a custom Authenticator, for example.
|
||
|
|
||
|
### How do I generate an Access Token for a service account?
|
||
|
|
||
|
There is no formal concept of "service account" or "bot" on DataHub (yet). For now, we recommend you configure any
|
||
|
programmatic clients of DataHub to use a Personal Access Token generated from a user with the correct privileges, for example
|
||
|
the root "datahub" user account.
|
||
|
|
||
|
### I want to authenticate requests using a custom Authenticator? How do I do this?
|
||
|
|
||
|
You can configure DataHub to add your custom **Authenticator** to the **Authentication Chain** by changing the `application.yml` configuration file for the Metadata Service:
|
||
|
|
||
|
```yml
|
||
|
authentication:
|
||
|
enabled: true # Enable Metadata Service Authentication
|
||
|
....
|
||
|
authenticators: # Configure an Authenticator Chain
|
||
|
- type: <fully-qualified-authenticator-class-name> # E.g. com.linkedin.datahub.authentication.CustomAuthenticator
|
||
|
configs: # Specific configs that should be passed into 'init' method of Authenticator
|
||
|
customConfig1: <value>
|
||
|
```
|
||
|
|
||
|
Notice that you will need to have a class that implements the `Authenticator` interface with a zero-argument constructor available on the classpath
|
||
|
of the Metadata Service java process.
|
||
|
|
||
|
We love contributions! Feel free to raise a PR to contribute an Authenticator back if it's generally useful.
|
||
|
|
||
|
### Now that I can make authenticated requests to either DataHub Proxy Service and DataHub Metadata Service, which should I use?
|
||
|
|
||
|
Previously, we were recommending that folks contact the Metadata Service directly when doing things like
|
||
|
|
||
|
- ingesting Metadata via recipes
|
||
|
- issuing programmatic requests to the Rest.li APIs
|
||
|
- issuing programmatic requests to the GraphQL APIs
|
||
|
|
||
|
With these changes, we will be shifting to the recommendation that folks direct all traffic, whether it's programmatic or not,
|
||
|
to the **DataHub Frontend Proxy**, as routing to Metadata Service endpoints is currently available at the path `/api/gms`.
|
||
|
This recommendation is in effort to minimize the exposed surface area of DataHub to make securing, operating, maintaining, and developing
|
||
|
the platform simpler.
|
||
|
|
||
|
In practice, this will require migrating Metadata [Ingestion Recipes](https://datahubproject.io/docs/metadata-ingestion/#recipes) use the `datahub-rest` sink to pointing at a slightly different
|
||
|
host + path.
|
||
|
|
||
|
Example recipe that proxies through DataHub Frontend
|
||
|
|
||
|
```yml
|
||
|
source:
|
||
|
# source configs
|
||
|
sink:
|
||
|
type: "datahub-rest"
|
||
|
config:
|
||
|
...
|
||
|
token: <your-personal-access-token-here!>
|
||
|
```
|
||
|
|
||
|
## Feedback / Questions / Concerns
|
||
|
|
||
|
We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on [Slack](https://datahubspace.slack.com/join/shared_invite/zt-nx7i0dj7-I3IJYC551vpnvvjIaNRRGw#/shared-invite/email)!
|