mirror of
https://github.com/datahub-project/datahub.git
synced 2025-09-01 05:13:15 +00:00
fix(docs): build and broken snowflake docs fix (#6997)
This commit is contained in:
parent
9578e418c9
commit
c82d4fb2fb
@ -1,380 +0,0 @@
|
||||
- Start Date: 2022-09-02
|
||||
- RFC PR: https://github.com/datahub-project/datahub/pull/5818
|
||||
- Discussion Issue: https://github.com/datahub-project/datahub/issues/5635
|
||||
- Implementation PR(s): (leave this empty)
|
||||
|
||||
# Serialisation of Updates via GMS
|
||||
|
||||
## Summary
|
||||
|
||||
Make it possible for the GMS to serialise updates by rejecting an update if an aspect has changed between a client
|
||||
reading the state and writing a proposed new state.
|
||||
|
||||
## Basic example
|
||||
|
||||
When a client connects to DataHub and wants to make changes to an existing aspect, their update may depend on the
|
||||
current state of that aspect. An example would be adding items to a list or adding a new aspect on the basis that
|
||||
one doesn't yet exist.
|
||||
|
||||
Because the update endpoint requires client to write the full state of the aspect they wish to update, this can lead
|
||||
to race conditions. If Client A and Client B are both writing to the same aspect concurrently they will (silently) find
|
||||
that only one of their updates will have worked.
|
||||
|
||||
See the very basic example below:
|
||||
|
||||
### Current State
|
||||
```json
|
||||
{
|
||||
"myList": ["red", "blue", "green"]
|
||||
}
|
||||
```
|
||||
|
||||
Both clients legitimately read the starting state as above.
|
||||
|
||||
### Client A
|
||||
|
||||
Wants to add "yellow" to the list, so the target state is
|
||||
```json
|
||||
{
|
||||
"myList": ["red", "blue", "green", "yellow"]
|
||||
}
|
||||
```
|
||||
|
||||
### Client B
|
||||
|
||||
Wants to add "purple" to the list, so the target state is
|
||||
```json
|
||||
{
|
||||
"myList": ["red", "blue", "green", "purple"]
|
||||
}
|
||||
```
|
||||
|
||||
If they both run their updates in a similar timeframe, both will succeed but either "yellow" or "purple" will be
|
||||
added to the list, not both.
|
||||
|
||||
I would like a way for a client to request for an update to be rejected if the initial state differs from its
|
||||
assumptions, i.e. only update the state if the starting state is still the same.
|
||||
|
||||
## Motivation
|
||||
|
||||
We wish to avoid losing data silently when two clients make updates to the same aspect. This can be quite likely in
|
||||
an event-driven world driving downstream operations on Datasets. We should offer clients the chance to conditionally
|
||||
update an aspect on the basis that what they recently observed is still the case and signal to the client if the
|
||||
state changed before the update was possible. Essentially we need a long "compare-and-swap" operation
|
||||
for atomic updates.
|
||||
|
||||
## Requirements
|
||||
|
||||
> What specific requirements does your design need to meet? This should ideally be a bulleted list of items you wish
|
||||
> to achieve with your design. This can help everyone involved (including yourself!) make sure your design is robust
|
||||
> enough to meet these requirements.
|
||||
>
|
||||
> Once everyone has agreed upon the set of requirements for your design, we can use this list to review the detailed
|
||||
> design.
|
||||
|
||||
* A client needs to be able to identify the unique state or version of a given aspect when querying for it.
|
||||
* The client needs to be able to reference the same state to a GMS "update" endpoint if it wants to ensure the
|
||||
aspect has not changed between fetching it and mutating it.
|
||||
* A client could include multiple aspects in its precondition state, in case one update relies on the state of many
|
||||
other aspects. We need to defend against race conditions for all the aspects given.
|
||||
|
||||
### Extensibility
|
||||
|
||||
> Please also call out extensibility requirements. Is this proposal meant to be extended in the future? Are you adding
|
||||
> a new API or set of models that others can build on in later? Please list these concerns here as well.
|
||||
|
||||
1. The proposal could be extended to include a list of aspect versions which must hold true in order for a single
|
||||
aspect update to occur.
|
||||
2. We could extend to include the state version in a batch update, though this may be complicated if handling
|
||||
multiple updates to the same aspect in a single batch.
|
||||
|
||||
## Non-Requirements
|
||||
|
||||
It's not important for us to discuss complex prerequisites here. The only thing we need to enforce is the state of
|
||||
an aspect has not changed when making an update to it.
|
||||
|
||||
There are some potential spin-offs of this design which could involve writing some kind of PATCH update where a client
|
||||
supplies a diff instead of a complete new state, but this is out of scope of this particular RFC.
|
||||
|
||||
## Detailed design
|
||||
|
||||
> This is the bulk of the RFC.
|
||||
|
||||
> Explain the design in enough detail for somebody familiar with the framework to understand, and for somebody familiar
|
||||
> with the implementation to implement. This should get into specifics and corner-cases, and include examples of how the
|
||||
> feature is used. Any new terminology should be defined here.
|
||||
|
||||
### PROPOSED SOLUTION: Read and Conditional Update
|
||||
|
||||
Read the current aspect(s) and pass that state back to the client. The client can propose an update on the basis that
|
||||
what they just read is still valid. The client should return an HTTP error code that specifically mentions the state
|
||||
has changed underneath it when attempting the update. In this case the client can choose to re-apply the update
|
||||
(read-modify-write) or throw its own error.
|
||||
|
||||
Good
|
||||
* Guarantee of state between the read and write
|
||||
* No need for business-specific logic in GMS code
|
||||
|
||||
Bad
|
||||
* Slow, especially under high contention
|
||||
* Client will need to handle retries
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
actor Client
|
||||
loop Attempt until applied or error
|
||||
Client->>GMS: GetAspect
|
||||
GMS-->>Client: currentAspect
|
||||
Client-->>Client: Locally modify aspect
|
||||
Client->>GMS: UpdateAspect if still currentAspect
|
||||
GMS->>MetadataStore: UpdateAspect if still currentAspect
|
||||
MetadataStore-->>GMS: applied?
|
||||
alt applied
|
||||
GMS-->>Client: OK
|
||||
else error
|
||||
GMS-->>Client: ERROR
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
#### Implementation via Etag
|
||||
|
||||
We could achieve this via HTTP [Conditional Requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests#avoiding_the_lost_update_problem_with_optimistic_locking).
|
||||
GMS would populate the `ETag` header in the GET response for aspects. This can be used in the `If-Match` or
|
||||
`If-None-Match` header when a POST request is passed to GMS to update the aspects. The main concern is then how to
|
||||
format the ASCII `Etag` value for multiple aspects such that client and server are able to compose them.
|
||||
|
||||
A suggested format would be semicolon-separated aspect-value pairs, e.g. `myAspect1=hashValue1;myAspect2=hashValue2`.
|
||||
|
||||
## How we teach this
|
||||
|
||||
> What names and terminology work best for these concepts and why? How is this idea best presented? As a continuation
|
||||
> of existing DataHub patterns, or as a wholly new one?
|
||||
|
||||
> What audience or audiences would be impacted by this change? Just DataHub backend developers? Frontend developers?
|
||||
> Users of the DataHub application itself?
|
||||
|
||||
> Would the acceptance of this proposal mean the DataHub guides must be re-organized or altered? Does it change how
|
||||
> DataHub is taught to new users at any level?
|
||||
|
||||
> How should this feature be introduced and taught to existing audiences?
|
||||
|
||||
We would have two potential additions to terminology:
|
||||
1. Previous state: The version(s) of aspects required in order for a GM update to succeed.
|
||||
2. Preconditions: A wider set of assertions which must be true in order for a GM update to succeed.
|
||||
|
||||
We don't really cover "Preconditions" in this RFC, and in fact I argue we should never do, as this leaks business
|
||||
knowledge into the GMS code. I would therefore use "Previous State" to describe the aspect versions required for
|
||||
updates.
|
||||
|
||||
### Documentation
|
||||
|
||||
This should be added as an extra section of the following:
|
||||
* REST API examples
|
||||
* Java/Python REST Emitter
|
||||
* GraphQL mutations?
|
||||
|
||||
## Drawbacks
|
||||
|
||||
> Why should we *not* do this? Please consider the impact on teaching DataHub, on the integration of this feature with
|
||||
> other existing and planned features, on the impact of the API churn on existing apps, etc.
|
||||
|
||||
> There are tradeoffs to choosing any path, please attempt to identify them here.
|
||||
|
||||
There are two major reasons this will be difficult to achieve: Aspect Versioning and Performance.
|
||||
|
||||
### Aspect Versioning
|
||||
|
||||
Currently, [aspect versioning](/docs/advanced/aspect-versioning.md) uses 0 as a mutable placeholder for the latest
|
||||
aspect version. We would have to find a way to get around this or provide a different, deterministic identifier for
|
||||
a specific aspect version if we aren't able to reference a single fixed integer for the aspect versions.
|
||||
|
||||
We don't necessarily have to overhaul the versioning approach here, as all we need is a way of uniquely identifying
|
||||
the aspect instance that is to be changed. We don't need to rely on numerical version ordering for this, but
|
||||
something will need to be added as the general aspect level to allow for this new functionality.
|
||||
|
||||
Whatever is done, the most important thing is that changes can be isolated at the metadata storage level, whether
|
||||
through locking, transactions or Cassandra LWTs.
|
||||
|
||||
### Performance
|
||||
|
||||
There are some significant performance drawbacks for most of these designs, as they will involve an extra hop between
|
||||
the client and DataHub. Things will get even worse if the aspect updates are under heavy contention.
|
||||
|
||||
#### Under contention
|
||||
|
||||
This approach would suffer a lot under high contention. Let's say:
|
||||
* the time sending a request from client to gms and receiving response is `x`
|
||||
* the time taken to manipulate read to create a new write request at the client is `y` (assume constant for simplicity)
|
||||
* a constant transaction/LWT overhead of `z` (LWT if using Cassandra)
|
||||
|
||||
Under a contention of n clients sending 1 request targeting the same aspect, we would see:
|
||||
* 1 client would take `2x + y + z` time
|
||||
* 1 client would take `4x + 2y + 2z` time
|
||||
* ...
|
||||
* nth concurrent client would take 2nx + ny + nz time
|
||||
|
||||
This is assuming a client is just making one update to the aspect. If an aspect is being used in some complex
|
||||
communication chain this could take a very long time.
|
||||
|
||||
## Alternatives
|
||||
|
||||
> What other designs have been considered? What is the impact of not doing this?
|
||||
|
||||
> This section could also include prior art, that is, how other frameworks in the same domain have solved this problem.
|
||||
|
||||
### Modelling
|
||||
|
||||
Alternatives involve cleverly modelling aspects so they are only ever upserted without a previous state in mind, but
|
||||
I haven't worked out how to achieve this for our requirements given the standard entity and aspect model available.
|
||||
|
||||
### Change Processor
|
||||
|
||||
Check the allowed set of transitions between current and proposed aspect state. This would be achieved with
|
||||
business-specific logic injected into the DataHub codebase via plugins. This feels like the wrong approach.
|
||||
|
||||
Good
|
||||
* Fast as all code is executed in GMS
|
||||
|
||||
Bad
|
||||
* No guarantee of state between the read and write
|
||||
* No guarantee that the plugin code will run
|
||||
* Specific business logic getting tied up with GMS code
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
actor Client
|
||||
Client->>+GMS: IngestAspect
|
||||
GMS->>MetadataStore: GetAspect
|
||||
MetadataStore-->>GMS: aspect
|
||||
GMS->>+ChangeProcessor: IsPermitted?
|
||||
alt permitted
|
||||
ChangeProcessor-->>GMS: yes
|
||||
note right of MetadataStore: Aspect may have changed here
|
||||
GMS->>MetadataStore: store new version
|
||||
MetadataStore-->>GMS: result
|
||||
else disallowed
|
||||
ChangeProcessor-->>-GMS: no
|
||||
end
|
||||
GMS-->>-Client: result
|
||||
```
|
||||
|
||||
### Idempotent Partial Update (Patch)
|
||||
|
||||
The client proposes a partial change to an aspect that is idempotent. The change can apply whatever state the aspect
|
||||
is in at the time. This will need to be extended to check for multiple currentAspect states when requested.
|
||||
|
||||
Good
|
||||
* No need for business-specific logic in GMS code
|
||||
* Fast, as retry loop will be confined to GMS
|
||||
|
||||
Bad
|
||||
* GMS will need to understand partial (PATCH) updates
|
||||
* Retries will need to be handled in GMS
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
actor Client
|
||||
Client->>GMS: UpdateAspect
|
||||
loop Attempt until applied or error
|
||||
GMS->>MetadataStore: GetAspect
|
||||
MetadataStore-->>GMS: currentAspect
|
||||
GMS-->>GMS: Locally modify aspect
|
||||
GMS->>MetadataStore: UpdateAspect if still currentAspect
|
||||
MetadataStore-->>GMS: applied?
|
||||
end
|
||||
alt applied
|
||||
GMS-->>Client: OK
|
||||
else error
|
||||
GMS-->>Client: ERROR
|
||||
end
|
||||
```
|
||||
|
||||
#### Patch Language
|
||||
|
||||
If a patch language were available, it might be possible to update collections in such a way that it's not
|
||||
necessary for clients to know about the previous state of an aspect.
|
||||
|
||||
MongoDB offers an update mode which allows clients to [add items to a set](https://mongodb.github.io/mongo-java-driver/4.7/apidocs/mongodb-driver-core/com/mongodb/client/model/Updates.html#addToSet(java.lang.String,TItem)),
|
||||
for example. This would avoid the need to go back to the client to ask them to construct the final state of a
|
||||
collection.
|
||||
|
||||
##### Implementation of PATCH with json-patch format
|
||||
|
||||
Client sends its update request in json-patch or some other delta format. GMS will reject the update if the "previous"
|
||||
side of the delta does not match. This closely matches the work in https://github.com/datahub-project/datahub/pull/5901.
|
||||
|
||||
### Serialised Updates Exclusively via Kafka
|
||||
|
||||
Here, the client never updates the aspect via HTTP but instead passes it via the MetadataChangeProposal topic in Kafka.
|
||||
I don't think this is a viable solution due to the possibility of issuing out-of-date updates.
|
||||
|
||||
Good
|
||||
* Local ordering
|
||||
* Fast performance
|
||||
|
||||
Bad
|
||||
* No way of knowing what the current state is unless the client constructs it from the Kafka topics
|
||||
* Danger of proposing out-of-date updates if the topic is lagging
|
||||
|
||||
### URN-Aspect Locking
|
||||
|
||||
Lock the target URN aspect(s) when reading. The client is guaranteed that the state it has read will not change
|
||||
until it releases the lock.
|
||||
|
||||
Good
|
||||
* Guarantee of state between the read and write
|
||||
* No need for business-specific logic in GMS code
|
||||
|
||||
Bad
|
||||
* Very slow, especially under high contention
|
||||
* Need to handle lock release if a client fails to release the lock
|
||||
* Potentially blocking MCE Kafka queue
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
actor Client
|
||||
Client->>GMS: GetAspectWithLock
|
||||
GMS->>+AspectLock: LockUrnAspect
|
||||
GMS->>MetadataStore: GetAspect
|
||||
MetadataStore-->>GMS: currentAspect
|
||||
GMS-->>Client: currentAspect+lock
|
||||
Client-->>Client: Locally modify aspect
|
||||
Client->>GMS: UpdateAspect(lock)
|
||||
GMS->>MetadataStore: UpdateAspect
|
||||
MetadataStore-->>GMS: applied?
|
||||
alt applied
|
||||
GMS-->>Client: OK
|
||||
else error
|
||||
GMS-->>Client: ERROR
|
||||
end
|
||||
Client->>GMS: ReleaseAspectLock
|
||||
GMS->>AspectLock: ReleaseUrnAspectLock
|
||||
opt timeout
|
||||
AspectLock->>-AspectLock: ReleaseUrnAspectLock
|
||||
end
|
||||
```
|
||||
|
||||
## Rollout / Adoption Strategy
|
||||
|
||||
> If we implemented this proposal, how will existing users / developers adopt it? Is it a breaking change? Can we write
|
||||
> automatic refactoring / migration tools? Can we provide a runtime adapter library for the original API it replaces?
|
||||
|
||||
This rollout could be done as either a new API endpoint or an optional additional header to an existing API, so
|
||||
existing APIs will not be broken. Once available, clients may opt in to use the new features.
|
||||
|
||||
## Future Work
|
||||
|
||||
> Describe any future projects, at a very high level, that will build off this proposal. This does not need to be
|
||||
> exhaustive, nor does it need to be anything you work on. It just helps reviewers see how this can be used in the
|
||||
> future, so they can help ensure your design is flexible enough.
|
||||
|
||||
The capabilities built here would allow us to implement more parts of GMS than just UPSERT change types, as we could
|
||||
guarantee the current state before making any changes.
|
||||
|
||||
## Unresolved questions
|
||||
|
||||
> Optional, but suggested for first drafts. What parts of the design are still TBD?
|
||||
|
||||
I'm not as familiar with the GMS code as I could be, so it's unclear to me exactly where we'd end up making code
|
||||
changes. I know the solution needs to be agnostic to whichever backing store is being used.
|
@ -62,10 +62,7 @@ class SnowflakeV2Config(
|
||||
|
||||
extract_tags: TagOption = Field(
|
||||
default=TagOption.skip,
|
||||
description="""Optional. Allowed values are `without_lineage`, `with_lineage`, and `skip` (default).
|
||||
`without_lineage` only extracts tags that have been applied directly to the given entity.
|
||||
`with_lineage` extracts both directly applied and propagated tags, but will be significantly slower.
|
||||
See the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/object-tagging.html#tag-lineage) for information about tag lineage/propagation. """,
|
||||
description="""Optional. Allowed values are `without_lineage`, `with_lineage`, and `skip` (default). `without_lineage` only extracts tags that have been applied directly to the given entity. `with_lineage` extracts both directly applied and propagated tags, but will be significantly slower. See the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/object-tagging.html#tag-lineage) for information about tag lineage/propagation. """,
|
||||
)
|
||||
|
||||
classification: Optional[ClassificationConfig] = Field(
|
||||
|
Loading…
x
Reference in New Issue
Block a user