fix(docs): build and broken snowflake docs fix (#6997)

2025-12-20 06:28:21 +00:00 · 2023-01-10 22:52:36 +05:30 · 2023-01-10 22:52:36 +05:30 · c82d4fb2fb
commit c82d4fb2fb
parent 9578e418c9
2 changed files with 1 additions and 384 deletions
--- a/docs/rfc/active/5818-serial-updates.md
+++ b/docs/rfc/active/5818-serial-updates.md
@ -1,380 +0,0 @@
 - Start Date: 2022-09-02
 - RFC PR: https://github.com/datahub-project/datahub/pull/5818
 - Discussion Issue: https://github.com/datahub-project/datahub/issues/5635
 - Implementation PR(s): (leave this empty)
 # Serialisation of Updates via GMS
 ## Summary
 Make it possible for the GMS to serialise updates by rejecting an update if an aspect has changed between a client 
 reading the state and writing a proposed new state.
 ## Basic example
 When a client connects to DataHub and wants to make changes to an existing aspect, their update may depend on the 
 current state of that aspect. An example would be adding items to a list or adding a new aspect on the basis that 
 one doesn't yet exist.
 Because the update endpoint requires client to write the full state of the aspect they wish to update, this can lead 
 to race conditions. If Client A and Client B are both writing to the same aspect concurrently they will (silently) find 
 that only one of their updates will have worked.
 See the very basic example below:
 ### Current State
 ```json
 {
  "myList": ["red", "blue", "green"]
 }
 ```
 Both clients legitimately read the starting state as above.
 ### Client A
 Wants to add "yellow" to the list, so the target state is
 ```json
 {
  "myList": ["red", "blue", "green", "yellow"]
 }
 ```
 ### Client B
 Wants to add "purple" to the list, so the target state is
 ```json
 {
  "myList": ["red", "blue", "green", "purple"]
 }
 ```
 If they both run their updates in a similar timeframe, both will succeed but either "yellow" or "purple" will be 
 added to the list, not both.
 I would like a way for a client to request for an update to be rejected if the initial state differs from its 
 assumptions, i.e. only update the state if the starting state is still the same.
 ## Motivation
 We wish to avoid losing data silently when two clients make updates to the same aspect. This can be quite likely in 
 an event-driven world driving downstream operations on Datasets. We should offer clients the chance to conditionally 
 update an aspect on the basis that what they recently observed is still the case and signal to the client if the 
 state changed before the update was possible. Essentially we need a long "compare-and-swap" operation 
 for atomic updates.
 ## Requirements
 > What specific requirements does your design need to meet? This should ideally be a bulleted list of items you wish
 > to achieve with your design. This can help everyone involved (including yourself!) make sure your design is robust
 > enough to meet these requirements.
 >
 > Once everyone has agreed upon the set of requirements for your design, we can use this list to review the detailed
 > design.
 * A client needs to be able to identify the unique state or version of a given aspect when querying for it.
 * The client needs to be able to reference the same state to a GMS "update" endpoint if it wants to ensure the 
  aspect has not changed between fetching it and mutating it.
 * A client could include multiple aspects in its precondition state, in case one update relies on the state of many 
  other aspects. We need to defend against race conditions for all the aspects given.
 ### Extensibility
 > Please also call out extensibility requirements. Is this proposal meant to be extended in the future? Are you adding
 > a new API or set of models that others can build on in later? Please list these concerns here as well.
 1. The proposal could be extended to include a list of aspect versions which must hold true in order for a single 
   aspect update to occur.
 2. We could extend to include the state version in a batch update, though this may be complicated if handling 
   multiple updates to the same aspect in a single batch.
 ## Non-Requirements
 It's not important for us to discuss complex prerequisites here. The only thing we need to enforce is the state of 
 an aspect has not changed when making an update to it.
 There are some potential spin-offs of this design which could involve writing some kind of PATCH update where a client 
 supplies a diff instead of a complete new state, but this is out of scope of this particular RFC.
 ## Detailed design
 > This is the bulk of the RFC.
 > Explain the design in enough detail for somebody familiar with the framework to understand, and for somebody familiar
 > with the implementation to implement. This should get into specifics and corner-cases, and include examples of how the
 > feature is used. Any new terminology should be defined here.
 ### PROPOSED SOLUTION: Read and Conditional Update
 Read the current aspect(s) and pass that state back to the client. The client can propose an update on the basis that 
 what they just read is still valid. The client should return an HTTP error code that specifically mentions the state 
 has changed underneath it when attempting the update. In this case the client can choose to re-apply the update 
 (read-modify-write) or throw its own error.
 Good
 * Guarantee of state between the read and write
 * No need for business-specific logic in GMS code
 Bad
 * Slow, especially under high contention
 * Client will need to handle retries
 ```mermaid
 sequenceDiagram
    actor Client
    loop Attempt until applied or error
    Client->>GMS: GetAspect
    GMS-->>Client: currentAspect
    Client-->>Client: Locally modify aspect
    Client->>GMS: UpdateAspect if still currentAspect
    GMS->>MetadataStore: UpdateAspect if still currentAspect
    MetadataStore-->>GMS: applied?
    alt applied
        GMS-->>Client: OK
    else error
        GMS-->>Client: ERROR
    end
    end
 ```
 #### Implementation via Etag
 We could achieve this via HTTP [Conditional Requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests#avoiding_the_lost_update_problem_with_optimistic_locking).
 GMS would populate the `ETag` header in the GET response for aspects. This can be used in the `If-Match` or 
 `If-None-Match` header when a POST request is passed to GMS to update the aspects. The main concern is then how to 
 format the ASCII `Etag` value for multiple aspects such that client and server are able to compose them.
 A suggested format would be semicolon-separated aspect-value pairs, e.g. `myAspect1=hashValue1;myAspect2=hashValue2`.
 ## How we teach this
 > What names and terminology work best for these concepts and why? How is this idea best presented? As a continuation
 > of existing DataHub patterns, or as a wholly new one?
 > What audience or audiences would be impacted by this change? Just DataHub backend developers? Frontend developers?
 > Users of the DataHub application itself?
 > Would the acceptance of this proposal mean the DataHub guides must be re-organized or altered? Does it change how
 > DataHub is taught to new users at any level?
 > How should this feature be introduced and taught to existing audiences?
 We would have two potential additions to terminology:
 1. Previous state: The version(s) of aspects required in order for a GM update to succeed.
 2. Preconditions: A wider set of assertions which must be true in order for a GM update to succeed.
 We don't really cover "Preconditions" in this RFC, and in fact I argue we should never do, as this leaks business 
 knowledge into the GMS code. I would therefore use "Previous State" to describe the aspect versions required for 
 updates.
 ### Documentation
 This should be added as an extra section of the following:
 * REST API examples
 * Java/Python REST Emitter
 * GraphQL mutations?
 ## Drawbacks
 > Why should we *not* do this? Please consider the impact on teaching DataHub, on the integration of this feature with
 > other existing and planned features, on the impact of the API churn on existing apps, etc.
 > There are tradeoffs to choosing any path, please attempt to identify them here.
 There are two major reasons this will be difficult to achieve: Aspect Versioning and Performance.
 ### Aspect Versioning
 Currently, [aspect versioning](/docs/advanced/aspect-versioning.md) uses 0 as a mutable placeholder for the latest 
 aspect version. We would have to find a way to get around this or provide a different, deterministic identifier for 
 a specific aspect version if we aren't able to reference a single fixed integer for the aspect versions.
 We don't necessarily have to overhaul the versioning approach here, as all we need is a way of uniquely identifying
 the aspect instance that is to be changed. We don't need to rely on numerical version ordering for this, but
 something will need to be added as the general aspect level to allow for this new functionality.
 Whatever is done, the most important thing is that changes can be isolated at the metadata storage level, whether
 through locking, transactions or Cassandra LWTs.
 ### Performance
 There are some significant performance drawbacks for most of these designs, as they will involve an extra hop between 
 the client and DataHub. Things will get even worse if the aspect updates are under heavy contention.
 #### Under contention
 This approach would suffer a lot under high contention. Let's say:
 * the time sending a request from client to gms and receiving response is `x`
 * the time taken to manipulate read to create a new write request at the client is `y` (assume constant for simplicity)
 * a constant transaction/LWT overhead of `z` (LWT if using Cassandra)
 Under a contention of n clients sending 1 request targeting the same aspect, we would see:
 * 1 client would take `2x + y + z` time
 * 1 client would take `4x + 2y + 2z` time
 * ...
 * nth concurrent client would take 2nx + ny + nz time
 This is assuming a client is just making one update to the aspect. If an aspect is being used in some complex 
 communication chain this could take a very long time.
 ## Alternatives
 > What other designs have been considered? What is the impact of not doing this?
 > This section could also include prior art, that is, how other frameworks in the same domain have solved this problem.
 ### Modelling
 Alternatives involve cleverly modelling aspects so they are only ever upserted without a previous state in mind, but 
 I haven't worked out how to achieve this for our requirements given the standard entity and aspect model available.
 ### Change Processor
 Check the allowed set of transitions between current and proposed aspect state. This would be achieved with
 business-specific logic injected into the DataHub codebase via plugins. This feels like the wrong approach.
 Good
 * Fast as all code is executed in GMS
 Bad
 * No guarantee of state between the read and write
 * No guarantee that the plugin code will run
 * Specific business logic getting tied up with GMS code
 ```mermaid
 sequenceDiagram
    actor Client
    Client->>+GMS: IngestAspect
    GMS->>MetadataStore: GetAspect
    MetadataStore-->>GMS: aspect
    GMS->>+ChangeProcessor: IsPermitted?
    alt permitted
        ChangeProcessor-->>GMS: yes
        note right of MetadataStore: Aspect may have changed here
        GMS->>MetadataStore: store new version
        MetadataStore-->>GMS: result
    else disallowed
        ChangeProcessor-->>-GMS: no
    end
    GMS-->>-Client: result
 ```
 ### Idempotent Partial Update (Patch)
 The client proposes a partial change to an aspect that is idempotent. The change can apply whatever state the aspect
 is in at the time. This will need to be extended to check for multiple currentAspect states when requested.
 Good
 * No need for business-specific logic in GMS code
 * Fast, as retry loop will be confined to GMS
 Bad
 * GMS will need to understand partial (PATCH) updates
 * Retries will need to be handled in GMS
 ```mermaid
 sequenceDiagram
    actor Client
    Client->>GMS: UpdateAspect
    loop Attempt until applied or error
    GMS->>MetadataStore: GetAspect
    MetadataStore-->>GMS: currentAspect
    GMS-->>GMS: Locally modify aspect
    GMS->>MetadataStore: UpdateAspect if still currentAspect
    MetadataStore-->>GMS: applied?
    end
    alt applied
        GMS-->>Client: OK
    else error
        GMS-->>Client: ERROR
    end
 ```
 #### Patch Language
 If a patch language were available, it might be possible to update collections in such a way that it's not
 necessary for clients to know about the previous state of an aspect.
 MongoDB offers an update mode which allows clients to [add items to a set](https://mongodb.github.io/mongo-java-driver/4.7/apidocs/mongodb-driver-core/com/mongodb/client/model/Updates.html#addToSet(java.lang.String,TItem)),
 for example. This would avoid the need to go back to the client to ask them to construct the final state of a
 collection.
 ##### Implementation of PATCH with json-patch format
 Client sends its update request in json-patch or some other delta format. GMS will reject the update if the "previous"
 side of the delta does not match. This closely matches the work in https://github.com/datahub-project/datahub/pull/5901.
 ### Serialised Updates Exclusively via Kafka
 Here, the client never updates the aspect via HTTP but instead passes it via the MetadataChangeProposal topic in Kafka.
 I don't think this is a viable solution due to the possibility of issuing out-of-date updates.
 Good
 * Local ordering
 * Fast performance
 Bad
 * No way of knowing what the current state is unless the client constructs it from the Kafka topics
 * Danger of proposing out-of-date updates if the topic is lagging
 ### URN-Aspect Locking
 Lock the target URN aspect(s) when reading. The client is guaranteed that the state it has read will not change 
 until it releases the lock.
 Good
 * Guarantee of state between the read and write
 * No need for business-specific logic in GMS code
 Bad
 * Very slow, especially under high contention
 * Need to handle lock release if a client fails to release the lock
 * Potentially blocking MCE Kafka queue
 ```mermaid
 sequenceDiagram
    actor Client
    Client->>GMS: GetAspectWithLock
    GMS->>+AspectLock: LockUrnAspect
    GMS->>MetadataStore: GetAspect
    MetadataStore-->>GMS: currentAspect
    GMS-->>Client: currentAspect+lock
    Client-->>Client: Locally modify aspect
    Client->>GMS: UpdateAspect(lock)
    GMS->>MetadataStore: UpdateAspect
    MetadataStore-->>GMS: applied?
    alt applied
        GMS-->>Client: OK
    else error
        GMS-->>Client: ERROR
    end
    Client->>GMS: ReleaseAspectLock
    GMS->>AspectLock: ReleaseUrnAspectLock
    opt timeout
        AspectLock->>-AspectLock: ReleaseUrnAspectLock
    end
 ```
 ## Rollout / Adoption Strategy
 > If we implemented this proposal, how will existing users / developers adopt it? Is it a breaking change? Can we write
 > automatic refactoring / migration tools? Can we provide a runtime adapter library for the original API it replaces? 
 This rollout could be done as either a new API endpoint or an optional additional header to an existing API, so 
 existing APIs will not be broken. Once available, clients may opt in to use the new features.
 ## Future Work
 > Describe any future projects, at a very high level, that will build off this proposal. This does not need to be
 > exhaustive, nor does it need to be anything you work on. It just helps reviewers see how this can be used in the
 > future, so they can help ensure your design is flexible enough.
 The capabilities built here would allow us to implement more parts of GMS than just UPSERT change types, as we could 
 guarantee the current state before making any changes.
 ## Unresolved questions
 > Optional, but suggested for first drafts. What parts of the design are still TBD?
 I'm not as familiar with the GMS code as I could be, so it's unclear to me exactly where we'd end up making code 
 changes. I know the solution needs to be agnostic to whichever backing store is being used.
--- a/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py
@ -62,10 +62,7 @@ class SnowflakeV2Config(
    extract_tags: TagOption = Field(
        default=TagOption.skip,
-        description="""Optional. Allowed values are `without_lineage`, `with_lineage`, and `skip` (default).
+        description="""Optional. Allowed values are `without_lineage`, `with_lineage`, and `skip` (default). `without_lineage` only extracts tags that have been applied directly to the given entity. `with_lineage` extracts both directly applied and propagated tags, but will be significantly slower. See the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/object-tagging.html#tag-lineage) for information about tag lineage/propagation. """,
        `without_lineage` only extracts tags that have been applied directly to the given entity.
        `with_lineage` extracts both directly applied and propagated tags, but will be significantly slower.
        See the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/object-tagging.html#tag-lineage) for information about tag lineage/propagation. """,
    )
    classification: Optional[ClassificationConfig] = Field(