fix: fix features section (#8571)

2025-12-26 17:37:33 +00:00 · 2023-08-04 11:23:52 +09:00 · 2023-08-04 11:23:52 +09:00 · 47114bf83c
commit 47114bf83c
parent ab101bc49c
3 changed files with 60 additions and 232 deletions
--- a/docs-website/sidebars.js
+++ b/docs-website/sidebars.js
@ -410,7 +410,6 @@ module.exports = {
        "docs/features/dataset-usage-and-query-history",
        "docs/posts",
        "docs/sync-status",
-        "docs/architecture/stemming_and_synonyms",
        "docs/lineage/lineage-feature-guide",
        {
          type: "doc",
--- a/docs/architecture/stemming_and_synonyms.md
+++ b/docs/architecture/stemming_and_synonyms.md
@ -1,158 +0,0 @@
-import FeatureAvailability from '@site/src/components/FeatureAvailability';
-
-# About DataHub [Stemming and Synonyms Support]
-
-<FeatureAvailability/>
-
-This feature adds features to our current search implementation in an effort to make search results more relevant. Included improvements are:
-
-* Stemming - Using a multi-language stemmer to allow better partial matching based on lexicographical roots i.e. "log" resolves from logs, logging. logger etc.
-* Urn matching - Both partial and full Urns previously did not give desirable behavior in search results, these are now properly indexed and queried to give better matching results
-* Word breaks across special characters - Previously when typing in a query like "logging_events", autocomplete would fail to resolve results after typing in the underscore until at least "logging_eve" had been typed and the same would occur with spaces. This has been resolved.
-* Synonyms - A list of synonyms that will match across search results, currently static, has been added. We will be evolving this list over time to improve matching jargon versions of words to their full word equivalent. For example, typing in staging in a query can resolve datasets with stg in their name.
-
-
-<!-- TODO: ADD RELEASE VERSION -->
-
-## [Stemming and Synonyms Support] Setup, Prerequisites, and Permissions
-
-A reindex is required for this feature to work as it modifies non-dynamic mappings and settings in the index. This reindex is carried out as a part of the bootstrapping process by
-DataHub Upgrade which has been added to the helm charts and docker-compose files as a required component with default configurations that should work for most deployments.
-The job uses existing credentials and permissions for ElasticSearch to achieve this. During the reindex, writes to ElasticSearch will fail, so it is recommended to schedule an outage during this time. If doing a rolling update, old versions of GMS should still be able to serve queries, but at minimum ingestion traffic needs to be stopped. Estimated downtime for instances on the order of a few million records is ~30 minutes. Larger instances may require several hours though.
-Once the reindex has succeeded, a message will be sent to new GMS and MCL/MAE Consumer instances that the state is ready for them to start up. Until this time they will hold off on starting using an exponential backoff to check for readiness.
-
-Relevant configuration for the Upgrade Job:
-
-### Helm Values
-
-```yaml
-global:
-  elasticsearch:
-    ## The following section controls when and how reindexing of elasticsearch indices are performed
-    index:
-      ## Enable reindexing when mappings change based on the data model annotations
-      enableMappingsReindex: false
-
-      ## Enable reindexing when static index settings change.
-      ## Dynamic settings which do not require reindexing are not affected
-      ## Primarily this should be enabled when re-sharding is necessary for scaling/performance.
-      enableSettingsReindex: false
-
-      ## Index settings can be overridden for entity indices or other indices on an index by index basis
-      ## Some index settings, such as # of shards, requires reindexing while others, i.e. replicas, do not
-      ## Non-Entity indices do not require the prefix
-      # settingsOverrides: '{"graph_service_v1":{"number_of_shards":"5"},"system_metadata_service_v1":{"number_of_shards":"5"}}'
-      ## Entity indices do not require the prefix or suffix
-      # entitySettingsOverrides: '{"dataset":{"number_of_shards":"10"}}'
-
-      ## The amount of delay between indexing a document and having it returned in queries
-      ## Increasing this value can improve performance when ingesting large amounts of data
-      # refreshIntervalSeconds: 1
-
-      ## The following options control settings for datahub-upgrade job when creating or reindexing indices
-      upgrade:
-        enabled: true
-
-        ## When reindexing is required, this option will clone the existing index as a backup
-        ## The clone indices are not currently managed.
-        # cloneIndices: true
-
-        ## Typically when reindexing the document counts between the original and destination indices should match.
-        ## In some cases reindexing might not be able to proceed due to incompatibilities between a document in the
-        ## orignal index and the new index's mappings. This document could be dropped and re-ingested or restored from
-        ## the SQL database.
-        ##
-        ## This setting allows continuing if and only if the cloneIndices setting is also enabled which
-        ## ensures a complete backup of the original index is preserved.
-        # allowDocCountMismatch: false
-```
-
-### Docker Environment Variables
-
-* ELASTICSEARCH_INDEX_BUILDER_MAPPINGS_REINDEX - Controls whether to perform a reindex for mappings mismatches
-* ELASTICSEARCH_INDEX_BUILDER_SETTINGS_REINDEX - Controls whether to perform a reindex for settings mismatches
-* ELASTICSEARCH_BUILD_INDICES_ALLOW_DOC_COUNT_MISMATCH - Used in conjunction with ELASTICSEARCH_BUILD_INDICES_CLONE_INDICES to allow users to skip passed document count mismatches when reindexing. Count mismatches may indicate dropped records during the reindex, so to prevent data loss this is only allowed if cloning is enabled.
-* ELASTICSEARCH_BUILD_INDICES_CLONE_INDICES - Enables creating a clone of the current index to prevent data loss, default true
-* ELASTICSEARCH_BUILD_INDICES_INITIAL_BACK_OFF_MILLIS - Controls the GMS and MCL Consumer backoff for checking if the reindex process has completed during start up. It is recommended to leave the defaults which will result in waiting up to ~5 minutes before killing the start-up process, allowing a new pod to attempt to start up in orchestrated deployments.
-* ELASTICSEARCH_BUILD_INDICES_MAX_BACK_OFFS
-* ELASTICSEARCH_BUILD_INDICES_BACK_OFF_FACTOR
-* ELASTICSEARCH_BUILD_INDICES_WAIT_FOR_BUILD_INDICES - Controls whether to require waiting for the Build Indices job to finish. Defaults to true. It is not recommended to change this as it will allow GMS and MCL Consumers to start up in an error state.
-
-## Using [Stemming and Synonyms Support]
-
-### Stemming
-
-Stemming uses the root of a word without suffixes to match across intent of the search when a user is not quite sure of the precise name of a resource.
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimization/eso-stemming_1.png"/>
-</p>
-
-In this first image stemming is shown in the results. Even though the query is "event", the results contain instances with "events."
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimization/eso-stemming_2.png"/>
-</p>
-
-The second image exemplifies stemming on a query. The query is for "events", but the results show resources containing "event" as well.
-
-### Urn Matching
-
-Previously queries were not properly parsing out and tokenizing the expected portions of Urn types. Changes have been made on the index mapping and query side to support various partial and full Urn matching.
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimizations/eso-exact_match.png"/>
-</p>
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimizations/eso-partial_urn_1.png"/>
-</p>
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimizations/eso-partial_urn_2.png"/>
-</p>
-
-
-### Synonyms
-
-Synonyms includes a static list of equivalent terms that are baked into the index at index creation time. This allows for efficient indexing of related terms. It is possible to add these to the query side as well to
-allow for dynamic synonyms, but this is unsupported at this time and has performance implications.
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimizations/eso-synonyms_1.png"/>
-</p>
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimizations/eso-synonyms_2.png"/>
-</p>
-
-### Autocomplete improvements
-
-Improvements were made to autocomplete handling around special characters like underscores and spaces.
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimizations/eso-autocomplete_1.png"/>
-</p>
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimizations/eso-autocomplete_2.png"/>
-</p>
-
-<p align="center">
-  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/elasticsearch_optimizations/eso-autocomplete_3.png"/>
-</p>
-
-## Additional Resources
-
-### Videos
-
-
-**DataHub TownHall: Search Improvements Preview**
-
-<p align="center">
-<iframe width="560" height="315" src="https://www.youtube.com/embed/ECxIMbKwuOY?start=1529" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
-</p>
-
-## FAQ and Troubleshooting
-
-*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!*
--- a/docs/tests/metadata-tests.md
+++ b/docs/tests/metadata-tests.md
@ -1,11 +1,11 @@
 import FeatureAvailability from '@site/src/components/FeatureAvailability';

-# About Metadata Tests
+# Metadata Tests

 <FeatureAvailability saasOnly />

-DataHub includes a highly configurable, no-code framework that allows you to configure broad-spanning monitors & continuous actions 
-for the data assets - datasets, dashboards, charts, pipelines - that make up your enterprise Metadata Graph. 
+DataHub includes a highly configurable, no-code framework that allows you to configure broad-spanning monitors & continuous actions
+for the data assets - datasets, dashboards, charts, pipelines - that make up your enterprise Metadata Graph.
 At the center of this framework is the concept of a Metadata Test.

 There are two powerful use cases that are uniquely enabled by the Metadata Tests framework:
@ -13,59 +13,56 @@ There are two powerful use cases that are uniquely enabled by the Metadata Tests
 1. Automated Asset Classification
 2. Automated Metadata Completion Monitoring

-
 ### Automated Asset Classification

-Metadata Tests allows you to define conditions for selecting a subset of data assets (e.g. datasets, dashboards, etc), 
+Metadata Tests allows you to define conditions for selecting a subset of data assets (e.g. datasets, dashboards, etc),
 along with a set of actions to take for entities that are selected. After the test is defined, the actions
 will be applied continuously over time, as the selection set evolves & changes with your data ecosystem.

 When defining selection criteria, you'll be able to choose from a range of useful technical signals (e.g. usage, size) that are automatically
 extracted by DataHub (which vary by integration). This makes automatically classifying the "important" assets in your organization quite easy, which
-is in turn critical for running effective Data Governance initiatives within your organization. 
+is in turn critical for running effective Data Governance initiatives within your organization.

 For example, we can define a Metadata Test which selects all Snowflake Tables which are in the top 10% of "most queried"
-for the past 30 days, and then assign those Tables to a special "Tier 1" group using DataHub Tags, Glossary Terms, or Domains. 
-
+for the past 30 days, and then assign those Tables to a special "Tier 1" group using DataHub Tags, Glossary Terms, or Domains.

 ### Automated Data Governance Monitoring

-Metadata Tests allow you to define & monitor a set of rules that apply to assets in your data ecosystem (e.g. datasets, dashboards, etc). This is particularly useful when attempting to govern 
+Metadata Tests allow you to define & monitor a set of rules that apply to assets in your data ecosystem (e.g. datasets, dashboards, etc). This is particularly useful when attempting to govern
 your data, as it allows for the (1) definition and (2) measurement of centralized metadata standards, which are key for both bootstrapping
-and maintaining a well-governed data ecosystem. 
+and maintaining a well-governed data ecosystem.

 For example, we can define a Metadata Test which requires that all "Tier 1" data assets (e.g. those marked with a special Tag or Glossary Term),
 must have the following metadata:

-1. At least 1 explicit owner *and*
-2. High-level, human-authored documentation *and*
+1. At least 1 explicit owner _and_
+2. High-level, human-authored documentation _and_
 3. At least 1 Glossary Term from the "Classification" Term Group

 Then, we can closely monitor which assets are passing and failing these rules as we work to improve things over time.
-We can easily identify assets that are *in* and *out of* compliance with a set of centrally-defined standards. 
-
-By applying automation, Metadata Tests 
-can enable the full lifecycle of complex Data Governance initiatives - from scoping to execution to monitoring. 
+We can easily identify assets that are _in_ and _out of_ compliance with a set of centrally-defined standards.

+By applying automation, Metadata Tests
+can enable the full lifecycle of complex Data Governance initiatives - from scoping to execution to monitoring.

 ## Metadata Tests Setup, Prerequisites, and Permissions

 What you need to manage Metadata Tests on DataHub:

-* **Manage Tests** Privilege
+- **Manage Tests** Privilege

 This Platform Privilege allows users to create, edit, and remove all Metadata Tests on DataHub. Therefore, it should only be
 given to those users who will be serving as metadata Admins of the platform. The default `Admin` role has this Privilege.

 > Note that the Metadata Tests feature is currently limited in support for the following DataHub Asset Types:
 >
->- Dataset
->- Dashboard
->- Chart
->- Data Flow (e.g. Pipeline)
->- Data Job (e.g. Task)
->- Container (Database, Schema, Project)
-> 
+> - Dataset
+> - Dashboard
+> - Chart
+> - Data Flow (e.g. Pipeline)
+> - Data Job (e.g. Task)
+> - Container (Database, Schema, Project)
+>
 > If you'd like to see Metadata Tests for other asset types, please let your Acryl Customer Success partner know!

 ## Using Metadata Tests
@ -83,10 +80,10 @@ To begin building a new Metadata, click **Create new Test**.
 Inside the Metadata Test builder, we'll need to construct the 3 parts of a Metadata Test:

 1. **Selection Criteria** - Select assets that are in the scope of the test
-2. **Rules** - Define rules that selected assets can either pass or fail 
+2. **Rules** - Define rules that selected assets can either pass or fail
 3. **Actions (Optional)** - Define automated actions to be taken assets that are passing
-or failing the test
-   
+   or failing the test
+
 <p align="center">
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create.png"/>
 </p>
@ -99,8 +96,8 @@ Once the test is created, the test will be evaluated for any assets which fall i
 or once every day).

 ##### Selecting Asset Types
- 
-You must select at least one asset *type* from a set that includes Datasets, Dashboards, Charts, Data Flows (Pipelines), Data Jobs (Tasks),
+
+You must select at least one asset _type_ from a set that includes Datasets, Dashboards, Charts, Data Flows (Pipelines), Data Jobs (Tasks),
 and Containers.

 <p align="center">
@ -108,27 +105,26 @@ and Containers.
 </p>

 Entities will the selected types will be considered in scope, while those of other types will be considered out of scope and
-thus omitted from evaluation of the test. 
-
+thus omitted from evaluation of the test.

 ##### Building Conditions

-**Property** conditions are the basic unit of comparison used for selecting data assets. Each **Property** condition consists of a target *property*,
-an *operator*, and an optional *value*. 
+**Property** conditions are the basic unit of comparison used for selecting data assets. Each **Property** condition consists of a target _property_,
+an _operator_, and an optional _value_.

-A *property* is an attribute of a data asset. It can either be a technical signal (e.g. **metric** such as usage, storage size) or a  
-metadata signal (e.g. owners, domain, glossary terms, tags, and more), depending on the asset type and applicability of the signal. 
-The full set of supported *properties* can be found in the table below. 
+A _property_ is an attribute of a data asset. It can either be a technical signal (e.g. **metric** such as usage, storage size) or a  
+metadata signal (e.g. owners, domain, glossary terms, tags, and more), depending on the asset type and applicability of the signal.
+The full set of supported _properties_ can be found in the table below.

-An *operator* is the type of predicate that will be applied to the selected *property* when evaluating the test for an asset. The types
+An _operator_ is the type of predicate that will be applied to the selected _property_ when evaluating the test for an asset. The types
 of operators that are applicable depend on the selected property. Some examples of operators include `Equals`, `Exists`, `Matches Regex`,
-and `Contains`. 
+and `Contains`.

-A *value* defines the right-hand side of the condition, or a pre-configured value to evaluate the property and operator against. The type of the value
-is dependent on the selected *property* and *operator. For example, if the selected *operator* is `Matches Regex`, the type of the 
-value would be a string. 
+A _value_ defines the right-hand side of the condition, or a pre-configured value to evaluate the property and operator against. The type of the value
+is dependent on the selected _property_ and *operator. For example, if the selected *operator\* is `Matches Regex`, the type of the
+value would be a string.

-By selecting a property, operator, and value, we can create a single condition (or predicate) used for 
+By selecting a property, operator, and value, we can create a single condition (or predicate) used for
 selecting a data asset to be tested. For example, we can build property conditions that match:

 - All datasets in the top 25% of query usage in the past 30 days
@ -143,9 +139,9 @@ To create a **Property** condition, simply click **Add Condition** then select *
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create-property-condition.png"/>
 </p>

-We can combine **Property** conditions using boolean operators including `AND`, `OR`, and `NOT`, by 
-creating **Logical** conditions. To create a **Logical** condition, simply click **Add Condition** then select an 
-**And**, **Or**, or **Not** condition. 
+We can combine **Property** conditions using boolean operators including `AND`, `OR`, and `NOT`, by
+creating **Logical** conditions. To create a **Logical** condition, simply click **Add Condition** then select an
+**And**, **Or**, or **Not** condition.

 <p align="center">
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-create-logical-condition.png"/>
@ -156,13 +152,12 @@ Logical conditions allow us to accommodate complex real-world selection requirem
 - All Snowflake Tables that are in the Top 25% of most queried AND do not have a Domain
 - All Looker Dashboards that do not have a description authored in Looker OR in DataHub

-
 #### Step 2: Defining Rules

-In the second step, we can define a set of conditions that selected assets must match in order to be "passing" the test. 
+In the second step, we can define a set of conditions that selected assets must match in order to be "passing" the test.
 To do so, we can construct another set of **Property** conditions (as described above).

-> **Pro-Tip**: If no rules are supplied, then all assets that are selected by the criteria defined in Step 1 will be considered "passing". 
+> **Pro-Tip**: If no rules are supplied, then all assets that are selected by the criteria defined in Step 1 will be considered "passing".
 > If you need to apply an automated Action to the selected assets, you can leave the Rules blank and continue to the next step.

 <p align="center">
@ -171,11 +166,10 @@ To do so, we can construct another set of **Property** conditions (as described

 When combined with the selection criteria, Rules allow us to define complex, highly custom **Data Governance** policies such as:

- All datasets in the top 25% of query usage in the past 30 days **must have an owner**. 
+- All datasets in the top 25% of query usage in the past 30 days **must have an owner**.
 - All assets in the "Marketing" Domain **must have a description**
 - All Snowflake Tables that are in the Top 25% of most queried AND do not have a Domain **must have
-a Glossary Term from the Classification Term Group**
-
+  a Glossary Term from the Classification Term Group**

 ##### Validating Test Conditions

@ -183,18 +177,17 @@ During Step 2, we can quickly verify that the Selection Criteria & Rules we've a
 match our expectations by testing them against some existing assets indexed by DataHub.

 To verify your Test conditions, simply click **Try it out**, find an asset to test against by searching & filtering down your assets,
-and finally click **Run Test** to see whether the asset is passes or fails the provided conditions. 
+and finally click **Run Test** to see whether the asset is passes or fails the provided conditions.

 <p align="center">
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-validate-conditions.png"/>
 </p>

-
 #### Step 3: Defining Actions (Optional)

 > If you don't wish to take any actions for assets that pass or fail the test, simply click 'Skip'.

-In the third step, we can define a set of Actions that will be automatically applied to each selected asset which passes or fails the Rules conditions. 
+In the third step, we can define a set of Actions that will be automatically applied to each selected asset which passes or fails the Rules conditions.

 For example, we may wish to mark **passing** assets with a special DataHub Tag or Glossary Term (e.g. "Tier 1"), or remove these special marking for those which are failing.
 This allows us to automatically control classifications of data assets as they move in and out of compliance with the Rules defined in Step 2.
@ -210,46 +203,41 @@ A few of the supported Action types include:
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-define-actions.png"/>
 </p>

-
 #### Step 4: Name, Category, Description

 In the final step, we can add a freeform name, category, and description for our new Metadata Test.

-
 ### Viewing Test Results

 Metadata Test results can be viewed in 2 places:

-1. On an asset profile page (e.g. Dataset profile page), inside the **Validation** tab. 
+1. On an asset profile page (e.g. Dataset profile page), inside the **Validation** tab.
 2. On the Metadata Tests management page. To view all assets passing or failing a particular test,
-simply click on the labels which showing the number of passing or failing assets.
+   simply click on the labels which showing the number of passing or failing assets.

 <p align="center">
  <img width="50%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/metadata-test-view-results.png"/>
 </p>

-
 ### Updating an Existing Test

 To update an existing Test, simply click **Edit** on the test you wish to change.

 Then, make the changes required and click **Save**. When you save a Test, it may take up to 2 minutes for changes
-to be reflected across DataHub. 
-
+to be reflected across DataHub.

 ### Removing a Test

 To remove a Test, simply click on the trashcan icon located on the Tests list. This will remove the Test and
-deactivate it so that it no is evaluated. 
+deactivate it so that it no is evaluated.

 When you delete a Test, it may take up to 2 minutes for changes to be reflected.

-
 ### GraphQL

-* [listTests](../../graphql/queries.md#listtests)
-* [createTest](../../graphql/mutations.md#createtest)
-* [deleteTest](../../graphql/mutations.md#deletetest)
+- [listTests](../../graphql/queries.md#listtests)
+- [createTest](../../graphql/mutations.md#createtest)
+- [deleteTest](../../graphql/mutations.md#deletetest)

 ## FAQ and Troubleshooting

@ -263,19 +251,18 @@ Metadata Tests are evaluated in 2 scenarios:
 **Can I configure a custom evaluation schedule for my Metadata Test?**

 No, you cannot. Currently, the internal evaluator will ensure that tests are run continuously for
-each asset, regardless of whether it is being changed on DataHub. 
+each asset, regardless of whether it is being changed on DataHub.

 **How is a Metadata Test different from an Assertion?**

 An Assertion is a specific test, similar to a unit test, that is defined for a single data asset. Typically,
 it will include domain-specific knowledge about the asset and test against physical attributes of it. For example, an Assertion
-may verify that the number of rows for a specific table in Snowflake falls into a well-defined range. 
+may verify that the number of rows for a specific table in Snowflake falls into a well-defined range.

 A Metadata Test is a broad spanning predicate which applies to a subset of the Metadata Graph (e.g. across multiple
-data assets). Typically, it is defined against *metadata* attributes, as opposed to the physical data itself. For example,
-a Metadata Test may verify that ALL tables in Snowflake have at least 1 assigned owner, and a human-authored description. 
+data assets). Typically, it is defined against _metadata_ attributes, as opposed to the physical data itself. For example,
+a Metadata Test may verify that ALL tables in Snowflake have at least 1 assigned owner, and a human-authored description.
 Metadata Tests allow you to manage broad policies across your entire data ecosystem driven by metadata, for example to
-augment a larger scale Data Governance initiative. 
+augment a larger scale Data Governance initiative.

-
-*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!*
+_Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!_