Jan documentation updates (#1612)

* Update workflow docs

* Docs cleanup
This commit is contained in:
Nathan Evans 2025-01-10 11:36:27 -08:00 committed by GitHub
parent 63042d22f3
commit 0e7d22bfb0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 53 additions and 53 deletions

View File

@ -51,6 +51,7 @@ Available scripts are:
- `poetry run poe test_unit` - This will execute unit tests. - `poetry run poe test_unit` - This will execute unit tests.
- `poetry run poe test_integration` - This will execute integration tests. - `poetry run poe test_integration` - This will execute integration tests.
- `poetry run poe test_smoke` - This will execute smoke tests. - `poetry run poe test_smoke` - This will execute smoke tests.
- `poetry run poe test_verbs` - This will execute tests of the basic workflows.
- `poetry run poe check` - This will perform a suite of static checks across the package, including: - `poetry run poe check` - This will perform a suite of static checks across the package, including:
- formatting - formatting
- documentation formatting - documentation formatting

View File

@ -7,7 +7,7 @@ The knowledge model is a specification for data outputs that conform to our data
- `Document` - An input document into the system. These either represent individual rows in a CSV or individual .txt file. - `Document` - An input document into the system. These either represent individual rows in a CSV or individual .txt file.
- `TextUnit` - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set `CHUNK_BY_COLUMNS` to `id` so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many. - `TextUnit` - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set `CHUNK_BY_COLUMNS` to `id` so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many.
- `Entity` - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide. - `Entity` - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
- `Relationship` - A relationship between two entities. These are generated from the covariates. - `Relationship` - A relationship between two entities.
- `Covariate` - Extracted claim information, which contains statements about entities which may be time-bound. - `Covariate` - Extracted claim information, which contains statements about entities which may be time-bound.
- `Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure. - `Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
- `Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search. - `Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
@ -24,7 +24,7 @@ title: Dataflow Overview
flowchart TB flowchart TB
subgraph phase1[Phase 1: Compose TextUnits] subgraph phase1[Phase 1: Compose TextUnits]
documents[Documents] --> chunk[Chunk] documents[Documents] --> chunk[Chunk]
chunk --> embed[Embed] --> textUnits[Text Units] chunk --> textUnits[Text Units]
end end
subgraph phase2[Phase 2: Graph Extraction] subgraph phase2[Phase 2: Graph Extraction]
textUnits --> graph_extract[Entity & Relationship Extraction] textUnits --> graph_extract[Entity & Relationship Extraction]
@ -34,39 +34,37 @@ flowchart TB
end end
subgraph phase3[Phase 3: Graph Augmentation] subgraph phase3[Phase 3: Graph Augmentation]
graph_outputs --> community_detect[Community Detection] graph_outputs --> community_detect[Community Detection]
community_detect --> graph_embed[Graph Embedding] community_detect --> community_outputs[Communities Table]
graph_embed --> augmented_graph[Augmented Graph Tables]
end end
subgraph phase4[Phase 4: Community Summarization] subgraph phase4[Phase 4: Community Summarization]
augmented_graph --> summarized_communities[Community Summarization] community_outputs --> summarized_communities[Community Summarization]
summarized_communities --> embed_communities[Community Embedding] summarized_communities --> community_report_outputs[Community Reports Table]
embed_communities --> community_outputs[Community Tables]
end end
subgraph phase5[Phase 5: Document Processing] subgraph phase5[Phase 5: Document Processing]
documents --> link_to_text_units[Link to TextUnits] documents --> link_to_text_units[Link to TextUnits]
textUnits --> link_to_text_units textUnits --> link_to_text_units
link_to_text_units --> embed_documents[Document Embedding] link_to_text_units --> document_outputs[Documents Table]
embed_documents --> document_graph[Document Graph Creation]
document_graph --> document_outputs[Document Tables]
end end
subgraph phase6[Phase 6: Network Visualization] subgraph phase6[Phase 6: Network Visualization]
document_outputs --> umap_docs[Umap Documents] graph_outputs --> graph_embed[Graph Embedding]
augmented_graph --> umap_entities[Umap Entities] graph_embed --> umap_entities[Umap Entities]
umap_docs --> combine_nodes[Nodes Table] umap_entities --> combine_nodes[Final Nodes]
umap_entities --> combine_nodes end
subgraph phase7[Phase 7: Text Embeddings]
textUnits --> text_embed[Text Embedding]
graph_outputs --> description_embed[Description Embedding]
community_report_outputs --> content_embed[Content Embedding]
end end
``` ```
## Phase 1: Compose TextUnits ## Phase 1: Compose TextUnits
The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source tex. The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.
The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single "glean" step. (A "glean" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time. The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single "glean" step. (A "glean" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log) The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)
Each of these text-units are text-embedded and passed into the next phase of the pipeline.
```mermaid ```mermaid
--- ---
title: Documents into Text Chunks title: Documents into Text Chunks
@ -95,15 +93,15 @@ flowchart LR
### Entity & Relationship Extraction ### Entity & Relationship Extraction
In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of **entities** with a _name_, _type_, and _description_, and a list of **relationships** with a _source_, _target_, and _description_. In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of **entities** with a _title_, _type_, and _description_, and a list of **relationships** with a _source_, _target_, and _description_.
These subgraphs are merged together - any entities with the same _name_ and _type_ are merged by creating an array of their descriptions. Similarly, any relationships with the same _source_ and _target_ are merged by creating an array of their descriptions. These subgraphs are merged together - any entities with the same _title_ and _type_ are merged by creating an array of their descriptions. Similarly, any relationships with the same _source_ and _target_ are merged by creating an array of their descriptions.
### Entity & Relationship Summarization ### Entity & Relationship Summarization
Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description. Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.
### Claim Extraction & Emission ### Claim Extraction (optional)
Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called **Covariates**. Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called **Covariates**.
@ -111,27 +109,23 @@ Note: claim extraction is _optional_ and turned off by default. This is because
## Phase 3: Graph Augmentation ## Phase 3: Graph Augmentation
Now that we have a usable graph of entities and relationships, we want to understand their community structure and augment the graph with additional information. This is done in two steps: _Community Detection_ and _Graph Embedding_. These give us explicit (communities) and implicit (embeddings) ways of understanding the topological structure of our graph. Now that we have a usable graph of entities and relationships, we want to understand their community structure. These give us explicit ways of understanding the topological structure of our graph.
```mermaid ```mermaid
--- ---
title: Graph Augmentation title: Graph Augmentation
--- ---
flowchart LR flowchart LR
cd[Leiden Hierarchical Community Detection] --> ge[Node2Vec Graph Embedding] --> ag[Graph Table Emission] cd[Leiden Hierarchical Community Detection] --> ag[Graph Tables]
``` ```
### Community Detection ### Community Detection
In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity. In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.
### Graph Embedding ### Graph Tables
In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase. Once our graph augmentation steps are complete, the final **Entities**, **Relationships**, and **Communities** tables are exported.
### Graph Tables Emission
Once our graph augmentation steps are complete, the final **Entities** and **Relationships** tables are exported after their text fields are text-embedded.
## Phase 4: Community Summarization ## Phase 4: Community Summarization
@ -140,10 +134,10 @@ Once our graph augmentation steps are complete, the final **Entities** and **Rel
title: Community Summarization title: Community Summarization
--- ---
flowchart LR flowchart LR
sc[Generate Community Reports] --> ss[Summarize Community Reports] --> ce[Community Embedding] --> co[Community Tables Emission] sc[Generate Community Reports] --> ss[Summarize Community Reports] --> co[Community Reports Table]
``` ```
At this point, we have a functional graph of entities and relationships, a hierarchy of communities for the entities, as well as node2vec embeddings. At this point, we have a functional graph of entities and relationships and a hierarchy of communities for the entities.
Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster. Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.
@ -155,13 +149,9 @@ In this step, we generate a summary of each community using the LLM. This will a
In this step, each _community report_ is then summarized via the LLM for shorthand use. In this step, each _community report_ is then summarized via the LLM for shorthand use.
### Community Embedding ### Community Reports Table
In this step, we generate a vector representation of our communities by generating text embeddings of the community report, the community report summary, and the title of the community report. At this point, some bookkeeping work is performed and we export the **Community Reports** tables.
### Community Tables Emission
At this point, some bookkeeping work is performed and we export the **Communities** and **CommunityReports** tables.
## Phase 5: Document Processing ## Phase 5: Document Processing
@ -172,7 +162,7 @@ In this phase of the workflow, we create the _Documents_ table for the knowledge
title: Document Processing title: Document Processing
--- ---
flowchart LR flowchart LR
aug[Augment] --> dp[Link to TextUnits] --> de[Avg. Embedding] --> dg[Document Table Emission] aug[Augment] --> dp[Link to TextUnits] --> dg[Documents Table]
``` ```
### Augment with Columns (CSV Only) ### Augment with Columns (CSV Only)
@ -183,15 +173,11 @@ If the workflow is operating on CSV data, you may configure your workflow to add
In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa. In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.
### Document Embedding ### Documents Table
In this step, we generate a vector representation of our documents using an average embedding of document slices. We re-chunk documents without overlapping chunks, and then generate an embedding for each chunk. We create an average of these chunks weighted by token-count and use this as the document embedding. This will allow us to understand the implicit relationship between documents, and will help us generate a network representation of our documents.
### Documents Table Emission
At this point, we can export the **Documents** table into the knowledge Model. At this point, we can export the **Documents** table into the knowledge Model.
## Phase 6: Network Visualization ## Phase 6: Network Visualization (optional)
In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the _Entity-Relationship_ graph and the _Document_ graph. In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the _Entity-Relationship_ graph and the _Document_ graph.
@ -200,7 +186,27 @@ In this phase of the workflow, we perform some steps to support network visualiz
title: Network Visualization Workflows title: Network Visualization Workflows
--- ---
flowchart LR flowchart LR
nv[Umap Documents] --> ne[Umap Entities] --> ng[Nodes Table Emission] ag[Graph Table] --> ge[Node2Vec Graph Embedding] --> ne[Umap Entities] --> ng[Nodes Table]
``` ```
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of _Nodes_. The rows of this table include a discriminator indicating whether the node is a document or an entity, and the UMAP coordinates. ### Graph Embedding
In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.
### Dimensionality Reduction
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of _Nodes_. The rows of this table include the UMAP dimensions as x/y coordinates.
## Phase 7: Text Embedding
For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.
```mermaid
---
title: Text Embedding Workflows
---
flowchart LR
textUnits[Text Units] --> text_embed[Text Embedding]
graph_outputs[Graph Tables] --> description_embed[Description Embedding]
community_report_outputs[Community Reports] --> content_embed[Content Embedding]
```

View File

@ -10,7 +10,7 @@ Indexing Pipelines are configurable. They are composed of workflows, standard an
- embed entities into a graph vector space - embed entities into a graph vector space
- embed text chunks into a textual vector space - embed text chunks into a textual vector space
The outputs of the pipeline can be stored in a variety of formats, including JSON and Parquet - or they can be handled manually via the Python API. The outputs of the pipeline are stored as Parquet tables by default, and embeddings are written to your configured vector store.
## Getting Started ## Getting Started
@ -18,7 +18,6 @@ The outputs of the pipeline can be stored in a variety of formats, including JSO
See the [requirements](../developing.md#requirements) section in [Get Started](../get_started.md) for details on setting up a development environment. See the [requirements](../developing.md#requirements) section in [Get Started](../get_started.md) for details on setting up a development environment.
The Indexing Engine can be used in either a default configuration mode or with a custom pipeline.
To configure GraphRAG, see the [configuration](../config/overview.md) documentation. To configure GraphRAG, see the [configuration](../config/overview.md) documentation.
After you have a config file you can run the pipeline using the CLI or the Python API. After you have a config file you can run the pipeline using the CLI or the Python API.
@ -29,12 +28,6 @@ After you have a config file you can run the pipeline using the CLI or the Pytho
```bash ```bash
# Via Poetry # Via Poetry
poetry run poe cli --root <data_root> # default config mode poetry run poe cli --root <data_root> # default config mode
poetry run poe cli --config your_pipeline.yml # custom config mode
# Via Node
yarn run:index --root <data_root> # default config mode
yarn run:index --config your_pipeline.yml # custom config mode
``` ```
### Python API ### Python API