48 Commits

Author SHA1 Message Date
Nathan Evans
27c6de846f
Update docs for 2.0+ (#1984)
* Update docs

* Fix prompt links
2025-06-23 13:49:47 -07:00
Nathan Evans
36948b8d2e
Various minor updates (#1932)
* Add text unit ids to Community model

* Add graph utilities

* Turn off LCC for clustering by default

* Simplify embeddings config/flow

* Semver
2025-05-16 14:48:53 -07:00
Nathan Evans
25bbae8642
Docs: Add models page (#1842)
* Add models page

* Update config docs for new params

* Spelling

* Add comment on CoT with o-series

* Add notes about managed identity

* Update the viz guide

* Spruce up the getting started wording

* Capitalization

* Add BYOG page

* More BYOG edits

* Update dictionary

* Change example model name
2025-04-28 17:36:08 -07:00
Nathan Evans
56e0fad218
NLP graph parity (#1888)
* Update stopwords config

* Minor edits

* Update PMI

* Format

* Perf improvements

* Semver

* Remove edge collection apply

* Remove source/target apply

* Add edge weight to graph snapshot

* Revert breaking optimizations

* Add perf fixes back in

* Format/types

* Update defaults

* Fix source/target ordering

* Fix test
2025-04-25 17:09:06 -06:00
Nathan Evans
3b1e70c06b
Update config docs (2.1.0) (#1818)
* Align docs with config

* Semver

* Spelling

* Format

* Spelling
2025-03-18 12:39:30 -07:00
Nathan Evans
ddc6541ab6
Add docs page about input formats (#1784)
* Add docs page about input formats

* Add json example

* Spelling
2025-03-11 17:37:46 -07:00
Nathan Evans
321d479ab6
Update notebooks for 2.0 (#1785)
* Update API overview

* Fix global search example

* Fix local search example

* Fix global dynamic example

* Fix drift example

* Update multi-index example

* Semver
2025-03-11 17:23:49 -07:00
Nathan Evans
bcb74789f1
Next release docs (#1627)
* Wordind updates

* Update yam lconfig and add notes to "deprecated" env

* Add basic search section

* Update versioning docs

* Minor edits for clarity

* Update init command

* Update init to add --force in docs

* Add NLP extraction params

* Move vector_store to root

* Add workflows to config

* Add FastGraphRAG docs

* add metadata column changes

* Added documentation for multi index search.

* Minor fixes.

* Add config and table renames

* Update migration notebook and comments to specify v1

* Add frequency to entity table docs

* add new chunking options for metadata

* Update output docs

* Minor edits and cleanup

* Add model ids to search configs

* Spruce up migration notebook

* Lint/format multi-index notebook

* SpaCy model note

* Update SpaCy footnote

* Updated multi_index_search.ipynb to remove ruff errors.

* add spacy to dictionary

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Dayenne Souza <ddesouza@microsoft.com>
Co-authored-by: dorbaker <dorbaker@microsoft.com>
2025-03-03 14:46:00 -08:00
Nathan Evans
61a309b182
Incremental model alignment (#1766)
* Used shared schema lists for all final columns

* Semver
2025-02-25 13:14:42 -06:00
Nathan Evans
981fd31963
Community children (#1704)
* Add children to the community tables

* Replace NaN children with empty list

* Replace subcommunity logic with built-in parent/child fields

* Remove restore_community_hierarchy

* Add children and frequency to migration notebook

* Format

* Semver

* Add children to reports

* Update tests

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-13 17:03:51 -08:00
Josh Bradley
b8b949f3bb
Cleanup query api - remove code duplication (#1690)
* consolidate query api functions and remove code duplication

* refactor and remove more code duplication

* Add semversioner file

* fix basic search

* fix drift search and update base class function names

* update example notebooks
2025-02-13 16:31:08 -05:00
Nathan Evans
c02ab0984a
Streamline workflows (#1674)
* Remove create_final_nodes

* Rename final entity output to "entities"

* Remove duplicate code from graph extraction

* Rename create_final_relationships output to "relationships"

* Rename create_final_communities output to "communities"

* Combine compute_communities and create_final_communities

* Rename create_final_covariates output to "covariates"

* Rename create_final_community_reports output to "community_reports"

* Rename create_final_text_units output to "text_units"

* Rename create_final_documents output to "documents"

* Remove transient snapshots config

* Move create_final_entities to finalize_entities operation

* Move create_final_relationships flow to finalize_relationships operation

* Reuse some community report functions

* Collapse most of graph and text unit-based report generation

* Unify schemas files

* Move community reports extractor

* Move NLP report prompt to prompts folder

* Fix a few pandas warnings

* Rename embeddings config to embed_text

* Rename claim_extraction config to extract_claims

* Remove nltk from standard graph extraction

* Fix verb tests

* Fix extract graph config naming

* Fix moved file reference

* Create v1-to-v2 migration notebook

* Semver

* Fix smoke test artifact count

* Raise tpm/rpm on smoke tests

* Update drift settings for smoke tests

* Reuse project directory var in api notebook

* Format

* Format
2025-02-07 11:11:03 -08:00
JunHo Kim (김준호)
30f36316af
Fix typo in table formatting in env_vars documentation (#1632)
Corrected a missing backtick in a note within the `GRAPHRAG_API_KEY` description. This ensures proper code formatting and improves readability in the documentation. No content was altered aside from formatting adjustments.

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-02-04 13:14:58 -08:00
Shamik
053bf60162
Update auto_prompt_tuning.md (#1659)
Updated the auto prompt tuning doc with `--selection-method` instead of only `--method` as per the latest API.

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-27 13:33:25 -06:00
Derek Worthen
c644338bae
Refactor config (#1593)
* Refactor config

- Add new ModelConfig to represent LLM settings
    - Combines LLMParameters, ParallelizationParameters, encoding_model, and async_mode
- Add top level models config that is a list of available LLM ModelConfigs
- Remove LLMConfig inheritance and delete LLMConfig
    - Replace the inheritance with a model_id reference to the ModelConfig listed in the top level models config
- Remove all fallbacks and hydration logic from create_graphrag_config
    - This removes the automatic env variable overrides
- Support env variables within config files using Templating
    - This requires "$" to be escaped with extra "$" so ".*\\.txt$" becomes ".*\\.txt$$"
- Update init content to initialize new config file with the ModelConfig structure

* Use dict of ModelConfig instead of list

* Add model validations and unit tests

* Fix ruff checks

* Add semversioner change

* Fix unit tests

* validate root_dir in pydantic model

* Rename ModelConfig to LanguageModelConfig

* Rename ModelConfigMissingError to LanguageModelConfigMissingError

* Add validationg for unexpected API keys

* Allow skipping pydantic validation for testing/mocking purposes.

* Add default lm configs to verb tests

* smoke test

* remove config from flows to fix llm arg mapping

* Fix embedding llm arg mapping

* Remove timestamp from smoke test outputs

* Remove unused "subworkflows" smoke test properties

* Add models to smoke test configs

* Update smoke test output path

* Send logs to logs folder

* Fix output path

* Fix csv test file pattern

* Update placeholder

* Format

* Instantiate default model configs

* Fix unit tests for config defaults

* Fix migration notebook

* Remove create_pipeline_config

* Remove several unused config models

* Remove indexing embedding and input configs

* Move embeddings function to config

* Remove skip_workflows

* Remove skip embeddings in favor of explicit naming

* fix unit test spelling mistake

* self.models[model_id] is already a language model. Remove redundant casting.

* update validation errors to instruct users to rerun graphrag init

* instantiate LanguageModelConfigs with validation

* skip validation in unit tests

* update verb tests to use default model settings instead of skipping validation

* test using llm settings

* cleanup verb tests

* remove unsafe default model config

* remove the ability to skip pydantic validation

* remove None union types when default values are set

* move vector_store from embeddings to top level of config and delete resolve_paths

* update vector store settings

* fix vector store and smoke tests

* fix serializing vector_store settings

* fix vector_store usage

* fix vector_store type

* support cli overrides for loading graphrag config

* rename storage to output

* Add --force flag to init

* Remove run_id and resume, fix Drift config assignment

* Ruff

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-21 17:52:06 -06:00
Alonso Guevara
e21a38f2ab
Fix/notebooks (#1614)
* Add new inputs and missing vector store for retrieving vectors

* Format

* Semver

* Remove .Identifier files

* Fix spellcheck

* Remove unnecessary input file for notebooks
2025-01-13 17:41:39 -06:00
Nathan Evans
0e7d22bfb0
Jan documentation updates (#1612)
* Update workflow docs

* Docs cleanup
2025-01-10 11:36:27 -08:00
Nathan Evans
7ec9ef0261
Refactor callbacks (#1583)
* Unify Workflow and Verb callbacks interfaces

* Semver

* Fix storage class instantiation (#1582)

---------

Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2025-01-06 10:58:59 -08:00
Nathan Evans
a35cb12741
Remove datashaper strip code (#1581)
Remove datashaper
2025-01-03 13:59:26 -08:00
Alonso Guevara
2abd6c5f5c
Update blog posts (#1571) 2024-12-30 17:16:08 -06:00
Josh Bradley
983664397b
Update doc site with api overview notebook (#1509)
update doc site
2024-12-12 16:08:24 -05:00
Josh Bradley
823342188d
Cleanup factory methods (#1482)
* cleanup factory methods to have similar design pattern across codebase

* add semversioner file

* cleanup logging factory

* update developer guide

* add comment

* typo fix

* cleanup reporter terminology

* renmae reporter to logger

* fix comments

* update comment

* instantiate factory classes correctly and update index api callback parameter

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-10 16:11:11 -06:00
Alonso Guevara
04405803db
Add Parent to communities in data model (#1491)
* Add Parent to communities in data model

* Semver

* Pyright

* Update docs

* Use leiden cluster parent id

* Format
2024-12-10 14:38:11 -06:00
Nathan Evans
61816e076f
Migration notebook (#1492)
* Add migration notebook

* Update migration instructions

* Semver

* Rename item in relationships table

* Remove indexing vector store shim

* Remove query shims

* Remove columns from migrated data

* Format

* Add community parents
2024-12-10 14:23:26 -06:00
Josh Bradley
b00142260d
Update index API + a notebook that provides a general API overview (#1454)
* update index api to accept callbacks

* fix hardcoded folder name that was creating an empty folder

* add API notebook

* add semversioner file

* filename change

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-05 15:34:21 -06:00
Nathan Evans
d17dfd01f9
Graph collapse (#1464)
* Refactor graph creation

* Semver

* Spellcheck

* Update integ pipeline

* Fix cast

* Improve pandas chaining

* Cleaner apply

* Use list comprehensions

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-05 11:57:26 -06:00
Josh Bradley
dad2176b3c
Miscellaneous code cleanup procedures (#1452) 2024-11-27 13:27:43 -05:00
Nathan Evans
0b2120ca45
Docs and notebooks update (#1451)
* Fix local question gen and example notebook

* Update global search notebook

* Add lazy blog post

* Update breaking changes doc for migration notes

* Simplify Getting Started page

* Semver

* Spellcheck

* Fix types

* Add comments on cache-free migration

* Update wording

* Spelling

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-27 09:56:48 -08:00
Josh Bradley
22a57d14c7
Improve CLI speed with lazy imports (#1319) 2024-11-15 19:41:10 -05:00
Nathan Evans
9b4f24ebce
First cut at config cleanup (#1411)
* Firsst cut at config cleanup

* Reorder top nav

* Add query prompts to tuning page

* Remove dynamic notebook from nav

* Add more thorough yml config descriptions in docs

* Further clean out the config

* Semver

* Add new blog post

* Emphasize yaml

* Clarify output

* Fix unit test

* Fix bullet nesting
2024-11-15 14:33:26 -08:00
Nathan Evans
425dbc60e3
Docs update (#1408)
* Fix footer contrast

* Fix broken links

* Remove a few unneeded examples

* Point python API example to the whole folder

* Convert schema bullets to tables
2024-11-14 21:26:29 -06:00
JunHo Kim (김준호)
ec9cdcce4d
fix typo. Correct the wording "global search" to "drift search" in drift search documentation (#1383)
Updated the wording of the example scenario from "global search" to "drift search" to accurately reflect the topic. This improves clarity and ensures the documentation accurately describes its content.

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-14 16:55:44 -06:00
Nathan Evans
51912b2e03
Move prompts (#1404)
* Move indexing prompts to root

* Move query prompts to root

* Export query prompts during init

* Extract general knowledge prompt

* Load query prompts from disk

* Semver

* Fix unit tests
2024-11-14 10:45:37 -08:00
Nathan Evans
c8c354e357
Artifact cleanup (#1341)
* Add source documents for verb tests

* Remove entity_type erroneous column

* Add new test data

* Remove source/target degree columns

* Remove top_level_node_id

* Remove chunk column configs

* Rename "chunk" to "text"

* Rename "chunk" to "text" in base

* Re-map document input to use base text units

* Revert base text units as final documents dep

* Update test data

* Split/rename node source_id

* Drop node size (dup of degree)

* Drop document_ids from covariates

* Remove unused document_ids from models

* Remove n_tokens from covariate table

* Fix missed document_ids delete

* Wire base text units to final documents

* Rename relationship rank as combined_degree

* Add rank as first-class property to Relationship

* Remove split_text operation

* Fix relationships test parquet

* Update test parquets

* Add entity ids to community table

* Remove stored graph embedding columns

* Format

* Semver

* Fix JSON typo

* Spelling

* Rename lancedb

* Sort lancedb

* Fix unit test

* Fix test to account for changing period

* Update tests for separate embeddings

* Format

* Better assertion printing

* Fix unit test for windows

* Rename document.raw_content -> document.text

* Remove read_documents function

* Remove unused document summary from model

* Remove unused imports

* Format

* Add new snapshots to default init

* Use util to construct embeddings collection name

* Align inc index model with branch changes

* Update data and tests for int ids

* Clean up embedding locs

* Switch entity "name" to "title" for consistency

* Fix short_id -> human_readable_id defaults

* Format

* Rework community IDs

* Fix community size compute

* Fix unit tests

* Fix report read

* Pare down nodes table output

* Fix unit test

* Fix merge

* Fix community loading

* Format

* Fix community id report extraction

* Update tests

* Consistent short IDs and ordering

* Update ordering and tests

* Update incremental for new nodes model

* Guard document columns loc

* Match column ordering

* Fix document guard

* Update smoke tests

* Fill NA on community extract

* Logging for smoke test debug

* Add parquet schema details doc

* Fix community hierarchy guard

* Use better empty hierarchy guard

* Back-compat shims

* Semver

* Fix warning

* Format

* Remove default fallback

* Reuse key
2024-11-13 15:11:19 -08:00
Alonso Guevara
e53422366d
Implement dynamic community selection for global search (#1396)
* update gitignore

* add dynamic community sleection to updated main branch

* update SearchResult to record output_tokens.

* update search result

* dynamic search working

* format

* add llm_calls_categories and prompt_tokens and output_tokens cate

* update

* formatting

* log drift search output and prompt tokens separately

* update global_search.ipynb. update operate dulce dataset and add create_final_communities. update dynamic community selection init

* add .ipynb back to cspell.config.yaml

* format

* add notebook example on dynamic search

* rearrange

* update gitignore

* format code

* code format

* code format

* fix default variable

---------

Co-authored-by: Bryan Li <bryanlimy@gmail.com>
2024-11-11 16:45:07 -08:00
Josh Bradley
a8ccded83c
Fix file path issue in the viz guide (#1372)
* Fix a file paths issue in the viz guide.

* fix formatting
2024-11-06 14:42:07 -08:00
Alonso Guevara
2047c1561c
Fix styling and misalignment on drift docs (#1373) 2024-11-06 16:29:53 -06:00
Josh Bradley
9762f33c1a
Add visualization guide (#1340) 2024-11-06 14:06:50 -05:00
Alonso Guevara
1557ce34f9
Fix init defaults for vector store and img in drift docs (#1357)
* Fix init defaults for vector store and img in drift docs

* Adde more doc

* Spellcheck

* Remove example
2024-11-05 14:14:17 -06:00
Alonso Guevara
d9f985ae52
Drift Search CLI, API, Docs and Example Notebook (#1348)
* Drift CLI and backwards compat

* Adding DRIFT Cli, Docs and example notebook

* Update tests and fix ruff

* Format

* Small cleanup

* Fix smoke tests

* Update notebook

* Oopsie fix

* Delete duplicate img
2024-11-05 12:05:19 -06:00
Nathan Evans
634e3ed62a
Transient entity graph (#1349)
* Make base_entity_graph transient

* Add transient snapshots

* Semver

* Fix unit test

* Fix smoke tests
2024-11-04 17:23:29 -08:00
gaudyb
17658c5df8
New workflow to generate embeddings in a single workflow (#1296)
* New workflow to generate embeddings in a single workflow

* New workflow to generate embeddings in a single workflow

* version change

* clean tests without any embeddings references

* clean tests without any embeddings references

* remove code

* feedback implemented

* changes in logic

* feedback implemented

* store in table bug fixed

* smoke test for generate_text_embeddings workflow

* smoke test fix

* add generate_text_embeddings to the list of transient workflows

* smoke tests

* fix

* ruff formatting updates

* fix

* smoke test fixed

* smoke test fixed

* fix lancedb import

* smoke test fix

* ignore sorting

* smoke test fixed

* smoke test fixed

* check smoke test

* smoke test fixed

* change config for vector store

* format fix

* vector store changes

* revert debug profile back to empty filepath

* merge conflict solved

* merge conflict solved

* format fixed

* format fixed

* fix return dataframe

* snapshot fix

* format fix

* embeddings param implemented

* validation fixes

* fix map

* fix map

* fix properties

* config updates

* smoke test fixed

* settings change

* Update collection config and rework back-compat

* Repalce . with - for embedding store

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2024-11-01 15:01:35 -07:00
Josh Bradley
083de12bcf
Auto-generate CLI doc pages (#1325) 2024-10-25 19:00:24 -04:00
Josh Bradley
d6e6f5c077
Convert CLI to Typer app (#1305) 2024-10-24 14:22:32 -04:00
KennyZhang1
e0840a2dc4
Fix vector store logic and refactor audience parameter (#1259) 2024-10-21 16:56:56 -04:00
Andres Morales
fc502ee029
Fix cookie consent script missing (#1292) 2024-10-17 09:44:14 -06:00
Andres Morales
137a5cd550
Fix/docs auto prompt img (#1283)
* Fix auto prompt tuning image path
2024-10-14 09:02:31 -06:00
Andres Morales
fc9895f793
Replace current docs by mkdocs (#1263)
* Replace docs by mkdocs-material

* Fix markdown

* Fix verions in gh-pages workflow

* remove whitespaces

* add semver

* Add build docs check on python-ci

* Fix command in index cli

* Spellcheck

* Spellcheck

* remove docsite paths

* clear outputs from notebook

* remove dependabot npm for docsite

* remove more docsite left overs

* execute notebooks

* Update notebooks

* update poetry lock

* Remove notebook build from ci

* Revert dep update

* Navigation tabs

* Fix stylesheet

* add kwds to dictionary

* Turn on notebook execution

* Update gitignore

* Add MSR Blog posts

* spellcheck

* Accessibility Changes

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-10-11 13:39:03 -06:00