8 Commits

Author SHA1 Message Date
Derek Worthen
4c8ef97760
V3/main (#2190)
* Remove graph embedding and UMAP (#2048)

* Remove umap/layout operation

* Remove graph embedding

* Bump unified-search to GR 2.5.0

* Remove graph vis from unified-search

* Remove file filtering (#2050)

* Remove document filtering

* Semver

* Fix integ tests

* Fix file find tuple

* Fix another dangling find tuple

* Remove text unit grouping (#2052)

* Remove text unit group_by_columns

* Semver

* Fix default token split test

* Fix models in config test samples

* Fix token length in context sort test

* Fix document sort

* Re-implement hierarchical Leiden (#2049)

* Use graspologic-native hierarchical leiden

* Re-implement largest_connected_component

* Copy in modularity

* Use graspologic-native directly in pyproject

* Remove directed graph tests (we don't use this)

* Semver

* Remove graspologic dep

* Use 4.1 and text-embedding-3-large as defaults

* Update comment

* Clean vector store (#2077)

* clean vector store code

* fix

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Update v3/main missing config + functions (#2082)

* reduce schema fields (#2089)

* reduce schema fields

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Remove strategy dicts (#2090)

* Remove "strategy" from community reports config/workflow

* Remove extraction strategy from extract_graph

* Remove summarization strategy from extract_graph

* Remove strategy from claim extraction

* Strongly type prompt templates

* Remove strategy from embed_text

* Push hydrated params into community report workflows

* Push hyrdated params into extract covariates

* Push hydrated params into extract graph NLP

* Push hydrated params into extract graph

* Push hydrated params into text embeddings

* Remove a few more low-level defaults

* Semver

* Remove configurable prompt delimiters

* Update smoke tests

* Remove fnllm (#2095)

* Sort deps alpha

* Remove multi search (#2093)

* Remove multi-search from CLI

* Remove multi-search from API

* Flatten vector_store config

* Push hydrated vector store down to embed_text

* Remove outputs from config

* Remove multi-search notebook/docs

* Add missing response_type in basic search API

* Fix basic search context and id mapping

* Fix v1 migration notebook

* Fix query entity search tests

* V3 docs and cleanup (#2100)

* Remove community contrib notebooks

* Add migration notebook and breaking changes page edits

* Update/polish docs

* Make model instance name configurable

* Add vector schema updates to v3 migration notebook

* Spellcheck

* Bump smoke test runtimes

* Remove document overwrite (#2101)

* remove document overwrite from vector store configuration

* remove document overwrite and refactor load documents method

* fix test

* fix test

* fix test

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Unified factory (#2105)

* Simplify Factory interface

* Migrate CacheFactory to standard base class

* Migrate LoggerFactory to standard base class

* Migrate StorageFactory to standard base class

* Migrate VectorStoreFactory to standard base class

* Update vector store example notebook

* Delete notebook outputs

* Move default providers into factories

* Move retry/limit tests into integ

* Split language model factories

* Set smoke test tpm/rpm

* Fix factory integ tests

* Add method to smoke test, switch text to 'fast'

* Fix text smoke config for fast workflow

* Add new workflows to text smoke test

* Convert input readers to a proper factory

* Remove covariates from fast smoke test

* Update docs for input factory

* Bump smoke runtime

* Even longer runtime

* min-csv timeout

* Remove unnecessary lambdas

* Prefix vector store (#2106)

* add prefix to vector store configuration and removal of container name

* docs updated

* change prefix property name

* change prefix property name

* feedback implemented

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* fix for container name

* Restructure project as monorepo. (#2111)

* Restructure project as monorepo.

* Fix formatting

* Storage fixes and cleanup (#2118)

* Fix pipeline recursion

* Remove base_dir from storage.find

* Remove max_count from storage.find

* Remove prefix on storage integ test

* Add base_dir in creation_date test

* Wrap base_dir in Path

* Use constants for input/update directories

* Nov 2025 housekeeping (#2120)

* Remove gensim sideload

* Split CI build/type checks from unit tests

* Thorough review of docs to align with v3

* Format

* Fix version

* Fix type

* Graphrag config (#2119)

* Add load_config to graphrag-common package.

* Empty graph guards (#2126)

* Remove networkx from graph_extractor and clean out redundancy

* Bubble pipeline error to console

* Remove embeddings optional new (#2128)

* remove optional embeddings

* fix test

* fix tests

* fix pipeline

* fix test

* fix test

* fix test

* fix tests

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Format

* Add empty checks for NLP graphs (#2133)

* Init command asks for models (#2137)

* Add init prompting for models

* Remove hard-coded model config validation

* Switch to typer option prompt for full CLI use with models

* Update getting started for init model input

* Bump request timeout and overall smoke test timeout

* Add graphrag-storage. (#2127)

* Add graphrag-storage.

* Python update (3.13) (#2149)

* Update to python 3.14 as default, with range down to 3.10

* Fix enum value in query cli

* Update pyarrow

* Update py version for storage package

* Remove 3.10

* add fastuuid

* Update Python support to 3.11-3.14 with stricter dependency constraints

- Set minimum Python version to 3.11 (removed 3.10 support)
- Added support for Python 3.14
- Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14
- Fixed license format to use SPDX-compatible format for Python 3.14
- Updated pyarrow to >=22.0.0 for Python 3.14 wheel support
- Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility
- Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control
- Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app

* update uv lock

* Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability

* Update uv lock

* Update numpy to >=2.0.0 for Python 3.14 Windows compatibility

Numpy 1.25.x has access violation issues on Python 3.14 Windows.
Numpy 2.x has proper Python 3.14 support including Windows wheels.

* update uv lock

* Update pandas to >=2.3.0 for numpy 2.x compatibility

Pandas 2.2.x was compiled against numpy 1.x and causes ABI
incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports
numpy 2.x properly.

* update uv.lock

* Add scipy>=1.15.0 for numpy 2.x compatibility

Scipy versions < 1.15.0 have C extensions built against numpy 1.x
and are incompatible with numpy 2.x, causing dtype size errors.

* update uv lock

* Update Python support to 3.11-3.13 with compatible dependencies

- Set Python version range to 3.11-3.13 (removed 3.14 support)
- Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13
- Dependencies optimized for Python 3.13 compatibility:
  - pyarrow~=22.0 (has Python 3.13 wheels)
  - numpy~=1.26
  - pandas~=2.2
  - blis~=1.0
  - fastuuid~=0.13
- Applied stricter version constraints using ~= operator throughout
- Updated uv.lock with resolved dependencies

* Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility

Numpy 1.26.x causes access violations on Python 3.13 Windows.
Numpy 2.1+ has proper Python 3.13 support with Windows wheels.
Pandas 2.3+ is required for numpy 2.x compatibility.

* update vsts.yml python version

* Add GraphRAG Cache package. (#2153)

* Add GraphRAG Cache package.

* Fix a bunch of module comments and function visibility (#2154)

* Issue #2004 fix (#2159)

* fix issue #2004 using KeenhoChu idea in his PR

* add unit test for dynamic community selection

* add unit test for dynamic community selection implementing #2158 logic

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (#2161)

* fix issue #860 for mismatch in prompts and input

* fix format

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Chunker factory (#2156)

* Delete NoopTextSplitter

* Delete unused check_token_limit

* Add base chunking factory and migrate workflow to use it

* Split apart chunker module

* Co-locate chunking/splitting

* Collapse token splitting functionality into one class/function

* Restore create_base_text_units parameterization

* Move Tokenizer base class to common package

* Move pre-pending into chunkers

* Streamline config

* Fix defaults construction

* Add prepending tests

* Remove chunk_size_includes_metadata config

* Revert ChunkingDocument interface

* Move metadata prepending to a util

* Move Tokenizer back to GR core

* Fix tokenizer removal from chunker

* Set defaults for chunking config

* Move chunking to monorepo package

* Format

* Typo

* Add ChunkResult model

* Streamline chunking config

* Add missing version updates for graphrag_chunking

* Input factory (#2168)

* Update input factory to match other factories

* Move input config alongside input readers

* Move file pattern logic into InputReader

* Set encoding default

* Clean up optional column configs

* Combine structured data extraction

* Remove pandas from input loading

* Throw if empty documents

* Add json lines (jsonl) input support

* Store raw data

* Fix merge imports

* Move metadata handling entirely to chunking

* Nicer automatic title

* Typo

* Add get_property utility for nested dictionary access with dot notation

* Update structured_file_reader to use get_property utility

* Extract input module into new graphrag-input monorepo package

- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module

* Rename ChunkResult to TextChunk and add transformer support

- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument

* Back-compat comment

* Align input config type name with other factory configs

* Add MarkItDown support

* Remove pattern default from MarkItDown reader

* Remove plugins flag (implicit disabled)

* Format

* Update verb tests

* Separate storage from input config

* Add empty objects for NaN raw_data

* Fix smoke tests

* Fix BOM in csv smoke

* Format

* DRIFT fixes (#2171)

* Use stable ids for community reports

* Remove deprecated title from embedding flow

* Remove embedding column from df loaders

* Fix lancedb insertion

* Add drift back to smoke tests

* Fix mock embedder to match default embedding length

* Fix DRIFT notebook

* Push drift_k_followups through to prompt

* Format

* Vector package (#2172)

* Extract graphrag-vectors package

* Simplify vector factory usage and config defaults

* Update factory integ initializers

* Fix mock patch

* Format

* Register vector stores in tests

* Set a default vector store name

* Update vector readme

* Remove impls from init

* Move some validation into impls

* Remove index_prefix

* Move duplicate method to base class

* Fix smoke vector config

* Update index bug (#2173)

* fix update index bug

* blob storage bug fix

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Add GraphRAG LLM package. (#2174)

* Update documentation for v3 release (#2176)

update documentation for v3 release

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Graphrag llm cleanup (#2181)

* Migration update (#2180)

* fix formatting.

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
2026-01-27 10:23:45 -08:00
Nathan Evans
ac8a7f5eef
Housekeeping (#2086)
* Add deprecation warnings for fnllm and multi-search

* Fix dangling token_encoder refs

* Fix local_search notebook

* Fix global search dynamic notebook

* Fix global search notebook

* Fix drift notebook

* Switch example notebooks to use LiteLLM config

* Properly annotate dev deps as a group

* Semver

* Remove --extra dev

* Remove llm_model variable

* Ignore ruff ASYNC240

* Add note about expected broken notebook in docs

* Fix custom vector store notebook

* Push tokenizer throughout
2025-10-07 16:21:24 -07:00
Copilot
7c28c70d5c
Switch from Poetry to uv for package management (#2008)
* Initial plan

* Switch from Poetry to uv for package management

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Clean up build artifacts and update gitignore

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* remove build artifacts

* remove hardcoded version string

* fix calls to pip in cicd

* Update gh-pages.yml workflow to use uv instead of Poetry

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff formatting fixes

* update cicd workflow with latest uv action

* fix command to retrieve package version

* update development instructions

* remove Poetry references

* Replace deprecated azuright action with npm-based Azurite installation

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* skip api version check for azurite

* add semversioner file

* update more changes from switching to UV

* Migrate unified-search-app from Poetry to uv package management

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* minor typo update

* minor Dockerfile update

* update cicd thresholds

* update pytest thresholds

* ruff fixes

* ruff fixes

* remove legacy npm settings that no longer apply

* Update Unified Search App Readme

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-08-13 18:57:25 -06:00
KennyZhang1
8368b12532
Add Cosmos DB storage/cache option (#1431)
* added cosmosdb constructor and database methods

* added rest of abstract method headers

* added cosmos db container methods

* implemented has and delete methods

* finished implementing abstract class methods

* integrated class into storage factory

* integrated cosmosdb class into cache factory

* added support for new config file fields

* replaced primary key cosmosdb initialization with connection strings

* modified cosmosdb setter to require json

* Fix non-default emitters

* Format

* Ruff

* ruff

* first successful run of cosmosdb indexing

* removed extraneous container_name setting

* require base_dir to be typed as str

* reverted merged changed from closed branch

* removed nested try statement

* readded initial non-parquet emitter fix

* added basic support for parquet emitter using internal conversions

* merged with main and resolved conflicts

* fixed more merge conflicts

* added cosmosdb functionality to query pipeline

* tested query for cosmosdb

* collapsed cosmosdb schema to use minimal containers and databases

* simplified create_database and create_container functions

* ruff fixes and semversioner

* spellcheck and ci fixes

* updated pyproject toml and lock file

* apply fixes after merge from main

* add temporary comments

* refactor cache factory

* refactored storage factory

* minor formatting

* update dictionary

* fix spellcheck typo

* fix default value

* fix pydantic model defaults

* update pydantic models

* fix init_content

* cleanup how factory passes parameters to file storage

* remove unnecessary output file type

* update pydantic model

* cleanup code

* implemented clear method

* fix merge from main

* add test stub for cosmosdb

* regenerate lock file

* modified set method to collapse parquet rows

* modified get method to collapse parquet rows

* updated has and delete methods and docstrings to adhere to new schema

* added prefix helper function

* replaced delimiter for prefixed id

* verified empty tests are passing

* fix merges from main

* add find test

* update cicd step name

* tested querying for new schema

* resolved errors from merge conflicts

* refactored set method to handle cache in new schema

* refactored get method to handle cache in new schema

* force unique ids to be written to cosmos for nodes

* found bug with has and delete methods

* modified has and delete to work with cache in new schema

* fix the merge from main

* minor typo fixes

* update lock file

* spellcheck fix

* fix init function signature

* minor formatting updates

* remove https protocol

* change localhost to 127.0.0.1 address

* update pytest to use bacj engine

* verified cache tests

* improved speed of has function

* resolved pytest error with find function

* added test for child method

* make container_name variable private as _container_name

* minor variable name fix

* cleanup cosmos pytest and make the cosmosdb storage class operations more efficient

* update cicd to use different cosmosdb emulator

* test with http protocol

* added pytest for clear()

* add longer timeout for cosmosdb emulator startup

* revert http connection back to https

* add comments to cicd code for future dev usage

* set to container and database clients to none upon deletion

* ruff changes

* add comments to cicd code

* removed unneeded None statements and ruff fixes

* more ruff fixes

* Update test_run.py

* remove unnecessary call to delete container

* ruff format updates

* Reverted test_run.py

* fix ruff formatter errors

* cleanup variable names to be more consistent

* remove extra semversioner file

* revert pydantic model changes

* revert pydantic model change

* revert pydantic model change

* re-enable inline formatting rule

* update documentation in dev guide

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2024-12-19 13:43:21 -06:00
Josh Bradley
0394b55086
Update CI/CD - skip running unit tests on documentation-only PRs (#1371) 2024-11-06 14:19:21 -05:00
Josh Bradley
3df6f8c65b
Allow ci/cd to skip draft PRs (#1314) 2024-10-23 12:46:00 -04:00
Alonso Guevara
044516f538
Clean and organize run index code (#1090)
* Create entypoint for cli and api (#1067)

* Add cli and api entrypoints for update index

* Semver

* Update docs

* Run tests on feature branch main

* Better /main handling in tests

* Clean and organize run index code

* Ruff fix

* Pyright fix

* Format fixes

* Pyright fix

* Format

* Fix integ tests

* Fix ruff

* Reorganize and clean up
2024-09-05 08:15:10 -06:00
Nathan Evans
f5b4d2fea5
Ci streamline (#988)
* Remove excess vars from gh-pages build

* Delete redundant javascript ci

* Pull apart testing CI

* Clean up integration tests build

* Move storage tests to integration CI

* Take py 3.10 out of smoke tests matrix

* Use minimum supported python version for most tests

* Re-run main CI on any test change

* Add Josh and Kenny to author list

* Update auto-resolve perms
2024-08-21 15:16:15 -06:00