Derek Worthen 4c8ef97760
V3/main (#2190)
* Remove graph embedding and UMAP (#2048)

* Remove umap/layout operation

* Remove graph embedding

* Bump unified-search to GR 2.5.0

* Remove graph vis from unified-search

* Remove file filtering (#2050)

* Remove document filtering

* Semver

* Fix integ tests

* Fix file find tuple

* Fix another dangling find tuple

* Remove text unit grouping (#2052)

* Remove text unit group_by_columns

* Semver

* Fix default token split test

* Fix models in config test samples

* Fix token length in context sort test

* Fix document sort

* Re-implement hierarchical Leiden (#2049)

* Use graspologic-native hierarchical leiden

* Re-implement largest_connected_component

* Copy in modularity

* Use graspologic-native directly in pyproject

* Remove directed graph tests (we don't use this)

* Semver

* Remove graspologic dep

* Use 4.1 and text-embedding-3-large as defaults

* Update comment

* Clean vector store (#2077)

* clean vector store code

* fix

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Update v3/main missing config + functions (#2082)

* reduce schema fields (#2089)

* reduce schema fields

* fix launch.json

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Remove strategy dicts (#2090)

* Remove "strategy" from community reports config/workflow

* Remove extraction strategy from extract_graph

* Remove summarization strategy from extract_graph

* Remove strategy from claim extraction

* Strongly type prompt templates

* Remove strategy from embed_text

* Push hydrated params into community report workflows

* Push hyrdated params into extract covariates

* Push hydrated params into extract graph NLP

* Push hydrated params into extract graph

* Push hydrated params into text embeddings

* Remove a few more low-level defaults

* Semver

* Remove configurable prompt delimiters

* Update smoke tests

* Remove fnllm (#2095)

* Sort deps alpha

* Remove multi search (#2093)

* Remove multi-search from CLI

* Remove multi-search from API

* Flatten vector_store config

* Push hydrated vector store down to embed_text

* Remove outputs from config

* Remove multi-search notebook/docs

* Add missing response_type in basic search API

* Fix basic search context and id mapping

* Fix v1 migration notebook

* Fix query entity search tests

* V3 docs and cleanup (#2100)

* Remove community contrib notebooks

* Add migration notebook and breaking changes page edits

* Update/polish docs

* Make model instance name configurable

* Add vector schema updates to v3 migration notebook

* Spellcheck

* Bump smoke test runtimes

* Remove document overwrite (#2101)

* remove document overwrite from vector store configuration

* remove document overwrite and refactor load documents method

* fix test

* fix test

* fix test

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Unified factory (#2105)

* Simplify Factory interface

* Migrate CacheFactory to standard base class

* Migrate LoggerFactory to standard base class

* Migrate StorageFactory to standard base class

* Migrate VectorStoreFactory to standard base class

* Update vector store example notebook

* Delete notebook outputs

* Move default providers into factories

* Move retry/limit tests into integ

* Split language model factories

* Set smoke test tpm/rpm

* Fix factory integ tests

* Add method to smoke test, switch text to 'fast'

* Fix text smoke config for fast workflow

* Add new workflows to text smoke test

* Convert input readers to a proper factory

* Remove covariates from fast smoke test

* Update docs for input factory

* Bump smoke runtime

* Even longer runtime

* min-csv timeout

* Remove unnecessary lambdas

* Prefix vector store (#2106)

* add prefix to vector store configuration and removal of container name

* docs updated

* change prefix property name

* change prefix property name

* feedback implemented

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* fix for container name

* Restructure project as monorepo. (#2111)

* Restructure project as monorepo.

* Fix formatting

* Storage fixes and cleanup (#2118)

* Fix pipeline recursion

* Remove base_dir from storage.find

* Remove max_count from storage.find

* Remove prefix on storage integ test

* Add base_dir in creation_date test

* Wrap base_dir in Path

* Use constants for input/update directories

* Nov 2025 housekeeping (#2120)

* Remove gensim sideload

* Split CI build/type checks from unit tests

* Thorough review of docs to align with v3

* Format

* Fix version

* Fix type

* Graphrag config (#2119)

* Add load_config to graphrag-common package.

* Empty graph guards (#2126)

* Remove networkx from graph_extractor and clean out redundancy

* Bubble pipeline error to console

* Remove embeddings optional new (#2128)

* remove optional embeddings

* fix test

* fix tests

* fix pipeline

* fix test

* fix test

* fix test

* fix tests

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Format

* Add empty checks for NLP graphs (#2133)

* Init command asks for models (#2137)

* Add init prompting for models

* Remove hard-coded model config validation

* Switch to typer option prompt for full CLI use with models

* Update getting started for init model input

* Bump request timeout and overall smoke test timeout

* Add graphrag-storage. (#2127)

* Add graphrag-storage.

* Python update (3.13) (#2149)

* Update to python 3.14 as default, with range down to 3.10

* Fix enum value in query cli

* Update pyarrow

* Update py version for storage package

* Remove 3.10

* add fastuuid

* Update Python support to 3.11-3.14 with stricter dependency constraints

- Set minimum Python version to 3.11 (removed 3.10 support)
- Added support for Python 3.14
- Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14
- Fixed license format to use SPDX-compatible format for Python 3.14
- Updated pyarrow to >=22.0.0 for Python 3.14 wheel support
- Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility
- Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control
- Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app

* update uv lock

* Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability

* Update uv lock

* Update numpy to >=2.0.0 for Python 3.14 Windows compatibility

Numpy 1.25.x has access violation issues on Python 3.14 Windows.
Numpy 2.x has proper Python 3.14 support including Windows wheels.

* update uv lock

* Update pandas to >=2.3.0 for numpy 2.x compatibility

Pandas 2.2.x was compiled against numpy 1.x and causes ABI
incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports
numpy 2.x properly.

* update uv.lock

* Add scipy>=1.15.0 for numpy 2.x compatibility

Scipy versions < 1.15.0 have C extensions built against numpy 1.x
and are incompatible with numpy 2.x, causing dtype size errors.

* update uv lock

* Update Python support to 3.11-3.13 with compatible dependencies

- Set Python version range to 3.11-3.13 (removed 3.14 support)
- Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13
- Dependencies optimized for Python 3.13 compatibility:
  - pyarrow~=22.0 (has Python 3.13 wheels)
  - numpy~=1.26
  - pandas~=2.2
  - blis~=1.0
  - fastuuid~=0.13
- Applied stricter version constraints using ~= operator throughout
- Updated uv.lock with resolved dependencies

* Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility

Numpy 1.26.x causes access violations on Python 3.13 Windows.
Numpy 2.1+ has proper Python 3.13 support with Windows wheels.
Pandas 2.3+ is required for numpy 2.x compatibility.

* update vsts.yml python version

* Add GraphRAG Cache package. (#2153)

* Add GraphRAG Cache package.

* Fix a bunch of module comments and function visibility (#2154)

* Issue #2004 fix (#2159)

* fix issue #2004 using KeenhoChu idea in his PR

* add unit test for dynamic community selection

* add unit test for dynamic community selection implementing #2158 logic

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (#2161)

* fix issue #860 for mismatch in prompts and input

* fix format

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Chunker factory (#2156)

* Delete NoopTextSplitter

* Delete unused check_token_limit

* Add base chunking factory and migrate workflow to use it

* Split apart chunker module

* Co-locate chunking/splitting

* Collapse token splitting functionality into one class/function

* Restore create_base_text_units parameterization

* Move Tokenizer base class to common package

* Move pre-pending into chunkers

* Streamline config

* Fix defaults construction

* Add prepending tests

* Remove chunk_size_includes_metadata config

* Revert ChunkingDocument interface

* Move metadata prepending to a util

* Move Tokenizer back to GR core

* Fix tokenizer removal from chunker

* Set defaults for chunking config

* Move chunking to monorepo package

* Format

* Typo

* Add ChunkResult model

* Streamline chunking config

* Add missing version updates for graphrag_chunking

* Input factory (#2168)

* Update input factory to match other factories

* Move input config alongside input readers

* Move file pattern logic into InputReader

* Set encoding default

* Clean up optional column configs

* Combine structured data extraction

* Remove pandas from input loading

* Throw if empty documents

* Add json lines (jsonl) input support

* Store raw data

* Fix merge imports

* Move metadata handling entirely to chunking

* Nicer automatic title

* Typo

* Add get_property utility for nested dictionary access with dot notation

* Update structured_file_reader to use get_property utility

* Extract input module into new graphrag-input monorepo package

- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module

* Rename ChunkResult to TextChunk and add transformer support

- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument

* Back-compat comment

* Align input config type name with other factory configs

* Add MarkItDown support

* Remove pattern default from MarkItDown reader

* Remove plugins flag (implicit disabled)

* Format

* Update verb tests

* Separate storage from input config

* Add empty objects for NaN raw_data

* Fix smoke tests

* Fix BOM in csv smoke

* Format

* DRIFT fixes (#2171)

* Use stable ids for community reports

* Remove deprecated title from embedding flow

* Remove embedding column from df loaders

* Fix lancedb insertion

* Add drift back to smoke tests

* Fix mock embedder to match default embedding length

* Fix DRIFT notebook

* Push drift_k_followups through to prompt

* Format

* Vector package (#2172)

* Extract graphrag-vectors package

* Simplify vector factory usage and config defaults

* Update factory integ initializers

* Fix mock patch

* Format

* Register vector stores in tests

* Set a default vector store name

* Update vector readme

* Remove impls from init

* Move some validation into impls

* Remove index_prefix

* Move duplicate method to base class

* Fix smoke vector config

* Update index bug (#2173)

* fix update index bug

* blob storage bug fix

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Add GraphRAG LLM package. (#2174)

* Update documentation for v3 release (#2176)

update documentation for v3 release

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>

* Graphrag llm cleanup (#2181)

* Migration update (#2180)

* fix formatting.

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com>
Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
2026-01-27 10:23:45 -08:00
..
2026-01-27 10:23:45 -08:00
2026-01-27 10:23:45 -08:00
2024-07-01 15:25:30 -06:00
2024-07-01 15:25:30 -06:00