graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2026-01-28 10:48:13 +00:00

Author	SHA1	Message	Date
Derek Worthen	4c8ef97760	V3/main (#2190 ) * Remove graph embedding and UMAP (#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (#2082) * reduce schema fields (#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (#2095) * Sort deps alpha * Remove multi search (#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (#2119) * Add load_config to graphrag-common package. * Empty graph guards (#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (#2133) * Init command asks for models (#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (#2127) * Add graphrag-storage. * Python update (3.13) (#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (#2154) * Issue #2004 fix (#2159) * fix issue #2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing #2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (#2161) * fix issue #860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (#2174) * Update documentation for v3 release (#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (#2181) * Migration update (#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>	2026-01-27 10:23:45 -08:00
Nathan Evans	ac8a7f5eef	Housekeeping (#2086 ) * Add deprecation warnings for fnllm and multi-search * Fix dangling token_encoder refs * Fix local_search notebook * Fix global search dynamic notebook * Fix global search notebook * Fix drift notebook * Switch example notebooks to use LiteLLM config * Properly annotate dev deps as a group * Semver * Remove --extra dev * Remove llm_model variable * Ignore ruff ASYNC240 * Add note about expected broken notebook in docs * Fix custom vector store notebook * Push tokenizer throughout	2025-10-07 16:21:24 -07:00
Copilot	7c28c70d5c	Switch from Poetry to uv for package management (#2008 ) * Initial plan * Switch from Poetry to uv for package management Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Clean up build artifacts and update gitignore Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * remove build artifacts * remove hardcoded version string * fix calls to pip in cicd * Update gh-pages.yml workflow to use uv instead of Poetry Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff formatting fixes * update cicd workflow with latest uv action * fix command to retrieve package version * update development instructions * remove Poetry references * Replace deprecated azuright action with npm-based Azurite installation Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * skip api version check for azurite * add semversioner file * update more changes from switching to UV * Migrate unified-search-app from Poetry to uv package management Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * minor typo update * minor Dockerfile update * update cicd thresholds * update pytest thresholds * ruff fixes * ruff fixes * remove legacy npm settings that no longer apply * Update Unified Search App Readme --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-08-13 18:57:25 -06:00
KennyZhang1	8368b12532	Add Cosmos DB storage/cache option (#1431 ) * added cosmosdb constructor and database methods * added rest of abstract method headers * added cosmos db container methods * implemented has and delete methods * finished implementing abstract class methods * integrated class into storage factory * integrated cosmosdb class into cache factory * added support for new config file fields * replaced primary key cosmosdb initialization with connection strings * modified cosmosdb setter to require json * Fix non-default emitters * Format * Ruff * ruff * first successful run of cosmosdb indexing * removed extraneous container_name setting * require base_dir to be typed as str * reverted merged changed from closed branch * removed nested try statement * readded initial non-parquet emitter fix * added basic support for parquet emitter using internal conversions * merged with main and resolved conflicts * fixed more merge conflicts * added cosmosdb functionality to query pipeline * tested query for cosmosdb * collapsed cosmosdb schema to use minimal containers and databases * simplified create_database and create_container functions * ruff fixes and semversioner * spellcheck and ci fixes * updated pyproject toml and lock file * apply fixes after merge from main * add temporary comments * refactor cache factory * refactored storage factory * minor formatting * update dictionary * fix spellcheck typo * fix default value * fix pydantic model defaults * update pydantic models * fix init_content * cleanup how factory passes parameters to file storage * remove unnecessary output file type * update pydantic model * cleanup code * implemented clear method * fix merge from main * add test stub for cosmosdb * regenerate lock file * modified set method to collapse parquet rows * modified get method to collapse parquet rows * updated has and delete methods and docstrings to adhere to new schema * added prefix helper function * replaced delimiter for prefixed id * verified empty tests are passing * fix merges from main * add find test * update cicd step name * tested querying for new schema * resolved errors from merge conflicts * refactored set method to handle cache in new schema * refactored get method to handle cache in new schema * force unique ids to be written to cosmos for nodes * found bug with has and delete methods * modified has and delete to work with cache in new schema * fix the merge from main * minor typo fixes * update lock file * spellcheck fix * fix init function signature * minor formatting updates * remove https protocol * change localhost to 127.0.0.1 address * update pytest to use bacj engine * verified cache tests * improved speed of has function * resolved pytest error with find function * added test for child method * make container_name variable private as _container_name * minor variable name fix * cleanup cosmos pytest and make the cosmosdb storage class operations more efficient * update cicd to use different cosmosdb emulator * test with http protocol * added pytest for clear() * add longer timeout for cosmosdb emulator startup * revert http connection back to https * add comments to cicd code for future dev usage * set to container and database clients to none upon deletion * ruff changes * add comments to cicd code * removed unneeded None statements and ruff fixes * more ruff fixes * Update test_run.py * remove unnecessary call to delete container * ruff format updates * Reverted test_run.py * fix ruff formatter errors * cleanup variable names to be more consistent * remove extra semversioner file * revert pydantic model changes * revert pydantic model change * revert pydantic model change * re-enable inline formatting rule * update documentation in dev guide --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-12-19 13:43:21 -06:00
Josh Bradley	0394b55086	Update CI/CD - skip running unit tests on documentation-only PRs (#1371 )	2024-11-06 14:19:21 -05:00
Josh Bradley	3df6f8c65b	Allow ci/cd to skip draft PRs (#1314 )	2024-10-23 12:46:00 -04:00
Alonso Guevara	044516f538	Clean and organize run index code (#1090 ) * Create entypoint for cli and api (#1067) * Add cli and api entrypoints for update index * Semver * Update docs * Run tests on feature branch main * Better /main handling in tests * Clean and organize run index code * Ruff fix * Pyright fix * Format fixes * Pyright fix * Format * Fix integ tests * Fix ruff * Reorganize and clean up	2024-09-05 08:15:10 -06:00
Nathan Evans	f5b4d2fea5	Ci streamline (#988 ) * Remove excess vars from gh-pages build * Delete redundant javascript ci * Pull apart testing CI * Clean up integration tests build * Move storage tests to integration CI * Take py 3.10 out of smoke tests matrix * Use minimum supported python version for most tests * Re-run main CI on any test change * Add Josh and Kenny to author list * Update auto-resolve perms	2024-08-21 15:16:15 -06:00

8 Commits