graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2025-07-03 07:04:19 +00:00

Author	SHA1	Message	Date
Derek Worthen	54885b8ab1	Refactor config defaults (#1723 ) * Refactor config defaults - Implement type-safe, hierarchical dataclass for config defaults instead of namespaced constants. - Allow for instantiating config directly from defaults data structure. * fix vector_store db_uri default --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-02-20 13:01:29 -06:00
Nathan Evans	35b639399b	Incremental flow rework (#1696 ) * Rework update output structure * Semver * Fix unit test * Update frequency in incremental --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-02-13 18:22:32 -06:00
Josh Bradley	f14cda2b6d	Improve default llm retry logic to be more optimized (#1701 )	2025-02-13 16:56:37 -05:00
Nathan Evans	fe461417b5	Export NLP community reports prompt (#1697 ) * Properly export the NLP community reports prompt * Semver * Fix verb tests	2025-02-12 10:41:39 -08:00
Dayenne Souza	b94290ec2b	add option to add metadata into text chunks (#1681 ) * add new options * add metadata json into input document * remove doc change * add metadata column into text loader * prepend_metadata * run fix * fix tests and patch * fix test * add watrning for metadata tokens > config size * fix typo and run fix * fix test_integration * fix test * run check * rename and fix chunking * fix * fix * fiz test verbs * fix * fix tests * fix chunking * fix index * fix cosmos test * fix vars * fix after PR * fix	2025-02-12 09:38:03 -08:00
Nathan Evans	c02ab0984a	Streamline workflows (#1674 ) * Remove create_final_nodes * Rename final entity output to "entities" * Remove duplicate code from graph extraction * Rename create_final_relationships output to "relationships" * Rename create_final_communities output to "communities" * Combine compute_communities and create_final_communities * Rename create_final_covariates output to "covariates" * Rename create_final_community_reports output to "community_reports" * Rename create_final_text_units output to "text_units" * Rename create_final_documents output to "documents" * Remove transient snapshots config * Move create_final_entities to finalize_entities operation * Move create_final_relationships flow to finalize_relationships operation * Reuse some community report functions * Collapse most of graph and text unit-based report generation * Unify schemas files * Move community reports extractor * Move NLP report prompt to prompts folder * Fix a few pandas warnings * Rename embeddings config to embed_text * Rename claim_extraction config to extract_claims * Remove nltk from standard graph extraction * Fix verb tests * Fix extract graph config naming * Fix moved file reference * Create v1-to-v2 migration notebook * Semver * Fix smoke test artifact count * Raise tpm/rpm on smoke tests * Update drift settings for smoke tests * Reuse project directory var in api notebook * Format * Format	2025-02-07 11:11:03 -08:00
KennyZhang1	83cc2daf91	Multi-index query CLI support (#1675 ) * Add vector store id reference to embeddings config. * changed structure of output config section * added cli integration for multi index global * added cli integration for multi index local * added cli integration for multi index drift and basic * finished local testing of multi-index cli * ruff fixes * partially refactored test code to align with new output section * more test changes for new output structure * semversioner * refactored to align with new multi index config proposal * locally tested new multi-index output proposal * cleaned up tests to align with new structure --------- Co-authored-by: Derek Worthen <worthend.derek@gmail.com>	2025-02-07 12:56:48 -05:00
Dayenne Souza	ad5b5120ec	remove unused columns and rename document_attribute_columns (#1672 ) * remove unused columns and change property document_attribute_columns to metadata * format file * fix 'metadata' column on output * run check * fix test on nltk * remove docs changes	2025-02-03 14:37:06 -03:00
Derek Worthen	94bd2bb816	Require explicit azure auth settings when using AOI. (#1665 ) * Require explicit azure auth settings when using AOI. - Must set LanguageModel.azure_auth_type to either "api_key" or "managed_identity" when using AOI. * Fix smoke tests * Use general auth_type property instead of azure_auth_type * Remove unused error type * Update validation * Update validation comment	2025-01-29 12:28:47 -08:00
Derek Worthen	eeee84e9d9	Add vector store id reference to embeddings config. (#1662 )	2025-01-28 10:46:41 -08:00
KennyZhang1	1bbce33f42	Multi-index querying for API layer (#1644 ) * added multi-global-query function header * ported over code for merging dataframes * added connection to global streaming api function * added function header for update context helper * implemented and incorperated update_context function * Updated to make sure 'parent' column in final_communities gets incremented for multi index. * first cut at multi_local_seach function * several minor changes and fixes * Updated multi index local search. * Cleaned up code. * fixed lambda function ruff errors * fixed more ruff errors * moved query api helpers to util file * moved index api helpers to util file * merged in code left out of conflict * changed GraphRagConfig object to support lists of vector stores * Updated with fixes for multi_local_search. * Minor updates. * Minor updates. * Updates for ruff check. * Minor updates. * removed redundant vector_store_configs arg * ruff formatting changes * semversioner * Minor fix. * spellcheck fixes * ruff * test fix for cicd errors * another test fix * added explicit typing for ci tests * added dict type check for vector_store during indexing * more ruff fixes * moved type check * Removed streaming. Added multi drift and basic searches. * Formatting changes. * Updates for pyright. * Update for ruff. * Ruff formatted. * first cut at fixing vector store typing errors * got multi local search working with new config * ruff and test fixes * added fix for embeddings type error * renamed multi index api functions * ruff * convert config model to dict[VectorStoreConfig] * modified tests to support new vector_store model * ruff fixes * changed some test setups to match new model * changed ci/cd settings files to match new structure * Fix stderror check * fixed bug in vector_store_config validation * ruff * add database_name field to vectorstoreconfig * removed print statements * small refactoring for PR comments * modified default config in test * modified vector store config unit test --------- Co-authored-by: dorbaker <dorbaker@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-01-27 17:26:38 -05:00
Derek Worthen	c644338bae	Refactor config (#1593 ) * Refactor config - Add new ModelConfig to represent LLM settings - Combines LLMParameters, ParallelizationParameters, encoding_model, and async_mode - Add top level models config that is a list of available LLM ModelConfigs - Remove LLMConfig inheritance and delete LLMConfig - Replace the inheritance with a model_id reference to the ModelConfig listed in the top level models config - Remove all fallbacks and hydration logic from create_graphrag_config - This removes the automatic env variable overrides - Support env variables within config files using Templating - This requires "$" to be escaped with extra "$" so ".\\.txt$" becomes ".\\.txt$$" - Update init content to initialize new config file with the ModelConfig structure * Use dict of ModelConfig instead of list * Add model validations and unit tests * Fix ruff checks * Add semversioner change * Fix unit tests * validate root_dir in pydantic model * Rename ModelConfig to LanguageModelConfig * Rename ModelConfigMissingError to LanguageModelConfigMissingError * Add validationg for unexpected API keys * Allow skipping pydantic validation for testing/mocking purposes. * Add default lm configs to verb tests * smoke test * remove config from flows to fix llm arg mapping * Fix embedding llm arg mapping * Remove timestamp from smoke test outputs * Remove unused "subworkflows" smoke test properties * Add models to smoke test configs * Update smoke test output path * Send logs to logs folder * Fix output path * Fix csv test file pattern * Update placeholder * Format * Instantiate default model configs * Fix unit tests for config defaults * Fix migration notebook * Remove create_pipeline_config * Remove several unused config models * Remove indexing embedding and input configs * Move embeddings function to config * Remove skip_workflows * Remove skip embeddings in favor of explicit naming * fix unit test spelling mistake * self.models[model_id] is already a language model. Remove redundant casting. * update validation errors to instruct users to rerun graphrag init * instantiate LanguageModelConfigs with validation * skip validation in unit tests * update verb tests to use default model settings instead of skipping validation * test using llm settings * cleanup verb tests * remove unsafe default model config * remove the ability to skip pydantic validation * remove None union types when default values are set * move vector_store from embeddings to top level of config and delete resolve_paths * update vector store settings * fix vector store and smoke tests * fix serializing vector_store settings * fix vector_store usage * fix vector_store type * support cli overrides for loading graphrag config * rename storage to output * Add --force flag to init * Remove run_id and resume, fix Drift config assignment * Ruff --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-01-21 17:52:06 -06:00

12 Commits