32 Commits

Author SHA1 Message Date
Derek Worthen
54885b8ab1
Refactor config defaults (#1723)
* Refactor config defaults

- Implement type-safe, hierarchical dataclass for config
defaults instead of namespaced constants.
- Allow for instantiating config directly from defaults data structure.

* fix vector_store db_uri default

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-20 13:01:29 -06:00
Alonso Guevara
7bdeaee94a
Create Language Model Providers and Registry methods. Remove fnllm coupling (#1724)
* Base structure

* Add fnllm providers and Mock LLM

* Remove fnllm coupling, introduce llm providers

* Ruff + Tests fix

* Spellcheck

* Semver

* Format

* Default MockChat params

* Fix more tests

* Fix embedding smoke test

* Fix embeddings smoke test

* Fix MockEmbeddingLLM

* Rename LLM to model. Package organization

* Fix prompt tuning

* Oops

* Oops II
2025-02-20 08:56:20 -06:00
Nathan Evans
35b639399b
Incremental flow rework (#1696)
* Rework update output structure

* Semver

* Fix unit test

* Update frequency in incremental

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-13 18:22:32 -06:00
Josh Bradley
f14cda2b6d
Improve default llm retry logic to be more optimized (#1701) 2025-02-13 16:56:37 -05:00
Nathan Evans
fe461417b5
Export NLP community reports prompt (#1697)
* Properly export the NLP community reports prompt

* Semver

* Fix verb tests
2025-02-12 10:41:39 -08:00
Dayenne Souza
b94290ec2b
add option to add metadata into text chunks (#1681)
* add new options

* add metadata json into input document

* remove doc change

* add metadata column into text loader

* prepend_metadata

* run fix

* fix tests and patch

* fix test

* add watrning for metadata tokens > config size

* fix typo and run fix

* fix test_integration

* fix test

* run check

* rename and fix chunking

* fix

* fix

* fiz test verbs

* fix

* fix tests

* fix chunking

* fix index

* fix cosmos test

* fix vars

* fix after PR

* fix
2025-02-12 09:38:03 -08:00
Nathan Evans
c02ab0984a
Streamline workflows (#1674)
* Remove create_final_nodes

* Rename final entity output to "entities"

* Remove duplicate code from graph extraction

* Rename create_final_relationships output to "relationships"

* Rename create_final_communities output to "communities"

* Combine compute_communities and create_final_communities

* Rename create_final_covariates output to "covariates"

* Rename create_final_community_reports output to "community_reports"

* Rename create_final_text_units output to "text_units"

* Rename create_final_documents output to "documents"

* Remove transient snapshots config

* Move create_final_entities to finalize_entities operation

* Move create_final_relationships flow to finalize_relationships operation

* Reuse some community report functions

* Collapse most of graph and text unit-based report generation

* Unify schemas files

* Move community reports extractor

* Move NLP report prompt to prompts folder

* Fix a few pandas warnings

* Rename embeddings config to embed_text

* Rename claim_extraction config to extract_claims

* Remove nltk from standard graph extraction

* Fix verb tests

* Fix extract graph config naming

* Fix moved file reference

* Create v1-to-v2 migration notebook

* Semver

* Fix smoke test artifact count

* Raise tpm/rpm on smoke tests

* Update drift settings for smoke tests

* Reuse project directory var in api notebook

* Format

* Format
2025-02-07 11:11:03 -08:00
KennyZhang1
83cc2daf91
Multi-index query CLI support (#1675)
* Add vector store id reference to embeddings config.

* changed structure of output config section

* added cli integration for multi index global

* added cli integration for multi index local

* added cli integration for multi index drift and basic

* finished local testing of multi-index cli

* ruff fixes

* partially refactored test code to align with new output section

* more test changes for new output structure

* semversioner

* refactored to align with new multi index config proposal

* locally tested new multi-index output proposal

* cleaned up tests to align with new structure

---------

Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
2025-02-07 12:56:48 -05:00
Dayenne Souza
ad5b5120ec
remove unused columns and rename document_attribute_columns (#1672)
* remove unused columns and change property document_attribute_columns to metadata

* format file

* fix 'metadata' column on output

* run check

* fix test on nltk

* remove docs changes
2025-02-03 14:37:06 -03:00
Derek Worthen
94bd2bb816
Require explicit azure auth settings when using AOI. (#1665)
* Require explicit azure auth settings when using AOI.

- Must set LanguageModel.azure_auth_type to either
"api_key" or "managed_identity" when using AOI.

* Fix smoke tests

* Use general auth_type property instead of azure_auth_type

* Remove unused error type

* Update validation

* Update validation comment
2025-01-29 12:28:47 -08:00
Derek Worthen
eeee84e9d9
Add vector store id reference to embeddings config. (#1662) 2025-01-28 10:46:41 -08:00
KennyZhang1
1bbce33f42
Multi-index querying for API layer (#1644)
* added multi-global-query function header

* ported over code for merging dataframes

* added connection to global streaming api function

* added function header for update context helper

* implemented and incorperated update_context function

* Updated to make sure 'parent' column in final_communities gets incremented for multi index.

* first cut at multi_local_seach function

* several minor changes and fixes

* Updated multi index local search.

* Cleaned up code.

* fixed lambda function ruff errors

* fixed more ruff errors

* moved query api helpers to util file

* moved index api helpers to util file

* merged in code left out of conflict

* changed GraphRagConfig object to support lists of vector stores

* Updated with fixes for multi_local_search.

* Minor updates.

* Minor updates.

* Updates for ruff check.

* Minor updates.

* removed redundant vector_store_configs arg

* ruff formatting changes

* semversioner

* Minor fix.

* spellcheck fixes

* ruff

* test fix for cicd errors

* another test fix

* added explicit typing for ci tests

* added dict type check for vector_store during indexing

* more ruff fixes

* moved type check

* Removed streaming. Added multi drift and basic searches.

* Formatting changes.

* Updates for pyright.

* Update for ruff.

* Ruff formatted.

* first cut at fixing vector store typing errors

* got multi local search working with new config

* ruff and test fixes

* added fix for embeddings type error

* renamed multi index api functions

* ruff

* convert config model to dict[VectorStoreConfig]

* modified tests to support new vector_store model

* ruff fixes

* changed some test setups to match new model

* changed ci/cd settings files to match new structure

* Fix stderror check

* fixed bug in vector_store_config validation

* ruff

* add database_name field to vectorstoreconfig

* removed print statements

* small refactoring for PR comments

* modified default config in test

* modified vector store config unit test

---------

Co-authored-by: dorbaker <dorbaker@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-27 17:26:38 -05:00
Derek Worthen
c644338bae
Refactor config (#1593)
* Refactor config

- Add new ModelConfig to represent LLM settings
    - Combines LLMParameters, ParallelizationParameters, encoding_model, and async_mode
- Add top level models config that is a list of available LLM ModelConfigs
- Remove LLMConfig inheritance and delete LLMConfig
    - Replace the inheritance with a model_id reference to the ModelConfig listed in the top level models config
- Remove all fallbacks and hydration logic from create_graphrag_config
    - This removes the automatic env variable overrides
- Support env variables within config files using Templating
    - This requires "$" to be escaped with extra "$" so ".*\\.txt$" becomes ".*\\.txt$$"
- Update init content to initialize new config file with the ModelConfig structure

* Use dict of ModelConfig instead of list

* Add model validations and unit tests

* Fix ruff checks

* Add semversioner change

* Fix unit tests

* validate root_dir in pydantic model

* Rename ModelConfig to LanguageModelConfig

* Rename ModelConfigMissingError to LanguageModelConfigMissingError

* Add validationg for unexpected API keys

* Allow skipping pydantic validation for testing/mocking purposes.

* Add default lm configs to verb tests

* smoke test

* remove config from flows to fix llm arg mapping

* Fix embedding llm arg mapping

* Remove timestamp from smoke test outputs

* Remove unused "subworkflows" smoke test properties

* Add models to smoke test configs

* Update smoke test output path

* Send logs to logs folder

* Fix output path

* Fix csv test file pattern

* Update placeholder

* Format

* Instantiate default model configs

* Fix unit tests for config defaults

* Fix migration notebook

* Remove create_pipeline_config

* Remove several unused config models

* Remove indexing embedding and input configs

* Move embeddings function to config

* Remove skip_workflows

* Remove skip embeddings in favor of explicit naming

* fix unit test spelling mistake

* self.models[model_id] is already a language model. Remove redundant casting.

* update validation errors to instruct users to rerun graphrag init

* instantiate LanguageModelConfigs with validation

* skip validation in unit tests

* update verb tests to use default model settings instead of skipping validation

* test using llm settings

* cleanup verb tests

* remove unsafe default model config

* remove the ability to skip pydantic validation

* remove None union types when default values are set

* move vector_store from embeddings to top level of config and delete resolve_paths

* update vector store settings

* fix vector store and smoke tests

* fix serializing vector_store settings

* fix vector_store usage

* fix vector_store type

* support cli overrides for loading graphrag config

* rename storage to output

* Add --force flag to init

* Remove run_id and resume, fix Drift config assignment

* Ruff

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-21 17:52:06 -06:00
Nathan Evans
a35cb12741
Remove datashaper strip code (#1581)
Remove datashaper
2025-01-03 13:59:26 -08:00
Derek Worthen
80367be018
Remove config input models (#1570)
* Remove config input models

* remove unit tests related to config input models

* add semversioner change

* Merge branch 'main' into config-remove-input-models
2025-01-02 15:25:10 -08:00
gaudyb
185f513ca7
Basic search implementation (#1563)
* basic search implementation

* basic streaming functionality

* format check

* check fix

* release change

* Chore/gleanings any encoding (#1569)

* Make claims and entities independent of encoding

* Semver

* Change semver release type

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-02 13:49:11 -06:00
Alonso Guevara
1c3b0f34c3
Chore/lib updates (#1477)
* Update dependencies and fix issues

* Format

* Semver

* Fix Pyright

* Pyright

* More Pyright

* Pyright
2024-12-06 14:08:24 -06:00
Nathan Evans
d17dfd01f9
Graph collapse (#1464)
* Refactor graph creation

* Semver

* Spellcheck

* Update integ pipeline

* Fix cast

* Improve pandas chaining

* Cleaner apply

* Use list comprehensions

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-05 11:57:26 -06:00
Josh Bradley
22a57d14c7
Improve CLI speed with lazy imports (#1319) 2024-11-15 19:41:10 -05:00
Nathan Evans
51912b2e03
Move prompts (#1404)
* Move indexing prompts to root

* Move query prompts to root

* Export query prompts during init

* Extract general knowledge prompt

* Load query prompts from disk

* Semver

* Fix unit tests
2024-11-14 10:45:37 -08:00
Nathan Evans
634e3ed62a
Transient entity graph (#1349)
* Make base_entity_graph transient

* Add transient snapshots

* Semver

* Fix unit test

* Fix smoke tests
2024-11-04 17:23:29 -08:00
Josh Bradley
083de12bcf
Auto-generate CLI doc pages (#1325) 2024-10-25 19:00:24 -04:00
Andres Morales
fc9895f793
Replace current docs by mkdocs (#1263)
* Replace docs by mkdocs-material

* Fix markdown

* Fix verions in gh-pages workflow

* remove whitespaces

* add semver

* Add build docs check on python-ci

* Fix command in index cli

* Spellcheck

* Spellcheck

* remove docsite paths

* clear outputs from notebook

* remove dependabot npm for docsite

* remove more docsite left overs

* execute notebooks

* Update notebooks

* update poetry lock

* Remove notebook build from ci

* Revert dep update

* Navigation tabs

* Fix stylesheet

* add kwds to dictionary

* Turn on notebook execution

* Update gitignore

* Add MSR Blog posts

* spellcheck

* Accessibility Changes

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-10-11 13:39:03 -06:00
Derek Worthen
2d45ece9b6
fix setting base_dir to full paths when not using file system. (#1096)
* fix setting base_dir to full paths when not using file system.

* add general resolve_path
2024-09-04 11:33:44 -07:00
Derek Worthen
ab29cc2a7e
Consistent config load_config (#1065)
* Consistent config load_config

- Provide a consistent way to load configuration
- Resolve potential timestamp directories upfront
    upon config object creation
- Add unit tests for resolving timestamp directories
- Resolves #599
- Resolves #1049

* fix formatting issues

* remove unnecessary path resolution

* fix smoke tests

* update prompts to use load_config

* Update none checks

* Update none checks

* Update searching for config method signature

* Update unit tests

* fix formatting issues
2024-09-03 16:33:16 -06:00
Chris Trevino
9d99f323ea
Add encoding model to entity/claim extraction config sections (#740)
* Add encoding-model configuration to entity & claim extraction

* add change note

* pr updates

* test fix

* disable GH-based smoke tests
2024-07-26 15:05:08 -07:00
Chris Trevino
4c229afec8
add encoding model to text-chunking config (#743)
* add encoding model to text-chunking config

* revert groupby fix, handled in other pr

* revert environment reader update for other pr
2024-07-26 14:15:17 -07:00
Chris Trevino
4e6589b614
fix config reader to allow for zero gleans (#735) 2024-07-26 09:11:34 -07:00
Alonso Guevara
ce462515d8
Local search llm params (#533)
* initialize config with  LocalSearchConfig and GlobalSearchConfig

* init_content LocalSearchConfig and GlobalSearchConfig

* rollback MAP_SYSTEM_PROMPT

* Small changes before merging. Notebook rollback

* Semver

---------

Co-authored-by: glide-the <2533736852@qq.com>
2024-07-15 13:01:56 -06:00
Kylin
e2572c7fab
[bug fix]Fix community_report config doesn't work in settings.yaml (#405)
* fix community_report doesn't work in settings.yaml

* add semversioner

* fix unittest about community report to community reports of env

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-07-08 22:48:02 -06:00
Alonso Guevara
b912081f1b
Add N parameter support (#390)
* Add N parameter support

* Fix unit tests

* Add new env vars to param testing
2024-07-08 14:04:49 -06:00
Alonso Guevara
81b81cf60b Initial Release 2024-07-01 15:25:30 -06:00