9 Commits

Author SHA1 Message Date
Nathan Evans
bd06d8b4f0
Context property bag ("state") (#1774)
* Add pipeline state property bag to run context

* Move state creation out of context util

* Move callbacks into PipelineRunContext

* Semver

* Rename state.json to context.json to avoid confusion with stats.json

* Expand smoke test row count

* Add util to create storage and cache
2025-02-28 09:31:48 -08:00
Alonso Guevara
7bdeaee94a
Create Language Model Providers and Registry methods. Remove fnllm coupling (#1724)
* Base structure

* Add fnllm providers and Mock LLM

* Remove fnllm coupling, introduce llm providers

* Ruff + Tests fix

* Spellcheck

* Semver

* Format

* Default MockChat params

* Fix more tests

* Fix embedding smoke test

* Fix embeddings smoke test

* Fix MockEmbeddingLLM

* Rename LLM to model. Package organization

* Fix prompt tuning

* Oops

* Oops II
2025-02-20 08:56:20 -06:00
Nathan Evans
96219a2182
Register workflows (#1691)
* Add workflow registration

* Add ability to mutate config by workflows

* Separate graph finalization

* Separate graph pruning

* Semver

* Update tests

* Update smoke tests

* Fix iterrows on create_graph

* Remove prune_graph from llm construction

* Update test data

* Remove prune_graph from smoke tests
2025-02-14 13:21:31 -08:00
Nathan Evans
c02ab0984a
Streamline workflows (#1674)
* Remove create_final_nodes

* Rename final entity output to "entities"

* Remove duplicate code from graph extraction

* Rename create_final_relationships output to "relationships"

* Rename create_final_communities output to "communities"

* Combine compute_communities and create_final_communities

* Rename create_final_covariates output to "covariates"

* Rename create_final_community_reports output to "community_reports"

* Rename create_final_text_units output to "text_units"

* Rename create_final_documents output to "documents"

* Remove transient snapshots config

* Move create_final_entities to finalize_entities operation

* Move create_final_relationships flow to finalize_relationships operation

* Reuse some community report functions

* Collapse most of graph and text unit-based report generation

* Unify schemas files

* Move community reports extractor

* Move NLP report prompt to prompts folder

* Fix a few pandas warnings

* Rename embeddings config to embed_text

* Rename claim_extraction config to extract_claims

* Remove nltk from standard graph extraction

* Fix verb tests

* Fix extract graph config naming

* Fix moved file reference

* Create v1-to-v2 migration notebook

* Semver

* Fix smoke test artifact count

* Raise tpm/rpm on smoke tests

* Update drift settings for smoke tests

* Reuse project directory var in api notebook

* Format

* Format
2025-02-07 11:11:03 -08:00
Derek Worthen
c644338bae
Refactor config (#1593)
* Refactor config

- Add new ModelConfig to represent LLM settings
    - Combines LLMParameters, ParallelizationParameters, encoding_model, and async_mode
- Add top level models config that is a list of available LLM ModelConfigs
- Remove LLMConfig inheritance and delete LLMConfig
    - Replace the inheritance with a model_id reference to the ModelConfig listed in the top level models config
- Remove all fallbacks and hydration logic from create_graphrag_config
    - This removes the automatic env variable overrides
- Support env variables within config files using Templating
    - This requires "$" to be escaped with extra "$" so ".*\\.txt$" becomes ".*\\.txt$$"
- Update init content to initialize new config file with the ModelConfig structure

* Use dict of ModelConfig instead of list

* Add model validations and unit tests

* Fix ruff checks

* Add semversioner change

* Fix unit tests

* validate root_dir in pydantic model

* Rename ModelConfig to LanguageModelConfig

* Rename ModelConfigMissingError to LanguageModelConfigMissingError

* Add validationg for unexpected API keys

* Allow skipping pydantic validation for testing/mocking purposes.

* Add default lm configs to verb tests

* smoke test

* remove config from flows to fix llm arg mapping

* Fix embedding llm arg mapping

* Remove timestamp from smoke test outputs

* Remove unused "subworkflows" smoke test properties

* Add models to smoke test configs

* Update smoke test output path

* Send logs to logs folder

* Fix output path

* Fix csv test file pattern

* Update placeholder

* Format

* Instantiate default model configs

* Fix unit tests for config defaults

* Fix migration notebook

* Remove create_pipeline_config

* Remove several unused config models

* Remove indexing embedding and input configs

* Move embeddings function to config

* Remove skip_workflows

* Remove skip embeddings in favor of explicit naming

* fix unit test spelling mistake

* self.models[model_id] is already a language model. Remove redundant casting.

* update validation errors to instruct users to rerun graphrag init

* instantiate LanguageModelConfigs with validation

* skip validation in unit tests

* update verb tests to use default model settings instead of skipping validation

* test using llm settings

* cleanup verb tests

* remove unsafe default model config

* remove the ability to skip pydantic validation

* remove None union types when default values are set

* move vector_store from embeddings to top level of config and delete resolve_paths

* update vector store settings

* fix vector store and smoke tests

* fix serializing vector_store settings

* fix vector_store usage

* fix vector_store type

* support cli overrides for loading graphrag config

* rename storage to output

* Add --force flag to init

* Remove run_id and resume, fix Drift config assignment

* Ruff

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-21 17:52:06 -06:00
Nathan Evans
7ec9ef0261
Refactor callbacks (#1583)
* Unify Workflow and Verb callbacks interfaces

* Semver

* Fix storage class instantiation (#1582)

---------

Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2025-01-06 10:58:59 -08:00
Nathan Evans
a35cb12741
Remove datashaper strip code (#1581)
Remove datashaper
2025-01-03 13:59:26 -08:00
KennyZhang1
8368b12532
Add Cosmos DB storage/cache option (#1431)
* added cosmosdb constructor and database methods

* added rest of abstract method headers

* added cosmos db container methods

* implemented has and delete methods

* finished implementing abstract class methods

* integrated class into storage factory

* integrated cosmosdb class into cache factory

* added support for new config file fields

* replaced primary key cosmosdb initialization with connection strings

* modified cosmosdb setter to require json

* Fix non-default emitters

* Format

* Ruff

* ruff

* first successful run of cosmosdb indexing

* removed extraneous container_name setting

* require base_dir to be typed as str

* reverted merged changed from closed branch

* removed nested try statement

* readded initial non-parquet emitter fix

* added basic support for parquet emitter using internal conversions

* merged with main and resolved conflicts

* fixed more merge conflicts

* added cosmosdb functionality to query pipeline

* tested query for cosmosdb

* collapsed cosmosdb schema to use minimal containers and databases

* simplified create_database and create_container functions

* ruff fixes and semversioner

* spellcheck and ci fixes

* updated pyproject toml and lock file

* apply fixes after merge from main

* add temporary comments

* refactor cache factory

* refactored storage factory

* minor formatting

* update dictionary

* fix spellcheck typo

* fix default value

* fix pydantic model defaults

* update pydantic models

* fix init_content

* cleanup how factory passes parameters to file storage

* remove unnecessary output file type

* update pydantic model

* cleanup code

* implemented clear method

* fix merge from main

* add test stub for cosmosdb

* regenerate lock file

* modified set method to collapse parquet rows

* modified get method to collapse parquet rows

* updated has and delete methods and docstrings to adhere to new schema

* added prefix helper function

* replaced delimiter for prefixed id

* verified empty tests are passing

* fix merges from main

* add find test

* update cicd step name

* tested querying for new schema

* resolved errors from merge conflicts

* refactored set method to handle cache in new schema

* refactored get method to handle cache in new schema

* force unique ids to be written to cosmos for nodes

* found bug with has and delete methods

* modified has and delete to work with cache in new schema

* fix the merge from main

* minor typo fixes

* update lock file

* spellcheck fix

* fix init function signature

* minor formatting updates

* remove https protocol

* change localhost to 127.0.0.1 address

* update pytest to use bacj engine

* verified cache tests

* improved speed of has function

* resolved pytest error with find function

* added test for child method

* make container_name variable private as _container_name

* minor variable name fix

* cleanup cosmos pytest and make the cosmosdb storage class operations more efficient

* update cicd to use different cosmosdb emulator

* test with http protocol

* added pytest for clear()

* add longer timeout for cosmosdb emulator startup

* revert http connection back to https

* add comments to cicd code for future dev usage

* set to container and database clients to none upon deletion

* ruff changes

* add comments to cicd code

* removed unneeded None statements and ruff fixes

* more ruff fixes

* Update test_run.py

* remove unnecessary call to delete container

* ruff format updates

* Reverted test_run.py

* fix ruff formatter errors

* cleanup variable names to be more consistent

* remove extra semversioner file

* revert pydantic model changes

* revert pydantic model change

* revert pydantic model change

* re-enable inline formatting rule

* update documentation in dev guide

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2024-12-19 13:43:21 -06:00
Nathan Evans
1d68af308b
Community workflow (#1495)
* Create separate communities workflow

* Add test for new workflow

* Rename workflows

* Collapse subflows into parents

* Rename flows, reuse variables

* Semver

* Fix integration test

* Fix smoke tests

* Fix megapipeline format

* Rename missed files

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-11 15:41:16 -06:00