365 Commits

Author SHA1 Message Date
dependabot[bot]
97949ff014
Bump cryptography from 43.0.0 to 44.0.1 in /unified-search-app
Bumps [cryptography](https://github.com/pyca/cryptography) from 43.0.0 to 44.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/43.0.0...44.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-version: 44.0.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-04-07 19:04:37 +00:00
gaudyb
0e1a6e3770
Unified search added to graphrag (#1862)
* unified search app added to graphrag repository

* ignore print statements

* update words for unified-search

* fix lint errors

* fix lint error

* fix module name

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
2025-04-07 11:59:02 -06:00
KennyZhang1
61769dd47e
Vector Store Integration Tests (#1856)
* Add vector store id reference to embeddings config.

* generated initial vector store pytests

* cleaned up cosmosdb vector store test

* fixed class name typo and debugged cosmosdb vector store test

* reset emulator connection string

* remove unneccessary comments

* removed extra comments from azure ai search test

* ruff

* semversioner

* fix cicd issues

* bypass diskANN policy for test env

* handle floating point inprecisions

---------

Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
2025-04-01 11:05:04 -04:00
Gabriel Nieves-Ponce
ffd8db7104
Gnievesponce prompt tune embedd chunking (#1826)
* Added support for embeddings chunking as defined by the  config.

* ran semvisor -t patch

* Eliminated redunant code by using the embed_text strategy directly

* Added fix to support brakets within the corpus text; For example, inline LaTeX within a markdown file

---------

Co-authored-by: Gabriel Nieves <gnievesponce@microsoft.com>
2025-03-31 12:38:01 -04:00
Alonso Guevara
b7b2b562ce
fnllm version fix (#1835)
* Fix fnllm version

* Semver
2025-03-21 22:13:56 -07:00
Nathan Evans
3b1e70c06b
Update config docs (2.1.0) (#1818)
* Align docs with config

* Semver

* Spelling

* Format

* Spelling
2025-03-18 12:39:30 -07:00
Nathan Evans
813b4de99f
Fix API key reference for gh-pages (#1821) 2025-03-18 11:10:11 -07:00
Nathan Evans
ddc6541ab6
Add docs page about input formats (#1784)
* Add docs page about input formats

* Add json example

* Spelling
2025-03-11 17:37:46 -07:00
Nathan Evans
321d479ab6
Update notebooks for 2.0 (#1785)
* Update API overview

* Fix global search example

* Fix local search example

* Fix global dynamic example

* Fix drift example

* Update multi-index example

* Semver
2025-03-11 17:23:49 -07:00
Alonso Guevara
0d363e6957
Release v2.1.0 (#1800) v2.1.0 2025-03-11 18:16:08 -06:00
Alonso Guevara
53950f8442
Fix/model provider key injection check (#1799)
* Check available models for type validation

* Semver

* Fix ruff and pyright

* Apply feedback
2025-03-11 17:48:30 -06:00
Gabriel Nieves-Ponce
e39d869bed
Added support for verbose logging and csv-metadata to the prompt tune… (#1789)
* Added support for verbose logging and csv-metadata to the prompt tune client.

* Updated community report summarization file name and prompt template

* updated semversioner

* ran ruff linter

* Ran poe format

* Fix Ruff complains

* Fix a new ruff complain :P

* Pyright

* Fix tests

---------

Co-authored-by: Gabriel Nieves <gnievesponce@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-03-11 14:55:02 -06:00
Nathan Evans
66c2cfb3ce
Support JSON input files (#1777)
* Add csv loader tests

* Add test loader tests

* Add json input support

* Remove temp path constraint

* Reuse loader cose

* Semver

* Set file pattern automatically based on type, if empty

* Remove pattern from smoke test config

* Spelling

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-03-10 14:04:07 -07:00
Nathan Evans
bcb74789f1
Next release docs (#1627)
* Wordind updates

* Update yam lconfig and add notes to "deprecated" env

* Add basic search section

* Update versioning docs

* Minor edits for clarity

* Update init command

* Update init to add --force in docs

* Add NLP extraction params

* Move vector_store to root

* Add workflows to config

* Add FastGraphRAG docs

* add metadata column changes

* Added documentation for multi index search.

* Minor fixes.

* Add config and table renames

* Update migration notebook and comments to specify v1

* Add frequency to entity table docs

* add new chunking options for metadata

* Update output docs

* Minor edits and cleanup

* Add model ids to search configs

* Spruce up migration notebook

* Lint/format multi-index notebook

* SpaCy model note

* Update SpaCy footnote

* Updated multi_index_search.ipynb to remove ruff errors.

* add spacy to dictionary

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Dayenne Souza <ddesouza@microsoft.com>
Co-authored-by: dorbaker <dorbaker@microsoft.com>
2025-03-03 14:46:00 -08:00
Nathan Evans
bd06d8b4f0
Context property bag ("state") (#1774)
* Add pipeline state property bag to run context

* Move state creation out of context util

* Move callbacks into PipelineRunContext

* Semver

* Rename state.json to context.json to avoid confusion with stats.json

* Expand smoke test row count

* Add util to create storage and cache
2025-02-28 09:31:48 -08:00
Nathan Evans
a15942629b
Add more verb tests (#1773)
* Add NLP verb test

* Add finalize_graph tests

* Add more thorough final column assertions
2025-02-27 09:31:46 -08:00
Alonso Guevara
b4b8b81c0a
Remove spacy model from toml (#1771)
* Remove spacy model from toml

* Semver
2025-02-26 10:58:02 -06:00
Alonso Guevara
716f93dd8b
Release v2.0.0 (#1769)
* Release v2.0.0

* snspshots...
v2.0.0
2025-02-25 17:52:30 -06:00
Alonso Guevara
facf68148a
Fix summarization and relationship grouping on Inc Indexing (#1768)
* Finx sumarization for large descriptions on incremental indexing

* Semver

* Ruff
2025-02-25 17:29:55 -06:00
Nathan Evans
ede6a74546
Pipeline callbacks (#1729)
* Add pipeline_start and pipeline_end callbacks

* Collapse redundant callback/logger logic

* Remove redundant reporting config classes

* Remove a few out-of-date type ignores

* Semver

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-25 15:07:51 -08:00
Nathan Evans
e40476153d
Speed up smoke tests (#1736)
* Move verb tests to regular CI

* Clean up env vars

* Update smoke runtime expectations

* Rework artifact assertions

* Fix plural in name

* remove redundant artifact len check

* Remove redundant artifact len check

* Adjust graph output expectations

* Update community expectations

* Include all workflow output

* Adjust text unit expectations

* Adjust assertions per dataset

* Fix test config param name

* Update nan allowed for optional model fields

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-25 13:24:35 -08:00
Nathan Evans
61a309b182
Incremental model alignment (#1766)
* Used shared schema lists for all final columns

* Semver
2025-02-25 13:14:42 -06:00
Alonso Guevara
0144b3fd88
Update FNLLM (#1738)
* Add ModelProvider to Query package.

* Spellcheck + others

* Semver

* Fix tests

* Format

* Fix Pyright

* Fix tests

* Fix for smoke tests

* Update fnllm version

* Semver

* Ruff
2025-02-24 20:30:45 -06:00
Nathan Evans
5dd9fc53cd
Move embeddings snapshots (#1737)
* Move embedding snapshots to the workflow runner

* Semver

* Rename input tables
2025-02-24 17:38:01 -08:00
Alonso Guevara
e0d233fe10
Feat/llm provider query (#1735)
* Add ModelProvider to Query package.

* Spellcheck + others

* Semver

* Fix tests

* Format

* Fix Pyright

* Fix tests

* Fix for smoke tests
2025-02-24 18:35:51 -06:00
Nathan Evans
faa05b691f
Fix text unit incremental ID updates (#1734)
* Increment text_unit ids during incremental

* Semver
2025-02-24 14:58:00 -08:00
Nathan Evans
a932b2d342
Fix StopAsyncIteration catch (#1730) 2025-02-21 11:46:44 -08:00
Derek Worthen
54885b8ab1
Refactor config defaults (#1723)
* Refactor config defaults

- Implement type-safe, hierarchical dataclass for config
defaults instead of namespaced constants.
- Allow for instantiating config directly from defaults data structure.

* fix vector_store db_uri default

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-20 13:01:29 -06:00
Alonso Guevara
7bdeaee94a
Create Language Model Providers and Registry methods. Remove fnllm coupling (#1724)
* Base structure

* Add fnllm providers and Mock LLM

* Remove fnllm coupling, introduce llm providers

* Ruff + Tests fix

* Spellcheck

* Semver

* Format

* Default MockChat params

* Fix more tests

* Fix embedding smoke test

* Fix embeddings smoke test

* Fix MockEmbeddingLLM

* Rename LLM to model. Package organization

* Fix prompt tuning

* Oops

* Oops II
2025-02-20 08:56:20 -06:00
Nathan Evans
a42772d368
Query callbacks (#1721)
* Add callbacks to global search

* Add callbacks to local search

* Add streaming callbacks in local search CLI

* Add callbacks to basic search

* Add callbacks to DRIFT search

* Semver

* Return generators directly in API

* Guard callbacks
2025-02-19 13:00:07 -08:00
Nathan Evans
efcaf9636d
Tuck flow functions under their workflows (#1720)
* Move flow functions to workflow

* Remove redundant workflow_name variable

* Semver
2025-02-18 15:33:36 -06:00
Alonso Guevara
7f020826be
Fix/json mode community reports (#1713)
* Patch json mode on Community Reports

* Semversioner

* Wording oopsie
2025-02-14 16:51:42 -06:00
Nathan Evans
96219a2182
Register workflows (#1691)
* Add workflow registration

* Add ability to mutate config by workflows

* Separate graph finalization

* Separate graph pruning

* Semver

* Update tests

* Update smoke tests

* Fix iterrows on create_graph

* Remove prune_graph from llm construction

* Update test data

* Remove prune_graph from smoke tests
2025-02-14 13:21:31 -08:00
Nathan Evans
981fd31963
Community children (#1704)
* Add children to the community tables

* Replace NaN children with empty list

* Replace subcommunity logic with built-in parent/child fields

* Remove restore_community_hierarchy

* Add children and frequency to migration notebook

* Format

* Semver

* Add children to reports

* Update tests

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-13 17:03:51 -08:00
Nathan Evans
35b639399b
Incremental flow rework (#1696)
* Rework update output structure

* Semver

* Fix unit test

* Update frequency in incremental

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-13 18:22:32 -06:00
Alonso Guevara
5ef2399a6f
Chore/remove iterrows (#1708)
* Remove most iterrow usages

* Semver

* Ruff

* Pyright

* Format
2025-02-13 17:32:54 -06:00
Josh Bradley
f14cda2b6d
Improve default llm retry logic to be more optimized (#1701) 2025-02-13 16:56:37 -05:00
Josh Bradley
b8b949f3bb
Cleanup query api - remove code duplication (#1690)
* consolidate query api functions and remove code duplication

* refactor and remove more code duplication

* Add semversioner file

* fix basic search

* fix drift search and update base class function names

* update example notebooks
2025-02-13 16:31:08 -05:00
Nathan Evans
fe461417b5
Export NLP community reports prompt (#1697)
* Properly export the NLP community reports prompt

* Semver

* Fix verb tests
2025-02-12 10:41:39 -08:00
Dayenne Souza
b94290ec2b
add option to add metadata into text chunks (#1681)
* add new options

* add metadata json into input document

* remove doc change

* add metadata column into text loader

* prepend_metadata

* run fix

* fix tests and patch

* fix test

* add watrning for metadata tokens > config size

* fix typo and run fix

* fix test_integration

* fix test

* run check

* rename and fix chunking

* fix

* fix

* fiz test verbs

* fix

* fix tests

* fix chunking

* fix index

* fix cosmos test

* fix vars

* fix after PR

* fix
2025-02-12 09:38:03 -08:00
KennyZhang1
b9dc7b90d5
Fix/streamline workflow miq bugs (#1694)
* Add vector store id reference to embeddings config.

* added communities to links and maxvals

* Consistent naming

* Update entity_ids to include index_name

* added consistent logging messages to miq cli

* semversioner

---------

Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-02-11 16:13:28 -05:00
Nathan Evans
a6a78d5897
Nlp cache (#1689)
* Add cache to build_noun_graph

* Semver
2025-02-10 11:00:51 -08:00
Nathan Evans
c02ab0984a
Streamline workflows (#1674)
* Remove create_final_nodes

* Rename final entity output to "entities"

* Remove duplicate code from graph extraction

* Rename create_final_relationships output to "relationships"

* Rename create_final_communities output to "communities"

* Combine compute_communities and create_final_communities

* Rename create_final_covariates output to "covariates"

* Rename create_final_community_reports output to "community_reports"

* Rename create_final_text_units output to "text_units"

* Rename create_final_documents output to "documents"

* Remove transient snapshots config

* Move create_final_entities to finalize_entities operation

* Move create_final_relationships flow to finalize_relationships operation

* Reuse some community report functions

* Collapse most of graph and text unit-based report generation

* Unify schemas files

* Move community reports extractor

* Move NLP report prompt to prompts folder

* Fix a few pandas warnings

* Rename embeddings config to embed_text

* Rename claim_extraction config to extract_claims

* Remove nltk from standard graph extraction

* Fix verb tests

* Fix extract graph config naming

* Fix moved file reference

* Create v1-to-v2 migration notebook

* Semver

* Fix smoke test artifact count

* Raise tpm/rpm on smoke tests

* Update drift settings for smoke tests

* Reuse project directory var in api notebook

* Format

* Format
2025-02-07 11:11:03 -08:00
KennyZhang1
83cc2daf91
Multi-index query CLI support (#1675)
* Add vector store id reference to embeddings config.

* changed structure of output config section

* added cli integration for multi index global

* added cli integration for multi index local

* added cli integration for multi index drift and basic

* finished local testing of multi-index cli

* ruff fixes

* partially refactored test code to align with new output section

* more test changes for new output structure

* semversioner

* refactored to align with new multi index config proposal

* locally tested new multi-index output proposal

* cleaned up tests to align with new structure

---------

Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
2025-02-07 12:56:48 -05:00
Alonso Guevara
0805924a35
Fix/drift n depth (#1676)
* Fix n_depth param

* Semver

* Change smoke tests params for drift

* Reduce log printing for expected exceptions
2025-02-05 17:22:34 -06:00
JunHo Kim (김준호)
a4d35bc66f
Fix typo in DEVELOPING.md instructions (#1631)
Corrected "this values" to "these values" for improved clarity. This ensures the documentation is more accurate and professional.

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-02-04 13:16:57 -08:00
JunHo Kim (김준호)
30f36316af
Fix typo in table formatting in env_vars documentation (#1632)
Corrected a missing backtick in a note within the `GRAPHRAG_API_KEY` description. This ensures proper code formatting and improves readability in the documentation. No content was altered aside from formatting adjustments.

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-02-04 13:14:58 -08:00
Dayenne Souza
ad5b5120ec
remove unused columns and rename document_attribute_columns (#1672)
* remove unused columns and change property document_attribute_columns to metadata

* format file

* fix 'metadata' column on output

* run check

* fix test on nltk

* remove docs changes
2025-02-03 14:37:06 -03:00
Nathan Evans
907d271f4e
Fix recursive report generation (#1669) 2025-01-30 11:03:25 -08:00
Nathan Evans
53b06aa2ac
Add generate_text_embeddings to FGR (#1667) 2025-01-29 14:31:48 -08:00