graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2025-11-29 00:20:57 +00:00

Author	SHA1	Message	Date
Nathan Evans	a35cb12741	Remove datashaper strip code (#1581 ) Remove datashaper	2025-01-03 13:59:26 -08:00
dependabot[bot]	58f646a019	Bump ruff from 0.8.4 to 0.8.5 (#1579 ) * Bump ruff from 0.8.4 to 0.8.5 Bumps [ruff](https://github.com/astral-sh/ruff) from 0.8.4 to 0.8.5. - [Release notes](https://github.com/astral-sh/ruff/releases) - [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md) - [Commits](https://github.com/astral-sh/ruff/compare/0.8.4...0.8.5) --- updated-dependencies: - dependency-name: ruff dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Fix ruff * Semver * Another ruff --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-01-02 17:45:52 -06:00
Derek Worthen	80367be018	Remove config input models (#1570 ) * Remove config input models * remove unit tests related to config input models * add semversioner change * Merge branch 'main' into config-remove-input-models	2025-01-02 15:25:10 -08:00
gaudyb	185f513ca7	Basic search implementation (#1563 ) * basic search implementation * basic streaming functionality * format check * check fix * release change * Chore/gleanings any encoding (#1569) * Make claims and entities independent of encoding * Semver * Change semver release type --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-01-02 13:49:11 -06:00
Alonso Guevara	5f9ad0d003	Chore/gleanings any encoding (#1569 ) * Make claims and entities independent of encoding * Semver * Change semver release type	2025-01-02 11:44:21 -06:00
Alonso Guevara	2abd6c5f5c	Update blog posts (#1571 )	2024-12-30 17:16:08 -06:00
Alonso Guevara	5258bc5f4f	Fix/gleanings loop (#1564 ) * Fix gleaning output parsing * Semver	2024-12-30 12:57:33 -06:00
Nathan Evans	a2647da473	Simplify flow config (#1554 ) * Flatten compute_communities config * Remove cluster strategy type * Flatten create_base_text_units config * Move cluster seed to config default, leave as None in functions * Remove "prechunked" logic * Remove hard-coded encoding model * Remove unused variables * Strongly type embed_config * Simplify layout_graph config * Semver * Fix integration test * Fix config unit tests: ignore new config defaults * Remove pipeline integ test	2024-12-27 16:38:36 -08:00
Theo Beigbeder	e6de713f25	Fix in load_llm.py (#1508 ) Fixed an issue where the "proxy" setting was passed to the PublicOpenAPI constructor instead of the "api_base" parameter, disabling the use of on-premise OpenAI-based LLM servers Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-12-19 13:51:01 -06:00
joeyhacker	c450f85edd	Solved graphrag index can't use other llm problem (#1507 ) Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-19 13:49:47 -06:00
KennyZhang1	8368b12532	Add Cosmos DB storage/cache option (#1431 ) * added cosmosdb constructor and database methods * added rest of abstract method headers * added cosmos db container methods * implemented has and delete methods * finished implementing abstract class methods * integrated class into storage factory * integrated cosmosdb class into cache factory * added support for new config file fields * replaced primary key cosmosdb initialization with connection strings * modified cosmosdb setter to require json * Fix non-default emitters * Format * Ruff * ruff * first successful run of cosmosdb indexing * removed extraneous container_name setting * require base_dir to be typed as str * reverted merged changed from closed branch * removed nested try statement * readded initial non-parquet emitter fix * added basic support for parquet emitter using internal conversions * merged with main and resolved conflicts * fixed more merge conflicts * added cosmosdb functionality to query pipeline * tested query for cosmosdb * collapsed cosmosdb schema to use minimal containers and databases * simplified create_database and create_container functions * ruff fixes and semversioner * spellcheck and ci fixes * updated pyproject toml and lock file * apply fixes after merge from main * add temporary comments * refactor cache factory * refactored storage factory * minor formatting * update dictionary * fix spellcheck typo * fix default value * fix pydantic model defaults * update pydantic models * fix init_content * cleanup how factory passes parameters to file storage * remove unnecessary output file type * update pydantic model * cleanup code * implemented clear method * fix merge from main * add test stub for cosmosdb * regenerate lock file * modified set method to collapse parquet rows * modified get method to collapse parquet rows * updated has and delete methods and docstrings to adhere to new schema * added prefix helper function * replaced delimiter for prefixed id * verified empty tests are passing * fix merges from main * add find test * update cicd step name * tested querying for new schema * resolved errors from merge conflicts * refactored set method to handle cache in new schema * refactored get method to handle cache in new schema * force unique ids to be written to cosmos for nodes * found bug with has and delete methods * modified has and delete to work with cache in new schema * fix the merge from main * minor typo fixes * update lock file * spellcheck fix * fix init function signature * minor formatting updates * remove https protocol * change localhost to 127.0.0.1 address * update pytest to use bacj engine * verified cache tests * improved speed of has function * resolved pytest error with find function * added test for child method * make container_name variable private as _container_name * minor variable name fix * cleanup cosmos pytest and make the cosmosdb storage class operations more efficient * update cicd to use different cosmosdb emulator * test with http protocol * added pytest for clear() * add longer timeout for cosmosdb emulator startup * revert http connection back to https * add comments to cicd code for future dev usage * set to container and database clients to none upon deletion * ruff changes * add comments to cicd code * removed unneeded None statements and ruff fixes * more ruff fixes * Update test_run.py * remove unnecessary call to delete container * ruff format updates * Reverted test_run.py * fix ruff formatter errors * cleanup variable names to be more consistent * remove extra semversioner file * revert pydantic model changes * revert pydantic model change * revert pydantic model change * re-enable inline formatting rule * update documentation in dev guide --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-12-19 13:43:21 -06:00
Nathan Evans	c1c09bab80	Flow cleanup (#1510 ) * Move snapshots out of flows into verbs * Move degree compute out of extract_graph * Move entity/relationship df merging into extract * Move "title" to extraction source * Move text_unit_ids agg closer to extraction * Move data definition * Update test data * Semver * Update smoke tests * Fix empty degree field and update smoke tests and verb data * Move extractors (#1516) * Consolidate graph embedding and umap * Consolidate claim extraction * Consolidate graph extractor * Move graph utils * Move summarizers * Semver --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> * Fix syntax typo --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-18 18:07:44 -08:00
Nathan Evans	d0543d1fd6	Move extractors (#1516 ) * Consolidate graph embedding and umap * Consolidate claim extraction * Consolidate graph extractor * Move graph utils * Move summarizers * Semver --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-18 16:21:41 -08:00
ex0ns	d59b397fd2	feat: move py.typed to root (#1529 ) Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-18 17:40:45 -06:00
Alonso Guevara	aa467f462a	Release v1.0.1 (#1534 ) v1.0.1	2024-12-18 17:24:43 -06:00
Alonso Guevara	cfe2082669	Fix/llm bugs empty extraction (#1533 ) * Add llm singleton and check for empty extraction * Semver * Tests and spellcheck * Move the singletons to a proper place * Leftover print * Ruff	2024-12-18 17:07:29 -06:00
Alonso Guevara	f7cd155dbc	Fix/encoding model config (#1527 ) * fix: include encoding_model option when initializing LLMParameters * chore: add semver patch description * Fix encoding model parsing * Fix unit tests --------- Co-authored-by: Nico Reinartz <nico.reinartz@rwth-aachen.de>	2024-12-16 21:03:56 -06:00
Alonso Guevara	329b83cf7f	Fix on_error Callbacks (#1526 )	2024-12-16 14:56:41 -06:00
Josh Bradley	983664397b	Update doc site with api overview notebook (#1509 ) update doc site	2024-12-12 16:08:24 -05:00
Alonso Guevara	2d1c27d748	Release v1.0.0 (#1501 ) v1.0.0	2024-12-11 17:47:28 -06:00
Nathan Evans	1d68af308b	Community workflow (#1495 ) * Create separate communities workflow * Add test for new workflow * Rename workflows * Collapse subflows into parents * Rename flows, reuse variables * Semver * Fix integration test * Fix smoke tests * Fix megapipeline format * Rename missed files --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-11 15:41:16 -06:00
Alonso Guevara	de12521405	Dependency updates (#1494 ) * Dependency updates * Semver	2024-12-10 17:25:38 -06:00
Josh Bradley	823342188d	Cleanup factory methods (#1482 ) * cleanup factory methods to have similar design pattern across codebase * add semversioner file * cleanup logging factory * update developer guide * add comment * typo fix * cleanup reporter terminology * renmae reporter to logger * fix comments * update comment * instantiate factory classes correctly and update index api callback parameter --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-10 16:11:11 -06:00
Alonso Guevara	04405803db	Add Parent to communities in data model (#1491 ) * Add Parent to communities in data model * Semver * Pyright * Update docs * Use leiden cluster parent id * Format	2024-12-10 14:38:11 -06:00
Nathan Evans	61816e076f	Migration notebook (#1492 ) * Add migration notebook * Update migration instructions * Semver * Rename item in relationships table * Remove indexing vector store shim * Remove query shims * Remove columns from migrated data * Format * Add community parents	2024-12-10 14:23:26 -06:00
Alonso Guevara	1a13e0fd93	Release v0.9.0 (#1479 ) * Release v0.9.0 * Spellcheck v0.9.0	2024-12-06 14:29:55 -06:00
Alonso Guevara	1c3b0f34c3	Chore/lib updates (#1477 ) * Update dependencies and fix issues * Format * Semver * Fix Pyright * Pyright * More Pyright * Pyright	2024-12-06 14:08:24 -06:00
volksen	b1f2ca785e	deduplicate sources in local search context (#1468 ) Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-06 13:05:00 -06:00
Chris Trevino	5ff2d3c76d	Remove graphrag.llm, replace with fnllm (#1315 ) * add fnllm; remove llm folder * remove llm unit tests * update imports * update imports * formatting * enable autosave * update mockllm * update community reports extractor * move most llm usage to fnllm * update type issues * fix unit tests * type updates * update dictionary * semver * update llm construction, get integration tests working * load from llmparameters model * move ruff settings to ruff.toml * add gitattributes file * ignore ruff.toml spelling * update .gitattributes * update gitignore * update config construction * update prompt var usage * add cache adapter * use cache adapter in embeddings calls * update embedding strategy * add fnllm * add pytest-dotenv * fix some verb tests * get verbtests running * update ruff.toml for vscode * enable ruff native server in vscode * update artifact inspecting code * remove local-test update * use string.replace instead of string.format in community reprots etxractor * bump timeout * revert ruff.toml, vscode settings for another pr * revert cspell config * revert gitignore * remove json-repair, update fnllm * use fnllm generic type interfaces * update load_llm to use target models * consolidate chat parameters * add 'extra_attributes' prop to community report response * formatting * update fnllm * formatting * formatting * Add defaults to some llm params to avoid null on params hash * Formatting --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-12-05 18:07:47 -06:00
Alonso Guevara	d43124e576	Refactor Create Final Community reports to simplify code (#1456 ) * Optimize prep claims * Optimize community hierarchy restore * Partial optimization of prepare_community_reports * More optimization code * Fix context string generation * Filter community -1 * Fix cache, add more optimization fixes * Fix local search community ids * Cleanup * Format * Semver * Remove perf counter * Unused import * Format * Fix edge addition to reports * Add edge by edge context creation * Re-org of the optimization code * Format * Ruff * Some Ruff fixes * More pyright * More pyright * Pyright * Pyright * Update tests	2024-12-05 17:13:05 -06:00
Josh Bradley	b00142260d	Update index API + a notebook that provides a general API overview (#1454 ) * update index api to accept callbacks * fix hardcoded folder name that was creating an empty folder * add API notebook * add semversioner file * filename change --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-05 15:34:21 -06:00
KennyZhang1	10f84c91eb	Replace md5 hash (#1470 ) * switched hashing function helper to sha256 * refactored references to hashing util * semversioner * switched from sha256 to sha512 * new semversioner * updated tests/verbs/data folder * generated fresh parquet files in data folder * moved ignore flag	2024-12-05 13:24:35 -06:00
Nathan Evans	d17dfd01f9	Graph collapse (#1464 ) * Refactor graph creation * Semver * Spellcheck * Update integ pipeline * Fix cast * Improve pandas chaining * Cleaner apply * Use list comprehensions --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-05 11:57:26 -06:00
Gijs Segerink	756f5c38a7	Update search.py (#1457 ) Missing query in astream_search	2024-12-04 16:52:08 -06:00
Josh Bradley	dad2176b3c	Miscellaneous code cleanup procedures (#1452 )	2024-11-27 13:27:43 -05:00
Nathan Evans	0b2120ca45	Docs and notebooks update (#1451 ) * Fix local question gen and example notebook * Update global search notebook * Add lazy blog post * Update breaking changes doc for migration notes * Simplify Getting Started page * Semver * Spellcheck * Fix types * Add comments on cache-free migration * Update wording * Spelling --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-11-27 09:56:48 -08:00
Yuan Chai	2b7d28944d	Fix encoding issue: Ensure non-ASCII characters are correctly represe… (#1446 ) Fix encoding issue: Ensure non-ASCII characters are correctly represented in entity name key Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-11-26 17:47:06 -06:00
Alonso Guevara	ae796b99cb	Fix dynamic community selection in global search (#1450 ) * Fix dynamic community selection in global search * Format * Ruff fix	2024-11-26 15:19:50 -06:00
Alonso Guevara	6d21ef2683	Release v0.5.0 (#1415 ) v0.5.0	2024-11-18 00:06:54 -06:00
Josh Bradley	22a57d14c7	Improve CLI speed with lazy imports (#1319 )	2024-11-15 19:41:10 -05:00
Nathan Evans	9b4f24ebce	First cut at config cleanup (#1411 ) * Firsst cut at config cleanup * Reorder top nav * Add query prompts to tuning page * Remove dynamic notebook from nav * Add more thorough yml config descriptions in docs * Further clean out the config * Semver * Add new blog post * Emphasize yaml * Clarify output * Fix unit test * Fix bullet nesting	2024-11-15 14:33:26 -08:00
Nathan Evans	425dbc60e3	Docs update (#1408 ) * Fix footer contrast * Fix broken links * Remove a few unneeded examples * Point python API example to the whole folder * Convert schema bullets to tables	2024-11-14 21:26:29 -06:00
JunHo Kim (김준호)	ec9cdcce4d	fix typo. Correct the wording "global search" to "drift search" in drift search documentation (#1383 ) Updated the wording of the example scenario from "global search" to "drift search" to accurately reflect the topic. This improves clarity and ensures the documentation accurately describes its content. Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-11-14 16:55:44 -06:00
Jeff Baumes	0a5801041a	Fix documentation for generate_indexing_prompts (#1336 ) Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-11-14 16:53:59 -06:00
Alonso Guevara	c90166ca32	Add Parquet as part of the default emitters when not present (#1407 ) Add Parquet as part of the default emitters when not pressent	2024-11-14 13:04:19 -06:00
Nathan Evans	51912b2e03	Move prompts (#1404 ) * Move indexing prompts to root * Move query prompts to root * Export query prompts during init * Extract general knowledge prompt * Load query prompts from disk * Semver * Fix unit tests	2024-11-14 10:45:37 -08:00
Nathan Evans	c8c354e357	Artifact cleanup (#1341 ) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key	2024-11-13 15:11:19 -08:00
Alonso Guevara	e53422366d	Implement dynamic community selection for global search (#1396 ) * update gitignore * add dynamic community sleection to updated main branch * update SearchResult to record output_tokens. * update search result * dynamic search working * format * add llm_calls_categories and prompt_tokens and output_tokens cate * update * formatting * log drift search output and prompt tokens separately * update global_search.ipynb. update operate dulce dataset and add create_final_communities. update dynamic community selection init * add .ipynb back to cspell.config.yaml * format * add notebook example on dynamic search * rearrange * update gitignore * format code * code format * code format * fix default variable --------- Co-authored-by: Bryan Li <bryanlimy@gmail.com>	2024-11-11 16:45:07 -08:00
Alonso Guevara	ba50caab4d	Release v0.4.1 (#1387 ) * Release v0.4.1 * Spellcheck v0.4.1	2024-11-08 17:59:57 -06:00
Alonso Guevara	20c120288b	Feat/update cli (#1376 ) * Add update cli option with default storage * Semver * Semver * Pyright * Format	2024-11-07 06:59:10 -06:00

1 2 3 4 5 ...

342 Commits