graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2025-07-24 17:31:50 +00:00

Author	SHA1	Message	Date
Nathan Evans	c8c354e357	Artifact cleanup (#1341 ) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key	2024-11-13 15:11:19 -08:00
Alonso Guevara	d9f985ae52	Drift Search CLI, API, Docs and Example Notebook (#1348 ) * Drift CLI and backwards compat * Adding DRIFT Cli, Docs and example notebook * Update tests and fix ruff * Format * Small cleanup * Fix smoke tests * Update notebook * Oopsie fix * Delete duplicate img	2024-11-05 12:05:19 -06:00
Nathan Evans	634e3ed62a	Transient entity graph (#1349 ) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests	2024-11-04 17:23:29 -08:00
gaudyb	17658c5df8	New workflow to generate embeddings in a single workflow (#1296 ) * New workflow to generate embeddings in a single workflow * New workflow to generate embeddings in a single workflow * version change * clean tests without any embeddings references * clean tests without any embeddings references * remove code * feedback implemented * changes in logic * feedback implemented * store in table bug fixed * smoke test for generate_text_embeddings workflow * smoke test fix * add generate_text_embeddings to the list of transient workflows * smoke tests * fix * ruff formatting updates * fix * smoke test fixed * smoke test fixed * fix lancedb import * smoke test fix * ignore sorting * smoke test fixed * smoke test fixed * check smoke test * smoke test fixed * change config for vector store * format fix * vector store changes * revert debug profile back to empty filepath * merge conflict solved * merge conflict solved * format fixed * format fixed * fix return dataframe * snapshot fix * format fix * embeddings param implemented * validation fixes * fix map * fix map * fix properties * config updates * smoke test fixed * settings change * Update collection config and rework back-compat * Repalce . with - for embedding store --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com>	2024-11-01 15:01:35 -07:00
Alonso Guevara	7235c6faf5	Add Incremental Indexing v1 (#1318 ) * Create entypoint for cli and api (#1067) * Add cli and api entrypoints for update index * Semver * Update docs * Run tests on feature branch main * Better /main handling in tests * Incremental indexing/file delta (#1123) * Calculate new inputs and deleted inputs on update * Semver * Clear ruff checks * Fix pyright * Fix PyRight * Ruff again * Update relationships after inc index (#1236) * Collapse create final community reports (#1227) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set * Collapse create base entity graph (#1233) * Collapse create_base_entity_graph * Format/typing * Semver * Fix smoke tests * Simplify assignment * Collapse create summarized entities (#1237) * Collapse entity summarize * Semver * Collapse create base extracted entities (#1235) * Set up base assertions * Replace entity_extract * Finish collapsing workflow * Semver * Update snoke tests * Incremental indexing/update final text units (#1241) * Update final text units * Format * Address comments * Add v1 community merge using time period (#1257) * Add naive community merge using time period * formatting * Query fixes * Add descriptions from merged_entities * Add summarization and embeddings * Use iso format * Ruff * Pyright and smoke tests * Pyright * Pyright * Update parquet for verb tests * Fix smoke tests * Remove sorting * Update smoke tests * Smoke tests * Smoke tests * Updated verb test to ack for latest changes on covariates * Add config for incremental index + Bug fixes (#1317) * Add config for incremental index + Bug fixes * Ruff * Fix smoke tests * Semversioner * Small refactor * Remove unused file * Ruff * Update verb tests inputs * Update verb tests inputs --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com>	2024-10-30 11:59:44 -06:00
Josh Bradley	0cc79b9cf7	Add backwards compatibility patch for vector store (#1334 )	2024-10-29 14:54:08 -04:00
Josh Bradley	083de12bcf	Auto-generate CLI doc pages (#1325 )	2024-10-25 19:00:24 -04:00
Josh Bradley	d6e6f5c077	Convert CLI to Typer app (#1305 )	2024-10-24 14:22:32 -04:00
Nathan Evans	94f1e62e5c	Rework workflow architecture (#1311 ) * Rename pipeline_storage file * Add runtime storage option to context * Fix import * Switch to memory storage for runtime * Infra for workflow runtime storage * Migrate base_text_units to runtime storage * Fix comment * Semver * Remove whitespace * Remove subflow smoke tests and ignore transient artifacts * Remove entity graph from transient list (not yet implemented) * Increase smoke runtime allotment for create_base_entity_graph * Revert format fix * Remove noqa	2024-10-24 10:20:03 -07:00
Alonso Guevara	8a6d4e66fe	DRIFT Search (#1285 ) * drift search * args for drift global query in local search * accept drift context in search base * optionally parse embeddings from df when creating CommunityReport * abstract class for drift context * pathing for drift config * drift config * add defs for drift config * formatting * capture generated tokens in token count * semversion * Formatting and ruff * Some algorithmic refactors * Ruff * Format * Use asdict() * Address comments * Update smoke tests * Update smoke tests * Update smoke tests part 2 --------- Co-authored-by: Julian Whiting <j2whitin@gmail.com>	2024-10-21 17:22:11 -06:00
KennyZhang1	e0840a2dc4	Fix vector store logic and refactor audience parameter (#1259 )	2024-10-21 16:56:56 -04:00
Matthieu Maitre	6aae386b30	Perf optimizations in map_query_to_entities() (#1276 ) * Address perf issue in map_query_to_entities() * Add semver --------- Co-authored-by: Matthieu Maitre <mmaitre@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-21 12:03:48 -06:00
Nathan Evans	1f70d42572	Empty workflow returns (#1291 ) * Skip emitting empty dataframes * Semver * Better empty df check	2024-10-17 09:25:36 -07:00
Nathan Evans	ce5b1207e0	Collapse graph documents workflows (#1284 ) * Copy base documents logic into final documents * Delete create_base_documents * Combine graph creation under create_base_entity_graph * Delete collapsed workflows * Migrate most graph internals to nx.Graph * Fix None edge case * Semver * Remove comment typo * Fix smoke tests	2024-10-15 13:58:58 -06:00
Andres Morales	fc9895f793	Replace current docs by mkdocs (#1263 ) * Replace docs by mkdocs-material * Fix markdown * Fix verions in gh-pages workflow * remove whitespaces * add semver * Add build docs check on python-ci * Fix command in index cli * Spellcheck * Spellcheck * remove docsite paths * clear outputs from notebook * remove dependabot npm for docsite * remove more docsite left overs * execute notebooks * Update notebooks * update poetry lock * Remove notebook build from ci * Revert dep update * Navigation tabs * Fix stylesheet * add kwds to dictionary * Turn on notebook execution * Update gitignore * Add MSR Blog posts * spellcheck * Accessibility Changes --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-11 13:39:03 -06:00
Nathan Evans	61b3d6d56a	Migrate helper verbs (#1248 ) * Remove genid * Move snapshot_rows * Move snapshot * Delete spread_json * Delete unzip * Delete zip * Move unpack_graph * Move compute_edge_combined_degree * Delete create_graph * Delete concat * Delete text replace * Delete text_translate * Move text_split * Inline aggregate override * Move cluster_graph * Move merge_graphs * Semver * Move text_chunk * Move layout_graph and fix some __init__s * Move extract_covariates * Rename text_split -> split_text * Move extract_entities * Move summarize_descriptions * Rename text_chunk -> chunk_text * Move community report creation * Remove verb-level packing operators * Streamline some naming * Streamline param name/order * Move mock LLM data to tests * Fixed missed rename * Update some strategy refs * Rename run_gi * Inject mock responses into integ test config	2024-10-09 13:46:44 -07:00
Nathan Evans	f5c5876dde	Reorganize flows (#1240 ) * Extract base docs and entity graph * Move extracted entities and text units * Move communities and community reports * Move covariates and final documents * Move entities, nodes, relationships * Move text_units and summarized entities * Assert all snapshot null cases * Remove disabled steps util * Remove incorrect use of input "others" * Convert text_embed_df to just return the embeddings, not update the df * Convert snapshot functions to noops * Semver * Remove lingering covariates_enabled param * Name consistency * Syntax cleanup	2024-10-02 08:57:08 -07:00
Nathan Evans	9070ea5c3c	Collapse create base extracted entities (#1235 ) * Set up base assertions * Replace entity_extract * Finish collapsing workflow * Semver * Update snoke tests	2024-09-30 17:32:56 -07:00
Nathan Evans	630679f8e3	Collapse create summarized entities (#1237 ) * Collapse entity summarize * Semver	2024-09-30 17:17:44 -07:00
Nathan Evans	5220bb7ecc	Collapse create base entity graph (#1233 ) * Collapse create_base_entity_graph * Format/typing * Semver * Fix smoke tests * Simplify assignment	2024-09-30 15:39:42 -07:00
Nathan Evans	00d5e77568	Collapse create final community reports (#1227 ) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set	2024-09-30 10:46:07 -07:00
Nathan Evans	ce71bcf7fb	Collapse create final entities (#1220 ) * Collapse create_final_entities * Update smoke tests * Semver * Remove prints * Update embedding assertions	2024-09-25 17:35:44 -07:00
Nathan Evans	3217013019	Revisit create final text units (#1216 ) * Add embeddings to collapsed subflow * Semver * Fix smoke tests	2024-09-25 16:55:27 -07:00
Nathan Evans	73e709b686	Collapse create final covariates (#1215 ) * Add covariate test * Add detailed mock assertions * Collapse create_final_covariates * Delete unused doc_id field * Semver * Update smoke test * Remove unused subject/object type columns	2024-09-25 16:30:22 -07:00
Nathan Evans	14750f4d37	Collapse create final documents (#1217 ) * Collapse create_final_documents * Semver	2024-09-25 15:50:46 -07:00
Nathan Evans	f518c8b80b	Collapse relationship embeddings (#1199 ) * Merge text_embed into a single relationships subflow * Update smoke tests * Semver * Spelling	2024-09-24 15:03:26 -07:00
Nathan Evans	1755afbdec	Collapse create base text units (#1178 ) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Setup initial test and config * Collapse create_base_text_units * Semver * Spelling * Fix smoke tests * Addres PR comments --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-23 16:55:53 -07:00
Nathan Evans	fbc483e4e5	Collapse create base documents (#1176 ) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests	2024-09-23 13:24:06 -07:00
Nathan Evans	f8ab1b30dc	Collapse create_final_nodes (#1171 ) * Collapse create_final_nodes * Update smoke tests * Typo --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-20 13:48:56 -07:00
Nathan Evans	ae094bb144	Collapse create final relationships (#1158 ) * Collapse pre/post embedding workflows * Semver * Fix smoke tests --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-19 17:38:01 -06:00
Derek Worthen	3b09df6e07	Migrate towards using static output directories (#1113 ) * Migrate towards using static output directories - Fixes load_config eagering resolving directories. Directories are only resolved when the output directories are local. - Add support for `--output` and `--reporting` flags for index CLI. To achieve previous output structure `index --output run1/artifacts --reports run1/reports`. - Use static output directories when initializing a new project. - Maintains backward compatibility for those using timestamp outputs locally. * fix smoke tests * update query cli to work with static directories * remove eager path resolution from load_config. Support CLI overrides that can be resolved. * add docs and output logs/artifacts to same directory * use match statement * switch back to if statement --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-18 17:36:50 -06:00
Nathan Evans	aa5b426f1d	Collapse final communities workflow (#1150 ) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-17 17:04:42 -07:00
Nathan Evans	a473265580	Collapse verbs: create_final_text_units (#1143 ) * Load default config in verb tests * Load proper workflow config * Collapse text unit pre-embedding steps * Format * Update smoke tests * Semver * Format * Merge join* subflows into create_final_text_units * Remove join_text_units_to_covariate_ids * Format * Remove join_text_units_to_entity_ids * Remove join_text_units_to_relationship_ids * Clean up merges and aggregations * Remove unnecessary cast	2024-09-17 10:32:25 -07:00
Nathan Evans	d22c0e7836	Covariate collapse (#1142 ) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test * Update test parquets for covariates * Collapse covariate join * Rework subtasks for per-flow customization * Format * Semver * Fix smoke test	2024-09-16 12:35:45 -07:00
Nathan Evans	2de302ff0d	Verb merge nre1 (#1140 ) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test	2024-09-16 12:10:29 -07:00
Derek Worthen	2d45ece9b6	fix setting base_dir to full paths when not using file system. (#1096 ) * fix setting base_dir to full paths when not using file system. * add general resolve_path	2024-09-04 11:33:44 -07:00
Derek Worthen	ab29cc2a7e	Consistent config load_config (#1065 ) * Consistent config load_config - Provide a consistent way to load configuration - Resolve potential timestamp directories upfront upon config object creation - Add unit tests for resolving timestamp directories - Resolves #599 - Resolves #1049 * fix formatting issues * remove unnecessary path resolution * fix smoke tests * update prompts to use load_config * Update none checks * Update none checks * Update searching for config method signature * Update unit tests * fix formatting issues	2024-09-03 16:33:16 -06:00
Alonso Guevara	cb0aae7e6b	Add graphrag_import_neo4j_cypher Notebook (#593 ) * Added graphrag_import_neo4j_cypher Notebook * changed to procedure for setting embedding property to save disk space * Reformat and cleanup * semver * Poetry lock update * Update AAIS docs * Rename contrib folder * Merge from main * Revert "Merge from main" This reverts commit a399dde97b689a5b5c62dc2e9c2290cb2503b3a4. * Fix ruff check * Add readme and fix tests * Fix community reports --------- Co-authored-by: Michael Hunger <github@jexp.de>	2024-08-23 15:18:35 -06:00
Nathan Evans	f5b4d2fea5	Ci streamline (#988 ) * Remove excess vars from gh-pages build * Delete redundant javascript ci * Pull apart testing CI * Clean up integration tests build * Move storage tests to integration CI * Take py 3.10 out of smoke tests matrix * Use minimum supported python version for most tests * Re-run main CI on any test change * Add Josh and Kenny to author list * Update auto-resolve perms	2024-08-21 15:16:15 -06:00
Nathan Evans	98cabba38b	Notebook tests (#978 ) * Fix notebook test runs * Delete old issue template * Add notebook CI action * Print temp directories * Print more env * Move printing up * Use runner_temp * Try using current directory * Try TMP env * Re-write TMP * Wrong yml * Fix echo * Only export if windows * More logging * Move export * Reformat env write * Fix braces * Switch to in-memory execution * Downgrade action perms * Unused import	2024-08-20 17:19:37 -06:00
Alonso Guevara	0b7c5a6ae9	Add cast check on schema validation for community reports (#932 ) * Add support for both float and int on schema validation for community report generation * Cast instead of type check * Add mising file * Add prompt with ints to smoke tests * Fix unit tests * Fix unit tests	2024-08-14 16:40:47 -06:00
Nathan Evans	ac504e31a0	Add stricter filtering and tests for cli data directory discovery (#910 ) * Add stricter filtering and tests for cli data directory discovery * Semver * Ignore ruff on error type * Format * Fix for windows paths * Fix for windows paths * Uncomment blob tests * Sort by timestamp name instead of modified date * Format * Add additional folder name test	2024-08-13 17:34:14 -06:00
Andres Morales	5a7dbaa051	Fix sort_context max_tokens & max_tokens param in verb (#888 ) * Fix sort_context max_tokens & max_tokens param in verb * Fix sort_context for windows test * add semversioner file --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-08-12 15:55:31 -06:00
Alonso Guevara	7fd23fa79c	Stabilize smoke tests for query community context building (#908 ) * Stabilize smoke tests for query community context building * Fix CODEOWNERS	2024-08-12 13:17:40 -06:00
Alonso Guevara	c451aa0093	Update smoke tests (#861 ) * Run smoke tests on 4o * Shorten dulce for smoke tests * Update secrets for consistency	2024-08-08 13:07:44 -06:00
Dayenne Souza	1e10bd342e	Re-enable smoke tests (#848 ) * add smoke tests again * add smoke tests separated action * add patch version * disable blob test * blob conn again * add file as cache type * remove cache type enterely * increase timeout * remove comment --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-08-07 12:23:46 -06:00
Chris Trevino	56db78ae38	system -> assistant (#773 ) * system -> assistant * semver	2024-07-29 14:56:55 -07:00
Chris Trevino	9d99f323ea	Add encoding model to entity/claim extraction config sections (#740 ) * Add encoding-model configuration to entity & claim extraction * add change note * pr updates * test fix * disable GH-based smoke tests	2024-07-26 15:05:08 -07:00
Chris Trevino	4c229afec8	add encoding model to text-chunking config (#743 ) * add encoding model to text-chunking config * revert groupby fix, handled in other pr * revert environment reader update for other pr	2024-07-26 14:15:17 -07:00
Chris Trevino	f5c9c2bee0	Add History input to cache-key, cache data (#736 ) * Update caching llm to use history inputs * formatting * linting * update glean sections to have continuous history	2024-07-26 09:26:37 -07:00

1 2 3

107 Commits