graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2025-07-19 06:53:19 +00:00

Author	SHA1	Message	Date
Nathan Evans	66c2cfb3ce	Support JSON input files (#1777 ) * Add csv loader tests * Add test loader tests * Add json input support * Remove temp path constraint * Reuse loader cose * Semver * Set file pattern automatically based on type, if empty * Remove pattern from smoke test config * Spelling --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-03-10 14:04:07 -07:00
Nathan Evans	ede6a74546	Pipeline callbacks (#1729 ) * Add pipeline_start and pipeline_end callbacks * Collapse redundant callback/logger logic * Remove redundant reporting config classes * Remove a few out-of-date type ignores * Semver --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-02-25 15:07:51 -08:00
Alonso Guevara	7bdeaee94a	Create Language Model Providers and Registry methods. Remove fnllm coupling (#1724 ) * Base structure * Add fnllm providers and Mock LLM * Remove fnllm coupling, introduce llm providers * Ruff + Tests fix * Spellcheck * Semver * Format * Default MockChat params * Fix more tests * Fix embedding smoke test * Fix embeddings smoke test * Fix MockEmbeddingLLM * Rename LLM to model. Package organization * Fix prompt tuning * Oops * Oops II	2025-02-20 08:56:20 -06:00
Josh Bradley	f14cda2b6d	Improve default llm retry logic to be more optimized (#1701 )	2025-02-13 16:56:37 -05:00
Dayenne Souza	b94290ec2b	add option to add metadata into text chunks (#1681 ) * add new options * add metadata json into input document * remove doc change * add metadata column into text loader * prepend_metadata * run fix * fix tests and patch * fix test * add watrning for metadata tokens > config size * fix typo and run fix * fix test_integration * fix test * run check * rename and fix chunking * fix * fix * fiz test verbs * fix * fix tests * fix chunking * fix index * fix cosmos test * fix vars * fix after PR * fix	2025-02-12 09:38:03 -08:00
Nathan Evans	c02ab0984a	Streamline workflows (#1674 ) * Remove create_final_nodes * Rename final entity output to "entities" * Remove duplicate code from graph extraction * Rename create_final_relationships output to "relationships" * Rename create_final_communities output to "communities" * Combine compute_communities and create_final_communities * Rename create_final_covariates output to "covariates" * Rename create_final_community_reports output to "community_reports" * Rename create_final_text_units output to "text_units" * Rename create_final_documents output to "documents" * Remove transient snapshots config * Move create_final_entities to finalize_entities operation * Move create_final_relationships flow to finalize_relationships operation * Reuse some community report functions * Collapse most of graph and text unit-based report generation * Unify schemas files * Move community reports extractor * Move NLP report prompt to prompts folder * Fix a few pandas warnings * Rename embeddings config to embed_text * Rename claim_extraction config to extract_claims * Remove nltk from standard graph extraction * Fix verb tests * Fix extract graph config naming * Fix moved file reference * Create v1-to-v2 migration notebook * Semver * Fix smoke test artifact count * Raise tpm/rpm on smoke tests * Update drift settings for smoke tests * Reuse project directory var in api notebook * Format * Format	2025-02-07 11:11:03 -08:00
Dayenne Souza	ad5b5120ec	remove unused columns and rename document_attribute_columns (#1672 ) * remove unused columns and change property document_attribute_columns to metadata * format file * fix 'metadata' column on output * run check * fix test on nltk * remove docs changes	2025-02-03 14:37:06 -03:00
Dayenne Souza	2f2cfa7b70	Test and unify text splitter functionality (#1547 ) * add text_splitting unit test * change folder test text splitting * fix chunk fn * test new function * run formatter * run spell check * run semver * remove tiktoken mocked from tests * change progress ticker * fix ruff check	2025-01-13 18:42:44 -03:00
Nathan Evans	a35cb12741	Remove datashaper strip code (#1581 ) Remove datashaper	2025-01-03 13:59:26 -08:00
Nathan Evans	a2647da473	Simplify flow config (#1554 ) * Flatten compute_communities config * Remove cluster strategy type * Flatten create_base_text_units config * Move cluster seed to config default, leave as None in functions * Remove "prechunked" logic * Remove hard-coded encoding model * Remove unused variables * Strongly type embed_config * Simplify layout_graph config * Semver * Fix integration test * Fix config unit tests: ignore new config defaults * Remove pipeline integ test	2024-12-27 16:38:36 -08:00
KennyZhang1	8368b12532	Add Cosmos DB storage/cache option (#1431 ) * added cosmosdb constructor and database methods * added rest of abstract method headers * added cosmos db container methods * implemented has and delete methods * finished implementing abstract class methods * integrated class into storage factory * integrated cosmosdb class into cache factory * added support for new config file fields * replaced primary key cosmosdb initialization with connection strings * modified cosmosdb setter to require json * Fix non-default emitters * Format * Ruff * ruff * first successful run of cosmosdb indexing * removed extraneous container_name setting * require base_dir to be typed as str * reverted merged changed from closed branch * removed nested try statement * readded initial non-parquet emitter fix * added basic support for parquet emitter using internal conversions * merged with main and resolved conflicts * fixed more merge conflicts * added cosmosdb functionality to query pipeline * tested query for cosmosdb * collapsed cosmosdb schema to use minimal containers and databases * simplified create_database and create_container functions * ruff fixes and semversioner * spellcheck and ci fixes * updated pyproject toml and lock file * apply fixes after merge from main * add temporary comments * refactor cache factory * refactored storage factory * minor formatting * update dictionary * fix spellcheck typo * fix default value * fix pydantic model defaults * update pydantic models * fix init_content * cleanup how factory passes parameters to file storage * remove unnecessary output file type * update pydantic model * cleanup code * implemented clear method * fix merge from main * add test stub for cosmosdb * regenerate lock file * modified set method to collapse parquet rows * modified get method to collapse parquet rows * updated has and delete methods and docstrings to adhere to new schema * added prefix helper function * replaced delimiter for prefixed id * verified empty tests are passing * fix merges from main * add find test * update cicd step name * tested querying for new schema * resolved errors from merge conflicts * refactored set method to handle cache in new schema * refactored get method to handle cache in new schema * force unique ids to be written to cosmos for nodes * found bug with has and delete methods * modified has and delete to work with cache in new schema * fix the merge from main * minor typo fixes * update lock file * spellcheck fix * fix init function signature * minor formatting updates * remove https protocol * change localhost to 127.0.0.1 address * update pytest to use bacj engine * verified cache tests * improved speed of has function * resolved pytest error with find function * added test for child method * make container_name variable private as _container_name * minor variable name fix * cleanup cosmos pytest and make the cosmosdb storage class operations more efficient * update cicd to use different cosmosdb emulator * test with http protocol * added pytest for clear() * add longer timeout for cosmosdb emulator startup * revert http connection back to https * add comments to cicd code for future dev usage * set to container and database clients to none upon deletion * ruff changes * add comments to cicd code * removed unneeded None statements and ruff fixes * more ruff fixes * Update test_run.py * remove unnecessary call to delete container * ruff format updates * Reverted test_run.py * fix ruff formatter errors * cleanup variable names to be more consistent * remove extra semversioner file * revert pydantic model changes * revert pydantic model change * revert pydantic model change * re-enable inline formatting rule * update documentation in dev guide --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-12-19 13:43:21 -06:00
Nathan Evans	c1c09bab80	Flow cleanup (#1510 ) * Move snapshots out of flows into verbs * Move degree compute out of extract_graph * Move entity/relationship df merging into extract * Move "title" to extraction source * Move text_unit_ids agg closer to extraction * Move data definition * Update test data * Semver * Update smoke tests * Fix empty degree field and update smoke tests and verb data * Move extractors (#1516) * Consolidate graph embedding and umap * Consolidate claim extraction * Consolidate graph extractor * Move graph utils * Move summarizers * Semver --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> * Fix syntax typo --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-18 18:07:44 -08:00
Nathan Evans	d0543d1fd6	Move extractors (#1516 ) * Consolidate graph embedding and umap * Consolidate claim extraction * Consolidate graph extractor * Move graph utils * Move summarizers * Semver --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-18 16:21:41 -08:00
Alonso Guevara	1c3b0f34c3	Chore/lib updates (#1477 ) * Update dependencies and fix issues * Format * Semver * Fix Pyright * Pyright * More Pyright * Pyright	2024-12-06 14:08:24 -06:00
Chris Trevino	5ff2d3c76d	Remove graphrag.llm, replace with fnllm (#1315 ) * add fnllm; remove llm folder * remove llm unit tests * update imports * update imports * formatting * enable autosave * update mockllm * update community reports extractor * move most llm usage to fnllm * update type issues * fix unit tests * type updates * update dictionary * semver * update llm construction, get integration tests working * load from llmparameters model * move ruff settings to ruff.toml * add gitattributes file * ignore ruff.toml spelling * update .gitattributes * update gitignore * update config construction * update prompt var usage * add cache adapter * use cache adapter in embeddings calls * update embedding strategy * add fnllm * add pytest-dotenv * fix some verb tests * get verbtests running * update ruff.toml for vscode * enable ruff native server in vscode * update artifact inspecting code * remove local-test update * use string.replace instead of string.format in community reprots etxractor * bump timeout * revert ruff.toml, vscode settings for another pr * revert cspell config * revert gitignore * remove json-repair, update fnllm * use fnllm generic type interfaces * update load_llm to use target models * consolidate chat parameters * add 'extra_attributes' prop to community report response * formatting * update fnllm * formatting * formatting * Add defaults to some llm params to avoid null on params hash * Formatting --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-12-05 18:07:47 -06:00
Josh Bradley	dad2176b3c	Miscellaneous code cleanup procedures (#1452 )	2024-11-27 13:27:43 -05:00
Josh Bradley	22a57d14c7	Improve CLI speed with lazy imports (#1319 )	2024-11-15 19:41:10 -05:00
Nathan Evans	9b4f24ebce	First cut at config cleanup (#1411 ) * Firsst cut at config cleanup * Reorder top nav * Add query prompts to tuning page * Remove dynamic notebook from nav * Add more thorough yml config descriptions in docs * Further clean out the config * Semver * Add new blog post * Emphasize yaml * Clarify output * Fix unit test * Fix bullet nesting	2024-11-15 14:33:26 -08:00
Nathan Evans	c8c354e357	Artifact cleanup (#1341 ) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key	2024-11-13 15:11:19 -08:00
Nathan Evans	1f70d42572	Empty workflow returns (#1291 ) * Skip emitting empty dataframes * Semver * Better empty df check	2024-10-17 09:25:36 -07:00
Nathan Evans	ce5b1207e0	Collapse graph documents workflows (#1284 ) * Copy base documents logic into final documents * Delete create_base_documents * Combine graph creation under create_base_entity_graph * Delete collapsed workflows * Migrate most graph internals to nx.Graph * Fix None edge case * Semver * Remove comment typo * Fix smoke tests	2024-10-15 13:58:58 -06:00
Nathan Evans	61b3d6d56a	Migrate helper verbs (#1248 ) * Remove genid * Move snapshot_rows * Move snapshot * Delete spread_json * Delete unzip * Delete zip * Move unpack_graph * Move compute_edge_combined_degree * Delete create_graph * Delete concat * Delete text replace * Delete text_translate * Move text_split * Inline aggregate override * Move cluster_graph * Move merge_graphs * Semver * Move text_chunk * Move layout_graph and fix some __init__s * Move extract_covariates * Rename text_split -> split_text * Move extract_entities * Move summarize_descriptions * Rename text_chunk -> chunk_text * Move community report creation * Remove verb-level packing operators * Streamline some naming * Streamline param name/order * Move mock LLM data to tests * Fixed missed rename * Update some strategy refs * Rename run_gi * Inject mock responses into integ test config	2024-10-09 13:46:44 -07:00
Nathan Evans	f5b4d2fea5	Ci streamline (#988 ) * Remove excess vars from gh-pages build * Delete redundant javascript ci * Pull apart testing CI * Clean up integration tests build * Move storage tests to integration CI * Take py 3.10 out of smoke tests matrix * Use minimum supported python version for most tests * Re-run main CI on any test change * Add Josh and Kenny to author list * Update auto-resolve perms	2024-08-21 15:16:15 -06:00
Alonso Guevara	0b7c5a6ae9	Add cast check on schema validation for community reports (#932 ) * Add support for both float and int on schema validation for community report generation * Cast instead of type check * Add mising file * Add prompt with ints to smoke tests * Fix unit tests * Fix unit tests	2024-08-14 16:40:47 -06:00
Andres Morales	5a7dbaa051	Fix sort_context max_tokens & max_tokens param in verb (#888 ) * Fix sort_context max_tokens & max_tokens param in verb * Fix sort_context for windows test * add semversioner file --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-08-12 15:55:31 -06:00
Alonso Guevara	81b81cf60b	Initial Release	2024-07-01 15:25:30 -06:00

26 Commits