* Updated the variable names within the for-loop to differentiate between them and the original title variable used in the dataframe. This avoids corrupting the original column-name defined in the title variable.
* Semver and formart
---------
Co-authored-by: Gabriel Nieves-Ponce <gnievesponce@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* New workflow to generate embeddings in a single workflow
* New workflow to generate embeddings in a single workflow
* version change
* clean tests without any embeddings references
* clean tests without any embeddings references
* remove code
* feedback implemented
* changes in logic
* feedback implemented
* store in table bug fixed
* smoke test for generate_text_embeddings workflow
* smoke test fix
* add generate_text_embeddings to the list of transient workflows
* smoke tests
* fix
* ruff formatting updates
* fix
* smoke test fixed
* smoke test fixed
* fix lancedb import
* smoke test fix
* ignore sorting
* smoke test fixed
* smoke test fixed
* check smoke test
* smoke test fixed
* change config for vector store
* format fix
* vector store changes
* revert debug profile back to empty filepath
* merge conflict solved
* merge conflict solved
* format fixed
* format fixed
* fix return dataframe
* snapshot fix
* format fix
* embeddings param implemented
* validation fixes
* fix map
* fix map
* fix properties
* config updates
* smoke test fixed
* settings change
* Update collection config and rework back-compat
* Repalce . with - for embedding store
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
* move mkdocs-typer to devdeps
* add .gitattributes for toml parsing issues on Windows CI
* bump timeout
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* refactor build text unit context for better performance
* Further optimization and styling
* Remove TODO
---------
Co-authored-by: Brad Firesheets <v-bradleyf@microsoft.com>
Co-authored-by: bfirems <162185685+bfirems@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Update documentation URLs for consistency
Revised links in documentation files to remove the "posts" subdirectory for consistency and correctness.
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Update community_context.py to check conversation_history_context's value
For the following code (line 90 - 96), conversation_history_context is concatenated with community_context, but the case where conversation_history_context is empty("") has not been considered. When conversation_history_context is empty (""), concatenation should not be performed, as it would result in community_context or each element in community_context having an extra "\n\n".
Therefore, by introducing a context_prefix to check the state of conversation_history_context, concatenation can be handled appropriately. When conversation_history_context is empty (""), the following code will use "" for concatenation. When conversation_history_context is not empty (""), the functionality will be similar to the previous code.
* Format and semver
* Code cleanup
---------
Co-authored-by: ZeyuTeng96 <96521059+ZeyuTeng96@users.noreply.github.com>
Updated the configuration documentation to reflect the default filename for configuration file.
Default config files are `["settings.yaml", "settings.yml", "settings.json"]`
ce71bcf7fb/graphrag/config/config_file_loader.py (L15)
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Update description of GRAPHRAG_CACHE_BASE_DIR in env_vars.md
Clarified that `GRAPHRAG_CACHE_BASE_DIR` refers to the base directory path for cache files rather than reporting outputs. This improves the accuracy of the documentation and helps users understand the correct usage of this environment variable.
* Update description of `GRAPHRAG_CACHE_BASE_DIR`
Simplified the description of `GRAPHRAG_CACHE_BASE_DIR` to make it clearer. Changed "base directory path" to "base path" for conciseness.
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Move text_embed to verb-less operation
* Move embed_graph to verb-less operation
* Return embeddings from embed_graph instead of modifying df
* Semver
* Use config existence instead of bool for graph embedding
* Send clustering strategy directly
* Extract base docs and entity graph
* Move extracted entities and text units
* Move communities and community reports
* Move covariates and final documents
* Move entities, nodes, relationships
* Move text_units and summarized entities
* Assert all snapshot null cases
* Remove disabled steps util
* Remove incorrect use of input "others"
* Convert text_embed_df to just return the embeddings, not update the df
* Convert snapshot functions to noops
* Semver
* Remove lingering covariates_enabled param
* Name consistency
* Syntax cleanup