* switched hashing function helper to sha256
* refactored references to hashing util
* semversioner
* switched from sha256 to sha512
* new semversioner
* updated tests/verbs/data folder
* generated fresh parquet files in data folder
* moved ignore flag
* Firsst cut at config cleanup
* Reorder top nav
* Add query prompts to tuning page
* Remove dynamic notebook from nav
* Add more thorough yml config descriptions in docs
* Further clean out the config
* Semver
* Add new blog post
* Emphasize yaml
* Clarify output
* Fix unit test
* Fix bullet nesting
* Move indexing prompts to root
* Move query prompts to root
* Export query prompts during init
* Extract general knowledge prompt
* Load query prompts from disk
* Semver
* Fix unit tests
* Add source documents for verb tests
* Remove entity_type erroneous column
* Add new test data
* Remove source/target degree columns
* Remove top_level_node_id
* Remove chunk column configs
* Rename "chunk" to "text"
* Rename "chunk" to "text" in base
* Re-map document input to use base text units
* Revert base text units as final documents dep
* Update test data
* Split/rename node source_id
* Drop node size (dup of degree)
* Drop document_ids from covariates
* Remove unused document_ids from models
* Remove n_tokens from covariate table
* Fix missed document_ids delete
* Wire base text units to final documents
* Rename relationship rank as combined_degree
* Add rank as first-class property to Relationship
* Remove split_text operation
* Fix relationships test parquet
* Update test parquets
* Add entity ids to community table
* Remove stored graph embedding columns
* Format
* Semver
* Fix JSON typo
* Spelling
* Rename lancedb
* Sort lancedb
* Fix unit test
* Fix test to account for changing period
* Update tests for separate embeddings
* Format
* Better assertion printing
* Fix unit test for windows
* Rename document.raw_content -> document.text
* Remove read_documents function
* Remove unused document summary from model
* Remove unused imports
* Format
* Add new snapshots to default init
* Use util to construct embeddings collection name
* Align inc index model with branch changes
* Update data and tests for int ids
* Clean up embedding locs
* Switch entity "name" to "title" for consistency
* Fix short_id -> human_readable_id defaults
* Format
* Rework community IDs
* Fix community size compute
* Fix unit tests
* Fix report read
* Pare down nodes table output
* Fix unit test
* Fix merge
* Fix community loading
* Format
* Fix community id report extraction
* Update tests
* Consistent short IDs and ordering
* Update ordering and tests
* Update incremental for new nodes model
* Guard document columns loc
* Match column ordering
* Fix document guard
* Update smoke tests
* Fill NA on community extract
* Logging for smoke test debug
* Add parquet schema details doc
* Fix community hierarchy guard
* Use better empty hierarchy guard
* Back-compat shims
* Semver
* Fix warning
* Format
* Remove default fallback
* Reuse key
* New workflow to generate embeddings in a single workflow
* New workflow to generate embeddings in a single workflow
* version change
* clean tests without any embeddings references
* clean tests without any embeddings references
* remove code
* feedback implemented
* changes in logic
* feedback implemented
* store in table bug fixed
* smoke test for generate_text_embeddings workflow
* smoke test fix
* add generate_text_embeddings to the list of transient workflows
* smoke tests
* fix
* ruff formatting updates
* fix
* smoke test fixed
* smoke test fixed
* fix lancedb import
* smoke test fix
* ignore sorting
* smoke test fixed
* smoke test fixed
* check smoke test
* smoke test fixed
* change config for vector store
* format fix
* vector store changes
* revert debug profile back to empty filepath
* merge conflict solved
* merge conflict solved
* format fixed
* format fixed
* fix return dataframe
* snapshot fix
* format fix
* embeddings param implemented
* validation fixes
* fix map
* fix map
* fix properties
* config updates
* smoke test fixed
* settings change
* Update collection config and rework back-compat
* Repalce . with - for embedding store
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
* Extract base docs and entity graph
* Move extracted entities and text units
* Move communities and community reports
* Move covariates and final documents
* Move entities, nodes, relationships
* Move text_units and summarized entities
* Assert all snapshot null cases
* Remove disabled steps util
* Remove incorrect use of input "others"
* Convert text_embed_df to just return the embeddings, not update the df
* Convert snapshot functions to noops
* Semver
* Remove lingering covariates_enabled param
* Name consistency
* Syntax cleanup
* Migrate towards using static output directories
- Fixes load_config eagering resolving directories.
Directories are only resolved when the output
directories are local.
- Add support for `--output` and `--reporting` flags
for index CLI. To achieve previous output structure
`index --output run1/artifacts --reports run1/reports`.
- Use static output directories when initializing
a new project.
- Maintains backward compatibility for those using
timestamp outputs locally.
* fix smoke tests
* update query cli to work with static directories
* remove eager path resolution from load_config. Support CLI overrides that can be resolved.
* add docs and output logs/artifacts to same directory
* use match statement
* switch back to if statement
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Collapse create_final_communities
* Semver
* Spellcheck
* Clean up filtering
* Add space in title
* Format
* Cleanup imports and format
* Spruce up the tests
* Update dictionary.txt
* Spellcheck
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Setup basic verb test runner
* Replace join_text_units_to_entity_ids with subflow
* Update comments
* Replace join_text_units_to_relationship_ids subflow
* Roll in final select
* Reuse assertion util
* Small fix + format
* Format/typing
* Semver
* Format/typing
* Semver
* Revert format changes
* Fix smoke test subworkflow count
* Edit subworkflows for another smoke test
* Update test parquets for covariates
* Collapse covariate join
* Rework subtasks for per-flow customization
* Format
* Semver
* Fix smoke test
* Setup basic verb test runner
* Replace join_text_units_to_entity_ids with subflow
* Update comments
* Replace join_text_units_to_relationship_ids subflow
* Roll in final select
* Reuse assertion util
* Small fix + format
* Format/typing
* Semver
* Format/typing
* Semver
* Revert format changes
* Fix smoke test subworkflow count
* Edit subworkflows for another smoke test
* Added graphrag_import_neo4j_cypher Notebook
* changed to procedure for setting embedding property to save disk space
* Reformat and cleanup
* semver
* Poetry lock update
* Update AAIS docs
* Rename contrib folder
* Merge from main
* Revert "Merge from main"
This reverts commit a399dde97b689a5b5c62dc2e9c2290cb2503b3a4.
* Fix ruff check
* Add readme and fix tests
* Fix community reports
---------
Co-authored-by: Michael Hunger <github@jexp.de>