7 Commits

Author SHA1 Message Date
Nathan Evans
a35cb12741
Remove datashaper strip code (#1581)
Remove datashaper
2025-01-03 13:59:26 -08:00
Nathan Evans
d17dfd01f9
Graph collapse (#1464)
* Refactor graph creation

* Semver

* Spellcheck

* Update integ pipeline

* Fix cast

* Improve pandas chaining

* Cleaner apply

* Use list comprehensions

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-05 11:57:26 -06:00
Nathan Evans
c8c354e357
Artifact cleanup (#1341)
* Add source documents for verb tests

* Remove entity_type erroneous column

* Add new test data

* Remove source/target degree columns

* Remove top_level_node_id

* Remove chunk column configs

* Rename "chunk" to "text"

* Rename "chunk" to "text" in base

* Re-map document input to use base text units

* Revert base text units as final documents dep

* Update test data

* Split/rename node source_id

* Drop node size (dup of degree)

* Drop document_ids from covariates

* Remove unused document_ids from models

* Remove n_tokens from covariate table

* Fix missed document_ids delete

* Wire base text units to final documents

* Rename relationship rank as combined_degree

* Add rank as first-class property to Relationship

* Remove split_text operation

* Fix relationships test parquet

* Update test parquets

* Add entity ids to community table

* Remove stored graph embedding columns

* Format

* Semver

* Fix JSON typo

* Spelling

* Rename lancedb

* Sort lancedb

* Fix unit test

* Fix test to account for changing period

* Update tests for separate embeddings

* Format

* Better assertion printing

* Fix unit test for windows

* Rename document.raw_content -> document.text

* Remove read_documents function

* Remove unused document summary from model

* Remove unused imports

* Format

* Add new snapshots to default init

* Use util to construct embeddings collection name

* Align inc index model with branch changes

* Update data and tests for int ids

* Clean up embedding locs

* Switch entity "name" to "title" for consistency

* Fix short_id -> human_readable_id defaults

* Format

* Rework community IDs

* Fix community size compute

* Fix unit tests

* Fix report read

* Pare down nodes table output

* Fix unit test

* Fix merge

* Fix community loading

* Format

* Fix community id report extraction

* Update tests

* Consistent short IDs and ordering

* Update ordering and tests

* Update incremental for new nodes model

* Guard document columns loc

* Match column ordering

* Fix document guard

* Update smoke tests

* Fill NA on community extract

* Logging for smoke test debug

* Add parquet schema details doc

* Fix community hierarchy guard

* Use better empty hierarchy guard

* Back-compat shims

* Semver

* Fix warning

* Format

* Remove default fallback

* Reuse key
2024-11-13 15:11:19 -08:00
Alonso Guevara
7235c6faf5
Add Incremental Indexing v1 (#1318)
* Create entypoint for cli and api (#1067)

* Add cli and api entrypoints for update index

* Semver

* Update docs

* Run tests on feature branch main

* Better /main handling in tests

* Incremental indexing/file delta (#1123)

* Calculate new inputs and deleted inputs on update

* Semver

* Clear ruff checks

* Fix pyright

* Fix PyRight

* Ruff again

* Update relationships after inc index (#1236)

* Collapse create final community reports (#1227)

* Remove extraneous param

* Add community report mocking assertions

* Collapse primary report generation

* Collapse embeddings

* Format

* Semver

* Remove extraneous check

* Move option set

* Collapse create base entity graph (#1233)

* Collapse create_base_entity_graph

* Format/typing

* Semver

* Fix smoke tests

* Simplify assignment

* Collapse create summarized entities (#1237)

* Collapse entity summarize

* Semver

* Collapse create base extracted entities (#1235)

* Set up base assertions

* Replace entity_extract

* Finish collapsing workflow

* Semver

* Update snoke tests

* Incremental indexing/update final text units (#1241)

* Update final text units

* Format

* Address comments

* Add v1 community merge using time period (#1257)

* Add naive community merge using time period

* formatting

* Query fixes

* Add descriptions from merged_entities

* Add summarization and embeddings

* Use iso format

* Ruff

* Pyright and smoke tests

* Pyright

* Pyright

* Update parquet for verb tests

* Fix smoke tests

* Remove sorting

* Update smoke tests

* Smoke tests

* Smoke tests

* Updated verb test to ack for latest changes on covariates

* Add config for incremental index + Bug fixes (#1317)

* Add config for incremental index + Bug fixes

* Ruff

* Fix smoke tests

* Semversioner

* Small refactor

* Remove unused file

* Ruff

* Update verb tests inputs

* Update verb tests inputs

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2024-10-30 11:59:44 -06:00
Nathan Evans
94f1e62e5c
Rework workflow architecture (#1311)
* Rename pipeline_storage file

* Add runtime storage option to context

* Fix import

* Switch to memory storage for runtime

* Infra for workflow runtime storage

* Migrate base_text_units to runtime storage

* Fix comment

* Semver

* Remove whitespace

* Remove subflow smoke tests and ignore transient artifacts

* Remove entity graph from transient list (not yet implemented)

* Increase smoke runtime allotment for create_base_entity_graph

* Revert format fix

* Remove noqa
2024-10-24 10:20:03 -07:00
Nathan Evans
61b3d6d56a
Migrate helper verbs (#1248)
* Remove genid

* Move snapshot_rows

* Move snapshot

* Delete spread_json

* Delete unzip

* Delete zip

* Move unpack_graph

* Move compute_edge_combined_degree

* Delete create_graph

* Delete concat

* Delete text replace

* Delete text_translate

* Move text_split

* Inline aggregate override

* Move cluster_graph

* Move merge_graphs

* Semver

* Move text_chunk

* Move layout_graph and fix some __init__s

* Move extract_covariates

* Rename text_split -> split_text

* Move extract_entities

* Move summarize_descriptions

* Rename text_chunk -> chunk_text

* Move community report creation

* Remove verb-level packing operators

* Streamline some naming

* Streamline param name/order

* Move mock LLM data to tests

* Fixed missed rename

* Update some strategy refs

* Rename run_gi

* Inject mock responses into integ test config
2024-10-09 13:46:44 -07:00
Nathan Evans
73e709b686
Collapse create final covariates (#1215)
* Add covariate test

* Add detailed mock assertions

* Collapse create_final_covariates

* Delete unused doc_id field

* Semver

* Update smoke test

* Remove unused subject/object type columns
2024-09-25 16:30:22 -07:00