256 Commits

Author SHA1 Message Date
Yuan Chai
2b7d28944d
Fix encoding issue: Ensure non-ASCII characters are correctly represe… (#1446)
Fix encoding issue: Ensure non-ASCII characters are correctly represented in entity name key

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-26 17:47:06 -06:00
Alonso Guevara
ae796b99cb
Fix dynamic community selection in global search (#1450)
* Fix dynamic community selection in global search

* Format

* Ruff fix
2024-11-26 15:19:50 -06:00
Alonso Guevara
6d21ef2683
Release v0.5.0 (#1415) v0.5.0 2024-11-18 00:06:54 -06:00
Josh Bradley
22a57d14c7
Improve CLI speed with lazy imports (#1319) 2024-11-15 19:41:10 -05:00
Nathan Evans
9b4f24ebce
First cut at config cleanup (#1411)
* Firsst cut at config cleanup

* Reorder top nav

* Add query prompts to tuning page

* Remove dynamic notebook from nav

* Add more thorough yml config descriptions in docs

* Further clean out the config

* Semver

* Add new blog post

* Emphasize yaml

* Clarify output

* Fix unit test

* Fix bullet nesting
2024-11-15 14:33:26 -08:00
Nathan Evans
425dbc60e3
Docs update (#1408)
* Fix footer contrast

* Fix broken links

* Remove a few unneeded examples

* Point python API example to the whole folder

* Convert schema bullets to tables
2024-11-14 21:26:29 -06:00
JunHo Kim (김준호)
ec9cdcce4d
fix typo. Correct the wording "global search" to "drift search" in drift search documentation (#1383)
Updated the wording of the example scenario from "global search" to "drift search" to accurately reflect the topic. This improves clarity and ensures the documentation accurately describes its content.

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-14 16:55:44 -06:00
Jeff Baumes
0a5801041a
Fix documentation for generate_indexing_prompts (#1336)
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-14 16:53:59 -06:00
Alonso Guevara
c90166ca32
Add Parquet as part of the default emitters when not present (#1407)
Add Parquet as part of the default emitters when not pressent
2024-11-14 13:04:19 -06:00
Nathan Evans
51912b2e03
Move prompts (#1404)
* Move indexing prompts to root

* Move query prompts to root

* Export query prompts during init

* Extract general knowledge prompt

* Load query prompts from disk

* Semver

* Fix unit tests
2024-11-14 10:45:37 -08:00
Nathan Evans
c8c354e357
Artifact cleanup (#1341)
* Add source documents for verb tests

* Remove entity_type erroneous column

* Add new test data

* Remove source/target degree columns

* Remove top_level_node_id

* Remove chunk column configs

* Rename "chunk" to "text"

* Rename "chunk" to "text" in base

* Re-map document input to use base text units

* Revert base text units as final documents dep

* Update test data

* Split/rename node source_id

* Drop node size (dup of degree)

* Drop document_ids from covariates

* Remove unused document_ids from models

* Remove n_tokens from covariate table

* Fix missed document_ids delete

* Wire base text units to final documents

* Rename relationship rank as combined_degree

* Add rank as first-class property to Relationship

* Remove split_text operation

* Fix relationships test parquet

* Update test parquets

* Add entity ids to community table

* Remove stored graph embedding columns

* Format

* Semver

* Fix JSON typo

* Spelling

* Rename lancedb

* Sort lancedb

* Fix unit test

* Fix test to account for changing period

* Update tests for separate embeddings

* Format

* Better assertion printing

* Fix unit test for windows

* Rename document.raw_content -> document.text

* Remove read_documents function

* Remove unused document summary from model

* Remove unused imports

* Format

* Add new snapshots to default init

* Use util to construct embeddings collection name

* Align inc index model with branch changes

* Update data and tests for int ids

* Clean up embedding locs

* Switch entity "name" to "title" for consistency

* Fix short_id -> human_readable_id defaults

* Format

* Rework community IDs

* Fix community size compute

* Fix unit tests

* Fix report read

* Pare down nodes table output

* Fix unit test

* Fix merge

* Fix community loading

* Format

* Fix community id report extraction

* Update tests

* Consistent short IDs and ordering

* Update ordering and tests

* Update incremental for new nodes model

* Guard document columns loc

* Match column ordering

* Fix document guard

* Update smoke tests

* Fill NA on community extract

* Logging for smoke test debug

* Add parquet schema details doc

* Fix community hierarchy guard

* Use better empty hierarchy guard

* Back-compat shims

* Semver

* Fix warning

* Format

* Remove default fallback

* Reuse key
2024-11-13 15:11:19 -08:00
Alonso Guevara
e53422366d
Implement dynamic community selection for global search (#1396)
* update gitignore

* add dynamic community sleection to updated main branch

* update SearchResult to record output_tokens.

* update search result

* dynamic search working

* format

* add llm_calls_categories and prompt_tokens and output_tokens cate

* update

* formatting

* log drift search output and prompt tokens separately

* update global_search.ipynb. update operate dulce dataset and add create_final_communities. update dynamic community selection init

* add .ipynb back to cspell.config.yaml

* format

* add notebook example on dynamic search

* rearrange

* update gitignore

* format code

* code format

* code format

* fix default variable

---------

Co-authored-by: Bryan Li <bryanlimy@gmail.com>
2024-11-11 16:45:07 -08:00
Alonso Guevara
ba50caab4d
Release v0.4.1 (#1387)
* Release v0.4.1

* Spellcheck
v0.4.1
2024-11-08 17:59:57 -06:00
Alonso Guevara
20c120288b
Feat/update cli (#1376)
* Add update cli option with default storage

* Semver

* Semver

* Pyright

* Format
2024-11-07 06:59:10 -06:00
Kylin
baa261c8e9
[bugfix]Fix query error with --streaming (#1368)
* fix streaming output error

* add semversioner

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-06 17:49:06 -06:00
Alonso Guevara
3d79de96d1
Raise error on empty deltas for incremental indexing (#1375)
* Raise error on empty deltas for incremental indexing

* Format
2024-11-06 17:33:35 -06:00
Alonso Guevara
1661672569
Fix optional covariates check in incremental indexing (#1374)
* Fix optional covariates check in incremental indexing

* Oopsie fix
2024-11-06 17:22:11 -06:00
Josh Bradley
a8ccded83c
Fix file path issue in the viz guide (#1372)
* Fix a file paths issue in the viz guide.

* fix formatting
2024-11-06 14:42:07 -08:00
Alonso Guevara
2047c1561c
Fix styling and misalignment on drift docs (#1373) 2024-11-06 16:29:53 -06:00
Josh Bradley
0394b55086
Update CI/CD - skip running unit tests on documentation-only PRs (#1371) 2024-11-06 14:19:21 -05:00
Josh Bradley
9762f33c1a
Add visualization guide (#1340) 2024-11-06 14:06:50 -05:00
Alonso Guevara
a6d9b0ce3d
Release v0.4.0 (#1361)
* Release v0.4.0

* Missing change track
v0.4.0
2024-11-05 18:44:07 -06:00
Alonso Guevara
635c21109f
Fix Community ID loading for DRIFT search over existing indexes (#1360) 2024-11-05 18:21:36 -06:00
Alonso Guevara
80c0c7bdd1
Update Incremental Indexing to new embeddings workflow (#1359) 2024-11-05 16:54:02 -06:00
Alonso Guevara
83bd5cefe5
Fix content embedding container name (#1358) 2024-11-05 15:56:32 -06:00
Alonso Guevara
1557ce34f9
Fix init defaults for vector store and img in drift docs (#1357)
* Fix init defaults for vector store and img in drift docs

* Adde more doc

* Spellcheck

* Remove example
2024-11-05 14:14:17 -06:00
Alonso Guevara
d9f985ae52
Drift Search CLI, API, Docs and Example Notebook (#1348)
* Drift CLI and backwards compat

* Adding DRIFT Cli, Docs and example notebook

* Update tests and fix ruff

* Format

* Small cleanup

* Fix smoke tests

* Update notebook

* Oopsie fix

* Delete duplicate img
2024-11-05 12:05:19 -06:00
Gabriel Nieves-Ponce
68dfceef21
Updated the variable names within the for-loop to differentiate betwe… (#1356)
* Updated the variable names within the for-loop to differentiate between them and the original title variable used in the dataframe. This avoids corrupting the original column-name defined in the title variable.

* Semver and formart

---------

Co-authored-by: Gabriel Nieves-Ponce <gnievesponce@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-05 11:45:29 -06:00
Nathan Evans
634e3ed62a
Transient entity graph (#1349)
* Make base_entity_graph transient

* Add transient snapshots

* Semver

* Fix unit test

* Fix smoke tests
2024-11-04 17:23:29 -08:00
gaudyb
17658c5df8
New workflow to generate embeddings in a single workflow (#1296)
* New workflow to generate embeddings in a single workflow

* New workflow to generate embeddings in a single workflow

* version change

* clean tests without any embeddings references

* clean tests without any embeddings references

* remove code

* feedback implemented

* changes in logic

* feedback implemented

* store in table bug fixed

* smoke test for generate_text_embeddings workflow

* smoke test fix

* add generate_text_embeddings to the list of transient workflows

* smoke tests

* fix

* ruff formatting updates

* fix

* smoke test fixed

* smoke test fixed

* fix lancedb import

* smoke test fix

* ignore sorting

* smoke test fixed

* smoke test fixed

* check smoke test

* smoke test fixed

* change config for vector store

* format fix

* vector store changes

* revert debug profile back to empty filepath

* merge conflict solved

* merge conflict solved

* format fixed

* format fixed

* fix return dataframe

* snapshot fix

* format fix

* embeddings param implemented

* validation fixes

* fix map

* fix map

* fix properties

* config updates

* smoke test fixed

* settings change

* Update collection config and rework back-compat

* Repalce . with - for embedding store

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2024-11-01 15:01:35 -07:00
Chris Trevino
8302920ac8
move mkdocs-typer to devdeps (#1331)
* move mkdocs-typer to devdeps

* add .gitattributes for toml parsing issues on Windows CI

* bump timeout

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-10-30 14:49:30 -07:00
Alonso Guevara
7235c6faf5
Add Incremental Indexing v1 (#1318)
* Create entypoint for cli and api (#1067)

* Add cli and api entrypoints for update index

* Semver

* Update docs

* Run tests on feature branch main

* Better /main handling in tests

* Incremental indexing/file delta (#1123)

* Calculate new inputs and deleted inputs on update

* Semver

* Clear ruff checks

* Fix pyright

* Fix PyRight

* Ruff again

* Update relationships after inc index (#1236)

* Collapse create final community reports (#1227)

* Remove extraneous param

* Add community report mocking assertions

* Collapse primary report generation

* Collapse embeddings

* Format

* Semver

* Remove extraneous check

* Move option set

* Collapse create base entity graph (#1233)

* Collapse create_base_entity_graph

* Format/typing

* Semver

* Fix smoke tests

* Simplify assignment

* Collapse create summarized entities (#1237)

* Collapse entity summarize

* Semver

* Collapse create base extracted entities (#1235)

* Set up base assertions

* Replace entity_extract

* Finish collapsing workflow

* Semver

* Update snoke tests

* Incremental indexing/update final text units (#1241)

* Update final text units

* Format

* Address comments

* Add v1 community merge using time period (#1257)

* Add naive community merge using time period

* formatting

* Query fixes

* Add descriptions from merged_entities

* Add summarization and embeddings

* Use iso format

* Ruff

* Pyright and smoke tests

* Pyright

* Pyright

* Update parquet for verb tests

* Fix smoke tests

* Remove sorting

* Update smoke tests

* Smoke tests

* Smoke tests

* Updated verb test to ack for latest changes on covariates

* Add config for incremental index + Bug fixes (#1317)

* Add config for incremental index + Bug fixes

* Ruff

* Fix smoke tests

* Semversioner

* Small refactor

* Remove unused file

* Ruff

* Update verb tests inputs

* Update verb tests inputs

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2024-10-30 11:59:44 -06:00
Josh Bradley
0cc79b9cf7
Add backwards compatibility patch for vector store (#1334) 2024-10-29 14:54:08 -04:00
Alonso Guevara
83026bdb26
Remove duplicated entried from relationships and nodes (#1333) 2024-10-29 00:56:07 -04:00
Josh Bradley
083de12bcf
Auto-generate CLI doc pages (#1325) 2024-10-25 19:00:24 -04:00
Josh Bradley
d6e6f5c077
Convert CLI to Typer app (#1305) 2024-10-24 14:22:32 -04:00
Nathan Evans
94f1e62e5c
Rework workflow architecture (#1311)
* Rename pipeline_storage file

* Add runtime storage option to context

* Fix import

* Switch to memory storage for runtime

* Infra for workflow runtime storage

* Migrate base_text_units to runtime storage

* Fix comment

* Semver

* Remove whitespace

* Remove subflow smoke tests and ignore transient artifacts

* Remove entity graph from transient list (not yet implemented)

* Increase smoke runtime allotment for create_base_entity_graph

* Revert format fix

* Remove noqa
2024-10-24 10:20:03 -07:00
Alonso Guevara
ac09e0a740
Feature/optimize count relationships (#1312)
* refactor build text unit context for better performance

* Further optimization and styling

* Remove TODO

---------

Co-authored-by: Brad Firesheets <v-bradleyf@microsoft.com>
Co-authored-by: bfirems <162185685+bfirems@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2024-10-23 12:03:57 -06:00
Josh Bradley
3df6f8c65b
Allow ci/cd to skip draft PRs (#1314) 2024-10-23 12:46:00 -04:00
Alonso Guevara
77e77775ad
Fix drift search edge cases over small input sets (#1310)
* Fix edge cases over small input sets

* Ruff
2024-10-22 16:24:41 -06:00
JunHo Kim (김준호)
8d8c67d503
fix typo. Update documentation URLs for consistency (#1298)
Update documentation URLs for consistency

Revised links in documentation files to remove the "posts" subdirectory for consistency and correctness.

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-10-21 17:24:17 -06:00
Alonso Guevara
8a6d4e66fe
DRIFT Search (#1285)
* drift search

* args for drift global query in local search

* accept drift context in search base

* optionally parse embeddings from df when creating CommunityReport

* abstract class for drift context

* pathing for drift config

* drift config

* add defs for drift config

* formatting

* capture generated tokens in token count

* semversion

* Formatting and ruff

* Some algorithmic refactors

* Ruff

* Format

* Use asdict()

* Address comments

* Update smoke tests

* Update smoke tests

* Update smoke tests part 2

---------

Co-authored-by: Julian Whiting <j2whitin@gmail.com>
2024-10-21 17:22:11 -06:00
KennyZhang1
e0840a2dc4
Fix vector store logic and refactor audience parameter (#1259) 2024-10-21 16:56:56 -04:00
Matthieu Maitre
6aae386b30
Perf optimizations in map_query_to_entities() (#1276)
* Address perf issue in map_query_to_entities()

* Add semver

---------

Co-authored-by: Matthieu Maitre <mmaitre@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-10-21 12:03:48 -06:00
Nathan Evans
1f70d42572
Empty workflow returns (#1291)
* Skip emitting empty dataframes

* Semver

* Better empty df check
2024-10-17 09:25:36 -07:00
Andres Morales
fc502ee029
Fix cookie consent script missing (#1292) 2024-10-17 09:44:14 -06:00
Nathan Evans
ce5b1207e0
Collapse graph documents workflows (#1284)
* Copy base documents logic into final documents

* Delete create_base_documents

* Combine graph creation under create_base_entity_graph

* Delete collapsed workflows

* Migrate most graph internals to nx.Graph

* Fix None edge case

* Semver

* Remove comment typo

* Fix smoke tests
2024-10-15 13:58:58 -06:00
Andres Morales
137a5cd550
Fix/docs auto prompt img (#1283)
* Fix auto prompt tuning image path
2024-10-14 09:02:31 -06:00
Alonso Guevara
cb052a742f
Dependency updates (#1272)
* Dependency updates

* Pyright update
2024-10-11 18:06:11 -06:00
Andres Morales
fc9895f793
Replace current docs by mkdocs (#1263)
* Replace docs by mkdocs-material

* Fix markdown

* Fix verions in gh-pages workflow

* remove whitespaces

* add semver

* Add build docs check on python-ci

* Fix command in index cli

* Spellcheck

* Spellcheck

* remove docsite paths

* clear outputs from notebook

* remove dependabot npm for docsite

* remove more docsite left overs

* execute notebooks

* Update notebooks

* update poetry lock

* Remove notebook build from ci

* Revert dep update

* Navigation tabs

* Fix stylesheet

* add kwds to dictionary

* Turn on notebook execution

* Update gitignore

* Add MSR Blog posts

* spellcheck

* Accessibility Changes

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-10-11 13:39:03 -06:00