278 Commits

Author SHA1 Message Date
Josh Bradley
175fee72b6 initial commit and test of refactored pipeline reporter/logger 2024-12-10 09:54:08 -08:00
Josh Bradley
affe36a157 update comment 2024-12-09 12:13:05 -08:00
Josh Bradley
1194eeebb1 fix comments 2024-12-09 12:09:02 -08:00
Josh Bradley
a066b44aca renmae reporter to logger 2024-12-09 12:01:15 -08:00
Josh Bradley
92c024ff25 cleanup reporter terminology 2024-12-09 02:17:07 -08:00
Josh Bradley
e7e3ee701c typo fix 2024-12-09 02:16:11 -08:00
Josh Bradley
4f4098b000 add comment 2024-12-09 01:48:24 -08:00
Josh Bradley
27ed220887 update developer guide 2024-12-09 01:41:11 -08:00
Josh Bradley
601bc4a3c9 cleanup logging factory 2024-12-09 01:40:57 -08:00
Josh Bradley
bd97255838 add semversioner file 2024-12-08 22:49:31 -08:00
Josh Bradley
167ece56ac cleanup factory methods to have similar design pattern across codebase 2024-12-08 16:08:53 -05:00
Alonso Guevara
1a13e0fd93
Release v0.9.0 (#1479)
* Release v0.9.0

* Spellcheck
v0.9.0
2024-12-06 14:29:55 -06:00
Alonso Guevara
1c3b0f34c3
Chore/lib updates (#1477)
* Update dependencies and fix issues

* Format

* Semver

* Fix Pyright

* Pyright

* More Pyright

* Pyright
2024-12-06 14:08:24 -06:00
volksen
b1f2ca785e
deduplicate sources in local search context (#1468)
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-06 13:05:00 -06:00
Chris Trevino
5ff2d3c76d
Remove graphrag.llm, replace with fnllm (#1315)
* add fnllm; remove llm folder

* remove llm unit tests

* update imports

* update imports

* formatting

* enable autosave

* update mockllm

* update community reports extractor

* move most llm usage to fnllm

* update type issues

* fix unit tests

* type updates

* update dictionary

* semver

* update llm construction, get integration tests working

* load from llmparameters model

* move ruff settings to ruff.toml

* add gitattributes file

* ignore ruff.toml spelling

* update .gitattributes

* update gitignore

* update config construction

* update prompt var usage

* add cache adapter

* use cache adapter in embeddings calls

* update embedding strategy

* add fnllm

* add pytest-dotenv

* fix some verb tests

* get verbtests running

* update ruff.toml for vscode

* enable ruff native server in vscode

* update artifact inspecting code

* remove local-test update

* use string.replace instead of string.format in community reprots etxractor

* bump timeout

* revert ruff.toml, vscode settings for another pr

* revert cspell config

* revert gitignore

* remove json-repair, update fnllm

* use fnllm generic type interfaces

* update load_llm to use target models

* consolidate chat parameters

* add 'extra_attributes' prop to community report response

* formatting

* update fnllm

* formatting

* formatting

* Add defaults to some llm params to avoid null on params hash

* Formatting

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2024-12-05 18:07:47 -06:00
Alonso Guevara
d43124e576
Refactor Create Final Community reports to simplify code (#1456)
* Optimize prep claims

* Optimize community hierarchy restore

* Partial optimization of prepare_community_reports

* More optimization code

* Fix context string generation

* Filter community -1

* Fix cache, add more optimization fixes

* Fix local search community ids

* Cleanup

* Format

* Semver

* Remove perf counter

* Unused import

* Format

* Fix edge addition to reports

* Add edge by edge context creation

* Re-org of the optimization code

* Format

* Ruff

* Some Ruff fixes

* More pyright

* More pyright

* Pyright

* Pyright

* Update tests
2024-12-05 17:13:05 -06:00
Josh Bradley
b00142260d
Update index API + a notebook that provides a general API overview (#1454)
* update index api to accept callbacks

* fix hardcoded folder name that was creating an empty folder

* add API notebook

* add semversioner file

* filename change

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-05 15:34:21 -06:00
KennyZhang1
10f84c91eb
Replace md5 hash (#1470)
* switched hashing function helper to sha256

* refactored references to hashing util

* semversioner

* switched from sha256 to sha512

* new semversioner

* updated tests/verbs/data folder

* generated fresh parquet files in data folder

* moved ignore flag
2024-12-05 13:24:35 -06:00
Nathan Evans
d17dfd01f9
Graph collapse (#1464)
* Refactor graph creation

* Semver

* Spellcheck

* Update integ pipeline

* Fix cast

* Improve pandas chaining

* Cleaner apply

* Use list comprehensions

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-12-05 11:57:26 -06:00
Gijs Segerink
756f5c38a7
Update search.py (#1457)
Missing query in astream_search
2024-12-04 16:52:08 -06:00
Josh Bradley
dad2176b3c
Miscellaneous code cleanup procedures (#1452) 2024-11-27 13:27:43 -05:00
Nathan Evans
0b2120ca45
Docs and notebooks update (#1451)
* Fix local question gen and example notebook

* Update global search notebook

* Add lazy blog post

* Update breaking changes doc for migration notes

* Simplify Getting Started page

* Semver

* Spellcheck

* Fix types

* Add comments on cache-free migration

* Update wording

* Spelling

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-27 09:56:48 -08:00
Yuan Chai
2b7d28944d
Fix encoding issue: Ensure non-ASCII characters are correctly represe… (#1446)
Fix encoding issue: Ensure non-ASCII characters are correctly represented in entity name key

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-26 17:47:06 -06:00
Alonso Guevara
ae796b99cb
Fix dynamic community selection in global search (#1450)
* Fix dynamic community selection in global search

* Format

* Ruff fix
2024-11-26 15:19:50 -06:00
Alonso Guevara
6d21ef2683
Release v0.5.0 (#1415) v0.5.0 2024-11-18 00:06:54 -06:00
Josh Bradley
22a57d14c7
Improve CLI speed with lazy imports (#1319) 2024-11-15 19:41:10 -05:00
Nathan Evans
9b4f24ebce
First cut at config cleanup (#1411)
* Firsst cut at config cleanup

* Reorder top nav

* Add query prompts to tuning page

* Remove dynamic notebook from nav

* Add more thorough yml config descriptions in docs

* Further clean out the config

* Semver

* Add new blog post

* Emphasize yaml

* Clarify output

* Fix unit test

* Fix bullet nesting
2024-11-15 14:33:26 -08:00
Nathan Evans
425dbc60e3
Docs update (#1408)
* Fix footer contrast

* Fix broken links

* Remove a few unneeded examples

* Point python API example to the whole folder

* Convert schema bullets to tables
2024-11-14 21:26:29 -06:00
JunHo Kim (김준호)
ec9cdcce4d
fix typo. Correct the wording "global search" to "drift search" in drift search documentation (#1383)
Updated the wording of the example scenario from "global search" to "drift search" to accurately reflect the topic. This improves clarity and ensures the documentation accurately describes its content.

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-14 16:55:44 -06:00
Jeff Baumes
0a5801041a
Fix documentation for generate_indexing_prompts (#1336)
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-14 16:53:59 -06:00
Alonso Guevara
c90166ca32
Add Parquet as part of the default emitters when not present (#1407)
Add Parquet as part of the default emitters when not pressent
2024-11-14 13:04:19 -06:00
Nathan Evans
51912b2e03
Move prompts (#1404)
* Move indexing prompts to root

* Move query prompts to root

* Export query prompts during init

* Extract general knowledge prompt

* Load query prompts from disk

* Semver

* Fix unit tests
2024-11-14 10:45:37 -08:00
Nathan Evans
c8c354e357
Artifact cleanup (#1341)
* Add source documents for verb tests

* Remove entity_type erroneous column

* Add new test data

* Remove source/target degree columns

* Remove top_level_node_id

* Remove chunk column configs

* Rename "chunk" to "text"

* Rename "chunk" to "text" in base

* Re-map document input to use base text units

* Revert base text units as final documents dep

* Update test data

* Split/rename node source_id

* Drop node size (dup of degree)

* Drop document_ids from covariates

* Remove unused document_ids from models

* Remove n_tokens from covariate table

* Fix missed document_ids delete

* Wire base text units to final documents

* Rename relationship rank as combined_degree

* Add rank as first-class property to Relationship

* Remove split_text operation

* Fix relationships test parquet

* Update test parquets

* Add entity ids to community table

* Remove stored graph embedding columns

* Format

* Semver

* Fix JSON typo

* Spelling

* Rename lancedb

* Sort lancedb

* Fix unit test

* Fix test to account for changing period

* Update tests for separate embeddings

* Format

* Better assertion printing

* Fix unit test for windows

* Rename document.raw_content -> document.text

* Remove read_documents function

* Remove unused document summary from model

* Remove unused imports

* Format

* Add new snapshots to default init

* Use util to construct embeddings collection name

* Align inc index model with branch changes

* Update data and tests for int ids

* Clean up embedding locs

* Switch entity "name" to "title" for consistency

* Fix short_id -> human_readable_id defaults

* Format

* Rework community IDs

* Fix community size compute

* Fix unit tests

* Fix report read

* Pare down nodes table output

* Fix unit test

* Fix merge

* Fix community loading

* Format

* Fix community id report extraction

* Update tests

* Consistent short IDs and ordering

* Update ordering and tests

* Update incremental for new nodes model

* Guard document columns loc

* Match column ordering

* Fix document guard

* Update smoke tests

* Fill NA on community extract

* Logging for smoke test debug

* Add parquet schema details doc

* Fix community hierarchy guard

* Use better empty hierarchy guard

* Back-compat shims

* Semver

* Fix warning

* Format

* Remove default fallback

* Reuse key
2024-11-13 15:11:19 -08:00
Alonso Guevara
e53422366d
Implement dynamic community selection for global search (#1396)
* update gitignore

* add dynamic community sleection to updated main branch

* update SearchResult to record output_tokens.

* update search result

* dynamic search working

* format

* add llm_calls_categories and prompt_tokens and output_tokens cate

* update

* formatting

* log drift search output and prompt tokens separately

* update global_search.ipynb. update operate dulce dataset and add create_final_communities. update dynamic community selection init

* add .ipynb back to cspell.config.yaml

* format

* add notebook example on dynamic search

* rearrange

* update gitignore

* format code

* code format

* code format

* fix default variable

---------

Co-authored-by: Bryan Li <bryanlimy@gmail.com>
2024-11-11 16:45:07 -08:00
Alonso Guevara
ba50caab4d
Release v0.4.1 (#1387)
* Release v0.4.1

* Spellcheck
v0.4.1
2024-11-08 17:59:57 -06:00
Alonso Guevara
20c120288b
Feat/update cli (#1376)
* Add update cli option with default storage

* Semver

* Semver

* Pyright

* Format
2024-11-07 06:59:10 -06:00
Kylin
baa261c8e9
[bugfix]Fix query error with --streaming (#1368)
* fix streaming output error

* add semversioner

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-06 17:49:06 -06:00
Alonso Guevara
3d79de96d1
Raise error on empty deltas for incremental indexing (#1375)
* Raise error on empty deltas for incremental indexing

* Format
2024-11-06 17:33:35 -06:00
Alonso Guevara
1661672569
Fix optional covariates check in incremental indexing (#1374)
* Fix optional covariates check in incremental indexing

* Oopsie fix
2024-11-06 17:22:11 -06:00
Josh Bradley
a8ccded83c
Fix file path issue in the viz guide (#1372)
* Fix a file paths issue in the viz guide.

* fix formatting
2024-11-06 14:42:07 -08:00
Alonso Guevara
2047c1561c
Fix styling and misalignment on drift docs (#1373) 2024-11-06 16:29:53 -06:00
Josh Bradley
0394b55086
Update CI/CD - skip running unit tests on documentation-only PRs (#1371) 2024-11-06 14:19:21 -05:00
Josh Bradley
9762f33c1a
Add visualization guide (#1340) 2024-11-06 14:06:50 -05:00
Alonso Guevara
a6d9b0ce3d
Release v0.4.0 (#1361)
* Release v0.4.0

* Missing change track
v0.4.0
2024-11-05 18:44:07 -06:00
Alonso Guevara
635c21109f
Fix Community ID loading for DRIFT search over existing indexes (#1360) 2024-11-05 18:21:36 -06:00
Alonso Guevara
80c0c7bdd1
Update Incremental Indexing to new embeddings workflow (#1359) 2024-11-05 16:54:02 -06:00
Alonso Guevara
83bd5cefe5
Fix content embedding container name (#1358) 2024-11-05 15:56:32 -06:00
Alonso Guevara
1557ce34f9
Fix init defaults for vector store and img in drift docs (#1357)
* Fix init defaults for vector store and img in drift docs

* Adde more doc

* Spellcheck

* Remove example
2024-11-05 14:14:17 -06:00
Alonso Guevara
d9f985ae52
Drift Search CLI, API, Docs and Example Notebook (#1348)
* Drift CLI and backwards compat

* Adding DRIFT Cli, Docs and example notebook

* Update tests and fix ruff

* Format

* Small cleanup

* Fix smoke tests

* Update notebook

* Oopsie fix

* Delete duplicate img
2024-11-05 12:05:19 -06:00
Gabriel Nieves-Ponce
68dfceef21
Updated the variable names within the for-loop to differentiate betwe… (#1356)
* Updated the variable names within the for-loop to differentiate between them and the original title variable used in the dataframe. This avoids corrupting the original column-name defined in the title variable.

* Semver and formart

---------

Co-authored-by: Gabriel Nieves-Ponce <gnievesponce@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2024-11-05 11:45:29 -06:00