graphrag/tests/verbs/util.py

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

import pandas as pd
from pandas.testing import assert_series_equal

from graphrag.index.context import PipelineRunContext
from graphrag.index.run.utils import create_run_context
from graphrag.utils.storage import write_table_to_storage

pd.set_option("display.max_columns", None)


async def create_test_context(storage: list[str] | None = None) -> PipelineRunContext:
    """Create a test context with tables loaded into storage storage."""
    context = create_run_context(None, None, None)

    # always set the input docs
    input = load_test_table("source_documents")
    await write_table_to_storage(input, "input", context.storage)

    if storage:
        for name in storage:
            table = load_test_table(name)
            # normal storage interface insists on bytes
            await write_table_to_storage(table, name, context.storage)

    return context


def load_test_table(output: str) -> pd.DataFrame:
    """Pass in the workflow output (generally the workflow name)"""
    return pd.read_parquet(f"tests/verbs/data/{output}.parquet")


def compare_outputs(
    actual: pd.DataFrame, expected: pd.DataFrame, columns: list[str] | None = None
) -> None:
    """Compare the actual and expected dataframes, optionally specifying columns to compare.
    This uses assert_series_equal since we are sometimes intentionally omitting columns from the actual output.
    """
    cols = expected.columns if columns is None else columns

    assert len(actual) == len(expected), (
        f"Expected: {len(expected)} rows, Actual: {len(actual)} rows"
    )

    for column in cols:
        assert column in actual.columns
        try:
            # dtypes can differ since the test data is read from parquet and our workflow runs in memory
            assert_series_equal(
                actual[column], expected[column], check_dtype=False, check_index=False
            )
        except AssertionError:
            print("Expected:")
            print(expected[column])
            print("Actual:")
            print(actual[column])
            raise
Verb merge nre1 (#1140) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test 2024-09-16 12:10:29 -07:00			`# Copyright (c) 2024 Microsoft Corporation.`
			`# Licensed under the MIT License`

			`import pandas as pd`
Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`from pandas.testing import assert_series_equal`
Verb merge nre1 (#1140) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test 2024-09-16 12:10:29 -07:00
Rework workflow architecture (#1311) * Rename pipeline_storage file * Add runtime storage option to context * Fix import * Switch to memory storage for runtime * Infra for workflow runtime storage * Migrate base_text_units to runtime storage * Fix comment * Semver * Remove whitespace * Remove subflow smoke tests and ignore transient artifacts * Remove entity graph from transient list (not yet implemented) * Increase smoke runtime allotment for create_base_entity_graph * Revert format fix * Remove noqa 2024-10-24 10:20:03 -07:00			`from graphrag.index.context import PipelineRunContext`
			`from graphrag.index.run.utils import create_run_context`
Remove datashaper strip code (#1581) Remove datashaper 2025-01-03 13:59:26 -08:00			`from graphrag.utils.storage import write_table_to_storage`
Collapse verbs: create_final_text_units (#1143) * Load default config in verb tests * Load proper workflow config * Collapse text unit pre-embedding steps * Format * Update smoke tests * Semver * Format * Merge join* subflows into create_final_text_units * Remove join_text_units_to_covariate_ids * Format * Remove join_text_units_to_entity_ids * Remove join_text_units_to_relationship_ids * Clean up merges and aggregations * Remove unnecessary cast 2024-09-17 10:32:25 -07:00
Collapse create final community reports (#1227) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set 2024-09-30 10:46:07 -07:00			`pd.set_option("display.max_columns", None)`

Verb merge nre1 (#1140) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test 2024-09-16 12:10:29 -07:00
Remove datashaper strip code (#1581) Remove datashaper 2025-01-03 13:59:26 -08:00			`async def create_test_context(storage: list[str] \| None = None) -> PipelineRunContext:`
			`"""Create a test context with tables loaded into storage storage."""`
			`context = create_run_context(None, None, None)`
Collapse create base documents (#1176) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests 2024-09-23 13:24:06 -07:00
Remove datashaper strip code (#1581) Remove datashaper 2025-01-03 13:59:26 -08:00			`# always set the input docs`
			`input = load_test_table("source_documents")`
			`await write_table_to_storage(input, "input", context.storage)`
Collapse create base documents (#1176) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests 2024-09-23 13:24:06 -07:00
Remove datashaper strip code (#1581) Remove datashaper 2025-01-03 13:59:26 -08:00			`if storage:`
			`for name in storage:`
			`table = load_test_table(name)`
			`# normal storage interface insists on bytes`
			`await write_table_to_storage(table, name, context.storage)`
Collapse create base text units (#1178) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Setup initial test and config * Collapse create_base_text_units * Semver * Spelling * Fix smoke tests * Addres PR comments --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-23 16:55:53 -07:00
Remove datashaper strip code (#1581) Remove datashaper 2025-01-03 13:59:26 -08:00			`return context`
Verb merge nre1 (#1140) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test 2024-09-16 12:10:29 -07:00

Graph collapse (#1464) * Refactor graph creation * Semver * Spellcheck * Update integ pipeline * Fix cast * Improve pandas chaining * Cleaner apply * Use list comprehensions --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-12-05 09:57:26 -08:00			`def load_test_table(output: str) -> pd.DataFrame:`
Verb merge nre1 (#1140) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test 2024-09-16 12:10:29 -07:00			`"""Pass in the workflow output (generally the workflow name)"""`
			`return pd.read_parquet(f"tests/verbs/data/{output}.parquet")`


Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`def compare_outputs(`
			`actual: pd.DataFrame, expected: pd.DataFrame, columns: list[str] \| None = None`
			`) -> None:`
			`"""Compare the actual and expected dataframes, optionally specifying columns to compare.`
Add Incremental Indexing v1 (#1318) * Create entypoint for cli and api (#1067) * Add cli and api entrypoints for update index * Semver * Update docs * Run tests on feature branch main * Better /main handling in tests * Incremental indexing/file delta (#1123) * Calculate new inputs and deleted inputs on update * Semver * Clear ruff checks * Fix pyright * Fix PyRight * Ruff again * Update relationships after inc index (#1236) * Collapse create final community reports (#1227) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set * Collapse create base entity graph (#1233) * Collapse create_base_entity_graph * Format/typing * Semver * Fix smoke tests * Simplify assignment * Collapse create summarized entities (#1237) * Collapse entity summarize * Semver * Collapse create base extracted entities (#1235) * Set up base assertions * Replace entity_extract * Finish collapsing workflow * Semver * Update snoke tests * Incremental indexing/update final text units (#1241) * Update final text units * Format * Address comments * Add v1 community merge using time period (#1257) * Add naive community merge using time period * formatting * Query fixes * Add descriptions from merged_entities * Add summarization and embeddings * Use iso format * Ruff * Pyright and smoke tests * Pyright * Pyright * Update parquet for verb tests * Fix smoke tests * Remove sorting * Update smoke tests * Smoke tests * Smoke tests * Updated verb test to ack for latest changes on covariates * Add config for incremental index + Bug fixes (#1317) * Add config for incremental index + Bug fixes * Ruff * Fix smoke tests * Semversioner * Small refactor * Remove unused file * Ruff * Update verb tests inputs * Update verb tests inputs --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> 2024-10-30 11:59:44 -06:00			`This uses assert_series_equal since we are sometimes intentionally omitting columns from the actual output.`
			`"""`
Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`cols = expected.columns if columns is None else columns`
Collapse create base documents (#1176) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests 2024-09-23 13:24:06 -07:00
Chore/lib updates (#1477) * Update dependencies and fix issues * Format * Semver * Fix Pyright * Pyright * More Pyright * Pyright 2024-12-06 14:08:24 -06:00			`assert len(actual) == len(expected), (`
			`f"Expected: {len(expected)} rows, Actual: {len(actual)} rows"`
			`)`
Collapse create base documents (#1176) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests 2024-09-23 13:24:06 -07:00
			`for column in cols:`
			`assert column in actual.columns`
			`try:`
Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`# dtypes can differ since the test data is read from parquet and our workflow runs in memory`
Collapse create final entities (#1220) * Collapse create_final_entities * Update smoke tests * Semver * Remove prints * Update embedding assertions 2024-09-25 17:35:44 -07:00			`assert_series_equal(`
			`actual[column], expected[column], check_dtype=False, check_index=False`
			`)`
Collapse create base documents (#1176) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests 2024-09-23 13:24:06 -07:00			`except AssertionError:`
			`print("Expected:")`
			`print(expected[column])`
			`print("Actual:")`
Collapse create final entities (#1220) * Collapse create_final_entities * Update smoke tests * Semver * Remove prints * Update embedding assertions 2024-09-25 17:35:44 -07:00			`print(actual[column])`
Collapse create base documents (#1176) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests 2024-09-23 13:24:06 -07:00			`raise`