graphrag/tests/verbs/test_create_final_communities.py

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

from graphrag.index.run.utils import create_run_context
from graphrag.index.workflows.v1.create_final_communities import (
    build_steps,
    workflow_name,
)

from .util import (
    compare_outputs,
    get_workflow_output,
    load_expected,
    load_input_tables,
)


async def test_create_final_communities():
    input_tables = load_input_tables([
        "workflow:create_base_entity_graph",
    ])
    expected = load_expected(workflow_name)

    context = create_run_context(None, None, None)
    await context.runtime_storage.set(
        "base_entity_graph", input_tables["workflow:create_base_entity_graph"]
    )

    steps = build_steps({})

    actual = await get_workflow_output(
        input_tables,
        {
            "steps": steps,
        },
        context=context,
    )

    # ignore the period and id columns, because they recalculated every time
    assert "period" in expected.columns
    assert "id" in expected.columns
    columns = list(expected.columns.values)
    columns.remove("period")
    columns.remove("id")
    compare_outputs(
        actual,
        expected,
        columns=columns,
    )
Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`# Copyright (c) 2024 Microsoft Corporation.`
			`# Licensed under the MIT License`

Transient entity graph (#1349) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests 2024-11-04 17:23:29 -08:00			`from graphrag.index.run.utils import create_run_context`
Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`from graphrag.index.workflows.v1.create_final_communities import (`
			`build_steps,`
			`workflow_name,`
			`)`

			`from .util import (`
			`compare_outputs,`
			`get_workflow_output,`
			`load_expected,`
			`load_input_tables,`
			`)`


			`async def test_create_final_communities():`
			`input_tables = load_input_tables([`
			`"workflow:create_base_entity_graph",`
			`])`
			`expected = load_expected(workflow_name)`

Transient entity graph (#1349) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests 2024-11-04 17:23:29 -08:00			`context = create_run_context(None, None, None)`
			`await context.runtime_storage.set(`
			`"base_entity_graph", input_tables["workflow:create_base_entity_graph"]`
			`)`

Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`steps = build_steps({})`

			`actual = await get_workflow_output(`
			`input_tables,`
			`{`
			`"steps": steps,`
			`},`
Transient entity graph (#1349) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests 2024-11-04 17:23:29 -08:00			`context=context,`
Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`)`

Artifact cleanup (#1341) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key 2024-11-13 15:11:19 -08:00			`# ignore the period and id columns, because they recalculated every time`
			`assert "period" in expected.columns`
			`assert "id" in expected.columns`
Transient entity graph (#1349) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests 2024-11-04 17:23:29 -08:00			`columns = list(expected.columns.values)`
			`columns.remove("period")`
Artifact cleanup (#1341) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key 2024-11-13 15:11:19 -08:00			`columns.remove("id")`
Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`compare_outputs(`
Add Incremental Indexing v1 (#1318) * Create entypoint for cli and api (#1067) * Add cli and api entrypoints for update index * Semver * Update docs * Run tests on feature branch main * Better /main handling in tests * Incremental indexing/file delta (#1123) * Calculate new inputs and deleted inputs on update * Semver * Clear ruff checks * Fix pyright * Fix PyRight * Ruff again * Update relationships after inc index (#1236) * Collapse create final community reports (#1227) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set * Collapse create base entity graph (#1233) * Collapse create_base_entity_graph * Format/typing * Semver * Fix smoke tests * Simplify assignment * Collapse create summarized entities (#1237) * Collapse entity summarize * Semver * Collapse create base extracted entities (#1235) * Set up base assertions * Replace entity_extract * Finish collapsing workflow * Semver * Update snoke tests * Incremental indexing/update final text units (#1241) * Update final text units * Format * Address comments * Add v1 community merge using time period (#1257) * Add naive community merge using time period * formatting * Query fixes * Add descriptions from merged_entities * Add summarization and embeddings * Use iso format * Ruff * Pyright and smoke tests * Pyright * Pyright * Update parquet for verb tests * Fix smoke tests * Remove sorting * Update smoke tests * Smoke tests * Smoke tests * Updated verb test to ack for latest changes on covariates * Add config for incremental index + Bug fixes (#1317) * Add config for incremental index + Bug fixes * Ruff * Fix smoke tests * Semversioner * Small refactor * Remove unused file * Ruff * Update verb tests inputs * Update verb tests inputs --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> 2024-10-30 11:59:44 -06:00			`actual,`
			`expected,`
Transient entity graph (#1349) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests 2024-11-04 17:23:29 -08:00			`columns=columns,`
Collapse final communities workflow (#1150) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> 2024-09-17 17:04:42 -07:00			`)`