mirror of
https://github.com/OpenSPG/KAG.git
synced 2025-11-16 10:27:57 +00:00
* add path find * fix find path * spg guided relation extraction * fix dict parse with same key * rename graphalgoclient to graphclient * rename graphalgoclient to graphclient * file reader supports http url * add checkpointer class * parser supports checkpoint * add build * remove incorrect logs * remove logs * update examples * update chain checkpointer * vectorizer batch size set to 32 * add a zodb backended checkpointer * add a zodb backended checkpointer * fix zodb based checkpointer * add thread for zodb IO * fix(common): resolve mutlithread conflict in zodb IO * fix(common): load existing zodb checkpoints * update examples * update examples * fix zodb writer * add docstring * fix jieba version mismatch * commit kag_config-tc.yaml 1、rename type to register_name 2、put a uniqe & specific name to register_name 3、rename reader to scanner 4、rename parser to reader 5、rename num_parallel to num_parallel_file, rename chain_level_num_paralle to num_parallel_chain_of_file 6、rename kag_extractor to schema_free_extractor, schema_base_extractor to schema_constraint_extractor 7、pre-define llm & vectorize_model and refer them in the yaml file Issues to be resolved: 1、examples of event extract & spg extract 2、statistic of indexer, such as nums of nodes & edges extracted, ratio of llm invoke. 3、Exceptions such as Debt, account does not exist should be thrown in llm invoke. 4、conf of solver need to be re-examined. * commit kag_config-tc.yaml 1、rename type to register_name 2、put a uniqe & specific name to register_name 3、rename reader to scanner 4、rename parser to reader 5、rename num_parallel to num_parallel_file, rename chain_level_num_paralle to num_parallel_chain_of_file 6、rename kag_extractor to schema_free_extractor, schema_base_extractor to schema_constraint_extractor 7、pre-define llm & vectorize_model and refer them in the yaml file Issues to be resolved: 1、examples of event extract & spg extract 2、statistic of indexer, such as nums of nodes & edges extracted, ratio of llm invoke. 3、Exceptions such as Debt, account does not exist should be thrown in llm invoke. 4、conf of solver need to be re-examined. * 1、fix bug in base_table_splitter * 1、fix bug in base_table_splitter * 1、fix bug in default_chain * 增加solver * add kag * update outline splitter * add main test * add op * code refactor * add tools * fix outline splitter * fix outline prompt * graph api pass * commit with page rank * add search api and graph api * add markdown report * fix vectorizer num batch compute * add retry for vectorize model call * update markdown reader * update markdown reader * update pdf reader * raise extractor failure * add default expr * add log * merge jc reader features * rm import * add build * fix zodb based checkpointer * add thread for zodb IO * fix(common): resolve mutlithread conflict in zodb IO * fix(common): load existing zodb checkpoints * update examples * update examples * fix zodb writer * add docstring * fix jieba version mismatch * commit kag_config-tc.yaml 1、rename type to register_name 2、put a uniqe & specific name to register_name 3、rename reader to scanner 4、rename parser to reader 5、rename num_parallel to num_parallel_file, rename chain_level_num_paralle to num_parallel_chain_of_file 6、rename kag_extractor to schema_free_extractor, schema_base_extractor to schema_constraint_extractor 7、pre-define llm & vectorize_model and refer them in the yaml file Issues to be resolved: 1、examples of event extract & spg extract 2、statistic of indexer, such as nums of nodes & edges extracted, ratio of llm invoke. 3、Exceptions such as Debt, account does not exist should be thrown in llm invoke. 4、conf of solver need to be re-examined. * commit kag_config-tc.yaml 1、rename type to register_name 2、put a uniqe & specific name to register_name 3、rename reader to scanner 4、rename parser to reader 5、rename num_parallel to num_parallel_file, rename chain_level_num_paralle to num_parallel_chain_of_file 6、rename kag_extractor to schema_free_extractor, schema_base_extractor to schema_constraint_extractor 7、pre-define llm & vectorize_model and refer them in the yaml file Issues to be resolved: 1、examples of event extract & spg extract 2、statistic of indexer, such as nums of nodes & edges extracted, ratio of llm invoke. 3、Exceptions such as Debt, account does not exist should be thrown in llm invoke. 4、conf of solver need to be re-examined. * 1、fix bug in base_table_splitter * 1、fix bug in base_table_splitter * 1、fix bug in default_chain * update outline splitter * add main test * add markdown report * code refactor * fix outline splitter * fix outline prompt * update markdown reader * fix vectorizer num batch compute * add retry for vectorize model call * update markdown reader * raise extractor failure * rm parser * run pipeline * add config option of whether to perform llm config check, default to false * fix * recover pdf reader * several components can be null for default chain * 支持完整qa运行 * add if * remove unused code * 使用chunk兜底 * excluded source relation to choose * add generate * default recall 10 * add local memory * 排除相似边 * 增加保护 * 修复并发问题 * add debug logger * 支持topk参数化 * 支持chunk截断和调整spo select 的prompt * 增加查询请求保护 * 增加force_chunk配置 * fix entity linker algorithm * 增加sub query改写 * fix md reader dup in test * fix * merge knext to kag parallel * fix package * 修复指标下跌问题 * scanner update * scanner update * add doc and update example scripts * fix * add bridge to spg server * add format * fix bridge * update conf for baike * disable ckpt for spg server runner * llm invoke error default raise exceptions * chore(version): bump version to X.Y.Z * update default response generation prompt * add method getSummarizationMetrics * fix(common): fix project conf empty error * fix typo * 增加上报信息 * 修改main solver * postprocessor support spg server * 修改solver支持名 * fix language * 修改chunker接口,增加openapi * rename vectorizer to vectorize_model in spg server config * generate_random_string start with gen * add knext llm vector checker * add knext llm vector checker * add knext llm vector checker * solver移除默认值 * udpate yaml and register_name for baike * udpate yaml and register_name for baike * remove config key check * 修复llmmodule * fix knext project * udpate yaml and register_name for examples * udpate yaml and register_name for examples * Revert "udpate yaml and register_name for examples" This reverts commit b3fa5ca9ba749e501133ac67bd8746027ab839d9. * update register name * fix * fix * support multiple resigter names * update component * update reader register names (#183) * fix markdown reader * fix llm client for retry * feat(common): add processed chunk id checkpoint (#185) * update reader register names * add processed chunk id checkpoint * feat(example): add example config (#186) * update reader register names * add processed chunk id checkpoint * add example config file * add max_workers parameter for getSummarizationMetrics to make it faster * add csqa data generation script generate_data.py * commit generated csqa builder and solver data * add csqa basic project files * adjust split_length and num_threads_per_chain to match lightrag settings * ignore ckpt dirs * add csqa evaluation script eval.py * save evaluation scripts summarization_metrics.py and factual_correctness.py * save LightRAG output csqa_lightrag_answers.json * ignore KAG output csqa_kag_answers.json * add README.md for CSQA * fix(solver): fix solver pipeline conf (#191) * update reader register names * add processed chunk id checkpoint * add example config file * update solver pipeline config * fix project create * update links and file paths * reformat csqa kag_config.yaml * reformat csqa python files * reformat getSummarizationMetrics and compare_summarization_answers * fix(solver): fix solver config (#192) * update reader register names * add processed chunk id checkpoint * add example config file * update solver pipeline config * fix project create * fix main solver conf * add except * fix typo in csqa README.md * feat(conf): support reinitialize config for call from java side (#199) * update reader register names * add processed chunk id checkpoint * add example config file * update solver pipeline config * fix project create * fix main solver conf * support reinitialize config for java call * revert default response generation prompt * update project list * add README.md for the hotpotqa, 2wiki and musique examples * 增加spo检索 * turn off kag config dump by default * turn off knext schema dump by default * add .gitignore and fix kag_config.yaml * add README.md for the medicine example * add README.md for the supplychain example * bugfix for risk mining * use exact out * refactor(solver): format solver code (#205) * update reader register names * add processed chunk id checkpoint * add example config file * update solver pipeline config * fix project create * fix main solver conf * support reinitialize config for java call * black format --------- Co-authored-by: peilong <peilong.zpl@antgroup.com> Co-authored-by: 锦呈 <zhangxinhong.zxh@antgroup.com> Co-authored-by: zhengke.gzk <zhengke.gzk@antgroup.com> Co-authored-by: huaidong.xhd <huaidong.xhd@antgroup.com>
170 lines
6.4 KiB
Python
170 lines
6.4 KiB
Python
# -*- coding: utf-8 -*-
|
|
# Copyright 2023 OpenSPG Authors
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
|
|
# in compliance with the License. You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software distributed under the License
|
|
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
|
|
# or implied.
|
|
|
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
|
from typing import Union, Type, List, Dict
|
|
|
|
import networkx as nx
|
|
from tqdm import tqdm
|
|
|
|
from knext.common.base.runnable import Runnable
|
|
from knext.common.base.restable import RESTable
|
|
|
|
|
|
class Chain(Runnable, RESTable):
|
|
"""
|
|
Base class for creating structured sequences of calls to components.
|
|
"""
|
|
|
|
"""The execution process of Chain, represented by a dag structure."""
|
|
dag: nx.DiGraph
|
|
|
|
def invoke(self, input: str, max_workers, **kwargs):
|
|
node_results = {}
|
|
futures = []
|
|
|
|
def execute_node(node, inputs: List[str]):
|
|
with ThreadPoolExecutor(max_workers) as inner_executor:
|
|
inner_futures = [
|
|
inner_executor.submit(node.invoke, inp) for inp in inputs
|
|
]
|
|
result = []
|
|
for idx, inner_future in tqdm(
|
|
enumerate(as_completed(inner_futures)),
|
|
total=len(inner_futures),
|
|
desc=f"Processing {node.name}",
|
|
):
|
|
ret = inner_future.result()
|
|
result.extend(ret)
|
|
return node, result
|
|
|
|
# Initialize a ThreadPoolExecutor
|
|
with ThreadPoolExecutor(max_workers) as executor:
|
|
# Find the starting nodes (nodes with no predecessors)
|
|
start_nodes = [
|
|
node for node in self.dag.nodes if self.dag.in_degree(node) == 0
|
|
]
|
|
|
|
# Initialize the first set of tasks
|
|
for node in start_nodes:
|
|
futures.append(executor.submit(execute_node, node, [input]))
|
|
|
|
# Process nodes as futures complete
|
|
while futures:
|
|
for future in as_completed(futures):
|
|
node, result = future.result()
|
|
node_results[node] = result
|
|
futures.remove(future)
|
|
|
|
# Submit successors for execution
|
|
successors = list(self.dag.successors(node))
|
|
for successor in successors:
|
|
# Check if all predecessors of the successor have finished processing
|
|
if all(
|
|
pred in node_results
|
|
for pred in self.dag.predecessors(successor)
|
|
):
|
|
# Gather all inputs from predecessors for this successor
|
|
inputs = []
|
|
for pred in self.dag.predecessors(successor):
|
|
inputs.extend(node_results[pred])
|
|
futures.append(
|
|
executor.submit(execute_node, successor, inputs)
|
|
)
|
|
|
|
# Collect the final results from the output nodes
|
|
output_nodes = [
|
|
node for node in self.dag.nodes if self.dag.out_degree(node) == 0
|
|
]
|
|
final_output = []
|
|
for node in output_nodes:
|
|
if node in node_results:
|
|
final_output.extend(node_results[node])
|
|
|
|
return final_output
|
|
|
|
def batch(self, inputs: List[str], max_workers, **kwargs):
|
|
for i in inputs:
|
|
self.invoke(i, max_workers, **kwargs)
|
|
|
|
def to_rest(self):
|
|
from knext.builder import rest
|
|
|
|
def __rshift__(
|
|
self,
|
|
other: Union[
|
|
Type["Chain"],
|
|
List[Type["Chain"]],
|
|
Type["Component"],
|
|
List[Type["Component"]],
|
|
None,
|
|
],
|
|
):
|
|
"""
|
|
Implements the right shift operator ">>" functionality to link Component or Chain objects.
|
|
|
|
This method can handle single Component/Chain objects or lists of them.
|
|
When linking Components, a new DAG (Directed Acyclic Graph) is created to represent the data flow connection.
|
|
When linking Chain objects, the DAGs of both Chains are merged.
|
|
|
|
Parameters:
|
|
other (Union[Type["Chain"], List[Type["Chain"]], Type["Component"], List[Type["Component"]], None]):
|
|
The subsequent steps to link, which can be a single or list of Component/Chain objects.
|
|
|
|
Returns:
|
|
A new Chain object with a DAG that represents the linked data flow between the current Chain and the parameter other.
|
|
"""
|
|
from knext.common.base.component import Component
|
|
|
|
if not other:
|
|
return self
|
|
# If other is not a list, convert it to a list
|
|
if not isinstance(other, list):
|
|
other = [other]
|
|
|
|
dag_list = []
|
|
for o in other:
|
|
if not o:
|
|
dag_list.append(o.dag)
|
|
# If o is a Component, create a new DAG and try to add o to the graph
|
|
if isinstance(o, Component):
|
|
end_nodes = [
|
|
node
|
|
for node, out_degree in self.dag.out_degree()
|
|
if out_degree == 0 or node._last
|
|
]
|
|
dag = nx.DiGraph(self.dag)
|
|
if len(end_nodes) > 0:
|
|
for end_node in end_nodes:
|
|
dag.add_edge(end_node, o)
|
|
dag.add_node(o)
|
|
dag_list.append(dag)
|
|
# If o is a Chain, merge the DAGs of self and o
|
|
elif isinstance(o, Chain):
|
|
combined_dag = nx.compose(self.dag, o.dag)
|
|
end_nodes = [
|
|
node
|
|
for node, out_degree in self.dag.out_degree()
|
|
if out_degree == 0 or node._last
|
|
]
|
|
start_nodes = [
|
|
node for node, in_degree in o.dag.in_degree() if in_degree == 0
|
|
]
|
|
|
|
if len(end_nodes) > 0 and len(start_nodes) > 0:
|
|
for end_node in end_nodes:
|
|
for start_node in start_nodes:
|
|
combined_dag.add_edge(end_node, start_node)
|
|
# Merge all DAGs and create the final Chain object
|
|
final_dag = nx.compose_all(dag_list)
|
|
return Chain(dag=final_dag)
|