mirror of
https://github.com/OpenSPG/KAG.git
synced 2025-06-27 03:20:08 +00:00

* add path find * fix find path * spg guided relation extraction * fix dict parse with same key * rename graphalgoclient to graphclient * rename graphalgoclient to graphclient * file reader supports http url * add checkpointer class * parser supports checkpoint * add build * remove incorrect logs * remove logs * update examples * update chain checkpointer * vectorizer batch size set to 32 * add a zodb backended checkpointer * add a zodb backended checkpointer * fix zodb based checkpointer * add thread for zodb IO * fix(common): resolve mutlithread conflict in zodb IO * fix(common): load existing zodb checkpoints * update examples * update examples * fix zodb writer * add docstring * fix jieba version mismatch * commit kag_config-tc.yaml 1、rename type to register_name 2、put a uniqe & specific name to register_name 3、rename reader to scanner 4、rename parser to reader 5、rename num_parallel to num_parallel_file, rename chain_level_num_paralle to num_parallel_chain_of_file 6、rename kag_extractor to schema_free_extractor, schema_base_extractor to schema_constraint_extractor 7、pre-define llm & vectorize_model and refer them in the yaml file Issues to be resolved: 1、examples of event extract & spg extract 2、statistic of indexer, such as nums of nodes & edges extracted, ratio of llm invoke. 3、Exceptions such as Debt, account does not exist should be thrown in llm invoke. 4、conf of solver need to be re-examined. * commit kag_config-tc.yaml 1、rename type to register_name 2、put a uniqe & specific name to register_name 3、rename reader to scanner 4、rename parser to reader 5、rename num_parallel to num_parallel_file, rename chain_level_num_paralle to num_parallel_chain_of_file 6、rename kag_extractor to schema_free_extractor, schema_base_extractor to schema_constraint_extractor 7、pre-define llm & vectorize_model and refer them in the yaml file Issues to be resolved: 1、examples of event extract & spg extract 2、statistic of indexer, such as nums of nodes & edges extracted, ratio of llm invoke. 3、Exceptions such as Debt, account does not exist should be thrown in llm invoke. 4、conf of solver need to be re-examined. * 1、fix bug in base_table_splitter * 1、fix bug in base_table_splitter * 1、fix bug in default_chain * 增加solver * add kag * update outline splitter * add main test * add op * code refactor * add tools * fix outline splitter * fix outline prompt * graph api pass * commit with page rank * add search api and graph api * add markdown report * fix vectorizer num batch compute * add retry for vectorize model call * update markdown reader * update markdown reader * update pdf reader * raise extractor failure * add default expr * add log * merge jc reader features * rm import * add build * fix zodb based checkpointer * add thread for zodb IO * fix(common): resolve mutlithread conflict in zodb IO * fix(common): load existing zodb checkpoints * update examples * update examples * fix zodb writer * add docstring * fix jieba version mismatch * commit kag_config-tc.yaml 1、rename type to register_name 2、put a uniqe & specific name to register_name 3、rename reader to scanner 4、rename parser to reader 5、rename num_parallel to num_parallel_file, rename chain_level_num_paralle to num_parallel_chain_of_file 6、rename kag_extractor to schema_free_extractor, schema_base_extractor to schema_constraint_extractor 7、pre-define llm & vectorize_model and refer them in the yaml file Issues to be resolved: 1、examples of event extract & spg extract 2、statistic of indexer, such as nums of nodes & edges extracted, ratio of llm invoke. 3、Exceptions such as Debt, account does not exist should be thrown in llm invoke. 4、conf of solver need to be re-examined. * commit kag_config-tc.yaml 1、rename type to register_name 2、put a uniqe & specific name to register_name 3、rename reader to scanner 4、rename parser to reader 5、rename num_parallel to num_parallel_file, rename chain_level_num_paralle to num_parallel_chain_of_file 6、rename kag_extractor to schema_free_extractor, schema_base_extractor to schema_constraint_extractor 7、pre-define llm & vectorize_model and refer them in the yaml file Issues to be resolved: 1、examples of event extract & spg extract 2、statistic of indexer, such as nums of nodes & edges extracted, ratio of llm invoke. 3、Exceptions such as Debt, account does not exist should be thrown in llm invoke. 4、conf of solver need to be re-examined. * 1、fix bug in base_table_splitter * 1、fix bug in base_table_splitter * 1、fix bug in default_chain * update outline splitter * add main test * add markdown report * code refactor * fix outline splitter * fix outline prompt * update markdown reader * fix vectorizer num batch compute * add retry for vectorize model call * update markdown reader * raise extractor failure * rm parser * run pipeline * add config option of whether to perform llm config check, default to false * fix * recover pdf reader * several components can be null for default chain * 支持完整qa运行 * add if * remove unused code * 使用chunk兜底 * excluded source relation to choose * add generate * default recall 10 * add local memory * 排除相似边 * 增加保护 * 修复并发问题 * add debug logger * 支持topk参数化 * 支持chunk截断和调整spo select 的prompt * 增加查询请求保护 * 增加force_chunk配置 * fix entity linker algorithm * 增加sub query改写 * fix md reader dup in test * fix * merge knext to kag parallel * fix package * 修复指标下跌问题 * scanner update * scanner update * add doc and update example scripts * fix * add bridge to spg server * add format * fix bridge * update conf for baike * disable ckpt for spg server runner * llm invoke error default raise exceptions * chore(version): bump version to X.Y.Z * update default response generation prompt * add method getSummarizationMetrics * fix(common): fix project conf empty error * fix typo * 增加上报信息 * 修改main solver * postprocessor support spg server * 修改solver支持名 * fix language * 修改chunker接口,增加openapi * rename vectorizer to vectorize_model in spg server config * generate_random_string start with gen * add knext llm vector checker * add knext llm vector checker * add knext llm vector checker * solver移除默认值 * udpate yaml and register_name for baike * udpate yaml and register_name for baike * remove config key check * 修复llmmodule * fix knext project * udpate yaml and register_name for examples * udpate yaml and register_name for examples * Revert "udpate yaml and register_name for examples" This reverts commit b3fa5ca9ba749e501133ac67bd8746027ab839d9. * update register name * fix * fix * support multiple resigter names * update component * update reader register names (#183) * fix markdown reader * fix llm client for retry * feat(common): add processed chunk id checkpoint (#185) * update reader register names * add processed chunk id checkpoint * feat(example): add example config (#186) * update reader register names * add processed chunk id checkpoint * add example config file * add max_workers parameter for getSummarizationMetrics to make it faster * add csqa data generation script generate_data.py * commit generated csqa builder and solver data * add csqa basic project files * adjust split_length and num_threads_per_chain to match lightrag settings * ignore ckpt dirs * add csqa evaluation script eval.py * save evaluation scripts summarization_metrics.py and factual_correctness.py * save LightRAG output csqa_lightrag_answers.json * ignore KAG output csqa_kag_answers.json * add README.md for CSQA * fix(solver): fix solver pipeline conf (#191) * update reader register names * add processed chunk id checkpoint * add example config file * update solver pipeline config * fix project create * update links and file paths * reformat csqa kag_config.yaml * reformat csqa python files * reformat getSummarizationMetrics and compare_summarization_answers * fix(solver): fix solver config (#192) * update reader register names * add processed chunk id checkpoint * add example config file * update solver pipeline config * fix project create * fix main solver conf * add except * fix typo in csqa README.md * feat(conf): support reinitialize config for call from java side (#199) * update reader register names * add processed chunk id checkpoint * add example config file * update solver pipeline config * fix project create * fix main solver conf * support reinitialize config for java call * revert default response generation prompt * update project list * add README.md for the hotpotqa, 2wiki and musique examples * 增加spo检索 * turn off kag config dump by default * turn off knext schema dump by default * add .gitignore and fix kag_config.yaml * add README.md for the medicine example * add README.md for the supplychain example * bugfix for risk mining * use exact out * refactor(solver): format solver code (#205) * update reader register names * add processed chunk id checkpoint * add example config file * update solver pipeline config * fix project create * fix main solver conf * support reinitialize config for java call * black format --------- Co-authored-by: peilong <peilong.zpl@antgroup.com> Co-authored-by: 锦呈 <zhangxinhong.zxh@antgroup.com> Co-authored-by: zhengke.gzk <zhengke.gzk@antgroup.com> Co-authored-by: huaidong.xhd <huaidong.xhd@antgroup.com>
206 lines
6.3 KiB
Python
206 lines
6.3 KiB
Python
# -*- coding: utf-8 -*-
|
|
# Copyright 2023 OpenSPG Authors
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
|
|
# in compliance with the License. You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software distributed under the License
|
|
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
|
|
# or implied.
|
|
import re
|
|
import sys
|
|
import json
|
|
from typing import Type, Tuple
|
|
import inspect
|
|
import os
|
|
from pathlib import Path
|
|
import importlib
|
|
from shutil import copystat, copy2
|
|
from typing import Any, Union
|
|
from jinja2 import Environment, FileSystemLoader, Template
|
|
from stat import S_IWUSR as OWNER_WRITE_PERMISSION
|
|
|
|
|
|
def _register(root, path, files, class_type):
|
|
relative_path = os.path.relpath(path, root)
|
|
module_prefix = relative_path.replace(".", "").replace("/", ".")
|
|
module_prefix = module_prefix + "." if module_prefix else ""
|
|
for file_name in files:
|
|
if file_name.endswith(".py"):
|
|
module_name = module_prefix + os.path.splitext(file_name)[0]
|
|
import importlib
|
|
|
|
module = importlib.import_module(module_name)
|
|
classes = inspect.getmembers(module, inspect.isclass)
|
|
for class_name, class_obj in classes:
|
|
if (
|
|
issubclass(class_obj, class_type)
|
|
and inspect.getmodule(class_obj) == module
|
|
):
|
|
|
|
class_type.register(
|
|
name=class_name,
|
|
local_path=os.path.join(path, file_name),
|
|
module_path=module_name,
|
|
)(class_obj)
|
|
|
|
|
|
def register_from_package(path: str, class_type: Type) -> None:
|
|
"""
|
|
Register all classes under the given package.
|
|
Only registered classes can be recognized by knext.
|
|
"""
|
|
if not append_python_path(path):
|
|
return
|
|
for root, dirs, files in os.walk(path):
|
|
_register(path, root, files, class_type)
|
|
class_type._has_registered = True
|
|
|
|
|
|
def append_python_path(path: str) -> bool:
|
|
"""
|
|
Append the given path to `sys.path`.
|
|
"""
|
|
path = Path(path).resolve()
|
|
path = str(path)
|
|
if path not in sys.path:
|
|
sys.path.append(path)
|
|
return True
|
|
return False
|
|
|
|
|
|
def render_template(
|
|
root_dir: Union[str, os.PathLike], file: Union[str, os.PathLike], **kwargs: Any
|
|
) -> None:
|
|
path_obj = Path(root_dir) / file
|
|
env = Environment(loader=FileSystemLoader(path_obj.parent))
|
|
template = env.get_template(path_obj.name)
|
|
content = template.render(kwargs)
|
|
|
|
render_path = path_obj.with_suffix("") if path_obj.suffix == ".tmpl" else path_obj
|
|
|
|
if path_obj.suffix == ".tmpl":
|
|
path_obj.rename(render_path)
|
|
|
|
render_path.write_text(content, "utf8")
|
|
|
|
|
|
def copytree(src: Path, dst: Path, **kwargs):
|
|
names = [x.name for x in src.iterdir()]
|
|
|
|
if not dst.exists():
|
|
dst.mkdir(parents=True)
|
|
|
|
for name in names:
|
|
_name = Template(name).render(**kwargs)
|
|
src_name = src / name
|
|
dst_name = dst / _name
|
|
if src_name.is_dir():
|
|
copytree(src_name, dst_name, **kwargs)
|
|
else:
|
|
copyfile(src_name, dst_name, **kwargs)
|
|
|
|
copystat(src, dst)
|
|
_make_writable(dst)
|
|
|
|
|
|
def copyfile(src: Path, dst: Path, **kwargs):
|
|
if dst.exists():
|
|
return
|
|
dst = Path(Template(str(dst)).render(**kwargs))
|
|
copy2(src, dst)
|
|
_make_writable(dst)
|
|
if dst.suffix != ".tmpl":
|
|
return
|
|
render_template("/", dst, **kwargs)
|
|
|
|
|
|
def remove_files_except(path, file, new_file):
|
|
for filename in os.listdir(path):
|
|
file_path = os.path.join(path, filename)
|
|
if os.path.isfile(file_path) and filename != file:
|
|
os.remove(file_path)
|
|
os.rename(path / file, path / new_file)
|
|
|
|
|
|
def _make_writable(path):
|
|
current_permissions = os.stat(path).st_mode
|
|
os.chmod(path, current_permissions | OWNER_WRITE_PERMISSION)
|
|
|
|
|
|
def escape_single_quotes(s: str):
|
|
return s.replace("'", "\\'")
|
|
|
|
|
|
def load_json(content):
|
|
try:
|
|
return json.loads(content)
|
|
except json.JSONDecodeError as e:
|
|
|
|
substr = content[: e.colno - 1]
|
|
return json.loads(substr)
|
|
|
|
|
|
def split_module_class_name(name: str, text: str) -> Tuple[str, str]:
|
|
"""
|
|
Split `name` as module name and class name pair.
|
|
|
|
:param name: fully qualified class name, e.g. ``foo.bar.MyClass``
|
|
:type name: str
|
|
:param text: describe the kind of the class, used in the exception message
|
|
:type text: str
|
|
:rtype: Tuple[str, str]
|
|
:raises RuntimeError: if `name` is not a fully qualified class name
|
|
"""
|
|
i = name.rfind(".")
|
|
if i == -1:
|
|
message = "invalid %s class name: %s" % (text, name)
|
|
raise RuntimeError(message)
|
|
module_name = name[:i]
|
|
class_name = name[i + 1 :]
|
|
return module_name, class_name
|
|
|
|
|
|
def dynamic_import_class(name: str, text: str):
|
|
"""
|
|
Import the class specified by `name` dyanmically.
|
|
|
|
:param name: fully qualified class name, e.g. ``foo.bar.MyClass``
|
|
:type name: str
|
|
:param text: describe the kind of the class, use in the exception message
|
|
:type text: str
|
|
:raises RuntimeError: if `name` is not a fully qualified class name, or
|
|
the class is not in the module specified by `name`
|
|
:raises ModuleNotFoundError: the module specified by `name` is not found
|
|
"""
|
|
module_name, class_name = split_module_class_name(name, text)
|
|
module = importlib.import_module(module_name)
|
|
class_ = getattr(module, class_name, None)
|
|
if class_ is None:
|
|
message = "class %r not found in module %r" % (class_name, module_name)
|
|
raise RuntimeError(message)
|
|
if not isinstance(class_, type):
|
|
message = "%r is not a class" % (name,)
|
|
raise RuntimeError(message)
|
|
return class_
|
|
|
|
|
|
def processing_phrases(phrase):
|
|
phrase = str(phrase)
|
|
return re.sub("[^A-Za-z0-9\u4e00-\u9fa5 ]", " ", phrase.lower()).strip()
|
|
|
|
|
|
def to_camel_case(phrase):
|
|
s = processing_phrases(phrase).replace(" ", "_")
|
|
return "".join(
|
|
word.capitalize() if i != 0 else word for i, word in enumerate(s.split("_"))
|
|
)
|
|
|
|
|
|
def to_snake_case(name):
|
|
words = re.findall("[A-Za-z][a-z0-9]*", name)
|
|
result = "_".join(words).lower()
|
|
return result
|