KAG/kag/builder/component/vectorizer/batch_vectorizer.py
zhuzhongshu123 13cea5f6fe
feat(kag): update to v0.7 (#456)
* add think cost

* update csv scanner

* add final rerank

* add reasoner

* add iterative planner

* fix dpr search

* fix dpr search

* add reference data

* move odps import

* update requirement.txt

* update 2wiki

* add missing file

* fix markdown reader

* add iterative planning

* update version

* update runner

* update 2wiki example

* update bridge

* merge solver and solver_new

* add cur day

* writer delete

* update multi process

* add missing files

* fix report

* add chunk retrieved executor

* update try in stream runner result

* add path

* add math executor

* update hotpotqa example

* remove log

* fix python coder solver

* update hotpotqa example

* fix python coder solver

* update config

* fix bad

* add log

* remove unused code

* commit with task thought

* move kag model to common

* add default chat llm

* fix

* use static planner

* support chunk graph node

* add args

* support naive rag

* llm client support tool calls

* add default async

* add openai

* fix result

* fix markdown reader

* fix thinker

* update asyncio interface

* feat(solver): add mcp support (#444)

* 上传mcp client相关代码

* 1、完成一套mcp client的调用,从pipeline到planner、executor
2、允许json中传入多个mcp_server,通过大模型进行调用并选择
3、调通baidu_map_mcp的使用

* 1、schema

* bugfix:删减冗余代码

---------

Co-authored-by: wanxingyu.wxy <wanxingyu.wxy@antgroup.com>

* fix affairqa after solver refactor

* fix affairqa after solver refactor

* fix readme

* add params

* update version

* update mcp executor

* update mcp executor

* solver add mcp executor

* add missing file

* add mpc executor

* add executor

* x

* update

* fix requirement

* fix main llm config

* fix solver

* bugfix:修复invoke函数调用逻辑

* chg eva

* update example

* add kag layer

* add step task

* support dot refresh

* support dot refresh

* support dot refresh

* support dot refresh

* add retrieved num

* add retrieved num

* add pipelineconf

* update ppr

* update musique prompts

* update

* add to_dict for BuilderComponentData

* async build

* add deduce prompt

* add deduce prompt

* add deduce prompt

* fix reader

* add deduce prompt

* add page thinker report

* modify prmpt

* add step status

* add self cognition

* add self cognition

* add memory graph storage

* add now time

* update memory config

* add now time

* chg graph loader

* 添加prqa数据集和代码

* bugfix:prqa调用逻辑修复

* optimize:优化代码逻辑,生成答案规范化

* add retry py code

* update memory graph

* update memory graph

* fix

* fix ner

* add with_out_refer generator prompt

* fix

* close ckpt

* fix query

* fix query

* update version

* add llm checker

* add llm checker

* 1、上传evalutor.py以及修改gold_answer.json格式
2、优化代码逻辑
3、修改README.md文件

* update exp

* update exp

* rerank support

* add static rewrite query

* recall more chunks

* fix graph load

* add static rewrite query

* fix bugs

* add finish check

* add finish check

* add finish check

* add finish check

* 1、上传evalutor.py的结果
2、优化代码逻辑,优化readme文件

* add lf retry

* add memory graph api

* fix reader api

* add ner

* add metrics

* fix bug

* remove ner

* add reraise fo retry

* add edge prop to memory graph

* add memory graph

* 1、评测数据集结果修正
2、优化evaluator.py代码
3、删除结果不存在而gold_answer中有答案的问题

* 删除评测结果文件

* fix knext host addr

* async eva

* add lf prompt

* add lf prompt

* add config

* add retry

* add unknown check

* add rc result

* add rc result

* add rc result

* add rc result

* 依据kag pipeline格式修改代码逻辑并通过测试

* bugfix:删除冗余代码

* fix report prompt

* bugfix:触发重试机制

* bugfix:中文符号错误

* fix rethinker prompt

* update version to 0.6.2b78

* update version

* 1、修改evaluator.py,通过大模型计算准确率,符合最新调用逻辑
2、修改prompt,让没有回答的结果重复测试

* update affairqa for evaluate

* update affairqa for evaluate

* bugfix:修正数据集

* bugfix:修正数据集

* bugfix:修正数据集

* fix name conflict

* bugfix:删除错误问题

* bugfix:文件名命名错误导致evaluator失败

* update for affairqa eval

* bugfix:修改代码保持evaluate逻辑一致

* x

* update for affairqa readme

* remove temp eval scripts

* bugfix for math deduce

* merge 0.6.2_dev

* merge 0.6.2_dev

* fix

* update client addr

* updated version

* update for affairqa eval

* evaUtils 支持中文

* fix affairqa eval:

* remove unused example

* update kag config

* fix default value

* update readme

* fix init

* 注释信息修改,并添加部分class说明

* update example config

* Tc 0.7.0 (#459)

* 提交affairQA 代码

* fix affairqa eval

---------

Co-authored-by: zhengke.gzk <zhengke.gzk@antgroup.com>

* fix all examples

* reformat

---------

Co-authored-by: peilong <peilong.zpl@antgroup.com>
Co-authored-by: 锦呈 <zhangxinhong.zxh@antgroup.com>
Co-authored-by: wanxingyu.wxy <wanxingyu.wxy@antgroup.com>
Co-authored-by: zhengke.gzk <zhengke.gzk@antgroup.com>
2025-04-17 17:23:52 +08:00

318 lines
13 KiB
Python

# -*- coding: utf-8 -*-
# Copyright 2023 OpenSPG Authors
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied.
import asyncio
from collections import defaultdict
from typing import List
from tenacity import stop_after_attempt, retry
from kag.builder.model.sub_graph import SubGraph
from kag.common.conf import KAG_PROJECT_CONF
from kag.common.utils import get_vector_field_name
from kag.interface import VectorizerABC, VectorizeModelABC
from knext.schema.client import SchemaClient
from knext.schema.model.base import IndexTypeEnum
from knext.common.base.runnable import Input, Output
class EmbeddingVectorPlaceholder(object):
def __init__(self, number, properties, vector_field, property_key, property_value):
self._number = number
self._properties = properties
self._vector_field = vector_field
self._property_key = property_key
self._property_value = property_value
self._embedding_vector = None
def replace(self):
if self._embedding_vector is not None:
self._properties[self._property_key] = self._property_value
self._properties[self._vector_field] = self._embedding_vector
def __repr__(self):
return repr(self._number)
class EmbeddingVectorManager(object):
def __init__(self):
self._placeholders = []
def get_placeholder(self, properties, vector_field):
for property_key, property_value in properties.items():
field_name = get_vector_field_name(property_key)
if field_name != vector_field:
continue
if property_value is None:
return None
if not isinstance(property_value, str):
property_value = str(property_value)
num = len(self._placeholders)
placeholder = EmbeddingVectorPlaceholder(
num, properties, vector_field, property_key, property_value
)
self._placeholders.append(placeholder)
return placeholder
return None
def _get_text_batch(self):
text_batch = dict()
for placeholder in self._placeholders:
property_value = placeholder._property_value
if property_value not in text_batch:
text_batch[property_value] = list()
text_batch[property_value].append(placeholder)
return text_batch
def _generate_vectors(self, vectorizer, text_batch, batch_size=32):
texts = list(text_batch)
if not texts:
return []
if len(texts) % batch_size == 0:
n_batchs = len(texts) // batch_size
else:
n_batchs = len(texts) // batch_size + 1
embeddings = []
for idx in range(n_batchs):
start = idx * batch_size
end = min(start + batch_size, len(texts))
embeddings.extend(vectorizer.vectorize(texts[start:end]))
return embeddings
async def _agenerate_vectors(self, vectorizer, text_batch, batch_size=32):
texts = list(text_batch)
if not texts:
return []
if len(texts) % batch_size == 0:
n_batchs = len(texts) // batch_size
else:
n_batchs = len(texts) // batch_size + 1
tasks = []
for idx in range(n_batchs):
start = idx * batch_size
end = min(start + batch_size, len(texts))
tasks.append(asyncio.create_task(vectorizer.avectorize(texts[start:end])))
results = await asyncio.gather(*tasks)
return [item for sublist in results for item in sublist]
def _fill_vectors(self, vectors, text_batch):
for vector, (_text, placeholders) in zip(vectors, text_batch.items()):
for placeholder in placeholders:
placeholder._embedding_vector = vector
def batch_generate(self, vectorizer, batch_size=32):
text_batch = self._get_text_batch()
vectors = self._generate_vectors(vectorizer, text_batch, batch_size)
self._fill_vectors(vectors, text_batch)
async def abatch_generate(self, vectorizer, batch_size=32):
text_batch = self._get_text_batch()
vectors = await self._agenerate_vectors(vectorizer, text_batch, batch_size)
self._fill_vectors(vectors, text_batch)
def patch(self):
for placeholder in self._placeholders:
placeholder.replace()
class EmbeddingVectorGenerator(object):
def __init__(self, vectorizer, vector_index_meta=None, extra_labels=("Entity",)):
self._vectorizer = vectorizer
self._extra_labels = extra_labels
self._vector_index_meta = vector_index_meta or {}
def batch_generate(self, node_batch, batch_size=32):
manager = EmbeddingVectorManager()
vector_index_meta = self._vector_index_meta
for node_item in node_batch:
label, properties = node_item
labels = [label]
if self._extra_labels:
labels.extend(self._extra_labels)
for label in labels:
if label not in vector_index_meta:
continue
for vector_field in vector_index_meta[label]:
if vector_field in properties:
continue
placeholder = manager.get_placeholder(properties, vector_field)
if placeholder is not None:
properties[vector_field] = placeholder
manager.batch_generate(self._vectorizer, batch_size)
manager.patch()
async def abatch_generate(self, node_batch, batch_size=32):
manager = EmbeddingVectorManager()
vector_index_meta = self._vector_index_meta
for node_item in node_batch:
label, properties = node_item
labels = [label]
if self._extra_labels:
labels.extend(self._extra_labels)
for label in labels:
if label not in vector_index_meta:
continue
for vector_field in vector_index_meta[label]:
if vector_field in properties:
continue
placeholder = manager.get_placeholder(properties, vector_field)
if placeholder is not None:
properties[vector_field] = placeholder
await manager.abatch_generate(self._vectorizer, batch_size)
manager.patch()
@VectorizerABC.register("batch")
@VectorizerABC.register("batch_vectorizer")
class BatchVectorizer(VectorizerABC):
"""
A class for generating embedding vectors for node attributes in a SubGraph in batches.
This class inherits from VectorizerABC and provides the functionality to generate embedding vectors
for node attributes in a SubGraph in batches. It uses a specified vectorization model and processes
the nodes of a specified batch size.
Attributes:
project_id (int): The ID of the project associated with the SubGraph.
vec_meta (defaultdict): Metadata for vector fields in the SubGraph.
vectorize_model (VectorizeModelABC): The model used for generating embedding vectors.
batch_size (int): The size of the batches in which to process the nodes.
"""
def __init__(self, vectorize_model: VectorizeModelABC, batch_size: int = 32):
"""
Initializes the BatchVectorizer with the specified vectorization model and batch size.
Args:
vectorize_model (VectorizeModelABC): The model used for generating embedding vectors.
batch_size (int): The size of the batches in which to process the nodes. Defaults to 32.
"""
super().__init__()
self.project_id = KAG_PROJECT_CONF.project_id
# self._init_graph_store()
self.vec_meta = self._init_vec_meta()
self.vectorize_model = vectorize_model
self.batch_size = batch_size
def _init_vec_meta(self):
"""
Initializes the vector metadata for the SubGraph.
Returns:
defaultdict: Metadata for vector fields in the SubGraph.
"""
vec_meta = defaultdict(list)
schema_client = SchemaClient(
host_addr=KAG_PROJECT_CONF.host_addr, project_id=self.project_id
)
spg_types = schema_client.load()
for type_name, spg_type in spg_types.items():
for prop_name, prop in spg_type.properties.items():
if prop_name == "name" or prop.index_type in [
# if prop.index_type in [
IndexTypeEnum.Vector,
IndexTypeEnum.TextAndVector,
]:
vec_meta[type_name].append(get_vector_field_name(prop_name))
return vec_meta
@retry(stop=stop_after_attempt(3), reraise=True)
def _generate_embedding_vectors(self, input_subgraph: SubGraph) -> SubGraph:
"""
Generates embedding vectors for the nodes in the input SubGraph.
Args:
input_subgraph (SubGraph): The SubGraph for which to generate embedding vectors.
Returns:
SubGraph: The modified SubGraph with generated embedding vectors.
"""
node_list = []
node_batch = []
for node in input_subgraph.nodes:
if not node.id or not node.name:
continue
properties = {"id": node.id, "name": node.name}
properties.update(node.properties)
node_list.append((node, properties))
node_batch.append((node.label, properties.copy()))
generator = EmbeddingVectorGenerator(self.vectorize_model, self.vec_meta)
generator.batch_generate(node_batch, self.batch_size)
for (node, properties), (_node_label, new_properties) in zip(
node_list, node_batch
):
for key, value in properties.items():
if key in new_properties and new_properties[key] == value:
del new_properties[key]
node.properties.update(new_properties)
return input_subgraph
@retry(stop=stop_after_attempt(3), reraise=True)
async def _agenerate_embedding_vectors(self, input_subgraph: SubGraph) -> SubGraph:
"""
Generates embedding vectors for the nodes in the input SubGraph.
Args:
input_subgraph (SubGraph): The SubGraph for which to generate embedding vectors.
Returns:
SubGraph: The modified SubGraph with generated embedding vectors.
"""
node_list = []
node_batch = []
for node in input_subgraph.nodes:
if not node.id or not node.name:
continue
properties = {"id": node.id, "name": node.name}
properties.update(node.properties)
node_list.append((node, properties))
node_batch.append((node.label, properties.copy()))
generator = EmbeddingVectorGenerator(self.vectorize_model, self.vec_meta)
await generator.abatch_generate(node_batch, self.batch_size)
for (node, properties), (_node_label, new_properties) in zip(
node_list, node_batch
):
for key, value in properties.items():
if key in new_properties and new_properties[key] == value:
del new_properties[key]
node.properties.update(new_properties)
return input_subgraph
def _invoke(self, input_subgraph: Input, **kwargs) -> List[Output]:
"""
Invokes the generation of embedding vectors for the input SubGraph.
Args:
input_subgraph (Input): The SubGraph for which to generate embedding vectors.
**kwargs: Additional keyword arguments, currently unused but kept for potential future expansion.
Returns:
List[Output]: A list containing the modified SubGraph with generated embedding vectors.
"""
modified_input = self._generate_embedding_vectors(input_subgraph)
return [modified_input]
async def _ainvoke(self, input_subgraph: Input, **kwargs) -> List[Output]:
"""
Invokes the generation of embedding vectors for the input SubGraph.
Args:
input_subgraph (Input): The SubGraph for which to generate embedding vectors.
**kwargs: Additional keyword arguments, currently unused but kept for potential future expansion.
Returns:
List[Output]: A list containing the modified SubGraph with generated embedding vectors.
"""
modified_input = await self._agenerate_embedding_vectors(input_subgraph)
return [modified_input]