graphiti/graphiti_core/prompts/extract_nodes.py
HUGO SON ce9ef3ca79
Add support for non-ASCII characters in LLM prompts (#805)
* Add support for non-ASCII characters in LLM prompts

- Add ensure_ascii parameter to Graphiti class (default: True)
- Create to_prompt_json helper function for consistent JSON serialization
- Update all prompt files to use new helper function
- Preserve Korean/Japanese/Chinese characters when ensure_ascii=False
- Maintain backward compatibility with existing behavior

Fixes issue where non-ASCII characters were escaped as unicode sequences
in prompts, making them unreadable in LLM logs and potentially affecting
model understanding.

* Remove unused json imports after replacing with to_prompt_json helper

- Fix ruff lint errors (F401) for unused json imports
- All prompt files now use to_prompt_json helper instead of json.dumps
- Maintains clean code style and passes lint checks

* Fix ensure_ascii propagation to all LLM calls

- Add ensure_ascii parameter to maintenance operation functions that were missing it
- Update function signatures in node_operations, community_operations, temporal_operations, and edge_operations
- Ensure all llm_client.generate_response calls receive proper ensure_ascii context
- Fix hardcoded ensure_ascii: True values that prevented non-ASCII character preservation
- Maintain backward compatibility with default ensure_ascii=True
- Complete the fix for issue #804 ensuring Korean/Japanese/Chinese characters are properly handled in LLM prompts
2025-08-08 11:07:32 -04:00

321 lines
11 KiB
Python

"""
Copyright 2024, Zep Software, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
from typing import Any, Protocol, TypedDict
from pydantic import BaseModel, Field
from .models import Message, PromptFunction, PromptVersion
from .prompt_helpers import to_prompt_json
class ExtractedEntity(BaseModel):
name: str = Field(..., description='Name of the extracted entity')
entity_type_id: int = Field(
description='ID of the classified entity type. '
'Must be one of the provided entity_type_id integers.',
)
class ExtractedEntities(BaseModel):
extracted_entities: list[ExtractedEntity] = Field(..., description='List of extracted entities')
class MissedEntities(BaseModel):
missed_entities: list[str] = Field(..., description="Names of entities that weren't extracted")
class EntityClassificationTriple(BaseModel):
uuid: str = Field(description='UUID of the entity')
name: str = Field(description='Name of the entity')
entity_type: str | None = Field(
default=None, description='Type of the entity. Must be one of the provided types or None'
)
class EntityClassification(BaseModel):
entity_classifications: list[EntityClassificationTriple] = Field(
..., description='List of entities classification triples.'
)
class EntitySummary(BaseModel):
summary: str = Field(
...,
description='Summary containing the important information about the entity. Under 250 words',
)
class Prompt(Protocol):
extract_message: PromptVersion
extract_json: PromptVersion
extract_text: PromptVersion
reflexion: PromptVersion
classify_nodes: PromptVersion
extract_attributes: PromptVersion
extract_summary: PromptVersion
class Versions(TypedDict):
extract_message: PromptFunction
extract_json: PromptFunction
extract_text: PromptFunction
reflexion: PromptFunction
classify_nodes: PromptFunction
extract_attributes: PromptFunction
extract_summary: PromptFunction
def extract_message(context: dict[str, Any]) -> list[Message]:
sys_prompt = """You are an AI assistant that extracts entity nodes from conversational messages.
Your primary task is to extract and classify the speaker and other significant entities mentioned in the conversation."""
user_prompt = f"""
<ENTITY TYPES>
{context['entity_types']}
</ENTITY TYPES>
<PREVIOUS MESSAGES>
{to_prompt_json([ep for ep in context['previous_episodes']], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</PREVIOUS MESSAGES>
<CURRENT MESSAGE>
{context['episode_content']}
</CURRENT MESSAGE>
Instructions:
You are given a conversation context and a CURRENT MESSAGE. Your task is to extract **entity nodes** mentioned **explicitly or implicitly** in the CURRENT MESSAGE.
Pronoun references such as he/she/they or this/that/those should be disambiguated to the names of the
reference entities.
1. **Speaker Extraction**: Always extract the speaker (the part before the colon `:` in each dialogue line) as the first entity node.
- If the speaker is mentioned again in the message, treat both mentions as a **single entity**.
2. **Entity Identification**:
- Extract all significant entities, concepts, or actors that are **explicitly or implicitly** mentioned in the CURRENT MESSAGE.
- **Exclude** entities mentioned only in the PREVIOUS MESSAGES (they are for context only).
3. **Entity Classification**:
- Use the descriptions in ENTITY TYPES to classify each extracted entity.
- Assign the appropriate `entity_type_id` for each one.
4. **Exclusions**:
- Do NOT extract entities representing relationships or actions.
- Do NOT extract dates, times, or other temporal information—these will be handled separately.
5. **Formatting**:
- Be **explicit and unambiguous** in naming entities (e.g., use full names when available).
{context['custom_prompt']}
"""
return [
Message(role='system', content=sys_prompt),
Message(role='user', content=user_prompt),
]
def extract_json(context: dict[str, Any]) -> list[Message]:
sys_prompt = """You are an AI assistant that extracts entity nodes from JSON.
Your primary task is to extract and classify relevant entities from JSON files"""
user_prompt = f"""
<ENTITY TYPES>
{context['entity_types']}
</ENTITY TYPES>
<SOURCE DESCRIPTION>:
{context['source_description']}
</SOURCE DESCRIPTION>
<JSON>
{context['episode_content']}
</JSON>
{context['custom_prompt']}
Given the above source description and JSON, extract relevant entities from the provided JSON.
For each entity extracted, also determine its entity type based on the provided ENTITY TYPES and their descriptions.
Indicate the classified entity type by providing its entity_type_id.
Guidelines:
1. Always try to extract an entities that the JSON represents. This will often be something like a "name" or "user field
2. Do NOT extract any properties that contain dates
"""
return [
Message(role='system', content=sys_prompt),
Message(role='user', content=user_prompt),
]
def extract_text(context: dict[str, Any]) -> list[Message]:
sys_prompt = """You are an AI assistant that extracts entity nodes from text.
Your primary task is to extract and classify the speaker and other significant entities mentioned in the provided text."""
user_prompt = f"""
<ENTITY TYPES>
{context['entity_types']}
</ENTITY TYPES>
<TEXT>
{context['episode_content']}
</TEXT>
Given the above text, extract entities from the TEXT that are explicitly or implicitly mentioned.
For each entity extracted, also determine its entity type based on the provided ENTITY TYPES and their descriptions.
Indicate the classified entity type by providing its entity_type_id.
{context['custom_prompt']}
Guidelines:
1. Extract significant entities, concepts, or actors mentioned in the conversation.
2. Avoid creating nodes for relationships or actions.
3. Avoid creating nodes for temporal information like dates, times or years (these will be added to edges later).
4. Be as explicit as possible in your node names, using full names and avoiding abbreviations.
"""
return [
Message(role='system', content=sys_prompt),
Message(role='user', content=user_prompt),
]
def reflexion(context: dict[str, Any]) -> list[Message]:
sys_prompt = """You are an AI assistant that determines which entities have not been extracted from the given context"""
user_prompt = f"""
<PREVIOUS MESSAGES>
{to_prompt_json([ep for ep in context['previous_episodes']], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</PREVIOUS MESSAGES>
<CURRENT MESSAGE>
{context['episode_content']}
</CURRENT MESSAGE>
<EXTRACTED ENTITIES>
{context['extracted_entities']}
</EXTRACTED ENTITIES>
Given the above previous messages, current message, and list of extracted entities; determine if any entities haven't been
extracted.
"""
return [
Message(role='system', content=sys_prompt),
Message(role='user', content=user_prompt),
]
def classify_nodes(context: dict[str, Any]) -> list[Message]:
sys_prompt = """You are an AI assistant that classifies entity nodes given the context from which they were extracted"""
user_prompt = f"""
<PREVIOUS MESSAGES>
{to_prompt_json([ep for ep in context['previous_episodes']], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</PREVIOUS MESSAGES>
<CURRENT MESSAGE>
{context['episode_content']}
</CURRENT MESSAGE>
<EXTRACTED ENTITIES>
{context['extracted_entities']}
</EXTRACTED ENTITIES>
<ENTITY TYPES>
{context['entity_types']}
</ENTITY TYPES>
Given the above conversation, extracted entities, and provided entity types and their descriptions, classify the extracted entities.
Guidelines:
1. Each entity must have exactly one type
2. Only use the provided ENTITY TYPES as types, do not use additional types to classify entities.
3. If none of the provided entity types accurately classify an extracted node, the type should be set to None
"""
return [
Message(role='system', content=sys_prompt),
Message(role='user', content=user_prompt),
]
def extract_attributes(context: dict[str, Any]) -> list[Message]:
return [
Message(
role='system',
content='You are a helpful assistant that extracts entity properties from the provided text.',
),
Message(
role='user',
content=f"""
<MESSAGES>
{to_prompt_json(context['previous_episodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
{to_prompt_json(context['episode_content'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</MESSAGES>
Given the above MESSAGES and the following ENTITY, update any of its attributes based on the information provided
in MESSAGES. Use the provided attribute descriptions to better understand how each attribute should be determined.
Guidelines:
1. Do not hallucinate entity property values if they cannot be found in the current context.
2. Only use the provided MESSAGES and ENTITY to set attribute values.
<ENTITY>
{context['node']}
</ENTITY>
""",
),
]
def extract_summary(context: dict[str, Any]) -> list[Message]:
return [
Message(
role='system',
content='You are a helpful assistant that extracts entity summaries from the provided text.',
),
Message(
role='user',
content=f"""
<MESSAGES>
{to_prompt_json(context['previous_episodes'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
{to_prompt_json(context['episode_content'], ensure_ascii=context.get('ensure_ascii', True), indent=2)}
</MESSAGES>
Given the above MESSAGES and the following ENTITY, update the summary that combines relevant information about the entity
from the messages and relevant information from the existing summary.
Guidelines:
1. Do not hallucinate entity summary information if they cannot be found in the current context.
2. Only use the provided MESSAGES and ENTITY to set attribute values.
3. The summary attribute represents a summary of the ENTITY, and should be updated with new information about the Entity from the MESSAGES.
Summaries must be no longer than 250 words.
<ENTITY>
{context['node']}
</ENTITY>
""",
),
]
versions: Versions = {
'extract_message': extract_message,
'extract_json': extract_json,
'extract_text': extract_text,
'reflexion': reflexion,
'extract_summary': extract_summary,
'classify_nodes': classify_nodes,
'extract_attributes': extract_attributes,
}