Introduce Project ID for context management, finalizing the stateful API pipeline from file submission to graph construction.

This commit is contained in:
666ghj 2025-11-28 17:21:08 +08:00
parent 9657061b26
commit 08f417f3b7
20 changed files with 2850 additions and 1 deletions

8
.env.example Normal file
View file

@ -0,0 +1,8 @@
# Zep Cloud API Key
# 从 https://app.getzep.com 获取
ZEP_API_KEY=
# LLM API Key
LLM_API_KEY=
LLM_BASE_URL=
LLM_MODEL_NAME=

10
.gitignore vendored
View file

@ -11,6 +11,7 @@
.env.development.local
.env.test.local
.env.production.local
__pycache__/
.vscode
.idea
@ -19,4 +20,11 @@
.cursor/
mydoc/
mytest/
mytest/
# 日志文件
backend/logs/
*.log
# 上传文件
backend/uploads/

243
backend/README.md Normal file
View file

@ -0,0 +1,243 @@
# MiroFish Backend
社会舆论模拟系统后端服务基于Flask框架。
## 项目结构
```
backend/
├── app/
│ ├── __init__.py # Flask应用工厂
│ ├── config.py # 配置管理
│ ├── api/ # API路由
│ │ ├── __init__.py
│ │ └── graph.py # 图谱相关接口
│ ├── services/ # 业务逻辑层
│ │ ├── ontology_generator.py # 本体生成服务
│ │ ├── graph_builder.py # 图谱构建服务
│ │ └── text_processor.py # 文本处理服务
│ ├── models/ # 数据模型
│ │ ├── task.py # 任务状态管理
│ │ └── project.py # 项目上下文管理
│ └── utils/ # 工具模块
│ ├── file_parser.py # 文件解析
│ └── llm_client.py # LLM客户端
├── requirements.txt
└── run.py # 启动入口
```
## 安装
```bash
conda activate MiroFish
cd backend
pip install -r requirements.txt
```
## 配置
在项目根目录 `MiroFish/.env` 中配置:
```bash
# LLM配置
LLM_API_KEY=your-llm-api-key
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_MODEL_NAME=gpt-4o-mini
# Zep配置
ZEP_API_KEY=your-zep-api-key
```
## 启动服务
```bash
python run.py
```
服务默认运行在 http://localhost:5000
---
## API接口
### 核心工作流程
```
1. 创建项目(可选)
POST /api/graph/project/create
2. 上传文件 + 生成本体接口1
POST /api/graph/ontology/generate
→ 返回 project_id
3. 构建图谱接口2
POST /api/graph/build
→ 传入 project_id
→ 返回 task_id
4. 查询任务进度
GET /api/graph/task/{task_id}
5. 获取图谱数据
GET /api/graph/data/{graph_id}
```
---
### 接口1生成本体定义
**POST** `/api/graph/ontology/generate`
上传文档,分析生成适合社会模拟的实体和关系类型定义。
**请求form-data**
| 字段 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `files` | File | ✅ | PDF/MD/TXT文件可多个 |
| `simulation_requirement` | Text | ✅ | 模拟需求描述 |
| `project_name` | Text | ❌ | 项目名称 |
| `additional_context` | Text | ❌ | 额外说明 |
**响应示例:**
```json
{
"success": true,
"data": {
"project_id": "proj_abc123def456",
"project_name": "武汉大学舆情分析",
"ontology": {
"entity_types": [
{
"name": "Person",
"description": "Individuals who can express opinions",
"attributes": [...]
}
],
"edge_types": [
{
"name": "AFFILIATED_WITH",
"description": "Indicates affiliation",
"source_targets": [...]
}
]
},
"analysis_summary": "分析说明...",
"files": [
{"filename": "报告.pdf", "size": 123456}
],
"total_text_length": 20833
}
}
```
---
### 接口2构建图谱
**POST** `/api/graph/build`
根据 `project_id` 构建Zep知识图谱异步任务
**请求JSON**
```json
{
"project_id": "proj_abc123def456",
"graph_name": "图谱名称",
"chunk_size": 500,
"chunk_overlap": 50
}
```
| 字段 | 类型 | 必填 | 说明 |
|------|------|------|------|
| `project_id` | string | ✅ | 来自接口1的返回 |
| `graph_name` | string | ❌ | 图谱名称 |
| `chunk_size` | int | ❌ | 文本块大小默认500 |
| `chunk_overlap` | int | ❌ | 块重叠字符默认50 |
**响应:**
```json
{
"success": true,
"data": {
"project_id": "proj_abc123def456",
"task_id": "task_xyz789",
"message": "图谱构建任务已启动"
}
}
```
---
### 任务状态查询
**GET** `/api/graph/task/{task_id}`
```json
{
"success": true,
"data": {
"task_id": "task_xyz789",
"status": "processing",
"progress": 45,
"message": "添加文本块 (15/30)...",
"result": null
}
}
```
**状态值:**
- `pending` - 等待中
- `processing` - 处理中
- `completed` - 已完成
- `failed` - 失败
---
### 项目管理接口
| 方法 | 路径 | 说明 |
|------|------|------|
| POST | `/api/graph/project/create` | 创建项目 |
| GET | `/api/graph/project/{project_id}` | 获取项目详情 |
| GET | `/api/graph/project/list` | 列出所有项目 |
| DELETE | `/api/graph/project/{project_id}` | 删除项目 |
---
### 图谱数据接口
| 方法 | 路径 | 说明 |
|------|------|------|
| GET | `/api/graph/data/{graph_id}` | 获取图谱节点和边 |
| DELETE | `/api/graph/delete/{graph_id}` | 删除Zep图谱 |
---
## 实体设计原则
本系统专为社会舆论模拟设计,实体必须是:
**✅ 可以是:**
- 具体的个人(有名有姓)
- 注册的公司、组织、机构
- 媒体机构
- 政府部门
**❌ 不可以是:**
- 抽象概念(如"技术"、"创新"
- 情绪、观点、趋势
- 泛指的群体(如"用户"、"消费者"
这是因为后续需要模拟各实体对舆论的反应和传播,抽象概念无法参与这种模拟。
---
## 项目状态流转
```
created → ontology_generated → graph_building → graph_completed
failed
```

52
backend/app/__init__.py Normal file
View file

@ -0,0 +1,52 @@
"""
MiroFish Backend - Flask应用工厂
"""
from flask import Flask, request
from flask_cors import CORS
from .config import Config
from .utils.logger import setup_logger, get_logger
def create_app(config_class=Config):
"""Flask应用工厂函数"""
app = Flask(__name__)
app.config.from_object(config_class)
# 设置日志
logger = setup_logger('mirofish')
logger.info("=" * 50)
logger.info("MiroFish Backend 启动中...")
logger.info("=" * 50)
# 启用CORS
CORS(app, resources={r"/api/*": {"origins": "*"}})
# 请求日志中间件
@app.before_request
def log_request():
logger = get_logger('mirofish.request')
logger.debug(f"请求: {request.method} {request.path}")
if request.content_type and 'json' in request.content_type:
logger.debug(f"请求体: {request.get_json(silent=True)}")
@app.after_request
def log_response(response):
logger = get_logger('mirofish.request')
logger.debug(f"响应: {response.status_code}")
return response
# 注册蓝图
from .api import graph_bp
app.register_blueprint(graph_bp, url_prefix='/api/graph')
# 健康检查
@app.route('/health')
def health():
return {'status': 'ok', 'service': 'MiroFish Backend'}
logger.info("MiroFish Backend 启动完成")
return app

View file

@ -0,0 +1,10 @@
"""
API路由模块
"""
from flask import Blueprint
graph_bp = Blueprint('graph', __name__)
from . import graph # noqa: E402, F401

657
backend/app/api/graph.py Normal file
View file

@ -0,0 +1,657 @@
"""
图谱相关API路由
采用项目上下文机制服务端持久化状态
"""
import os
import traceback
import threading
from flask import request, jsonify
from . import graph_bp
from ..config import Config
from ..services.ontology_generator import OntologyGenerator
from ..services.graph_builder import GraphBuilderService
from ..services.text_processor import TextProcessor
from ..utils.file_parser import FileParser
from ..utils.logger import get_logger
from ..models.task import TaskManager, TaskStatus
from ..models.project import ProjectManager, ProjectStatus
# 获取日志器
logger = get_logger('mirofish.api')
def allowed_file(filename: str) -> bool:
"""检查文件扩展名是否允许"""
if not filename or '.' not in filename:
return False
ext = os.path.splitext(filename)[1].lower().lstrip('.')
return ext in Config.ALLOWED_EXTENSIONS
# ============== 项目管理接口 ==============
@graph_bp.route('/project/create', methods=['POST'])
def create_project():
"""
创建新项目
请求JSON
{
"name": "项目名称" // 可选
}
返回
{
"success": true,
"data": {
"project_id": "proj_xxxx",
"name": "...",
"status": "created",
...
}
}
"""
try:
data = request.get_json() or {}
name = data.get('name', 'Unnamed Project')
project = ProjectManager.create_project(name=name)
return jsonify({
"success": True,
"data": project.to_dict()
})
except Exception as e:
return jsonify({
"success": False,
"error": str(e),
"traceback": traceback.format_exc()
}), 500
@graph_bp.route('/project/<project_id>', methods=['GET'])
def get_project(project_id: str):
"""
获取项目详情
"""
project = ProjectManager.get_project(project_id)
if not project:
return jsonify({
"success": False,
"error": f"项目不存在: {project_id}"
}), 404
return jsonify({
"success": True,
"data": project.to_dict()
})
@graph_bp.route('/project/list', methods=['GET'])
def list_projects():
"""
列出所有项目
"""
limit = request.args.get('limit', 50, type=int)
projects = ProjectManager.list_projects(limit=limit)
return jsonify({
"success": True,
"data": [p.to_dict() for p in projects],
"count": len(projects)
})
@graph_bp.route('/project/<project_id>', methods=['DELETE'])
def delete_project(project_id: str):
"""
删除项目
"""
success = ProjectManager.delete_project(project_id)
if not success:
return jsonify({
"success": False,
"error": f"项目不存在或删除失败: {project_id}"
}), 404
return jsonify({
"success": True,
"message": f"项目已删除: {project_id}"
})
@graph_bp.route('/project/<project_id>/reset', methods=['POST'])
def reset_project(project_id: str):
"""
重置项目状态用于重新构建图谱
"""
project = ProjectManager.get_project(project_id)
if not project:
return jsonify({
"success": False,
"error": f"项目不存在: {project_id}"
}), 404
# 重置到本体已生成状态
if project.ontology:
project.status = ProjectStatus.ONTOLOGY_GENERATED
else:
project.status = ProjectStatus.CREATED
project.graph_id = None
project.graph_build_task_id = None
project.error = None
ProjectManager.save_project(project)
return jsonify({
"success": True,
"message": f"项目已重置: {project_id}",
"data": project.to_dict()
})
# ============== 接口1上传文件并生成本体 ==============
@graph_bp.route('/ontology/generate', methods=['POST'])
def generate_ontology():
"""
接口1上传文件分析生成本体定义
请求方式multipart/form-data
参数
files: 上传的文件PDF/MD/TXT可多个
simulation_requirement: 模拟需求描述必填
project_name: 项目名称可选
additional_context: 额外说明可选
返回
{
"success": true,
"data": {
"project_id": "proj_xxxx",
"ontology": {
"entity_types": [...],
"edge_types": [...],
"analysis_summary": "..."
},
"files": [...],
"total_text_length": 12345
}
}
"""
try:
logger.info("=== 开始生成本体定义 ===")
# 获取参数
simulation_requirement = request.form.get('simulation_requirement', '')
project_name = request.form.get('project_name', 'Unnamed Project')
additional_context = request.form.get('additional_context', '')
logger.debug(f"项目名称: {project_name}")
logger.debug(f"模拟需求: {simulation_requirement[:100]}...")
if not simulation_requirement:
return jsonify({
"success": False,
"error": "请提供模拟需求描述 (simulation_requirement)"
}), 400
# 获取上传的文件
uploaded_files = request.files.getlist('files')
if not uploaded_files or all(not f.filename for f in uploaded_files):
return jsonify({
"success": False,
"error": "请至少上传一个文档文件"
}), 400
# 创建项目
project = ProjectManager.create_project(name=project_name)
project.simulation_requirement = simulation_requirement
logger.info(f"创建项目: {project.project_id}")
# 保存文件并提取文本
document_texts = []
all_text = ""
for file in uploaded_files:
if file and file.filename and allowed_file(file.filename):
# 保存文件到项目目录
file_info = ProjectManager.save_file_to_project(
project.project_id,
file,
file.filename
)
project.files.append({
"filename": file_info["original_filename"],
"size": file_info["size"]
})
# 提取文本
text = FileParser.extract_text(file_info["path"])
text = TextProcessor.preprocess_text(text)
document_texts.append(text)
all_text += f"\n\n=== {file_info['original_filename']} ===\n{text}"
if not document_texts:
ProjectManager.delete_project(project.project_id)
return jsonify({
"success": False,
"error": "没有成功处理任何文档,请检查文件格式"
}), 400
# 保存提取的文本
project.total_text_length = len(all_text)
ProjectManager.save_extracted_text(project.project_id, all_text)
logger.info(f"文本提取完成,共 {len(all_text)} 字符")
# 生成本体
logger.info("调用 LLM 生成本体定义...")
generator = OntologyGenerator()
ontology = generator.generate(
document_texts=document_texts,
simulation_requirement=simulation_requirement,
additional_context=additional_context if additional_context else None
)
# 保存本体到项目
entity_count = len(ontology.get("entity_types", []))
edge_count = len(ontology.get("edge_types", []))
logger.info(f"本体生成完成: {entity_count} 个实体类型, {edge_count} 个关系类型")
project.ontology = {
"entity_types": ontology.get("entity_types", []),
"edge_types": ontology.get("edge_types", [])
}
project.analysis_summary = ontology.get("analysis_summary", "")
project.status = ProjectStatus.ONTOLOGY_GENERATED
ProjectManager.save_project(project)
logger.info(f"=== 本体生成完成 === 项目ID: {project.project_id}")
return jsonify({
"success": True,
"data": {
"project_id": project.project_id,
"project_name": project.name,
"ontology": project.ontology,
"analysis_summary": project.analysis_summary,
"files": project.files,
"total_text_length": project.total_text_length
}
})
except Exception as e:
return jsonify({
"success": False,
"error": str(e),
"traceback": traceback.format_exc()
}), 500
# ============== 接口2构建图谱 ==============
@graph_bp.route('/build', methods=['POST'])
def build_graph():
"""
接口2根据project_id构建图谱
请求JSON
{
"project_id": "proj_xxxx", // 必填来自接口1
"graph_name": "图谱名称", // 可选
"chunk_size": 500, // 可选默认500
"chunk_overlap": 50 // 可选默认50
}
返回
{
"success": true,
"data": {
"project_id": "proj_xxxx",
"task_id": "task_xxxx",
"message": "图谱构建任务已启动"
}
}
"""
try:
logger.info("=== 开始构建图谱 ===")
# 检查配置
errors = []
if not Config.ZEP_API_KEY:
errors.append("ZEP_API_KEY未配置")
if errors:
logger.error(f"配置错误: {errors}")
return jsonify({
"success": False,
"error": "配置错误: " + "; ".join(errors)
}), 500
# 解析请求
data = request.get_json() or {}
project_id = data.get('project_id')
logger.debug(f"请求参数: project_id={project_id}")
if not project_id:
return jsonify({
"success": False,
"error": "请提供 project_id"
}), 400
# 获取项目
project = ProjectManager.get_project(project_id)
if not project:
return jsonify({
"success": False,
"error": f"项目不存在: {project_id}"
}), 404
# 检查项目状态
force = data.get('force', False) # 强制重新构建
if project.status == ProjectStatus.CREATED:
return jsonify({
"success": False,
"error": "项目尚未生成本体,请先调用 /ontology/generate"
}), 400
if project.status == ProjectStatus.GRAPH_BUILDING and not force:
return jsonify({
"success": False,
"error": "图谱正在构建中,请勿重复提交。如需强制重建,请添加 force: true",
"task_id": project.graph_build_task_id
}), 400
# 如果强制重建,重置状态
if force and project.status in [ProjectStatus.GRAPH_BUILDING, ProjectStatus.FAILED, ProjectStatus.GRAPH_COMPLETED]:
project.status = ProjectStatus.ONTOLOGY_GENERATED
project.graph_id = None
project.graph_build_task_id = None
project.error = None
# 获取配置
graph_name = data.get('graph_name', project.name or 'MiroFish Graph')
chunk_size = data.get('chunk_size', project.chunk_size or Config.DEFAULT_CHUNK_SIZE)
chunk_overlap = data.get('chunk_overlap', project.chunk_overlap or Config.DEFAULT_CHUNK_OVERLAP)
# 更新项目配置
project.chunk_size = chunk_size
project.chunk_overlap = chunk_overlap
# 获取提取的文本
text = ProjectManager.get_extracted_text(project_id)
if not text:
return jsonify({
"success": False,
"error": "未找到提取的文本内容"
}), 400
# 获取本体
ontology = project.ontology
if not ontology:
return jsonify({
"success": False,
"error": "未找到本体定义"
}), 400
# 创建异步任务
task_manager = TaskManager()
task_id = task_manager.create_task(f"构建图谱: {graph_name}")
logger.info(f"创建图谱构建任务: task_id={task_id}, project_id={project_id}")
# 更新项目状态
project.status = ProjectStatus.GRAPH_BUILDING
project.graph_build_task_id = task_id
ProjectManager.save_project(project)
# 启动后台任务
def build_task():
build_logger = get_logger('mirofish.build')
try:
build_logger.info(f"[{task_id}] 开始构建图谱...")
task_manager.update_task(
task_id,
status=TaskStatus.PROCESSING,
message="初始化图谱构建服务..."
)
# 创建图谱构建服务
builder = GraphBuilderService(api_key=Config.ZEP_API_KEY)
# 分块
task_manager.update_task(
task_id,
message="文本分块中...",
progress=5
)
chunks = TextProcessor.split_text(
text,
chunk_size=chunk_size,
overlap=chunk_overlap
)
total_chunks = len(chunks)
# 创建图谱
task_manager.update_task(
task_id,
message="创建Zep图谱...",
progress=10
)
graph_id = builder.create_graph(name=graph_name)
# 更新项目的graph_id
project.graph_id = graph_id
ProjectManager.save_project(project)
# 设置本体
task_manager.update_task(
task_id,
message="设置本体定义...",
progress=15
)
builder.set_ontology(graph_id, ontology)
# 添加文本progress_callback 签名是 (msg, progress_ratio)
def add_progress_callback(msg, progress_ratio):
progress = 15 + int(progress_ratio * 40) # 15% - 55%
task_manager.update_task(
task_id,
message=msg,
progress=progress
)
task_manager.update_task(
task_id,
message=f"开始添加 {total_chunks} 个文本块...",
progress=15
)
episode_uuids = builder.add_text_batches(
graph_id,
chunks,
batch_size=3,
progress_callback=add_progress_callback
)
# 等待Zep处理完成查询每个episode的processed状态
task_manager.update_task(
task_id,
message="等待Zep处理数据...",
progress=55
)
def wait_progress_callback(msg, progress_ratio):
progress = 55 + int(progress_ratio * 35) # 55% - 90%
task_manager.update_task(
task_id,
message=msg,
progress=progress
)
builder._wait_for_episodes(episode_uuids, wait_progress_callback)
# 获取图谱数据
task_manager.update_task(
task_id,
message="获取图谱数据...",
progress=95
)
graph_data = builder.get_graph_data(graph_id)
# 更新项目状态
project.status = ProjectStatus.GRAPH_COMPLETED
ProjectManager.save_project(project)
node_count = graph_data.get("node_count", 0)
edge_count = graph_data.get("edge_count", 0)
build_logger.info(f"[{task_id}] 图谱构建完成: graph_id={graph_id}, 节点={node_count}, 边={edge_count}")
# 完成
task_manager.update_task(
task_id,
status=TaskStatus.COMPLETED,
message="图谱构建完成",
progress=100,
result={
"project_id": project_id,
"graph_id": graph_id,
"node_count": node_count,
"edge_count": edge_count,
"chunk_count": total_chunks
}
)
except Exception as e:
# 更新项目状态为失败
build_logger.error(f"[{task_id}] 图谱构建失败: {str(e)}")
build_logger.debug(traceback.format_exc())
project.status = ProjectStatus.FAILED
project.error = str(e)
ProjectManager.save_project(project)
task_manager.update_task(
task_id,
status=TaskStatus.FAILED,
message=f"构建失败: {str(e)}",
error=traceback.format_exc()
)
# 启动后台线程
thread = threading.Thread(target=build_task, daemon=True)
thread.start()
return jsonify({
"success": True,
"data": {
"project_id": project_id,
"task_id": task_id,
"message": "图谱构建任务已启动,请通过 /task/{task_id} 查询进度"
}
})
except Exception as e:
return jsonify({
"success": False,
"error": str(e),
"traceback": traceback.format_exc()
}), 500
# ============== 任务查询接口 ==============
@graph_bp.route('/task/<task_id>', methods=['GET'])
def get_task(task_id: str):
"""
查询任务状态
"""
task = TaskManager().get_task(task_id)
if not task:
return jsonify({
"success": False,
"error": f"任务不存在: {task_id}"
}), 404
return jsonify({
"success": True,
"data": task.to_dict()
})
@graph_bp.route('/tasks', methods=['GET'])
def list_tasks():
"""
列出所有任务
"""
tasks = TaskManager().list_tasks()
return jsonify({
"success": True,
"data": [t.to_dict() for t in tasks],
"count": len(tasks)
})
# ============== 图谱数据接口 ==============
@graph_bp.route('/data/<graph_id>', methods=['GET'])
def get_graph_data(graph_id: str):
"""
获取图谱数据节点和边
"""
try:
if not Config.ZEP_API_KEY:
return jsonify({
"success": False,
"error": "ZEP_API_KEY未配置"
}), 500
builder = GraphBuilderService(api_key=Config.ZEP_API_KEY)
graph_data = builder.get_graph_data(graph_id)
return jsonify({
"success": True,
"data": graph_data
})
except Exception as e:
return jsonify({
"success": False,
"error": str(e),
"traceback": traceback.format_exc()
}), 500
@graph_bp.route('/delete/<graph_id>', methods=['DELETE'])
def delete_graph(graph_id: str):
"""
删除Zep图谱
"""
try:
if not Config.ZEP_API_KEY:
return jsonify({
"success": False,
"error": "ZEP_API_KEY未配置"
}), 500
builder = GraphBuilderService(api_key=Config.ZEP_API_KEY)
builder.delete_graph(graph_id)
return jsonify({
"success": True,
"message": f"图谱已删除: {graph_id}"
})
except Exception as e:
return jsonify({
"success": False,
"error": str(e),
"traceback": traceback.format_exc()
}), 500

53
backend/app/config.py Normal file
View file

@ -0,0 +1,53 @@
"""
配置管理
统一从项目根目录的 .env 文件加载配置
"""
import os
from dotenv import load_dotenv
# 加载项目根目录的 .env 文件
# 路径: MiroFish/.env (相对于 backend/app/config.py)
project_root_env = os.path.join(os.path.dirname(__file__), '../../.env')
if os.path.exists(project_root_env):
load_dotenv(project_root_env)
else:
# 如果根目录没有 .env尝试加载环境变量用于生产环境
load_dotenv()
class Config:
"""Flask配置类"""
# Flask配置
SECRET_KEY = os.environ.get('SECRET_KEY', 'mirofish-secret-key')
DEBUG = os.environ.get('FLASK_DEBUG', 'True').lower() == 'true'
# LLM配置统一使用OpenAI格式
LLM_API_KEY = os.environ.get('LLM_API_KEY')
LLM_BASE_URL = os.environ.get('LLM_BASE_URL', 'https://api.openai.com/v1')
LLM_MODEL_NAME = os.environ.get('LLM_MODEL_NAME', 'gpt-4o-mini')
# Zep配置
ZEP_API_KEY = os.environ.get('ZEP_API_KEY')
# 文件上传配置
MAX_CONTENT_LENGTH = 50 * 1024 * 1024 # 50MB
UPLOAD_FOLDER = os.path.join(os.path.dirname(__file__), '../uploads')
ALLOWED_EXTENSIONS = {'pdf', 'md', 'txt', 'markdown'}
# 文本处理配置
DEFAULT_CHUNK_SIZE = 500 # 默认切块大小
DEFAULT_CHUNK_OVERLAP = 50 # 默认重叠大小
@classmethod
def validate(cls):
"""验证必要配置"""
errors = []
if not cls.LLM_API_KEY:
errors.append("LLM_API_KEY 未配置")
if not cls.ZEP_API_KEY:
errors.append("ZEP_API_KEY 未配置")
return errors

View file

@ -0,0 +1,9 @@
"""
数据模型模块
"""
from .task import TaskManager, TaskStatus
from .project import Project, ProjectStatus, ProjectManager
__all__ = ['TaskManager', 'TaskStatus', 'Project', 'ProjectStatus', 'ProjectManager']

View file

@ -0,0 +1,305 @@
"""
项目上下文管理
用于在服务端持久化项目状态避免前端在接口间传递大量数据
"""
import os
import json
import uuid
import shutil
from datetime import datetime
from typing import Dict, Any, List, Optional
from enum import Enum
from dataclasses import dataclass, field, asdict
from ..config import Config
class ProjectStatus(str, Enum):
"""项目状态"""
CREATED = "created" # 刚创建,文件已上传
ONTOLOGY_GENERATED = "ontology_generated" # 本体已生成
GRAPH_BUILDING = "graph_building" # 图谱构建中
GRAPH_COMPLETED = "graph_completed" # 图谱构建完成
FAILED = "failed" # 失败
@dataclass
class Project:
"""项目数据模型"""
project_id: str
name: str
status: ProjectStatus
created_at: str
updated_at: str
# 文件信息
files: List[Dict[str, str]] = field(default_factory=list) # [{filename, path, size}]
total_text_length: int = 0
# 本体信息接口1生成后填充
ontology: Optional[Dict[str, Any]] = None
analysis_summary: Optional[str] = None
# 图谱信息接口2完成后填充
graph_id: Optional[str] = None
graph_build_task_id: Optional[str] = None
# 配置
simulation_requirement: Optional[str] = None
chunk_size: int = 500
chunk_overlap: int = 50
# 错误信息
error: Optional[str] = None
def to_dict(self) -> Dict[str, Any]:
"""转换为字典"""
return {
"project_id": self.project_id,
"name": self.name,
"status": self.status.value if isinstance(self.status, ProjectStatus) else self.status,
"created_at": self.created_at,
"updated_at": self.updated_at,
"files": self.files,
"total_text_length": self.total_text_length,
"ontology": self.ontology,
"analysis_summary": self.analysis_summary,
"graph_id": self.graph_id,
"graph_build_task_id": self.graph_build_task_id,
"simulation_requirement": self.simulation_requirement,
"chunk_size": self.chunk_size,
"chunk_overlap": self.chunk_overlap,
"error": self.error
}
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> 'Project':
"""从字典创建"""
status = data.get('status', 'created')
if isinstance(status, str):
status = ProjectStatus(status)
return cls(
project_id=data['project_id'],
name=data.get('name', 'Unnamed Project'),
status=status,
created_at=data.get('created_at', ''),
updated_at=data.get('updated_at', ''),
files=data.get('files', []),
total_text_length=data.get('total_text_length', 0),
ontology=data.get('ontology'),
analysis_summary=data.get('analysis_summary'),
graph_id=data.get('graph_id'),
graph_build_task_id=data.get('graph_build_task_id'),
simulation_requirement=data.get('simulation_requirement'),
chunk_size=data.get('chunk_size', 500),
chunk_overlap=data.get('chunk_overlap', 50),
error=data.get('error')
)
class ProjectManager:
"""项目管理器 - 负责项目的持久化存储和检索"""
# 项目存储根目录
PROJECTS_DIR = os.path.join(Config.UPLOAD_FOLDER, 'projects')
@classmethod
def _ensure_projects_dir(cls):
"""确保项目目录存在"""
os.makedirs(cls.PROJECTS_DIR, exist_ok=True)
@classmethod
def _get_project_dir(cls, project_id: str) -> str:
"""获取项目目录路径"""
return os.path.join(cls.PROJECTS_DIR, project_id)
@classmethod
def _get_project_meta_path(cls, project_id: str) -> str:
"""获取项目元数据文件路径"""
return os.path.join(cls._get_project_dir(project_id), 'project.json')
@classmethod
def _get_project_files_dir(cls, project_id: str) -> str:
"""获取项目文件存储目录"""
return os.path.join(cls._get_project_dir(project_id), 'files')
@classmethod
def _get_project_text_path(cls, project_id: str) -> str:
"""获取项目提取文本存储路径"""
return os.path.join(cls._get_project_dir(project_id), 'extracted_text.txt')
@classmethod
def create_project(cls, name: str = "Unnamed Project") -> Project:
"""
创建新项目
Args:
name: 项目名称
Returns:
新创建的Project对象
"""
cls._ensure_projects_dir()
project_id = f"proj_{uuid.uuid4().hex[:12]}"
now = datetime.now().isoformat()
project = Project(
project_id=project_id,
name=name,
status=ProjectStatus.CREATED,
created_at=now,
updated_at=now
)
# 创建项目目录结构
project_dir = cls._get_project_dir(project_id)
files_dir = cls._get_project_files_dir(project_id)
os.makedirs(project_dir, exist_ok=True)
os.makedirs(files_dir, exist_ok=True)
# 保存项目元数据
cls.save_project(project)
return project
@classmethod
def save_project(cls, project: Project) -> None:
"""保存项目元数据"""
project.updated_at = datetime.now().isoformat()
meta_path = cls._get_project_meta_path(project.project_id)
with open(meta_path, 'w', encoding='utf-8') as f:
json.dump(project.to_dict(), f, ensure_ascii=False, indent=2)
@classmethod
def get_project(cls, project_id: str) -> Optional[Project]:
"""
获取项目
Args:
project_id: 项目ID
Returns:
Project对象如果不存在返回None
"""
meta_path = cls._get_project_meta_path(project_id)
if not os.path.exists(meta_path):
return None
with open(meta_path, 'r', encoding='utf-8') as f:
data = json.load(f)
return Project.from_dict(data)
@classmethod
def list_projects(cls, limit: int = 50) -> List[Project]:
"""
列出所有项目
Args:
limit: 返回数量限制
Returns:
项目列表按创建时间倒序
"""
cls._ensure_projects_dir()
projects = []
for project_id in os.listdir(cls.PROJECTS_DIR):
project = cls.get_project(project_id)
if project:
projects.append(project)
# 按创建时间倒序排序
projects.sort(key=lambda p: p.created_at, reverse=True)
return projects[:limit]
@classmethod
def delete_project(cls, project_id: str) -> bool:
"""
删除项目及其所有文件
Args:
project_id: 项目ID
Returns:
是否删除成功
"""
project_dir = cls._get_project_dir(project_id)
if not os.path.exists(project_dir):
return False
shutil.rmtree(project_dir)
return True
@classmethod
def save_file_to_project(cls, project_id: str, file_storage, original_filename: str) -> Dict[str, str]:
"""
保存上传的文件到项目目录
Args:
project_id: 项目ID
file_storage: Flask的FileStorage对象
original_filename: 原始文件名
Returns:
文件信息字典 {filename, path, size}
"""
files_dir = cls._get_project_files_dir(project_id)
os.makedirs(files_dir, exist_ok=True)
# 生成安全的文件名
ext = os.path.splitext(original_filename)[1].lower()
safe_filename = f"{uuid.uuid4().hex[:8]}{ext}"
file_path = os.path.join(files_dir, safe_filename)
# 保存文件
file_storage.save(file_path)
# 获取文件大小
file_size = os.path.getsize(file_path)
return {
"original_filename": original_filename,
"saved_filename": safe_filename,
"path": file_path,
"size": file_size
}
@classmethod
def save_extracted_text(cls, project_id: str, text: str) -> None:
"""保存提取的文本"""
text_path = cls._get_project_text_path(project_id)
with open(text_path, 'w', encoding='utf-8') as f:
f.write(text)
@classmethod
def get_extracted_text(cls, project_id: str) -> Optional[str]:
"""获取提取的文本"""
text_path = cls._get_project_text_path(project_id)
if not os.path.exists(text_path):
return None
with open(text_path, 'r', encoding='utf-8') as f:
return f.read()
@classmethod
def get_project_files(cls, project_id: str) -> List[str]:
"""获取项目的所有文件路径"""
files_dir = cls._get_project_files_dir(project_id)
if not os.path.exists(files_dir):
return []
return [
os.path.join(files_dir, f)
for f in os.listdir(files_dir)
if os.path.isfile(os.path.join(files_dir, f))
]

178
backend/app/models/task.py Normal file
View file

@ -0,0 +1,178 @@
"""
任务状态管理
用于跟踪长时间运行的任务如图谱构建
"""
import uuid
import threading
from datetime import datetime
from enum import Enum
from typing import Dict, Any, Optional
from dataclasses import dataclass, field
class TaskStatus(str, Enum):
"""任务状态枚举"""
PENDING = "pending" # 等待中
PROCESSING = "processing" # 处理中
COMPLETED = "completed" # 已完成
FAILED = "failed" # 失败
@dataclass
class Task:
"""任务数据类"""
task_id: str
task_type: str
status: TaskStatus
created_at: datetime
updated_at: datetime
progress: int = 0 # 进度百分比 0-100
message: str = "" # 状态消息
result: Optional[Dict] = None # 任务结果
error: Optional[str] = None # 错误信息
metadata: Dict = field(default_factory=dict) # 额外元数据
def to_dict(self) -> Dict[str, Any]:
"""转换为字典"""
return {
"task_id": self.task_id,
"task_type": self.task_type,
"status": self.status.value,
"created_at": self.created_at.isoformat(),
"updated_at": self.updated_at.isoformat(),
"progress": self.progress,
"message": self.message,
"result": self.result,
"error": self.error,
"metadata": self.metadata,
}
class TaskManager:
"""
任务管理器
线程安全的任务状态管理
"""
_instance = None
_lock = threading.Lock()
def __new__(cls):
"""单例模式"""
if cls._instance is None:
with cls._lock:
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance._tasks: Dict[str, Task] = {}
cls._instance._task_lock = threading.Lock()
return cls._instance
def create_task(self, task_type: str, metadata: Optional[Dict] = None) -> str:
"""
创建新任务
Args:
task_type: 任务类型
metadata: 额外元数据
Returns:
任务ID
"""
task_id = str(uuid.uuid4())
now = datetime.now()
task = Task(
task_id=task_id,
task_type=task_type,
status=TaskStatus.PENDING,
created_at=now,
updated_at=now,
metadata=metadata or {}
)
with self._task_lock:
self._tasks[task_id] = task
return task_id
def get_task(self, task_id: str) -> Optional[Task]:
"""获取任务"""
with self._task_lock:
return self._tasks.get(task_id)
def update_task(
self,
task_id: str,
status: Optional[TaskStatus] = None,
progress: Optional[int] = None,
message: Optional[str] = None,
result: Optional[Dict] = None,
error: Optional[str] = None
):
"""
更新任务状态
Args:
task_id: 任务ID
status: 新状态
progress: 进度
message: 消息
result: 结果
error: 错误信息
"""
with self._task_lock:
task = self._tasks.get(task_id)
if task:
task.updated_at = datetime.now()
if status is not None:
task.status = status
if progress is not None:
task.progress = progress
if message is not None:
task.message = message
if result is not None:
task.result = result
if error is not None:
task.error = error
def complete_task(self, task_id: str, result: Dict):
"""标记任务完成"""
self.update_task(
task_id,
status=TaskStatus.COMPLETED,
progress=100,
message="任务完成",
result=result
)
def fail_task(self, task_id: str, error: str):
"""标记任务失败"""
self.update_task(
task_id,
status=TaskStatus.FAILED,
message="任务失败",
error=error
)
def list_tasks(self, task_type: Optional[str] = None) -> list:
"""列出任务"""
with self._task_lock:
tasks = list(self._tasks.values())
if task_type:
tasks = [t for t in tasks if t.task_type == task_type]
return [t.to_dict() for t in sorted(tasks, key=lambda x: x.created_at, reverse=True)]
def cleanup_old_tasks(self, max_age_hours: int = 24):
"""清理旧任务"""
from datetime import timedelta
cutoff = datetime.now() - timedelta(hours=max_age_hours)
with self._task_lock:
old_ids = [
tid for tid, task in self._tasks.items()
if task.created_at < cutoff and task.status in [TaskStatus.COMPLETED, TaskStatus.FAILED]
]
for tid in old_ids:
del self._tasks[tid]

View file

@ -0,0 +1,10 @@
"""
业务服务模块
"""
from .ontology_generator import OntologyGenerator
from .graph_builder import GraphBuilderService
from .text_processor import TextProcessor
__all__ = ['OntologyGenerator', 'GraphBuilderService', 'TextProcessor']

View file

@ -0,0 +1,457 @@
"""
图谱构建服务
接口2使用Zep API构建Standalone Graph
"""
import os
import uuid
import time
import threading
from typing import Dict, Any, List, Optional, Callable
from dataclasses import dataclass
from zep_cloud.client import Zep
from zep_cloud import EpisodeData, EntityEdgeSourceTarget
from ..config import Config
from ..models.task import TaskManager, TaskStatus
from .text_processor import TextProcessor
@dataclass
class GraphInfo:
"""图谱信息"""
graph_id: str
node_count: int
edge_count: int
entity_types: List[str]
def to_dict(self) -> Dict[str, Any]:
return {
"graph_id": self.graph_id,
"node_count": self.node_count,
"edge_count": self.edge_count,
"entity_types": self.entity_types,
}
class GraphBuilderService:
"""
图谱构建服务
负责调用Zep API构建知识图谱
"""
def __init__(self, api_key: Optional[str] = None):
self.api_key = api_key or Config.ZEP_API_KEY
if not self.api_key:
raise ValueError("ZEP_API_KEY 未配置")
self.client = Zep(api_key=self.api_key)
self.task_manager = TaskManager()
def build_graph_async(
self,
text: str,
ontology: Dict[str, Any],
graph_name: str = "MiroFish Graph",
chunk_size: int = 500,
chunk_overlap: int = 50,
batch_size: int = 3
) -> str:
"""
异步构建图谱
Args:
text: 输入文本
ontology: 本体定义来自接口1的输出
graph_name: 图谱名称
chunk_size: 文本块大小
chunk_overlap: 块重叠大小
batch_size: 每批发送的块数量
Returns:
任务ID
"""
# 创建任务
task_id = self.task_manager.create_task(
task_type="graph_build",
metadata={
"graph_name": graph_name,
"chunk_size": chunk_size,
"text_length": len(text),
}
)
# 在后台线程中执行构建
thread = threading.Thread(
target=self._build_graph_worker,
args=(task_id, text, ontology, graph_name, chunk_size, chunk_overlap, batch_size)
)
thread.daemon = True
thread.start()
return task_id
def _build_graph_worker(
self,
task_id: str,
text: str,
ontology: Dict[str, Any],
graph_name: str,
chunk_size: int,
chunk_overlap: int,
batch_size: int
):
"""图谱构建工作线程"""
try:
self.task_manager.update_task(
task_id,
status=TaskStatus.PROCESSING,
progress=5,
message="开始构建图谱..."
)
# 1. 创建图谱
graph_id = self.create_graph(graph_name)
self.task_manager.update_task(
task_id,
progress=10,
message=f"图谱已创建: {graph_id}"
)
# 2. 设置本体
self.set_ontology(graph_id, ontology)
self.task_manager.update_task(
task_id,
progress=15,
message="本体已设置"
)
# 3. 文本分块
chunks = TextProcessor.split_text(text, chunk_size, chunk_overlap)
total_chunks = len(chunks)
self.task_manager.update_task(
task_id,
progress=20,
message=f"文本已分割为 {total_chunks} 个块"
)
# 4. 分批发送数据
episode_uuids = self.add_text_batches(
graph_id, chunks, batch_size,
lambda msg, prog: self.task_manager.update_task(
task_id,
progress=20 + int(prog * 0.4), # 20-60%
message=msg
)
)
# 5. 等待Zep处理完成
self.task_manager.update_task(
task_id,
progress=60,
message="等待Zep处理数据..."
)
self._wait_for_episodes(
episode_uuids,
lambda msg, prog: self.task_manager.update_task(
task_id,
progress=60 + int(prog * 0.3), # 60-90%
message=msg
)
)
# 6. 获取图谱信息
self.task_manager.update_task(
task_id,
progress=90,
message="获取图谱信息..."
)
graph_info = self._get_graph_info(graph_id)
# 完成
self.task_manager.complete_task(task_id, {
"graph_id": graph_id,
"graph_info": graph_info.to_dict(),
"chunks_processed": total_chunks,
})
except Exception as e:
import traceback
error_msg = f"{str(e)}\n{traceback.format_exc()}"
self.task_manager.fail_task(task_id, error_msg)
def create_graph(self, name: str) -> str:
"""创建Zep图谱公开方法"""
graph_id = f"mirofish_{uuid.uuid4().hex[:16]}"
self.client.graph.create(
graph_id=graph_id,
name=name,
description="MiroFish Social Simulation Graph"
)
return graph_id
def set_ontology(self, graph_id: str, ontology: Dict[str, Any]):
"""设置图谱本体(公开方法)"""
from typing import Optional
from pydantic import Field
from zep_cloud.external_clients.ontology import EntityModel, EntityText, EdgeModel
# Zep 保留名称,不能作为属性名
RESERVED_NAMES = {'uuid', 'name', 'group_id', 'name_embedding', 'summary', 'created_at'}
def safe_attr_name(attr_name: str) -> str:
"""将保留名称转换为安全名称"""
if attr_name.lower() in RESERVED_NAMES:
return f"entity_{attr_name}"
return attr_name
# 动态创建实体类型
entity_types = {}
for entity_def in ontology.get("entity_types", []):
name = entity_def["name"]
description = entity_def.get("description", f"A {name} entity.")
# 创建属性字典和类型注解Pydantic v2 需要)
attrs = {"__doc__": description}
annotations = {}
for attr_def in entity_def.get("attributes", []):
attr_name = safe_attr_name(attr_def["name"]) # 使用安全名称
attr_desc = attr_def.get("description", attr_name)
attrs[attr_name] = Field(description=attr_desc, default=None)
annotations[attr_name] = Optional[EntityText] # 类型注解
attrs["__annotations__"] = annotations
# 动态创建类
entity_class = type(name, (EntityModel,), attrs)
entity_class.__doc__ = description
entity_types[name] = entity_class
# 动态创建边类型
edge_definitions = {}
for edge_def in ontology.get("edge_types", []):
name = edge_def["name"]
description = edge_def.get("description", f"A {name} relationship.")
# 创建属性字典和类型注解
attrs = {"__doc__": description}
annotations = {}
for attr_def in edge_def.get("attributes", []):
attr_name = safe_attr_name(attr_def["name"]) # 使用安全名称
attr_desc = attr_def.get("description", attr_name)
attrs[attr_name] = Field(description=attr_desc, default=None)
annotations[attr_name] = Optional[str] # 边属性用str类型
attrs["__annotations__"] = annotations
# 动态创建类
class_name = ''.join(word.capitalize() for word in name.split('_'))
edge_class = type(class_name, (EdgeModel,), attrs)
edge_class.__doc__ = description
# 构建source_targets
source_targets = []
for st in edge_def.get("source_targets", []):
source_targets.append(
EntityEdgeSourceTarget(
source=st.get("source", "Entity"),
target=st.get("target", "Entity")
)
)
if source_targets:
edge_definitions[name] = (edge_class, source_targets)
# 调用Zep API设置本体
if entity_types or edge_definitions:
self.client.graph.set_ontology(
graph_ids=[graph_id],
entities=entity_types if entity_types else None,
edges=edge_definitions if edge_definitions else None,
)
def add_text_batches(
self,
graph_id: str,
chunks: List[str],
batch_size: int = 3,
progress_callback: Optional[Callable] = None
) -> List[str]:
"""分批添加文本到图谱,返回所有 episode 的 uuid 列表"""
episode_uuids = []
total_chunks = len(chunks)
for i in range(0, total_chunks, batch_size):
batch_chunks = chunks[i:i + batch_size]
batch_num = i // batch_size + 1
total_batches = (total_chunks + batch_size - 1) // batch_size
if progress_callback:
progress = (i + len(batch_chunks)) / total_chunks
progress_callback(
f"发送第 {batch_num}/{total_batches} 批数据 ({len(batch_chunks)} 块)...",
progress
)
# 构建episode数据
episodes = [
EpisodeData(data=chunk, type="text")
for chunk in batch_chunks
]
# 发送到Zep
try:
batch_result = self.client.graph.add_batch(
graph_id=graph_id,
episodes=episodes
)
# 收集返回的 episode uuid
if batch_result and isinstance(batch_result, list):
for ep in batch_result:
ep_uuid = getattr(ep, 'uuid_', None) or getattr(ep, 'uuid', None)
if ep_uuid:
episode_uuids.append(ep_uuid)
# 避免请求过快
time.sleep(1)
except Exception as e:
if progress_callback:
progress_callback(f"批次 {batch_num} 发送失败: {str(e)}", 0)
raise
return episode_uuids
def _wait_for_episodes(
self,
episode_uuids: List[str],
progress_callback: Optional[Callable] = None,
timeout: int = 600
):
"""等待所有 episode 处理完成(通过查询每个 episode 的 processed 状态)"""
if not episode_uuids:
if progress_callback:
progress_callback("无需等待(没有 episode", 1.0)
return
start_time = time.time()
pending_episodes = set(episode_uuids)
completed_count = 0
total_episodes = len(episode_uuids)
if progress_callback:
progress_callback(f"开始等待 {total_episodes} 个文本块处理...", 0)
while pending_episodes:
if time.time() - start_time > timeout:
if progress_callback:
progress_callback(
f"部分文本块超时,已完成 {completed_count}/{total_episodes}",
completed_count / total_episodes
)
break
# 检查每个 episode 的处理状态
for ep_uuid in list(pending_episodes):
try:
episode = self.client.graph.episode.get(uuid_=ep_uuid)
is_processed = getattr(episode, 'processed', False)
if is_processed:
pending_episodes.remove(ep_uuid)
completed_count += 1
except Exception as e:
# 忽略单个查询错误,继续
pass
elapsed = int(time.time() - start_time)
if progress_callback:
progress_callback(
f"Zep处理中... {completed_count}/{total_episodes} 完成, {len(pending_episodes)} 待处理 ({elapsed}秒)",
completed_count / total_episodes if total_episodes > 0 else 0
)
if pending_episodes:
time.sleep(3) # 每3秒检查一次
if progress_callback:
progress_callback(f"处理完成: {completed_count}/{total_episodes}", 1.0)
def _get_graph_info(self, graph_id: str) -> GraphInfo:
"""获取图谱信息"""
# 获取节点
nodes = self.client.graph.node.get_by_graph_id(graph_id=graph_id)
# 获取边
edges = self.client.graph.edge.get_by_graph_id(graph_id=graph_id)
# 统计实体类型
entity_types = set()
for node in nodes:
if node.labels:
for label in node.labels:
if label not in ["Entity", "Node"]:
entity_types.add(label)
return GraphInfo(
graph_id=graph_id,
node_count=len(nodes),
edge_count=len(edges),
entity_types=list(entity_types)
)
def get_graph_data(self, graph_id: str) -> Dict[str, Any]:
"""
获取完整图谱数据
Args:
graph_id: 图谱ID
Returns:
包含nodes和edges的字典
"""
nodes = self.client.graph.node.get_by_graph_id(graph_id=graph_id)
edges = self.client.graph.edge.get_by_graph_id(graph_id=graph_id)
nodes_data = []
for node in nodes:
nodes_data.append({
"uuid": node.uuid_,
"name": node.name,
"labels": node.labels or [],
"summary": node.summary or "",
"attributes": node.attributes or {},
})
edges_data = []
for edge in edges:
edges_data.append({
"uuid": edge.uuid_,
"name": edge.name or "",
"fact": edge.fact or "",
"source_node_uuid": edge.source_node_uuid,
"target_node_uuid": edge.target_node_uuid,
"attributes": edge.attributes or {},
})
return {
"graph_id": graph_id,
"nodes": nodes_data,
"edges": edges_data,
"node_count": len(nodes_data),
"edge_count": len(edges_data),
}
def delete_graph(self, graph_id: str):
"""删除图谱"""
self.client.graph.delete(graph_id=graph_id)

View file

@ -0,0 +1,361 @@
"""
本体生成服务
接口1分析文本内容生成适合社会模拟的实体和关系类型定义
"""
import json
from typing import Dict, Any, List, Optional
from ..utils.llm_client import LLMClient
# 本体生成的系统提示词
ONTOLOGY_SYSTEM_PROMPT = """你是一个专业的知识图谱本体设计专家。你的任务是分析给定的文本内容和模拟需求,设计适合**社交媒体舆论模拟**的实体类型和关系类型。
**重要你必须输出有效的JSON格式数据不要输出任何其他内容**
## 核心任务背景
我们正在构建一个**社交媒体舆论模拟系统**在这个系统中
- 每个实体都是一个可以在社交媒体上发声互动传播信息的"账号""主体"
- 实体之间会相互影响转发评论回应
- 我们需要模拟舆论事件中各方的反应和信息传播路径
因此**实体必须是现实中真实存在的可以在社媒上发声和互动的主体**
**可以是鼓励多样化划分**
- 具体的个人公众人物当事人意见领袖专家学者
- 公司企业包括其官方账号
- 组织机构大学协会NGO工会等
- 政府部门监管机构
- 媒体机构报纸电视台自媒体网站
- 社交媒体平台本身
- 特定群体代表如校友会粉丝团维权群体等
**不可以是**
- 抽象概念"舆论""情绪""趋势"
- 主题/话题"学术诚信""教育改革"
- 观点/态度"支持方""反对方"
- 泛指群体"网友""公众""学生群体"
## 输出格式
请输出JSON格式包含以下结构
```json
{
"entity_types": [
{
"name": "实体类型名称英文PascalCase",
"description": "简短描述英文不超过100字符",
"attributes": [
{
"name": "属性名英文snake_case",
"type": "text",
"description": "属性描述"
}
],
"examples": ["示例实体1", "示例实体2"]
}
],
"edge_types": [
{
"name": "关系类型名称英文UPPER_SNAKE_CASE",
"description": "简短描述英文不超过100字符",
"source_targets": [
{"source": "源实体类型", "target": "目标实体类型"}
],
"attributes": []
}
],
"analysis_summary": "对文本内容的简要分析说明(中文)"
}
```
## 设计指南
1. **实体类型设计重要请尽量多划分**
- **数量要求至少5个最多10个实体类型**
- 每个实体类型代表一类可以在社媒上发声的主体
- 尽量细分不同角色例如
- 不要只用"Person"可以细分为"PublicFigure""Expert""Whistleblower"
- 不要只用"Organization"可以细分为"University""Company""NGO"
- description必须清晰说明什么样的实体应该被提取
- 每个类型提供2-3个具体示例
2. **关系类型设计**
- 关系应该反映社媒互动中的真实联系
- 关注可能影响舆论传播的关系
- 信息传播REPORTS_ON, COMMENTS_ON, SHARES
- 组织关系AFFILIATED_WITH, WORKS_FOR, REPRESENTS
- 互动关系RESPONDS_TO, SUPPORTS, OPPOSES
- 关系类型5-10个为宜
3. **属性设计**
- 每个实体类型1-3个关键属性
- 属性应有助于识别实体的社媒影响力如roleinfluence_level等
## 实体类型参考(请根据文本内容灵活选择和扩展)
- Person: 普通个人
- PublicFigure: 公众人物明星网红意见领袖
- Expert: 专家学者
- Journalist: 记者
- Company: 公司企业
- University: 高校
- GovernmentAgency: 政府机构
- MediaOutlet: 传统媒体
- SelfMedia: 自媒体账号
- SocialPlatform: 社交媒体平台
- NGO: 非政府组织
- IndustryAssociation: 行业协会
- AlumniAssociation: 校友会
- FanGroup: 粉丝群体/支持群体
## 关系类型参考
- WORKS_FOR: 工作于
- AFFILIATED_WITH: 隶属于
- REPRESENTS: 代表
- REGULATES: 监管
- REPORTS_ON: 报道
- COMMENTS_ON: 评论
- RESPONDS_TO: 回应
- SUPPORTS: 支持
- OPPOSES: 反对
- COLLABORATES_WITH: 合作
- COMPETES_WITH: 竞争
"""
class OntologyGenerator:
"""
本体生成器
分析文本内容生成实体和关系类型定义
"""
def __init__(self, llm_client: Optional[LLMClient] = None):
self.llm_client = llm_client or LLMClient()
def generate(
self,
document_texts: List[str],
simulation_requirement: str,
additional_context: Optional[str] = None
) -> Dict[str, Any]:
"""
生成本体定义
Args:
document_texts: 文档文本列表
simulation_requirement: 模拟需求描述
additional_context: 额外上下文
Returns:
本体定义entity_types, edge_types等
"""
# 构建用户消息
user_message = self._build_user_message(
document_texts,
simulation_requirement,
additional_context
)
messages = [
{"role": "system", "content": ONTOLOGY_SYSTEM_PROMPT},
{"role": "user", "content": user_message}
]
# 调用LLM
result = self.llm_client.chat_json(
messages=messages,
temperature=0.3,
max_tokens=4096
)
# 验证和后处理
result = self._validate_and_process(result)
return result
# 传给 LLM 的文本最大长度5万字
MAX_TEXT_LENGTH_FOR_LLM = 50000
def _build_user_message(
self,
document_texts: List[str],
simulation_requirement: str,
additional_context: Optional[str]
) -> str:
"""构建用户消息"""
# 合并文本
combined_text = "\n\n---\n\n".join(document_texts)
original_length = len(combined_text)
# 如果文本超过5万字截断仅影响传给LLM的内容不影响图谱构建
if len(combined_text) > self.MAX_TEXT_LENGTH_FOR_LLM:
combined_text = combined_text[:self.MAX_TEXT_LENGTH_FOR_LLM]
combined_text += f"\n\n...(原文共{original_length}字,已截取前{self.MAX_TEXT_LENGTH_FOR_LLM}字用于本体分析)..."
message = f"""## 模拟需求
{simulation_requirement}
## 文档内容
{combined_text}
"""
if additional_context:
message += f"""
## 额外说明
{additional_context}
"""
message += """
请根据以上内容设计适合社会舆论模拟的实体类型和关系类型
记住所有实体类型必须是现实中可以发声的主体不能是抽象概念
"""
return message
def _validate_and_process(self, result: Dict[str, Any]) -> Dict[str, Any]:
"""验证和后处理结果"""
# 确保必要字段存在
if "entity_types" not in result:
result["entity_types"] = []
if "edge_types" not in result:
result["edge_types"] = []
if "analysis_summary" not in result:
result["analysis_summary"] = ""
# 验证实体类型
for entity in result["entity_types"]:
if "attributes" not in entity:
entity["attributes"] = []
if "examples" not in entity:
entity["examples"] = []
# 确保description不超过100字符
if len(entity.get("description", "")) > 100:
entity["description"] = entity["description"][:97] + "..."
# 验证关系类型
for edge in result["edge_types"]:
if "source_targets" not in edge:
edge["source_targets"] = []
if "attributes" not in edge:
edge["attributes"] = []
if len(edge.get("description", "")) > 100:
edge["description"] = edge["description"][:97] + "..."
return result
def generate_python_code(self, ontology: Dict[str, Any]) -> str:
"""
将本体定义转换为Python代码类似ontology.py
Args:
ontology: 本体定义
Returns:
Python代码字符串
"""
code_lines = [
'"""',
'自定义实体类型定义',
'由MiroFish自动生成用于社会舆论模拟',
'"""',
'',
'from pydantic import Field',
'from zep_cloud.external_clients.ontology import EntityModel, EntityText, EdgeModel',
'',
'',
'# ============== 实体类型定义 ==============',
'',
]
# 生成实体类型
for entity in ontology.get("entity_types", []):
name = entity["name"]
desc = entity.get("description", f"A {name} entity.")
code_lines.append(f'class {name}(EntityModel):')
code_lines.append(f' """{desc}"""')
attrs = entity.get("attributes", [])
if attrs:
for attr in attrs:
attr_name = attr["name"]
attr_desc = attr.get("description", attr_name)
code_lines.append(f' {attr_name}: EntityText = Field(')
code_lines.append(f' description="{attr_desc}",')
code_lines.append(f' default=None')
code_lines.append(f' )')
else:
code_lines.append(' pass')
code_lines.append('')
code_lines.append('')
code_lines.append('# ============== 关系类型定义 ==============')
code_lines.append('')
# 生成关系类型
for edge in ontology.get("edge_types", []):
name = edge["name"]
# 转换为PascalCase类名
class_name = ''.join(word.capitalize() for word in name.split('_'))
desc = edge.get("description", f"A {name} relationship.")
code_lines.append(f'class {class_name}(EdgeModel):')
code_lines.append(f' """{desc}"""')
attrs = edge.get("attributes", [])
if attrs:
for attr in attrs:
attr_name = attr["name"]
attr_desc = attr.get("description", attr_name)
code_lines.append(f' {attr_name}: EntityText = Field(')
code_lines.append(f' description="{attr_desc}",')
code_lines.append(f' default=None')
code_lines.append(f' )')
else:
code_lines.append(' pass')
code_lines.append('')
code_lines.append('')
# 生成类型字典
code_lines.append('# ============== 类型配置 ==============')
code_lines.append('')
code_lines.append('ENTITY_TYPES = {')
for entity in ontology.get("entity_types", []):
name = entity["name"]
code_lines.append(f' "{name}": {name},')
code_lines.append('}')
code_lines.append('')
code_lines.append('EDGE_TYPES = {')
for edge in ontology.get("edge_types", []):
name = edge["name"]
class_name = ''.join(word.capitalize() for word in name.split('_'))
code_lines.append(f' "{name}": {class_name},')
code_lines.append('}')
code_lines.append('')
# 生成边的source_targets映射
code_lines.append('EDGE_SOURCE_TARGETS = {')
for edge in ontology.get("edge_types", []):
name = edge["name"]
source_targets = edge.get("source_targets", [])
if source_targets:
st_list = ', '.join([
f'{{"source": "{st.get("source", "Entity")}", "target": "{st.get("target", "Entity")}"}}'
for st in source_targets
])
code_lines.append(f' "{name}": [{st_list}],')
code_lines.append('}')
return '\n'.join(code_lines)

View file

@ -0,0 +1,71 @@
"""
文本处理服务
"""
from typing import List, Optional
from ..utils.file_parser import FileParser, split_text_into_chunks
class TextProcessor:
"""文本处理器"""
@staticmethod
def extract_from_files(file_paths: List[str]) -> str:
"""从多个文件提取文本"""
return FileParser.extract_from_multiple(file_paths)
@staticmethod
def split_text(
text: str,
chunk_size: int = 500,
overlap: int = 50
) -> List[str]:
"""
分割文本
Args:
text: 原始文本
chunk_size: 块大小
overlap: 重叠大小
Returns:
文本块列表
"""
return split_text_into_chunks(text, chunk_size, overlap)
@staticmethod
def preprocess_text(text: str) -> str:
"""
预处理文本
- 移除多余空白
- 标准化换行
Args:
text: 原始文本
Returns:
处理后的文本
"""
import re
# 标准化换行
text = text.replace('\r\n', '\n').replace('\r', '\n')
# 移除连续空行(保留最多两个换行)
text = re.sub(r'\n{3,}', '\n\n', text)
# 移除行首行尾空白
lines = [line.strip() for line in text.split('\n')]
text = '\n'.join(lines)
return text.strip()
@staticmethod
def get_text_stats(text: str) -> dict:
"""获取文本统计信息"""
return {
"total_chars": len(text),
"total_lines": text.count('\n') + 1,
"total_words": len(text.split()),
}

View file

@ -0,0 +1,9 @@
"""
工具模块
"""
from .file_parser import FileParser
from .llm_client import LLMClient
__all__ = ['FileParser', 'LLMClient']

View file

@ -0,0 +1,141 @@
"""
文件解析工具
支持PDFMarkdownTXT文件的文本提取
"""
import os
from pathlib import Path
from typing import List, Optional
class FileParser:
"""文件解析器"""
SUPPORTED_EXTENSIONS = {'.pdf', '.md', '.markdown', '.txt'}
@classmethod
def extract_text(cls, file_path: str) -> str:
"""
从文件中提取文本
Args:
file_path: 文件路径
Returns:
提取的文本内容
"""
path = Path(file_path)
if not path.exists():
raise FileNotFoundError(f"文件不存在: {file_path}")
suffix = path.suffix.lower()
if suffix not in cls.SUPPORTED_EXTENSIONS:
raise ValueError(f"不支持的文件格式: {suffix}")
if suffix == '.pdf':
return cls._extract_from_pdf(file_path)
elif suffix in {'.md', '.markdown'}:
return cls._extract_from_md(file_path)
elif suffix == '.txt':
return cls._extract_from_txt(file_path)
raise ValueError(f"无法处理的文件格式: {suffix}")
@staticmethod
def _extract_from_pdf(file_path: str) -> str:
"""从PDF提取文本"""
try:
import fitz # PyMuPDF
except ImportError:
raise ImportError("需要安装PyMuPDF: pip install PyMuPDF")
text_parts = []
with fitz.open(file_path) as doc:
for page in doc:
text = page.get_text()
if text.strip():
text_parts.append(text)
return "\n\n".join(text_parts)
@staticmethod
def _extract_from_md(file_path: str) -> str:
"""从Markdown提取文本"""
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
@staticmethod
def _extract_from_txt(file_path: str) -> str:
"""从TXT提取文本"""
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
@classmethod
def extract_from_multiple(cls, file_paths: List[str]) -> str:
"""
从多个文件提取文本并合并
Args:
file_paths: 文件路径列表
Returns:
合并后的文本
"""
all_texts = []
for i, file_path in enumerate(file_paths, 1):
try:
text = cls.extract_text(file_path)
filename = Path(file_path).name
all_texts.append(f"=== 文档 {i}: {filename} ===\n{text}")
except Exception as e:
all_texts.append(f"=== 文档 {i}: {file_path} (提取失败: {str(e)}) ===")
return "\n\n".join(all_texts)
def split_text_into_chunks(
text: str,
chunk_size: int = 500,
overlap: int = 50
) -> List[str]:
"""
将文本分割成小块
Args:
text: 原始文本
chunk_size: 每块的字符数
overlap: 重叠字符数
Returns:
文本块列表
"""
if len(text) <= chunk_size:
return [text] if text.strip() else []
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
# 尝试在句子边界处分割
if end < len(text):
# 查找最近的句子结束符
for sep in ['', '', '', '.\n', '!\n', '?\n', '\n\n', '. ', '! ', '? ']:
last_sep = text[start:end].rfind(sep)
if last_sep != -1 and last_sep > chunk_size * 0.3:
end = start + last_sep + len(sep)
break
chunk = text[start:end].strip()
if chunk:
chunks.append(chunk)
# 下一个块从重叠位置开始
start = end - overlap if end < len(text) else len(text)
return chunks

View file

@ -0,0 +1,91 @@
"""
LLM客户端封装
统一使用OpenAI格式调用
"""
import json
from typing import Optional, Dict, Any, List
from openai import OpenAI
from ..config import Config
class LLMClient:
"""LLM客户端"""
def __init__(
self,
api_key: Optional[str] = None,
base_url: Optional[str] = None,
model: Optional[str] = None
):
self.api_key = api_key or Config.LLM_API_KEY
self.base_url = base_url or Config.LLM_BASE_URL
self.model = model or Config.LLM_MODEL_NAME
if not self.api_key:
raise ValueError("LLM_API_KEY 未配置")
self.client = OpenAI(
api_key=self.api_key,
base_url=self.base_url
)
def chat(
self,
messages: List[Dict[str, str]],
temperature: float = 0.7,
max_tokens: int = 4096,
response_format: Optional[Dict] = None
) -> str:
"""
发送聊天请求
Args:
messages: 消息列表
temperature: 温度参数
max_tokens: 最大token数
response_format: 响应格式如JSON模式
Returns:
模型响应文本
"""
kwargs = {
"model": self.model,
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
}
if response_format:
kwargs["response_format"] = response_format
response = self.client.chat.completions.create(**kwargs)
return response.choices[0].message.content
def chat_json(
self,
messages: List[Dict[str, str]],
temperature: float = 0.3,
max_tokens: int = 4096
) -> Dict[str, Any]:
"""
发送聊天请求并返回JSON
Args:
messages: 消息列表
temperature: 温度参数
max_tokens: 最大token数
Returns:
解析后的JSON对象
"""
response = self.chat(
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
response_format={"type": "json_object"}
)
return json.loads(response)

107
backend/app/utils/logger.py Normal file
View file

@ -0,0 +1,107 @@
"""
日志配置模块
提供统一的日志管理同时输出到控制台和文件
"""
import os
import logging
from datetime import datetime
from logging.handlers import RotatingFileHandler
# 日志目录
LOG_DIR = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'logs')
def setup_logger(name: str = 'mirofish', level: int = logging.DEBUG) -> logging.Logger:
"""
设置日志器
Args:
name: 日志器名称
level: 日志级别
Returns:
配置好的日志器
"""
# 确保日志目录存在
os.makedirs(LOG_DIR, exist_ok=True)
# 创建日志器
logger = logging.getLogger(name)
logger.setLevel(level)
# 如果已经有处理器,不重复添加
if logger.handlers:
return logger
# 日志格式
detailed_formatter = logging.Formatter(
'[%(asctime)s] %(levelname)s [%(name)s.%(funcName)s:%(lineno)d] %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
simple_formatter = logging.Formatter(
'[%(asctime)s] %(levelname)s: %(message)s',
datefmt='%H:%M:%S'
)
# 1. 文件处理器 - 详细日志(按日期命名,带轮转)
log_filename = datetime.now().strftime('%Y-%m-%d') + '.log'
file_handler = RotatingFileHandler(
os.path.join(LOG_DIR, log_filename),
maxBytes=10 * 1024 * 1024, # 10MB
backupCount=5,
encoding='utf-8'
)
file_handler.setLevel(logging.DEBUG)
file_handler.setFormatter(detailed_formatter)
# 2. 控制台处理器 - 简洁日志INFO及以上
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
console_handler.setFormatter(simple_formatter)
# 添加处理器
logger.addHandler(file_handler)
logger.addHandler(console_handler)
return logger
def get_logger(name: str = 'mirofish') -> logging.Logger:
"""
获取日志器如果不存在则创建
Args:
name: 日志器名称
Returns:
日志器实例
"""
logger = logging.getLogger(name)
if not logger.handlers:
return setup_logger(name)
return logger
# 创建默认日志器
logger = setup_logger()
# 便捷方法
def debug(msg, *args, **kwargs):
logger.debug(msg, *args, **kwargs)
def info(msg, *args, **kwargs):
logger.info(msg, *args, **kwargs)
def warning(msg, *args, **kwargs):
logger.warning(msg, *args, **kwargs)
def error(msg, *args, **kwargs):
logger.error(msg, *args, **kwargs)
def critical(msg, *args, **kwargs):
logger.critical(msg, *args, **kwargs)

22
backend/requirements.txt Normal file
View file

@ -0,0 +1,22 @@
# Flask框架
flask>=3.0.0
flask-cors>=4.0.0
# Zep Cloud SDK
zep-cloud>=2.0.0
# OpenAI SDK用于LLM调用
openai>=1.0.0
# PDF处理
PyMuPDF>=1.24.0
# 环境变量
python-dotenv>=1.0.0
# 数据验证
pydantic>=2.0.0
# 文件处理
werkzeug>=3.0.0

57
backend/run.py Normal file
View file

@ -0,0 +1,57 @@
"""
MiroFish Backend 启动入口
"""
import os
import sys
# 添加项目根目录到路径
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from app import create_app
from app.config import Config
def main():
"""主函数"""
# 验证配置
errors = Config.validate()
if errors:
print("配置错误:")
for err in errors:
print(f" - {err}")
print("\n请检查 .env 文件中的配置")
sys.exit(1)
# 创建应用
app = create_app()
# 获取运行配置
host = os.environ.get('FLASK_HOST', '0.0.0.0')
port = int(os.environ.get('FLASK_PORT', 5001))
debug = Config.DEBUG
print(f"""
MiroFish Backend Server
Running on: http://{host}:{port}
Debug mode: {debug}
API Endpoints:
POST /api/graph/ontology/generate - 生成本体
POST /api/graph/build - 构建图谱
GET /api/graph/task/<task_id> - 查询任务
GET /api/graph/tasks - 列出任务
GET /api/graph/data/<graph_id> - 获取图数据
DELETE /api/graph/delete/<graph_id>- 删除图谱
""")
# 启动服务
app.run(host=host, port=port, debug=debug, threaded=True)
if __name__ == '__main__':
main()