Files

cauvang32 9c180bdd89 Refactor OpenAI utilities and remove Python executor

- Removed the `analyze_data_file` function from tool definitions to streamline functionality.
- Enhanced the `execute_python_code` function description to clarify auto-installation of packages and file handling.
- Deleted the `python_executor.py` module to simplify the codebase and improve maintainability.
- Introduced a new `token_counter.py` module for efficient token counting for OpenAI API requests, including support for Discord image links and cost estimation.

2025-10-02 21:49:48 +07:00

12 KiB

Raw Permalink Blame History

Code Interpreter Guide

Overview

The unified code interpreter provides ChatGPT/Claude-style code execution capabilities:

Secure Python execution in isolated virtual environments
File management with automatic 48-hour expiration
Data analysis with pandas, numpy, matplotlib, seaborn, plotly
Package installation with security validation
Visualization generation with automatic image handling

Features

1. Code Execution

Execute arbitrary Python code securely:

from src.utils.code_interpreter import execute_code

result = await execute_code(
    code="print('Hello, world!')",
    user_id=123456789
)

# Result:
# {
#     "success": True,
#     "output": "Hello, world!\n",
#     "error": "",
#     "execution_time": 0.05,
#     "return_code": 0
# }

2. File Upload & Management

Upload files for code to access:

from src.utils.code_interpreter import upload_file, list_user_files

# Upload a CSV file
with open('data.csv', 'rb') as f:
    result = await upload_file(
        user_id=123456789,
        file_data=f.read(),
        filename='data.csv',
        file_type='csv',
        db_handler=db
    )

file_id = result['file_id']

# List user's files
files = await list_user_files(user_id=123456789, db_handler=db)

3. Code with File Access

Access uploaded files in code:

# Upload a CSV file first
result = await upload_file(user_id=123, file_data=csv_bytes, filename='sales.csv')
file_id = result['file_id']

# Execute code that uses the file
code = """
# load_file() is automatically available
df = load_file('""" + file_id + """')
print(df.head())
print(f"Total rows: {len(df)}")
"""

result = await execute_code(
    code=code,
    user_id=123,
    user_files=[file_id],
    db_handler=db
)

4. Package Installation

Install approved packages on-demand:

result = await execute_code(
    code="""
import seaborn as sns
import matplotlib.pyplot as plt

tips = sns.load_dataset('tips')
plt.figure(figsize=(10, 6))
sns.scatterplot(data=tips, x='total_bill', y='tip')
plt.savefig('plot.png')
print('Plot saved!')
""",
    user_id=123,
    install_packages=['seaborn', 'matplotlib']
)

5. Data Analysis

Automatic data loading and analysis:

# The load_file() helper automatically detects file types
code = """
# Load CSV
df = load_file('file_id_here')

# Basic analysis
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.describe())

# Correlation analysis
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.savefig('correlation.png')
"""

result = await execute_code(code=code, user_id=123, user_files=['file_id_here'])

# Visualizations are returned in result['generated_files']
for file in result.get('generated_files', []):
    print(f"Generated: {file['filename']}")
    # file['data'] contains the image bytes

File Expiration

Automatic Cleanup (48 Hours)

Files automatically expire after 48 hours:

from src.utils.code_interpreter import cleanup_expired_files

# Run cleanup (should be scheduled periodically)
deleted_count = await cleanup_expired_files(db_handler=db)
print(f"Cleaned up {deleted_count} expired files")

Manual File Deletion

Delete files manually:

from src.utils.code_interpreter import delete_user_file

success = await delete_user_file(
    file_id='user_123_1234567890_abc123',
    user_id=123,
    db_handler=db
)

Security Features

Approved Packages

Only approved packages can be installed:

Data Science: numpy, pandas, scipy, scikit-learn, statsmodels
Visualization: matplotlib, seaborn, plotly, bokeh, altair
Image Processing: pillow, imageio, scikit-image
Machine Learning: tensorflow, keras, torch, xgboost, lightgbm
NLP: nltk, spacy, gensim, wordcloud
Math/Science: sympy, networkx, numba

Blocked Operations

Code is validated against dangerous operations:

❌ File system writes (outside execution dir)
❌ Network operations (socket, requests, urllib)
❌ Process spawning (subprocess)
❌ System access (os.system, eval, exec)
❌ Dangerous functions (import, globals, locals)

Execution Limits

Timeout: 60 seconds (configurable)
Output Size: 100KB max (truncated if larger)
File Size: 50MB max per file

Environment Management

Persistent Virtual Environment

The code interpreter uses a persistent venv:

Location: /tmp/bot_code_interpreter/venv
Cleanup: Automatically recreated every 7 days
Packages: Cached and reused across executions

Status Check

Get interpreter status:

from src.utils.code_interpreter import get_interpreter_status

status = await get_interpreter_status(db_handler=db)

# Returns:
# {
#     "venv_exists": True,
#     "python_path": "/tmp/bot_code_interpreter/venv/bin/python",
#     "installed_packages": ["numpy", "pandas", "matplotlib", ...],
#     "package_count": 15,
#     "last_cleanup": "2024-01-15T10:30:00",
#     "total_user_files": 42,
#     "total_file_size_mb": 125.5,
#     "file_expiration_hours": 48,
#     "max_file_size_mb": 50
# }

Database Schema

user_files Collection

{
  "file_id": "user_123_1234567890_abc123",
  "user_id": 123456789,
  "filename": "sales_data.csv",
  "file_path": "/tmp/bot_code_interpreter/user_files/123456789/user_123_1234567890_abc123.csv",
  "file_size": 1024000,
  "file_type": "csv",
  "uploaded_at": "2024-01-15T10:30:00",
  "expires_at": "2024-01-17T10:30:00"  // 48 hours later
}

Indexes

Automatically created for performance:

# Compound index for user queries
await db.user_files.create_index([("user_id", 1), ("expires_at", -1)])

# Unique index for file lookups
await db.user_files.create_index("file_id", unique=True)

# Index for cleanup queries
await db.user_files.create_index("expires_at")

Integration Example

Complete example integrating code interpreter:

from src.utils.code_interpreter import (
    execute_code,
    upload_file,
    list_user_files,
    cleanup_expired_files
)

async def handle_user_request(user_id: int, code: str, files: list, db):
    """Handle a code execution request from a user."""
    
    # Upload any files the user provided
    uploaded_files = []
    for file_data, filename in files:
        result = await upload_file(
            user_id=user_id,
            file_data=file_data,
            filename=filename,
            db_handler=db
        )
        if result['success']:
            uploaded_files.append(result['file_id'])
    
    # Execute the code with file access
    result = await execute_code(
        code=code,
        user_id=user_id,
        user_files=uploaded_files,
        install_packages=['pandas', 'matplotlib'],
        timeout=60,
        db_handler=db
    )
    
    # Check for errors
    if not result['success']:
        return f"❌ Error: {result['error']}"
    
    # Format output
    response = f"✅ Execution completed in {result['execution_time']:.2f}s\n\n"
    
    if result['output']:
        response += f"**Output:**\n```\n{result['output']}\n```\n"
    
    # Handle generated images
    for file in result.get('generated_files', []):
        if file['type'] == 'image':
            response += f"\n📊 Generated: {file['filename']}\n"
            # file['data'] contains image bytes - save or send to Discord
    
    return response

# Periodic cleanup (run every hour)
async def scheduled_cleanup(db):
    """Clean up expired files."""
    deleted = await cleanup_expired_files(db_handler=db)
    if deleted > 0:
        logging.info(f"Cleaned up {deleted} expired files")

Error Handling

Common Errors

Security Validation Failed

result = {
    "success": False,
    "error": "Security validation failed: Blocked unsafe operation: import\s+subprocess"
}

Timeout

result = {
    "success": False,
    "error": "Execution timeout after 60 seconds",
    "execution_time": 60,
    "return_code": -1
}

Package Not Approved

result = {
    "success": False,
    "error": "Package 'requests' is not in the approved list"
}

File Too Large

result = {
    "success": False,
    "error": "File too large. Maximum size is 50MB"
}

Best Practices

Always provide db_handler for file management
Set reasonable timeouts for long-running code
Handle generated_files in results (images, etc.)
Run cleanup_expired_files() periodically (hourly recommended)
Validate user input before passing to execute_code()
Check result['success'] before using output
Display execution_time to users for transparency

Architecture

Components

FileManager: Handles file upload/download, expiration, cleanup
PackageManager: Manages venv, installs packages, caches installations
CodeExecutor: Executes code securely, provides file access helpers

Execution Flow

User Code Request
    ↓
Security Validation (blocked patterns)
    ↓
Ensure venv Ready (create if needed)
    ↓
Install Packages (if requested)
    ↓
Create Temp Execution Dir
    ↓
Inject File Access Helpers (load_file, FILES dict)
    ↓
Execute Code (isolated subprocess)
    ↓
Collect Output + Generated Files
    ↓
Cleanup Temp Dir
    ↓
Return Results

Comparison to Old System

Old System (3 separate files)

code_interpreter.py - Router/dispatcher
python_executor.py - Execution logic
data_analyzer.py - Data analysis templates

New System (1 unified file)

✅ All functionality in code_interpreter.py
✅ 48-hour file expiration (like images)
✅ Persistent venv with package caching
✅ Better security validation
✅ Automatic data loading helpers
✅ Unified API with async/await
✅ MongoDB integration for file tracking
✅ Automatic cleanup scheduling

Troubleshooting

Venv Creation Fails

Check disk space and permissions:

df -h /tmp
ls -la /tmp/bot_code_interpreter

Packages Won't Install

Check if package is approved:

from src.utils.code_interpreter import get_package_manager

pm = get_package_manager()
is_approved, reason = pm.is_package_approved('package_name')
print(f"Approved: {is_approved}, Reason: {reason}")

Files Not Found

Check expiration:

from src.utils.code_interpreter import get_file_manager

fm = get_file_manager(db_handler=db)
file_meta = await fm.get_file(file_id, user_id)

if not file_meta:
    print("File expired or doesn't exist")
else:
    print(f"Expires at: {file_meta['expires_at']}")

Performance Issues

Check status and cleanup:

status = await get_interpreter_status(db_handler=db)
print(f"Total files: {status['total_user_files']}")
print(f"Total size: {status['total_file_size_mb']} MB")

# Force cleanup
deleted = await cleanup_expired_files(db_handler=db)
print(f"Cleaned up: {deleted} files")

Migration from Old System

If migrating from the old 3-file system:

Replace imports:

# Old
from src.utils.python_executor import execute_python_code
from src.utils.data_analyzer import analyze_data_file

# New
from src.utils.code_interpreter import execute_code

Update function calls:

# Old
result = await execute_python_code({
    "code": code,
    "user_id": user_id
})

# New
result = await execute_code(
    code=code,
    user_id=user_id,
    db_handler=db
)

Handle file uploads:

# New file handling
result = await upload_file(
    user_id=user_id,
    file_data=bytes,
    filename=name,
    db_handler=db
)

Schedule cleanup:

# Add to bot startup
@tasks.loop(hours=1)
async def cleanup_task():
    await cleanup_expired_files(db_handler=db)

Summary

The unified code interpreter provides:

🔒 Security: Validated patterns, approved packages only
⏱️ Expiration: Automatic 48-hour file cleanup
📦 Packages: Persistent venv with caching
📊 Analysis: Built-in data loading helpers
🎨 Visualizations: Automatic image generation handling
🔄 Integration: Clean async API with MongoDB
📈 Status: Real-time monitoring and metrics

All in one file: src/utils/code_interpreter.py

12 KiB Raw Permalink Blame History