Files
ChatGPT-Discord-Bot/docs/CODE_INTERPRETER_GUIDE.md
cauvang32 9c180bdd89 Refactor OpenAI utilities and remove Python executor
- Removed the `analyze_data_file` function from tool definitions to streamline functionality.
- Enhanced the `execute_python_code` function description to clarify auto-installation of packages and file handling.
- Deleted the `python_executor.py` module to simplify the codebase and improve maintainability.
- Introduced a new `token_counter.py` module for efficient token counting for OpenAI API requests, including support for Discord image links and cost estimation.
2025-10-02 21:49:48 +07:00

531 lines
12 KiB
Markdown

# Code Interpreter Guide
## Overview
The unified code interpreter provides ChatGPT/Claude-style code execution capabilities:
- **Secure Python execution** in isolated virtual environments
- **File management** with automatic 48-hour expiration
- **Data analysis** with pandas, numpy, matplotlib, seaborn, plotly
- **Package installation** with security validation
- **Visualization generation** with automatic image handling
## Features
### 1. Code Execution
Execute arbitrary Python code securely:
```python
from src.utils.code_interpreter import execute_code
result = await execute_code(
code="print('Hello, world!')",
user_id=123456789
)
# Result:
# {
# "success": True,
# "output": "Hello, world!\n",
# "error": "",
# "execution_time": 0.05,
# "return_code": 0
# }
```
### 2. File Upload & Management
Upload files for code to access:
```python
from src.utils.code_interpreter import upload_file, list_user_files
# Upload a CSV file
with open('data.csv', 'rb') as f:
result = await upload_file(
user_id=123456789,
file_data=f.read(),
filename='data.csv',
file_type='csv',
db_handler=db
)
file_id = result['file_id']
# List user's files
files = await list_user_files(user_id=123456789, db_handler=db)
```
### 3. Code with File Access
Access uploaded files in code:
```python
# Upload a CSV file first
result = await upload_file(user_id=123, file_data=csv_bytes, filename='sales.csv')
file_id = result['file_id']
# Execute code that uses the file
code = """
# load_file() is automatically available
df = load_file('""" + file_id + """')
print(df.head())
print(f"Total rows: {len(df)}")
"""
result = await execute_code(
code=code,
user_id=123,
user_files=[file_id],
db_handler=db
)
```
### 4. Package Installation
Install approved packages on-demand:
```python
result = await execute_code(
code="""
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
plt.figure(figsize=(10, 6))
sns.scatterplot(data=tips, x='total_bill', y='tip')
plt.savefig('plot.png')
print('Plot saved!')
""",
user_id=123,
install_packages=['seaborn', 'matplotlib']
)
```
### 5. Data Analysis
Automatic data loading and analysis:
```python
# The load_file() helper automatically detects file types
code = """
# Load CSV
df = load_file('file_id_here')
# Basic analysis
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.describe())
# Correlation analysis
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.savefig('correlation.png')
"""
result = await execute_code(code=code, user_id=123, user_files=['file_id_here'])
# Visualizations are returned in result['generated_files']
for file in result.get('generated_files', []):
print(f"Generated: {file['filename']}")
# file['data'] contains the image bytes
```
## File Expiration
### Automatic Cleanup (48 Hours)
Files automatically expire after 48 hours:
```python
from src.utils.code_interpreter import cleanup_expired_files
# Run cleanup (should be scheduled periodically)
deleted_count = await cleanup_expired_files(db_handler=db)
print(f"Cleaned up {deleted_count} expired files")
```
### Manual File Deletion
Delete files manually:
```python
from src.utils.code_interpreter import delete_user_file
success = await delete_user_file(
file_id='user_123_1234567890_abc123',
user_id=123,
db_handler=db
)
```
## Security Features
### Approved Packages
Only approved packages can be installed:
- **Data Science**: numpy, pandas, scipy, scikit-learn, statsmodels
- **Visualization**: matplotlib, seaborn, plotly, bokeh, altair
- **Image Processing**: pillow, imageio, scikit-image
- **Machine Learning**: tensorflow, keras, torch, xgboost, lightgbm
- **NLP**: nltk, spacy, gensim, wordcloud
- **Math/Science**: sympy, networkx, numba
### Blocked Operations
Code is validated against dangerous operations:
- ❌ File system writes (outside execution dir)
- ❌ Network operations (socket, requests, urllib)
- ❌ Process spawning (subprocess)
- ❌ System access (os.system, eval, exec)
- ❌ Dangerous functions (__import__, globals, locals)
### Execution Limits
- **Timeout**: 60 seconds (configurable)
- **Output Size**: 100KB max (truncated if larger)
- **File Size**: 50MB max per file
## Environment Management
### Persistent Virtual Environment
The code interpreter uses a persistent venv:
- **Location**: `/tmp/bot_code_interpreter/venv`
- **Cleanup**: Automatically recreated every 7 days
- **Packages**: Cached and reused across executions
### Status Check
Get interpreter status:
```python
from src.utils.code_interpreter import get_interpreter_status
status = await get_interpreter_status(db_handler=db)
# Returns:
# {
# "venv_exists": True,
# "python_path": "/tmp/bot_code_interpreter/venv/bin/python",
# "installed_packages": ["numpy", "pandas", "matplotlib", ...],
# "package_count": 15,
# "last_cleanup": "2024-01-15T10:30:00",
# "total_user_files": 42,
# "total_file_size_mb": 125.5,
# "file_expiration_hours": 48,
# "max_file_size_mb": 50
# }
```
## Database Schema
### user_files Collection
```javascript
{
"file_id": "user_123_1234567890_abc123",
"user_id": 123456789,
"filename": "sales_data.csv",
"file_path": "/tmp/bot_code_interpreter/user_files/123456789/user_123_1234567890_abc123.csv",
"file_size": 1024000,
"file_type": "csv",
"uploaded_at": "2024-01-15T10:30:00",
"expires_at": "2024-01-17T10:30:00" // 48 hours later
}
```
### Indexes
Automatically created for performance:
```python
# Compound index for user queries
await db.user_files.create_index([("user_id", 1), ("expires_at", -1)])
# Unique index for file lookups
await db.user_files.create_index("file_id", unique=True)
# Index for cleanup queries
await db.user_files.create_index("expires_at")
```
## Integration Example
Complete example integrating code interpreter:
```python
from src.utils.code_interpreter import (
execute_code,
upload_file,
list_user_files,
cleanup_expired_files
)
async def handle_user_request(user_id: int, code: str, files: list, db):
"""Handle a code execution request from a user."""
# Upload any files the user provided
uploaded_files = []
for file_data, filename in files:
result = await upload_file(
user_id=user_id,
file_data=file_data,
filename=filename,
db_handler=db
)
if result['success']:
uploaded_files.append(result['file_id'])
# Execute the code with file access
result = await execute_code(
code=code,
user_id=user_id,
user_files=uploaded_files,
install_packages=['pandas', 'matplotlib'],
timeout=60,
db_handler=db
)
# Check for errors
if not result['success']:
return f"❌ Error: {result['error']}"
# Format output
response = f"✅ Execution completed in {result['execution_time']:.2f}s\n\n"
if result['output']:
response += f"**Output:**\n```\n{result['output']}\n```\n"
# Handle generated images
for file in result.get('generated_files', []):
if file['type'] == 'image':
response += f"\n📊 Generated: {file['filename']}\n"
# file['data'] contains image bytes - save or send to Discord
return response
# Periodic cleanup (run every hour)
async def scheduled_cleanup(db):
"""Clean up expired files."""
deleted = await cleanup_expired_files(db_handler=db)
if deleted > 0:
logging.info(f"Cleaned up {deleted} expired files")
```
## Error Handling
### Common Errors
**Security Validation Failed**
```python
result = {
"success": False,
"error": "Security validation failed: Blocked unsafe operation: import\s+subprocess"
}
```
**Timeout**
```python
result = {
"success": False,
"error": "Execution timeout after 60 seconds",
"execution_time": 60,
"return_code": -1
}
```
**Package Not Approved**
```python
result = {
"success": False,
"error": "Package 'requests' is not in the approved list"
}
```
**File Too Large**
```python
result = {
"success": False,
"error": "File too large. Maximum size is 50MB"
}
```
## Best Practices
1. **Always provide db_handler** for file management
2. **Set reasonable timeouts** for long-running code
3. **Handle generated_files** in results (images, etc.)
4. **Run cleanup_expired_files()** periodically (hourly recommended)
5. **Validate user input** before passing to execute_code()
6. **Check result['success']** before using output
7. **Display execution_time** to users for transparency
## Architecture
### Components
1. **FileManager**: Handles file upload/download, expiration, cleanup
2. **PackageManager**: Manages venv, installs packages, caches installations
3. **CodeExecutor**: Executes code securely, provides file access helpers
### Execution Flow
```
User Code Request
Security Validation (blocked patterns)
Ensure venv Ready (create if needed)
Install Packages (if requested)
Create Temp Execution Dir
Inject File Access Helpers (load_file, FILES dict)
Execute Code (isolated subprocess)
Collect Output + Generated Files
Cleanup Temp Dir
Return Results
```
## Comparison to Old System
### Old System (3 separate files)
- `code_interpreter.py` - Router/dispatcher
- `python_executor.py` - Execution logic
- `data_analyzer.py` - Data analysis templates
### New System (1 unified file)
- ✅ All functionality in `code_interpreter.py`
- ✅ 48-hour file expiration (like images)
- ✅ Persistent venv with package caching
- ✅ Better security validation
- ✅ Automatic data loading helpers
- ✅ Unified API with async/await
- ✅ MongoDB integration for file tracking
- ✅ Automatic cleanup scheduling
## Troubleshooting
### Venv Creation Fails
Check disk space and permissions:
```bash
df -h /tmp
ls -la /tmp/bot_code_interpreter
```
### Packages Won't Install
Check if package is approved:
```python
from src.utils.code_interpreter import get_package_manager
pm = get_package_manager()
is_approved, reason = pm.is_package_approved('package_name')
print(f"Approved: {is_approved}, Reason: {reason}")
```
### Files Not Found
Check expiration:
```python
from src.utils.code_interpreter import get_file_manager
fm = get_file_manager(db_handler=db)
file_meta = await fm.get_file(file_id, user_id)
if not file_meta:
print("File expired or doesn't exist")
else:
print(f"Expires at: {file_meta['expires_at']}")
```
### Performance Issues
Check status and cleanup:
```python
status = await get_interpreter_status(db_handler=db)
print(f"Total files: {status['total_user_files']}")
print(f"Total size: {status['total_file_size_mb']} MB")
# Force cleanup
deleted = await cleanup_expired_files(db_handler=db)
print(f"Cleaned up: {deleted} files")
```
## Migration from Old System
If migrating from the old 3-file system:
1. **Replace imports**:
```python
# Old
from src.utils.python_executor import execute_python_code
from src.utils.data_analyzer import analyze_data_file
# New
from src.utils.code_interpreter import execute_code
```
2. **Update function calls**:
```python
# Old
result = await execute_python_code({
"code": code,
"user_id": user_id
})
# New
result = await execute_code(
code=code,
user_id=user_id,
db_handler=db
)
```
3. **Handle file uploads**:
```python
# New file handling
result = await upload_file(
user_id=user_id,
file_data=bytes,
filename=name,
db_handler=db
)
```
4. **Schedule cleanup**:
```python
# Add to bot startup
@tasks.loop(hours=1)
async def cleanup_task():
await cleanup_expired_files(db_handler=db)
```
## Summary
The unified code interpreter provides:
- 🔒 **Security**: Validated patterns, approved packages only
- ⏱️ **Expiration**: Automatic 48-hour file cleanup
- 📦 **Packages**: Persistent venv with caching
- 📊 **Analysis**: Built-in data loading helpers
- 🎨 **Visualizations**: Automatic image generation handling
- 🔄 **Integration**: Clean async API with MongoDB
- 📈 **Status**: Real-time monitoring and metrics
All in one file: `src/utils/code_interpreter.py`