ChatGPT-Discord-Bot/docs/CODE_INTERPRETER_GUIDE.md

# Code Interpreter Guide

## Overview

The unified code interpreter provides ChatGPT/Claude-style code execution capabilities:

- **Secure Python execution** in isolated virtual environments
- **File management** with automatic 48-hour expiration
- **Data analysis** with pandas, numpy, matplotlib, seaborn, plotly
- **Package installation** with security validation
- **Visualization generation** with automatic image handling

## Features

### 1. Code Execution

Execute arbitrary Python code securely:

```python
from src.utils.code_interpreter import execute_code

result = await execute_code(
    code="print('Hello, world!')",
    user_id=123456789
)

# Result:
# {
#     "success": True,
#     "output": "Hello, world!\n",
#     "error": "",
#     "execution_time": 0.05,
#     "return_code": 0
# }
```

### 2. File Upload & Management

Upload files for code to access:

```python
from src.utils.code_interpreter import upload_file, list_user_files

# Upload a CSV file
with open('data.csv', 'rb') as f:
    result = await upload_file(
        user_id=123456789,
        file_data=f.read(),
        filename='data.csv',
        file_type='csv',
        db_handler=db
    )

file_id = result['file_id']

# List user's files
files = await list_user_files(user_id=123456789, db_handler=db)
```

### 3. Code with File Access

Access uploaded files in code:

```python
# Upload a CSV file first
result = await upload_file(user_id=123, file_data=csv_bytes, filename='sales.csv')
file_id = result['file_id']

# Execute code that uses the file
code = """
# load_file() is automatically available
df = load_file('""" + file_id + """')
print(df.head())
print(f"Total rows: {len(df)}")
"""

result = await execute_code(
    code=code,
    user_id=123,
    user_files=[file_id],
    db_handler=db
)
```

### 4. Package Installation

Install approved packages on-demand:

```python
result = await execute_code(
    code="""
import seaborn as sns
import matplotlib.pyplot as plt

tips = sns.load_dataset('tips')
plt.figure(figsize=(10, 6))
sns.scatterplot(data=tips, x='total_bill', y='tip')
plt.savefig('plot.png')
print('Plot saved!')
""",
    user_id=123,
    install_packages=['seaborn', 'matplotlib']
)
```

### 5. Data Analysis

Automatic data loading and analysis:

```python
# The load_file() helper automatically detects file types
code = """
# Load CSV
df = load_file('file_id_here')

# Basic analysis
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.describe())

# Correlation analysis
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.savefig('correlation.png')
"""

result = await execute_code(code=code, user_id=123, user_files=['file_id_here'])

# Visualizations are returned in result['generated_files']
for file in result.get('generated_files', []):
    print(f"Generated: {file['filename']}")
    # file['data'] contains the image bytes
```

## File Expiration

### Automatic Cleanup (48 Hours)

Files automatically expire after 48 hours:

```python
from src.utils.code_interpreter import cleanup_expired_files

# Run cleanup (should be scheduled periodically)
deleted_count = await cleanup_expired_files(db_handler=db)
print(f"Cleaned up {deleted_count} expired files")
```

### Manual File Deletion

Delete files manually:

```python
from src.utils.code_interpreter import delete_user_file

success = await delete_user_file(
    file_id='user_123_1234567890_abc123',
    user_id=123,
    db_handler=db
)
```

## Security Features

### Approved Packages

Only approved packages can be installed:

- **Data Science**: numpy, pandas, scipy, scikit-learn, statsmodels
- **Visualization**: matplotlib, seaborn, plotly, bokeh, altair
- **Image Processing**: pillow, imageio, scikit-image
- **Machine Learning**: tensorflow, keras, torch, xgboost, lightgbm
- **NLP**: nltk, spacy, gensim, wordcloud
- **Math/Science**: sympy, networkx, numba

### Blocked Operations

Code is validated against dangerous operations:

- ❌ File system writes (outside execution dir)
- ❌ Network operations (socket, requests, urllib)
- ❌ Process spawning (subprocess)
- ❌ System access (os.system, eval, exec)
- ❌ Dangerous functions (__import__, globals, locals)

### Execution Limits

- **Timeout**: 60 seconds (configurable)
- **Output Size**: 100KB max (truncated if larger)
- **File Size**: 50MB max per file

## Environment Management

### Persistent Virtual Environment

The code interpreter uses a persistent venv:

- **Location**: `/tmp/bot_code_interpreter/venv`
- **Cleanup**: Automatically recreated every 7 days
- **Packages**: Cached and reused across executions

### Status Check

Get interpreter status:

```python
from src.utils.code_interpreter import get_interpreter_status

status = await get_interpreter_status(db_handler=db)

# Returns:
# {
#     "venv_exists": True,
#     "python_path": "/tmp/bot_code_interpreter/venv/bin/python",
#     "installed_packages": ["numpy", "pandas", "matplotlib", ...],
#     "package_count": 15,
#     "last_cleanup": "2024-01-15T10:30:00",
#     "total_user_files": 42,
#     "total_file_size_mb": 125.5,
#     "file_expiration_hours": 48,
#     "max_file_size_mb": 50
# }
```

## Database Schema

### user_files Collection

```javascript
{
  "file_id": "user_123_1234567890_abc123",
  "user_id": 123456789,
  "filename": "sales_data.csv",
  "file_path": "/tmp/bot_code_interpreter/user_files/123456789/user_123_1234567890_abc123.csv",
  "file_size": 1024000,
  "file_type": "csv",
  "uploaded_at": "2024-01-15T10:30:00",
  "expires_at": "2024-01-17T10:30:00"  // 48 hours later
}
```

### Indexes

Automatically created for performance:

```python
# Compound index for user queries
await db.user_files.create_index([("user_id", 1), ("expires_at", -1)])

# Unique index for file lookups
await db.user_files.create_index("file_id", unique=True)

# Index for cleanup queries
await db.user_files.create_index("expires_at")
```

## Integration Example

Complete example integrating code interpreter:

```python
from src.utils.code_interpreter import (
    execute_code,
    upload_file,
    list_user_files,
    cleanup_expired_files
)

async def handle_user_request(user_id: int, code: str, files: list, db):
    """Handle a code execution request from a user."""

    # Upload any files the user provided
    uploaded_files = []
    for file_data, filename in files:
        result = await upload_file(
            user_id=user_id,
            file_data=file_data,
            filename=filename,
            db_handler=db
        )
        if result['success']:
            uploaded_files.append(result['file_id'])

    # Execute the code with file access
    result = await execute_code(
        code=code,
        user_id=user_id,
        user_files=uploaded_files,
        install_packages=['pandas', 'matplotlib'],
        timeout=60,
        db_handler=db
    )

    # Check for errors
    if not result['success']:
        return f"❌ Error: {result['error']}"

    # Format output
    response = f"✅ Execution completed in {result['execution_time']:.2f}s\n\n"

    if result['output']:
        response += f"**Output:**\n```\n{result['output']}\n```\n"

    # Handle generated images
    for file in result.get('generated_files', []):
        if file['type'] == 'image':
            response += f"\n📊 Generated: {file['filename']}\n"
            # file['data'] contains image bytes - save or send to Discord

    return response

# Periodic cleanup (run every hour)
async def scheduled_cleanup(db):
    """Clean up expired files."""
    deleted = await cleanup_expired_files(db_handler=db)
    if deleted > 0:
        logging.info(f"Cleaned up {deleted} expired files")
```

## Error Handling

### Common Errors

**Security Validation Failed**
```python
result = {
    "success": False,
    "error": "Security validation failed: Blocked unsafe operation: import\s+subprocess"
}
```

**Timeout**
```python
result = {
    "success": False,
    "error": "Execution timeout after 60 seconds",
    "execution_time": 60,
    "return_code": -1
}
```

**Package Not Approved**
```python
result = {
    "success": False,
    "error": "Package 'requests' is not in the approved list"
}
```

**File Too Large**
```python
result = {
    "success": False,
    "error": "File too large. Maximum size is 50MB"
}
```

## Best Practices

1. **Always provide db_handler** for file management
2. **Set reasonable timeouts** for long-running code
3. **Handle generated_files** in results (images, etc.)
4. **Run cleanup_expired_files()** periodically (hourly recommended)
5. **Validate user input** before passing to execute_code()
6. **Check result['success']** before using output
7. **Display execution_time** to users for transparency

## Architecture

### Components

1. **FileManager**: Handles file upload/download, expiration, cleanup
2. **PackageManager**: Manages venv, installs packages, caches installations
3. **CodeExecutor**: Executes code securely, provides file access helpers

### Execution Flow

```
User Code Request
    ↓
Security Validation (blocked patterns)
    ↓
Ensure venv Ready (create if needed)
    ↓
Install Packages (if requested)
    ↓
Create Temp Execution Dir
    ↓
Inject File Access Helpers (load_file, FILES dict)
    ↓
Execute Code (isolated subprocess)
    ↓
Collect Output + Generated Files
    ↓
Cleanup Temp Dir
    ↓
Return Results
```

## Comparison to Old System

### Old System (3 separate files)
- `code_interpreter.py` - Router/dispatcher
- `python_executor.py` - Execution logic
- `data_analyzer.py` - Data analysis templates

### New System (1 unified file)
- ✅ All functionality in `code_interpreter.py`
- ✅ 48-hour file expiration (like images)
- ✅ Persistent venv with package caching
- ✅ Better security validation
- ✅ Automatic data loading helpers
- ✅ Unified API with async/await
- ✅ MongoDB integration for file tracking
- ✅ Automatic cleanup scheduling

## Troubleshooting

### Venv Creation Fails

Check disk space and permissions:
```bash
df -h /tmp
ls -la /tmp/bot_code_interpreter
```

### Packages Won't Install

Check if package is approved:
```python
from src.utils.code_interpreter import get_package_manager

pm = get_package_manager()
is_approved, reason = pm.is_package_approved('package_name')
print(f"Approved: {is_approved}, Reason: {reason}")
```

### Files Not Found

Check expiration:
```python
from src.utils.code_interpreter import get_file_manager

fm = get_file_manager(db_handler=db)
file_meta = await fm.get_file(file_id, user_id)

if not file_meta:
    print("File expired or doesn't exist")
else:
    print(f"Expires at: {file_meta['expires_at']}")
```

### Performance Issues

Check status and cleanup:
```python
status = await get_interpreter_status(db_handler=db)
print(f"Total files: {status['total_user_files']}")
print(f"Total size: {status['total_file_size_mb']} MB")

# Force cleanup
deleted = await cleanup_expired_files(db_handler=db)
print(f"Cleaned up: {deleted} files")
```

## Migration from Old System

If migrating from the old 3-file system:

1. **Replace imports**:
   ```python
   # Old
   from src.utils.python_executor import execute_python_code
   from src.utils.data_analyzer import analyze_data_file

   # New
   from src.utils.code_interpreter import execute_code
   ```

2. **Update function calls**:
   ```python
   # Old
   result = await execute_python_code({
       "code": code,
       "user_id": user_id
   })

   # New
   result = await execute_code(
       code=code,
       user_id=user_id,
       db_handler=db
   )
   ```

3. **Handle file uploads**:
   ```python
   # New file handling
   result = await upload_file(
       user_id=user_id,
       file_data=bytes,
       filename=name,
       db_handler=db
   )
   ```

4. **Schedule cleanup**:
   ```python
   # Add to bot startup
   @tasks.loop(hours=1)
   async def cleanup_task():
       await cleanup_expired_files(db_handler=db)
   ```

## Summary

The unified code interpreter provides:

- 🔒 **Security**: Validated patterns, approved packages only
- ⏱️ **Expiration**: Automatic 48-hour file cleanup
- 📦 **Packages**: Persistent venv with caching
- 📊 **Analysis**: Built-in data loading helpers
- 🎨 **Visualizations**: Automatic image generation handling
- 🔄 **Integration**: Clean async API with MongoDB
- 📈 **Status**: Real-time monitoring and metrics

All in one file: `src/utils/code_interpreter.py`