Files

cauvang32 9c180bdd89 Refactor OpenAI utilities and remove Python executor

- Removed the `analyze_data_file` function from tool definitions to streamline functionality.
- Enhanced the `execute_python_code` function description to clarify auto-installation of packages and file handling.
- Deleted the `python_executor.py` module to simplify the codebase and improve maintainability.
- Introduced a new `token_counter.py` module for efficient token counting for OpenAI API requests, including support for Discord image links and cost estimation.

2025-10-02 21:49:48 +07:00

16 KiB

Raw Blame History

Complete Implementation Summary

✅ All Requirements Implemented

1. ✅ File Storage with User Limits

Location: /tmp/bot_code_interpreter/user_files/{user_id}/
Per-User Limit: MAX_FILES_PER_USER in .env (default: 20 files)
Auto-Cleanup: When limit reached, oldest file automatically deleted
Expiration: Files expire after FILE_EXPIRATION_HOURS (default: 48 hours, -1 for permanent)
Metadata: MongoDB stores file_id, filename, file_type, expires_at, etc.

2. ✅ Universal File Access

By Code Interpreter: All files accessible via load_file(file_id)
By AI Model: File info in conversation context with file_id
Smart Loading: Auto-detects file type and loads appropriately
200+ File Types: CSV, Excel, JSON, Parquet, HDF5, NumPy, Images, Audio, Video, etc.

3. ✅ All Work Through Code Interpreter

Single Execution Path: Everything runs through execute_python_code
Removed: Deprecated analyze_data_file tool
Unified: Data analysis, Python code, file processing - all in one place
Auto-Install: Packages auto-install when imported
Auto-Capture: Generated files automatically sent to user

4. ✅ 200+ File Types Support

Tabular: CSV, Excel, Parquet, Feather, etc.
Structured: JSON, YAML, XML, TOML, etc.
Binary: HDF5, Pickle, NumPy, MATLAB, etc.
Media: Images, Audio, Video (20+ formats each)
Code: 50+ programming languages
Scientific: DICOM, NIfTI, FITS, VTK, etc.
Geospatial: GeoJSON, Shapefile, KML, etc.
Archives: ZIP, TAR, 7Z, etc.

5. ✅ Configurable Code Execution Timeout

Configuration: CODE_EXECUTION_TIMEOUT in .env (default: 300 seconds)
Smart Timeout: Only counts actual code execution time
Excluded from Timeout:
- Environment setup
- Package installation
- File upload/download
- Result collection
User-Friendly: Clear timeout error messages

📊 Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                         User Uploads File                        │
│                    (Any of 200+ file types)                     │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                    upload_discord_attachment()                   │
│  • Detects file type (200+ types)                               │
│  • Checks user file limit (MAX_FILES_PER_USER)                  │
│  • Deletes oldest if limit reached                              │
│  • Saves to /tmp/bot_code_interpreter/user_files/{user_id}/    │
│  • Stores metadata in MongoDB                                   │
│  • Sets expiration (FILE_EXPIRATION_HOURS)                      │
│  • Returns file_id                                              │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                      MongoDB (Metadata)                          │
│  {                                                               │
│    file_id: "abc123",                                            │
│    user_id: "12345",                                             │
│    filename: "data.csv",                                         │
│    file_type: "csv",                                             │
│    file_size: 1234567,                                           │
│    file_path: "/tmp/.../abc123.csv",                            │
│    uploaded_at: "2025-10-02T10:00:00",                          │
│    expires_at: "2025-10-04T10:00:00"                            │
│  }                                                               │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                  User Asks to Process File                       │
│              "Analyze this data", "Create plots", etc.          │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                        AI Model (GPT-4)                          │
│  • Sees file context with file_id in conversation               │
│  • Generates Python code:                                       │
│    df = load_file('abc123')                                     │
│    df.describe()                                                │
│    plt.plot(df['x'], df['y'])                                   │
│    plt.savefig('plot.png')                                      │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                    execute_python_code()                         │
│  1. Validate code security                                       │
│  2. Ensure venv ready (NOT counted in timeout)                  │
│  3. Install packages if needed (NOT counted in timeout)         │
│  4. Fetch all user files from DB                                │
│  5. Inject load_file() function with file_id mappings           │
│  6. Write code to temp file                                     │
│  7. ⏱️  START TIMEOUT TIMER                                     │
│  8. Execute Python code in isolated venv                        │
│  9. ⏱️  END TIMEOUT TIMER                                       │
│  10. Capture stdout, stderr, generated files                    │
│  11. Return results                                             │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                   Isolated Python Execution                      │
│                                                                  │
│  FILES = {'abc123': '/tmp/.../abc123.csv'}                      │
│                                                                  │
│  def load_file(file_id):                                        │
│      path = FILES[file_id]                                      │
│      # Smart auto-detection:                                    │
│      if path.endswith('.csv'):                                  │
│          return pd.read_csv(path)                               │
│      elif path.endswith('.xlsx'):                               │
│          return pd.read_excel(path)                             │
│      elif path.endswith('.parquet'):                            │
│          return pd.read_parquet(path)                           │
│      # ... 200+ file types handled ...                          │
│                                                                  │
│  # User's code executes here with timeout                       │
│  df = load_file('abc123')  # Auto: pd.read_csv()                │
│  print(df.describe())                                           │
│  plt.plot(df['x'], df['y'])                                     │
│  plt.savefig('plot.png')  # Auto-captured!                      │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                      Auto-Capture Results                        │
│  • stdout/stderr output                                          │
│  • Generated files: plot.png, results.csv, etc.                 │
│  • Execution time                                               │
│  • Success/error status                                         │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                   Send Results to Discord                        │
│  • Text output (stdout)                                          │
│  • Generated files as attachments                               │
│  • Error messages if any                                        │
│  • Execution time                                               │
└─────────────────────────────────────────────────────────────────┘
                             │
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                     Background Cleanup                           │
│  • After FILE_EXPIRATION_HOURS: Delete expired files            │
│  • When user exceeds MAX_FILES_PER_USER: Delete oldest          │
│  • Remove from disk and MongoDB                                 │
└─────────────────────────────────────────────────────────────────┘

📝 Configuration (.env)

# Discord & API Keys
DISCORD_TOKEN=your_token_here
OPENAI_API_KEY=your_api_key_here
OPENAI_BASE_URL=https://models.github.ai/inference
MONGODB_URI=your_mongodb_uri_here

# File Management
FILE_EXPIRATION_HOURS=48        # Files expire after 48 hours (-1 = never)
MAX_FILES_PER_USER=20           # Maximum 20 files per user

# Code Execution
CODE_EXECUTION_TIMEOUT=300      # 5 minutes timeout for code execution

🎯 Key Features

1. Universal File Support

✅ 200+ file types
✅ Smart auto-detection
✅ Automatic loading

2. Intelligent File Management

✅ Per-user limits
✅ Automatic cleanup
✅ Expiration handling

3. Unified Execution

✅ Single code interpreter
✅ Auto-install packages
✅ Auto-capture outputs

4. Smart Timeout

✅ Configurable duration
✅ Only counts code runtime
✅ Excludes setup/install

5. Production Ready

✅ Security validation
✅ Error handling
✅ Resource management

🧪 Testing Examples

Test 1: CSV File Analysis

# Upload data.csv
# Ask: "Analyze this CSV file"

# AI generates:
import pandas as pd
import matplotlib.pyplot as plt

df = load_file('file_id')  # Auto: pd.read_csv()
print(df.describe())
df.hist(figsize=(12, 8))
plt.savefig('histograms.png')

Test 2: Parquet File Processing

# Upload large_data.parquet
# Ask: "Show correlations"

# AI generates:
import pandas as pd
import seaborn as sns

df = load_file('file_id')  # Auto: pd.read_parquet()
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.savefig('correlation.png')

Test 3: Multiple File Types

# Upload: data.csv, config.yaml, model.pkl
# Ask: "Load all files and process"

# AI generates:
import pandas as pd
import yaml
import pickle

df = load_file('csv_id')      # Auto: pd.read_csv()
config = load_file('yaml_id')  # Auto: yaml.safe_load()
model = load_file('pkl_id')    # Auto: pickle.load()

predictions = model.predict(df)
results = pd.DataFrame({'predictions': predictions})
results.to_csv('predictions.csv')

Test 4: Timeout Handling

# Set CODE_EXECUTION_TIMEOUT=60
# Upload data.csv
# Ask: "Run complex computation"

# AI generates code that takes 70 seconds
# Result: TimeoutError after 60 seconds with clear message

📚 Documentation Files

UNIFIED_FILE_SYSTEM_SUMMARY.md - Complete file system overview
ALL_FILE_TYPES_AND_TIMEOUT_UPDATE.md - Detailed implementation
QUICK_REFERENCE_FILE_TYPES_TIMEOUT.md - Quick reference guide
THIS FILE - Complete summary

✅ Verification Checklist

Files saved to code_interpreter system
Per-user file limits enforced (MAX_FILES_PER_USER)
Files expire automatically (FILE_EXPIRATION_HOURS)
200+ file types supported
Files accessible via file_id
Smart load_file() auto-detection
All work runs through code_interpreter
Removed deprecated analyze_data_file
Configurable timeout (CODE_EXECUTION_TIMEOUT)
Timeout only counts code execution
Auto-install packages
Auto-capture generated files
MongoDB stores metadata only
Disk cleanup on expiration
Clear error messages
Production-ready security

🎉 Result

The bot now has a production-ready, ChatGPT-like file handling system:

✅ Upload any file (200+ types)
✅ Automatic management (limits, expiration, cleanup)
✅ Smart loading (auto-detects type)
✅ Unified execution (one code interpreter)
✅ Configurable timeout (smart timing)
✅ Auto-everything (packages, outputs, cleanup)

Simple. Powerful. Production-Ready. 🚀

16 KiB Raw Blame History

Complete Implementation Summary

✅ All Requirements Implemented

1. ✅ File Storage with User Limits

2. ✅ Universal File Access

3. ✅ All Work Through Code Interpreter

4. ✅ 200+ File Types Support

5. ✅ Configurable Code Execution Timeout

📊 Architecture Overview

📝 Configuration (.env)

🎯 Key Features

1. Universal File Support

2. Intelligent File Management

3. Unified Execution

4. Smart Timeout

5. Production Ready

🧪 Testing Examples

Test 1: CSV File Analysis

Test 2: Parquet File Processing

Test 3: Multiple File Types

Test 4: Timeout Handling

📚 Documentation Files

✅ Verification Checklist

🎉 Result

16 KiB

Raw Blame History