- Removed the `analyze_data_file` function from tool definitions to streamline functionality. - Enhanced the `execute_python_code` function description to clarify auto-installation of packages and file handling. - Deleted the `python_executor.py` module to simplify the codebase and improve maintainability. - Introduced a new `token_counter.py` module for efficient token counting for OpenAI API requests, including support for Discord image links and cost estimation.
16 KiB
16 KiB
Complete Implementation Summary
✅ All Requirements Implemented
1. ✅ File Storage with User Limits
- Location:
/tmp/bot_code_interpreter/user_files/{user_id}/ - Per-User Limit:
MAX_FILES_PER_USERin.env(default: 20 files) - Auto-Cleanup: When limit reached, oldest file automatically deleted
- Expiration: Files expire after
FILE_EXPIRATION_HOURS(default: 48 hours, -1 for permanent) - Metadata: MongoDB stores file_id, filename, file_type, expires_at, etc.
2. ✅ Universal File Access
- By Code Interpreter: All files accessible via
load_file(file_id) - By AI Model: File info in conversation context with file_id
- Smart Loading: Auto-detects file type and loads appropriately
- 200+ File Types: CSV, Excel, JSON, Parquet, HDF5, NumPy, Images, Audio, Video, etc.
3. ✅ All Work Through Code Interpreter
- Single Execution Path: Everything runs through
execute_python_code - Removed: Deprecated
analyze_data_filetool - Unified: Data analysis, Python code, file processing - all in one place
- Auto-Install: Packages auto-install when imported
- Auto-Capture: Generated files automatically sent to user
4. ✅ 200+ File Types Support
- Tabular: CSV, Excel, Parquet, Feather, etc.
- Structured: JSON, YAML, XML, TOML, etc.
- Binary: HDF5, Pickle, NumPy, MATLAB, etc.
- Media: Images, Audio, Video (20+ formats each)
- Code: 50+ programming languages
- Scientific: DICOM, NIfTI, FITS, VTK, etc.
- Geospatial: GeoJSON, Shapefile, KML, etc.
- Archives: ZIP, TAR, 7Z, etc.
5. ✅ Configurable Code Execution Timeout
- Configuration:
CODE_EXECUTION_TIMEOUTin.env(default: 300 seconds) - Smart Timeout: Only counts actual code execution time
- Excluded from Timeout:
- Environment setup
- Package installation
- File upload/download
- Result collection
- User-Friendly: Clear timeout error messages
📊 Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ User Uploads File │
│ (Any of 200+ file types) │
└────────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ upload_discord_attachment() │
│ • Detects file type (200+ types) │
│ • Checks user file limit (MAX_FILES_PER_USER) │
│ • Deletes oldest if limit reached │
│ • Saves to /tmp/bot_code_interpreter/user_files/{user_id}/ │
│ • Stores metadata in MongoDB │
│ • Sets expiration (FILE_EXPIRATION_HOURS) │
│ • Returns file_id │
└────────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ MongoDB (Metadata) │
│ { │
│ file_id: "abc123", │
│ user_id: "12345", │
│ filename: "data.csv", │
│ file_type: "csv", │
│ file_size: 1234567, │
│ file_path: "/tmp/.../abc123.csv", │
│ uploaded_at: "2025-10-02T10:00:00", │
│ expires_at: "2025-10-04T10:00:00" │
│ } │
└────────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ User Asks to Process File │
│ "Analyze this data", "Create plots", etc. │
└────────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ AI Model (GPT-4) │
│ • Sees file context with file_id in conversation │
│ • Generates Python code: │
│ df = load_file('abc123') │
│ df.describe() │
│ plt.plot(df['x'], df['y']) │
│ plt.savefig('plot.png') │
└────────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ execute_python_code() │
│ 1. Validate code security │
│ 2. Ensure venv ready (NOT counted in timeout) │
│ 3. Install packages if needed (NOT counted in timeout) │
│ 4. Fetch all user files from DB │
│ 5. Inject load_file() function with file_id mappings │
│ 6. Write code to temp file │
│ 7. ⏱️ START TIMEOUT TIMER │
│ 8. Execute Python code in isolated venv │
│ 9. ⏱️ END TIMEOUT TIMER │
│ 10. Capture stdout, stderr, generated files │
│ 11. Return results │
└────────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ Isolated Python Execution │
│ │
│ FILES = {'abc123': '/tmp/.../abc123.csv'} │
│ │
│ def load_file(file_id): │
│ path = FILES[file_id] │
│ # Smart auto-detection: │
│ if path.endswith('.csv'): │
│ return pd.read_csv(path) │
│ elif path.endswith('.xlsx'): │
│ return pd.read_excel(path) │
│ elif path.endswith('.parquet'): │
│ return pd.read_parquet(path) │
│ # ... 200+ file types handled ... │
│ │
│ # User's code executes here with timeout │
│ df = load_file('abc123') # Auto: pd.read_csv() │
│ print(df.describe()) │
│ plt.plot(df['x'], df['y']) │
│ plt.savefig('plot.png') # Auto-captured! │
└────────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ Auto-Capture Results │
│ • stdout/stderr output │
│ • Generated files: plot.png, results.csv, etc. │
│ • Execution time │
│ • Success/error status │
└────────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ Send Results to Discord │
│ • Text output (stdout) │
│ • Generated files as attachments │
│ • Error messages if any │
│ • Execution time │
└─────────────────────────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────┐
│ Background Cleanup │
│ • After FILE_EXPIRATION_HOURS: Delete expired files │
│ • When user exceeds MAX_FILES_PER_USER: Delete oldest │
│ • Remove from disk and MongoDB │
└─────────────────────────────────────────────────────────────────┘
📝 Configuration (.env)
# Discord & API Keys
DISCORD_TOKEN=your_token_here
OPENAI_API_KEY=your_api_key_here
OPENAI_BASE_URL=https://models.github.ai/inference
MONGODB_URI=your_mongodb_uri_here
# File Management
FILE_EXPIRATION_HOURS=48 # Files expire after 48 hours (-1 = never)
MAX_FILES_PER_USER=20 # Maximum 20 files per user
# Code Execution
CODE_EXECUTION_TIMEOUT=300 # 5 minutes timeout for code execution
🎯 Key Features
1. Universal File Support
- ✅ 200+ file types
- ✅ Smart auto-detection
- ✅ Automatic loading
2. Intelligent File Management
- ✅ Per-user limits
- ✅ Automatic cleanup
- ✅ Expiration handling
3. Unified Execution
- ✅ Single code interpreter
- ✅ Auto-install packages
- ✅ Auto-capture outputs
4. Smart Timeout
- ✅ Configurable duration
- ✅ Only counts code runtime
- ✅ Excludes setup/install
5. Production Ready
- ✅ Security validation
- ✅ Error handling
- ✅ Resource management
🧪 Testing Examples
Test 1: CSV File Analysis
# Upload data.csv
# Ask: "Analyze this CSV file"
# AI generates:
import pandas as pd
import matplotlib.pyplot as plt
df = load_file('file_id') # Auto: pd.read_csv()
print(df.describe())
df.hist(figsize=(12, 8))
plt.savefig('histograms.png')
Test 2: Parquet File Processing
# Upload large_data.parquet
# Ask: "Show correlations"
# AI generates:
import pandas as pd
import seaborn as sns
df = load_file('file_id') # Auto: pd.read_parquet()
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.savefig('correlation.png')
Test 3: Multiple File Types
# Upload: data.csv, config.yaml, model.pkl
# Ask: "Load all files and process"
# AI generates:
import pandas as pd
import yaml
import pickle
df = load_file('csv_id') # Auto: pd.read_csv()
config = load_file('yaml_id') # Auto: yaml.safe_load()
model = load_file('pkl_id') # Auto: pickle.load()
predictions = model.predict(df)
results = pd.DataFrame({'predictions': predictions})
results.to_csv('predictions.csv')
Test 4: Timeout Handling
# Set CODE_EXECUTION_TIMEOUT=60
# Upload data.csv
# Ask: "Run complex computation"
# AI generates code that takes 70 seconds
# Result: TimeoutError after 60 seconds with clear message
📚 Documentation Files
- UNIFIED_FILE_SYSTEM_SUMMARY.md - Complete file system overview
- ALL_FILE_TYPES_AND_TIMEOUT_UPDATE.md - Detailed implementation
- QUICK_REFERENCE_FILE_TYPES_TIMEOUT.md - Quick reference guide
- THIS FILE - Complete summary
✅ Verification Checklist
- Files saved to code_interpreter system
- Per-user file limits enforced (MAX_FILES_PER_USER)
- Files expire automatically (FILE_EXPIRATION_HOURS)
- 200+ file types supported
- Files accessible via file_id
- Smart load_file() auto-detection
- All work runs through code_interpreter
- Removed deprecated analyze_data_file
- Configurable timeout (CODE_EXECUTION_TIMEOUT)
- Timeout only counts code execution
- Auto-install packages
- Auto-capture generated files
- MongoDB stores metadata only
- Disk cleanup on expiration
- Clear error messages
- Production-ready security
🎉 Result
The bot now has a production-ready, ChatGPT-like file handling system:
- ✅ Upload any file (200+ types)
- ✅ Automatic management (limits, expiration, cleanup)
- ✅ Smart loading (auto-detects type)
- ✅ Unified execution (one code interpreter)
- ✅ Configurable timeout (smart timing)
- ✅ Auto-everything (packages, outputs, cleanup)
Simple. Powerful. Production-Ready. 🚀