- Removed the `analyze_data_file` function from tool definitions to streamline functionality. - Enhanced the `execute_python_code` function description to clarify auto-installation of packages and file handling. - Deleted the `python_executor.py` module to simplify the codebase and improve maintainability. - Introduced a new `token_counter.py` module for efficient token counting for OpenAI API requests, including support for Discord image links and cost estimation.
15 KiB
15 KiB
Unified File System - Complete Implementation Summary
🎯 Overview
The bot now has a fully unified file management system where:
- ✅ All files saved with per-user limits (configurable in
.env) - ✅ All files accessible by code_interpreter and AI models via
file_id - ✅ All work (data analysis, Python code, etc.) runs through
code_interpreter
📋 Key Features
1. File Storage & Limits
- Location:
/tmp/bot_code_interpreter/user_files/{user_id}/ - Metadata: MongoDB (file_id, filename, file_type, file_size, expires_at, etc.)
- Per-User Limit: Configurable via
MAX_FILES_PER_USERin.env(default: 20) - Auto-Cleanup: When limit reached, oldest file is automatically deleted
- Expiration: Files expire after
FILE_EXPIRATION_HOURS(default: 48 hours, -1 for permanent)
2. Supported File Types (80+ types)
# Tabular Data
.csv, .tsv, .xlsx, .xls, .xlsm, .xlsb, .ods
# Structured Data
.json, .jsonl, .ndjson, .xml, .yaml, .yml, .toml
# Database
.db, .sqlite, .sqlite3, .sql
# Scientific/Binary
.parquet, .feather, .hdf, .hdf5, .h5, .pickle, .pkl,
.joblib, .npy, .npz, .mat, .sav, .dta, .sas7bdat
# Text/Code
.txt, .log, .py, .r, .R
# Geospatial
.geojson, .shp, .kml, .gpx
3. File Access in Code
All user files are automatically accessible via:
# AI generates code like this:
df = load_file('file_id_abc123') # Auto-detects type!
# Automatically handles:
# - CSV → pd.read_csv()
# - Excel → pd.read_excel()
# - JSON → json.load() or pd.read_json()
# - Parquet → pd.read_parquet()
# - HDF5 → pd.read_hdf()
# - And 75+ more types!
4. Unified Execution Path
User uploads file (ANY type)
↓
upload_discord_attachment()
↓
Saved to /tmp/bot_code_interpreter/user_files/{user_id}/
↓
MongoDB: file_id, expires_at, metadata
↓
User asks AI to analyze
↓
AI generates Python code with load_file('file_id')
↓
execute_python_code() runs via code_interpreter
↓
Files auto-loaded, packages auto-installed
↓
Generated files (plots, CSVs, etc.) auto-sent to user
↓
After expiration → Auto-deleted (disk + DB)
⚙️ Configuration (.env)
# File expiration (hours)
FILE_EXPIRATION_HOURS=48 # Files expire after 48 hours
# FILE_EXPIRATION_HOURS=-1 # Or set to -1 for permanent storage
# Maximum files per user
MAX_FILES_PER_USER=20 # Each user can have up to 20 files
🔧 Implementation Details
Updated Files
1. src/module/message_handler.py
- ✅ Removed
analyze_data_filetool (deprecated) - ✅ Updated
DATA_FILE_EXTENSIONSto support 80+ types - ✅ Rewrote
_download_and_save_data_file()to useupload_discord_attachment() - ✅ Rewrote
_handle_data_file()to show detailed upload info - ✅ Updated
_execute_python_code()to fetch all user files from DB - ✅ Files passed as
user_filesarray to code_interpreter
2. src/config/config.py
- ✅ Added
FILE_EXPIRATION_HOURSconfig - ✅ Added
MAX_FILES_PER_USERconfig - ✅ Updated
NORMAL_CHAT_PROMPTto reflect new file system - ✅ Removed references to deprecated
analyze_data_filetool
3. src/utils/openai_utils.py
- ✅ Removed
analyze_data_filetool definition - ✅ Only
execute_python_codetool remains for all code execution
4. .env
- ✅ Added
MAX_FILES_PER_USER=20 - ✅ Already had
FILE_EXPIRATION_HOURS=48
📊 User Experience
File Upload
📊 File Uploaded Successfully!
📁 Name: data.csv
📦 Type: CSV
💾 Size: 1.2 MB
🆔 File ID: abc123xyz789
⏰ Expires: 2025-10-04 10:30:00
📂 Your Files: 3/20
✅ Ready for processing! You can now:
• Ask me to analyze this data
• Request visualizations or insights
• Write Python code to process it
• The file is automatically accessible in code execution
💡 Examples:
Analyze this data and show key statistics
Create visualizations from this file
Show me the first 10 rows
Plot correlations between all numeric columns
Code Execution
# AI automatically generates code like:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load user's file (file_id from context)
df = load_file('abc123xyz789') # Auto-detects CSV!
# Analyze
print(df.describe())
print(f"\nShape: {df.shape}")
# Visualize
sns.heatmap(df.corr(), annot=True)
plt.savefig('correlation_heatmap.png')
# Export results
df.describe().to_csv('statistics.csv')
All generated files are automatically sent to the user!
🔒 Security & Limits
Per-User Limits
- Max Files: 20 (configurable)
- Auto-Cleanup: Oldest file deleted when limit reached
- Expiration: 48 hours (configurable)
File Validation
- ✅ File type detection
- ✅ Size validation
- ✅ Extension checking
- ✅ Malicious file prevention
Isolation
- ✅ Each user has separate directory
- ✅ Code executed in isolated venv
- ✅ Files only accessible to owner
🚀 Benefits
For Users
- Simple Upload: Just drag & drop any data file
- Natural Interaction: "Analyze this file" - AI handles the rest
- Multiple Files: Up to 20 files, automatically managed
- Auto-Cleanup: Files expire automatically, no manual deletion needed
- Rich Output: Get plots, CSVs, reports automatically
For System
- Unified: One code execution system for everything
- Scalable: Per-user limits prevent abuse
- Efficient: Auto-cleanup prevents disk bloat
- Flexible: Support 80+ file types
- Simple: AI just writes normal Python code
For AI Model
- Natural: Just use
load_file('file_id') - Auto-Install: Import any package, auto-installs
- Auto-Output: Create files, automatically shared
- Context-Aware: Knows about user's uploaded files
- Powerful: Full pandas/numpy/scipy/sklearn/tensorflow stack
🧪 Testing
Test File Upload
- Upload CSV file → Should show detailed info with file_id
- Check
📂 Your Files: 1/20counter - Ask "analyze this data"
- AI should generate code with
load_file('file_id') - Code executes, results sent back
Test File Limit
- Upload 20 files
- Upload 21st file → Oldest should be auto-deleted
- Counter should show
20/20
Test File Types
- CSV:
pd.read_csv()auto-detected - Excel:
pd.read_excel()auto-detected - JSON:
json.load()orpd.read_json()auto-detected - Parquet:
pd.read_parquet()auto-detected - etc.
Test Expiration
- Set
FILE_EXPIRATION_HOURS=0.1(6 minutes) - Upload file
- Wait 6+ minutes
- File should be auto-deleted
📚 Architecture
┌─────────────────────────────────────────────────────────────┐
│ Discord User │
└────────────────────────┬────────────────────────────────────┘
│ Upload file
↓
┌─────────────────────────────────────────────────────────────┐
│ message_handler.py │
│ - _handle_data_file() │
│ - _download_and_save_data_file() │
│ - Enforces MAX_FILES_PER_USER limit │
└────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ code_interpreter.py │
│ - upload_discord_attachment() │
│ - Saves to /tmp/bot_code_interpreter/user_files/ │
│ - Stores metadata in MongoDB │
│ - Returns file_id │
└────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ MongoDB │
│ Collection: user_files │
│ { │
│ file_id: "abc123", │
│ user_id: "878573881449906208", │
│ filename: "data.csv", │
│ file_path: "/tmp/.../abc123.csv", │
│ file_type: "csv", │
│ file_size: 1234567, │
│ uploaded_at: "2025-10-02T10:30:00", │
│ expires_at: "2025-10-04T10:30:00" │
│ } │
└─────────────────────────────────────────────────────────────┘
│
│ User asks to analyze
↓
┌─────────────────────────────────────────────────────────────┐
│ AI Model │
│ - Sees file_id in conversation context │
│ - Generates Python code: │
│ df = load_file('abc123') │
└────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ message_handler.py │
│ - _execute_python_code() │
│ - Fetches all user files from DB │
│ - Passes user_files=[file_id1, file_id2, ...] │
└────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ code_interpreter.py │
│ - execute_code() │
│ - Injects load_file() function │
│ - Maps file_id → file_path │
│ - Auto-installs packages │
│ - Captures generated files │
└────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Isolated venv │
│ FILES = {'abc123': '/tmp/.../abc123.csv'} │
│ │
│ def load_file(file_id): │
│ path = FILES[file_id] │
│ # Auto-detect: CSV, Excel, JSON, etc. │
│ return pd.read_csv(path) # or appropriate loader │
│ │
│ # User's code executes here │
└────────────────────────┬────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────┐
│ Generated Files │
│ - plots.png │
│ - results.csv │
│ - report.txt │
│ → Auto-captured and sent to Discord user │
└─────────────────────────────────────────────────────────────┘
✅ Verification Checklist
- Files saved to code_interpreter system
- Files expire after configured hours
- Per-user file limits enforced
- 80+ file types supported
- Files accessible via file_id
- All analysis runs through execute_python_code
- Removed deprecated analyze_data_file tool
- Auto-installs packages on import
- Auto-captures generated files
- MongoDB stores only metadata
- Disk cleanup on expiration
- Oldest file deleted when limit reached
- Detailed upload confirmation shown
- File context added to conversation
- AI prompt updated with new system
🎉 Result
Before: Separate tools, temp directories, manual cleanup, limited file types After: One unified system, automatic everything, 80+ file types, production-ready!
The system now works exactly like ChatGPT's file handling - simple, powerful, and automatic! 🚀