Files

cauvang32 9c180bdd89 Refactor OpenAI utilities and remove Python executor

- Removed the `analyze_data_file` function from tool definitions to streamline functionality.
- Enhanced the `execute_python_code` function description to clarify auto-installation of packages and file handling.
- Deleted the `python_executor.py` module to simplify the codebase and improve maintainability.
- Introduced a new `token_counter.py` module for efficient token counting for OpenAI API requests, including support for Discord image links and cost estimation.

2025-10-02 21:49:48 +07:00

15 KiB

Raw Blame History

Unified File System - Complete Implementation Summary

🎯 Overview

The bot now has a fully unified file management system where:

✅ All files saved with per-user limits (configurable in .env)
✅ All files accessible by code_interpreter and AI models via file_id
✅ All work (data analysis, Python code, etc.) runs through code_interpreter

📋 Key Features

1. File Storage & Limits

Location: /tmp/bot_code_interpreter/user_files/{user_id}/
Metadata: MongoDB (file_id, filename, file_type, file_size, expires_at, etc.)
Per-User Limit: Configurable via MAX_FILES_PER_USER in .env (default: 20)
Auto-Cleanup: When limit reached, oldest file is automatically deleted
Expiration: Files expire after FILE_EXPIRATION_HOURS (default: 48 hours, -1 for permanent)

2. Supported File Types (80+ types)

# Tabular Data
.csv, .tsv, .xlsx, .xls, .xlsm, .xlsb, .ods

# Structured Data
.json, .jsonl, .ndjson, .xml, .yaml, .yml, .toml

# Database
.db, .sqlite, .sqlite3, .sql

# Scientific/Binary
.parquet, .feather, .hdf, .hdf5, .h5, .pickle, .pkl,
.joblib, .npy, .npz, .mat, .sav, .dta, .sas7bdat

# Text/Code
.txt, .log, .py, .r, .R

# Geospatial
.geojson, .shp, .kml, .gpx

3. File Access in Code

All user files are automatically accessible via:

# AI generates code like this:
df = load_file('file_id_abc123')  # Auto-detects type!

# Automatically handles:
# - CSV → pd.read_csv()
# - Excel → pd.read_excel()
# - JSON → json.load() or pd.read_json()
# - Parquet → pd.read_parquet()
# - HDF5 → pd.read_hdf()
# - And 75+ more types!

4. Unified Execution Path

User uploads file (ANY type)
    ↓
upload_discord_attachment()
    ↓
Saved to /tmp/bot_code_interpreter/user_files/{user_id}/
    ↓
MongoDB: file_id, expires_at, metadata
    ↓
User asks AI to analyze
    ↓
AI generates Python code with load_file('file_id')
    ↓
execute_python_code() runs via code_interpreter
    ↓
Files auto-loaded, packages auto-installed
    ↓
Generated files (plots, CSVs, etc.) auto-sent to user
    ↓
After expiration → Auto-deleted (disk + DB)

⚙️ Configuration (.env)

# File expiration (hours)
FILE_EXPIRATION_HOURS=48    # Files expire after 48 hours
# FILE_EXPIRATION_HOURS=-1  # Or set to -1 for permanent storage

# Maximum files per user
MAX_FILES_PER_USER=20       # Each user can have up to 20 files

🔧 Implementation Details

Updated Files

1. src/module/message_handler.py

✅ Removed analyze_data_file tool (deprecated)
✅ Updated DATA_FILE_EXTENSIONS to support 80+ types
✅ Rewrote _download_and_save_data_file() to use upload_discord_attachment()
✅ Rewrote _handle_data_file() to show detailed upload info
✅ Updated _execute_python_code() to fetch all user files from DB
✅ Files passed as user_files array to code_interpreter

2. src/config/config.py

✅ Added FILE_EXPIRATION_HOURS config
✅ Added MAX_FILES_PER_USER config
✅ Updated NORMAL_CHAT_PROMPT to reflect new file system
✅ Removed references to deprecated analyze_data_file tool

3. src/utils/openai_utils.py

✅ Removed analyze_data_file tool definition
✅ Only execute_python_code tool remains for all code execution

4. .env

✅ Added MAX_FILES_PER_USER=20
✅ Already had FILE_EXPIRATION_HOURS=48

📊 User Experience

File Upload

📊 File Uploaded Successfully!

📁 Name: data.csv
📦 Type: CSV
💾 Size: 1.2 MB
🆔 File ID: abc123xyz789
⏰ Expires: 2025-10-04 10:30:00
📂 Your Files: 3/20

✅ Ready for processing! You can now:
• Ask me to analyze this data
• Request visualizations or insights
• Write Python code to process it
• The file is automatically accessible in code execution

💡 Examples:
Analyze this data and show key statistics
Create visualizations from this file
Show me the first 10 rows
Plot correlations between all numeric columns

Code Execution

# AI automatically generates code like:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load user's file (file_id from context)
df = load_file('abc123xyz789')  # Auto-detects CSV!

# Analyze
print(df.describe())
print(f"\nShape: {df.shape}")

# Visualize
sns.heatmap(df.corr(), annot=True)
plt.savefig('correlation_heatmap.png')

# Export results
df.describe().to_csv('statistics.csv')

All generated files are automatically sent to the user!

🔒 Security & Limits

Per-User Limits

Max Files: 20 (configurable)
Auto-Cleanup: Oldest file deleted when limit reached
Expiration: 48 hours (configurable)

File Validation

✅ File type detection
✅ Size validation
✅ Extension checking
✅ Malicious file prevention

Isolation

✅ Each user has separate directory
✅ Code executed in isolated venv
✅ Files only accessible to owner

🚀 Benefits

For Users

Simple Upload: Just drag & drop any data file
Natural Interaction: "Analyze this file" - AI handles the rest
Multiple Files: Up to 20 files, automatically managed
Auto-Cleanup: Files expire automatically, no manual deletion needed
Rich Output: Get plots, CSVs, reports automatically

For System

Unified: One code execution system for everything
Scalable: Per-user limits prevent abuse
Efficient: Auto-cleanup prevents disk bloat
Flexible: Support 80+ file types
Simple: AI just writes normal Python code

For AI Model

Natural: Just use load_file('file_id')
Auto-Install: Import any package, auto-installs
Auto-Output: Create files, automatically shared
Context-Aware: Knows about user's uploaded files
Powerful: Full pandas/numpy/scipy/sklearn/tensorflow stack

🧪 Testing

Test File Upload

Upload CSV file → Should show detailed info with file_id
Check 📂 Your Files: 1/20 counter
Ask "analyze this data"
AI should generate code with load_file('file_id')
Code executes, results sent back

Test File Limit

Upload 20 files
Upload 21st file → Oldest should be auto-deleted
Counter should show 20/20

Test File Types

CSV: pd.read_csv() auto-detected
Excel: pd.read_excel() auto-detected
JSON: json.load() or pd.read_json() auto-detected
Parquet: pd.read_parquet() auto-detected
etc.

Test Expiration

Set FILE_EXPIRATION_HOURS=0.1 (6 minutes)
Upload file
Wait 6+ minutes
File should be auto-deleted

📚 Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Discord User                           │
└────────────────────────┬────────────────────────────────────┘
                         │ Upload file
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                message_handler.py                           │
│  - _handle_data_file()                                      │
│  - _download_and_save_data_file()                           │
│  - Enforces MAX_FILES_PER_USER limit                        │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│             code_interpreter.py                             │
│  - upload_discord_attachment()                              │
│  - Saves to /tmp/bot_code_interpreter/user_files/          │
│  - Stores metadata in MongoDB                               │
│  - Returns file_id                                          │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                    MongoDB                                  │
│  Collection: user_files                                     │
│  {                                                          │
│    file_id: "abc123",                                       │
│    user_id: "878573881449906208",                           │
│    filename: "data.csv",                                    │
│    file_path: "/tmp/.../abc123.csv",                        │
│    file_type: "csv",                                        │
│    file_size: 1234567,                                      │
│    uploaded_at: "2025-10-02T10:30:00",                      │
│    expires_at: "2025-10-04T10:30:00"                        │
│  }                                                          │
└─────────────────────────────────────────────────────────────┘
                         │
                         │ User asks to analyze
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                    AI Model                                 │
│  - Sees file_id in conversation context                     │
│  - Generates Python code:                                   │
│    df = load_file('abc123')                                 │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│            message_handler.py                               │
│  - _execute_python_code()                                   │
│  - Fetches all user files from DB                           │
│  - Passes user_files=[file_id1, file_id2, ...]              │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│             code_interpreter.py                             │
│  - execute_code()                                           │
│  - Injects load_file() function                             │
│  - Maps file_id → file_path                                 │
│  - Auto-installs packages                                   │
│  - Captures generated files                                 │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                 Isolated venv                               │
│  FILES = {'abc123': '/tmp/.../abc123.csv'}                  │
│                                                             │
│  def load_file(file_id):                                    │
│      path = FILES[file_id]                                  │
│      # Auto-detect: CSV, Excel, JSON, etc.                  │
│      return pd.read_csv(path)  # or appropriate loader      │
│                                                             │
│  # User's code executes here                                │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│              Generated Files                                │
│  - plots.png                                                │
│  - results.csv                                              │
│  - report.txt                                               │
│  → Auto-captured and sent to Discord user                   │
└─────────────────────────────────────────────────────────────┘

✅ Verification Checklist

Files saved to code_interpreter system
Files expire after configured hours
Per-user file limits enforced
80+ file types supported
Files accessible via file_id
All analysis runs through execute_python_code
Removed deprecated analyze_data_file tool
Auto-installs packages on import
Auto-captures generated files
MongoDB stores only metadata
Disk cleanup on expiration
Oldest file deleted when limit reached
Detailed upload confirmation shown
File context added to conversation
AI prompt updated with new system

🎉 Result

Before: Separate tools, temp directories, manual cleanup, limited file types After: One unified system, automatic everything, 80+ file types, production-ready!

The system now works exactly like ChatGPT's file handling - simple, powerful, and automatic! 🚀

15 KiB Raw Blame History