Files
ChatGPT-Discord-Bot/docs/UNIFIED_FILE_SYSTEM_SUMMARY.md
cauvang32 9c180bdd89 Refactor OpenAI utilities and remove Python executor
- Removed the `analyze_data_file` function from tool definitions to streamline functionality.
- Enhanced the `execute_python_code` function description to clarify auto-installation of packages and file handling.
- Deleted the `python_executor.py` module to simplify the codebase and improve maintainability.
- Introduced a new `token_counter.py` module for efficient token counting for OpenAI API requests, including support for Discord image links and cost estimation.
2025-10-02 21:49:48 +07:00

15 KiB

Unified File System - Complete Implementation Summary

🎯 Overview

The bot now has a fully unified file management system where:

  1. All files saved with per-user limits (configurable in .env)
  2. All files accessible by code_interpreter and AI models via file_id
  3. All work (data analysis, Python code, etc.) runs through code_interpreter

📋 Key Features

1. File Storage & Limits

  • Location: /tmp/bot_code_interpreter/user_files/{user_id}/
  • Metadata: MongoDB (file_id, filename, file_type, file_size, expires_at, etc.)
  • Per-User Limit: Configurable via MAX_FILES_PER_USER in .env (default: 20)
  • Auto-Cleanup: When limit reached, oldest file is automatically deleted
  • Expiration: Files expire after FILE_EXPIRATION_HOURS (default: 48 hours, -1 for permanent)

2. Supported File Types (80+ types)

# Tabular Data
.csv, .tsv, .xlsx, .xls, .xlsm, .xlsb, .ods

# Structured Data
.json, .jsonl, .ndjson, .xml, .yaml, .yml, .toml

# Database
.db, .sqlite, .sqlite3, .sql

# Scientific/Binary
.parquet, .feather, .hdf, .hdf5, .h5, .pickle, .pkl,
.joblib, .npy, .npz, .mat, .sav, .dta, .sas7bdat

# Text/Code
.txt, .log, .py, .r, .R

# Geospatial
.geojson, .shp, .kml, .gpx

3. File Access in Code

All user files are automatically accessible via:

# AI generates code like this:
df = load_file('file_id_abc123')  # Auto-detects type!

# Automatically handles:
# - CSV → pd.read_csv()
# - Excel → pd.read_excel()
# - JSON → json.load() or pd.read_json()
# - Parquet → pd.read_parquet()
# - HDF5 → pd.read_hdf()
# - And 75+ more types!

4. Unified Execution Path

User uploads file (ANY type)
    ↓
upload_discord_attachment()
    ↓
Saved to /tmp/bot_code_interpreter/user_files/{user_id}/
    ↓
MongoDB: file_id, expires_at, metadata
    ↓
User asks AI to analyze
    ↓
AI generates Python code with load_file('file_id')
    ↓
execute_python_code() runs via code_interpreter
    ↓
Files auto-loaded, packages auto-installed
    ↓
Generated files (plots, CSVs, etc.) auto-sent to user
    ↓
After expiration → Auto-deleted (disk + DB)

⚙️ Configuration (.env)

# File expiration (hours)
FILE_EXPIRATION_HOURS=48    # Files expire after 48 hours
# FILE_EXPIRATION_HOURS=-1  # Or set to -1 for permanent storage

# Maximum files per user
MAX_FILES_PER_USER=20       # Each user can have up to 20 files

🔧 Implementation Details

Updated Files

1. src/module/message_handler.py

  • Removed analyze_data_file tool (deprecated)
  • Updated DATA_FILE_EXTENSIONS to support 80+ types
  • Rewrote _download_and_save_data_file() to use upload_discord_attachment()
  • Rewrote _handle_data_file() to show detailed upload info
  • Updated _execute_python_code() to fetch all user files from DB
  • Files passed as user_files array to code_interpreter

2. src/config/config.py

  • Added FILE_EXPIRATION_HOURS config
  • Added MAX_FILES_PER_USER config
  • Updated NORMAL_CHAT_PROMPT to reflect new file system
  • Removed references to deprecated analyze_data_file tool

3. src/utils/openai_utils.py

  • Removed analyze_data_file tool definition
  • Only execute_python_code tool remains for all code execution

4. .env

  • Added MAX_FILES_PER_USER=20
  • Already had FILE_EXPIRATION_HOURS=48

📊 User Experience

File Upload

📊 File Uploaded Successfully!

📁 Name: data.csv
📦 Type: CSV
💾 Size: 1.2 MB
🆔 File ID: abc123xyz789
⏰ Expires: 2025-10-04 10:30:00
📂 Your Files: 3/20

✅ Ready for processing! You can now:
• Ask me to analyze this data
• Request visualizations or insights
• Write Python code to process it
• The file is automatically accessible in code execution

💡 Examples:
Analyze this data and show key statistics
Create visualizations from this file
Show me the first 10 rows
Plot correlations between all numeric columns

Code Execution

# AI automatically generates code like:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load user's file (file_id from context)
df = load_file('abc123xyz789')  # Auto-detects CSV!

# Analyze
print(df.describe())
print(f"\nShape: {df.shape}")

# Visualize
sns.heatmap(df.corr(), annot=True)
plt.savefig('correlation_heatmap.png')

# Export results
df.describe().to_csv('statistics.csv')

All generated files are automatically sent to the user!


🔒 Security & Limits

Per-User Limits

  • Max Files: 20 (configurable)
  • Auto-Cleanup: Oldest file deleted when limit reached
  • Expiration: 48 hours (configurable)

File Validation

  • File type detection
  • Size validation
  • Extension checking
  • Malicious file prevention

Isolation

  • Each user has separate directory
  • Code executed in isolated venv
  • Files only accessible to owner

🚀 Benefits

For Users

  1. Simple Upload: Just drag & drop any data file
  2. Natural Interaction: "Analyze this file" - AI handles the rest
  3. Multiple Files: Up to 20 files, automatically managed
  4. Auto-Cleanup: Files expire automatically, no manual deletion needed
  5. Rich Output: Get plots, CSVs, reports automatically

For System

  1. Unified: One code execution system for everything
  2. Scalable: Per-user limits prevent abuse
  3. Efficient: Auto-cleanup prevents disk bloat
  4. Flexible: Support 80+ file types
  5. Simple: AI just writes normal Python code

For AI Model

  1. Natural: Just use load_file('file_id')
  2. Auto-Install: Import any package, auto-installs
  3. Auto-Output: Create files, automatically shared
  4. Context-Aware: Knows about user's uploaded files
  5. Powerful: Full pandas/numpy/scipy/sklearn/tensorflow stack

🧪 Testing

Test File Upload

  1. Upload CSV file → Should show detailed info with file_id
  2. Check 📂 Your Files: 1/20 counter
  3. Ask "analyze this data"
  4. AI should generate code with load_file('file_id')
  5. Code executes, results sent back

Test File Limit

  1. Upload 20 files
  2. Upload 21st file → Oldest should be auto-deleted
  3. Counter should show 20/20

Test File Types

  • CSV: pd.read_csv() auto-detected
  • Excel: pd.read_excel() auto-detected
  • JSON: json.load() or pd.read_json() auto-detected
  • Parquet: pd.read_parquet() auto-detected
  • etc.

Test Expiration

  1. Set FILE_EXPIRATION_HOURS=0.1 (6 minutes)
  2. Upload file
  3. Wait 6+ minutes
  4. File should be auto-deleted

📚 Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Discord User                           │
└────────────────────────┬────────────────────────────────────┘
                         │ Upload file
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                message_handler.py                           │
│  - _handle_data_file()                                      │
│  - _download_and_save_data_file()                           │
│  - Enforces MAX_FILES_PER_USER limit                        │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│             code_interpreter.py                             │
│  - upload_discord_attachment()                              │
│  - Saves to /tmp/bot_code_interpreter/user_files/          │
│  - Stores metadata in MongoDB                               │
│  - Returns file_id                                          │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                    MongoDB                                  │
│  Collection: user_files                                     │
│  {                                                          │
│    file_id: "abc123",                                       │
│    user_id: "878573881449906208",                           │
│    filename: "data.csv",                                    │
│    file_path: "/tmp/.../abc123.csv",                        │
│    file_type: "csv",                                        │
│    file_size: 1234567,                                      │
│    uploaded_at: "2025-10-02T10:30:00",                      │
│    expires_at: "2025-10-04T10:30:00"                        │
│  }                                                          │
└─────────────────────────────────────────────────────────────┘
                         │
                         │ User asks to analyze
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                    AI Model                                 │
│  - Sees file_id in conversation context                     │
│  - Generates Python code:                                   │
│    df = load_file('abc123')                                 │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│            message_handler.py                               │
│  - _execute_python_code()                                   │
│  - Fetches all user files from DB                           │
│  - Passes user_files=[file_id1, file_id2, ...]              │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│             code_interpreter.py                             │
│  - execute_code()                                           │
│  - Injects load_file() function                             │
│  - Maps file_id → file_path                                 │
│  - Auto-installs packages                                   │
│  - Captures generated files                                 │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                 Isolated venv                               │
│  FILES = {'abc123': '/tmp/.../abc123.csv'}                  │
│                                                             │
│  def load_file(file_id):                                    │
│      path = FILES[file_id]                                  │
│      # Auto-detect: CSV, Excel, JSON, etc.                  │
│      return pd.read_csv(path)  # or appropriate loader      │
│                                                             │
│  # User's code executes here                                │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ↓
┌─────────────────────────────────────────────────────────────┐
│              Generated Files                                │
│  - plots.png                                                │
│  - results.csv                                              │
│  - report.txt                                               │
│  → Auto-captured and sent to Discord user                   │
└─────────────────────────────────────────────────────────────┘

Verification Checklist

  • Files saved to code_interpreter system
  • Files expire after configured hours
  • Per-user file limits enforced
  • 80+ file types supported
  • Files accessible via file_id
  • All analysis runs through execute_python_code
  • Removed deprecated analyze_data_file tool
  • Auto-installs packages on import
  • Auto-captures generated files
  • MongoDB stores only metadata
  • Disk cleanup on expiration
  • Oldest file deleted when limit reached
  • Detailed upload confirmation shown
  • File context added to conversation
  • AI prompt updated with new system

🎉 Result

Before: Separate tools, temp directories, manual cleanup, limited file types After: One unified system, automatic everything, 80+ file types, production-ready!

The system now works exactly like ChatGPT's file handling - simple, powerful, and automatic! 🚀