An intelligent file organization system that leverages semantic understanding to categorize documents.
FileSense is a local file management utility that organizes documents based on their semantic content rather than relying on filename patterns or extensions. It utilizes SentenceTransformers for vector embeddings and FAISS for efficient similarity indexing.
Key capabilities include:
- Semantic Classification: Maps files to categories based on meaning (e.g., "thermodynamics_notes.pdf" → "Physics").
- Automated Labeling: Integrates with Google Gemini to propose new categories for documents that do not match existing templates.
- OCR Integration: Supports text extraction from images and scanned PDFs via
pdfplumberandTesseract. - Concurrency: Parallel processing for handling large datasets.
- Monitoring: Real-time folder watching for automated sorting of new downloads.
- Core Engine:
classify_process_file.pyhandles the main classification pipeline. - Vector Database: Uses FAISS for high-performance similarity search.
- Inference: Supports various SentenceTransformer models (default:
bge-base-en-v1.5). - RL Module: Optional Reinforcement Learning module for policy-based classification improvement.
FileSense/
├── scripts/
│ ├── RL/ # Reinforcement Learning & SFT logic
│ ├── logger/ # Unified logging system
│ ├── main.py # CLI entry point (replaces script.py)
│ ├── watcher.py # Filesystem monitor (replaces watcher_script.py)
│ ├── launcher.py # Tkinter-based GUI
│ └── ... # Utility modules (OCR, Labelling, Indexing)
├── folder_labels.json # Category definitions and keyword mappings
└── folder_embeddings.faiss # Pre-computed vector index
- Python 3.8+
- Tesseract OCR (Optional, for OCR support)
- Google Gemini API Key (Optional, for automated category generation)
pip install -r requirements.txtCreate a .env file in the root directory:
API_KEY=your_google_gemini_api_keyTo initialize a knowledge base from preseeded data:
- Copy
preseeded.jsontofolder_labels.json. - Generate the vector index:
python scripts/create_index.py
python scripts/launcher.pypython scripts/main.py --dir ./path/to/files --threads 4python scripts/watcher.py --dir ./DownloadsFileSense operates strictly locally by default. The Reinforcement Learning (RL) features, which log classification metrics for policy optimization, are disabled by default. Use the --enable-rl flag to opt-in to these features.
MIT License
- Automated Knowledge Optimization: Implement scripts to iteratively refine category descriptions until semantic similarity thresholds are met across training datasets.
- Self-Optimizing Prompts: Enable the model to return revised instructions after each classification update to improve zero-shot accuracy.
- Reinforcement Learning Feedback: Allow users to contribute manual classification corrections to optimize the underlying RL policy.
- Model Comparative Analysis: Document the specific advantages of using SentenceTransformers versus traditional text classifiers for cross-domain file sorting.