FileSense

An intelligent file organization system that leverages semantic understanding to categorize documents.

Introduction

FileSense is a local file management utility that organizes documents based on their semantic content rather than relying on filename patterns or extensions. It utilizes SentenceTransformers for vector embeddings and FAISS for efficient similarity indexing.

Key capabilities include:

Semantic Classification: Maps files to categories based on meaning (e.g., "thermodynamics_notes.pdf" → "Physics").
Automated Labeling: Integrates with Google Gemini to propose new categories for documents that do not match existing templates.
OCR Integration: Supports text extraction from images and scanned PDFs via pdfplumber and Tesseract.
Concurrency: Parallel processing for handling large datasets.
Monitoring: Real-time folder watching for automated sorting of new downloads.

Technical Architecture

Component Overview

Core Engine: classify_process_file.py handles the main classification pipeline.
Vector Database: Uses FAISS for high-performance similarity search.
Inference: Supports various SentenceTransformer models (default: bge-base-en-v1.5).
RL Module: Optional Reinforcement Learning module for policy-based classification improvement.

Project Structure

FileSense/
├── scripts/
│   ├── RL/                 # Reinforcement Learning & SFT logic
│   ├── logger/             # Unified logging system
│   ├── main.py             # CLI entry point (replaces script.py)
│   ├── watcher.py          # Filesystem monitor (replaces watcher_script.py)
│   ├── launcher.py         # Tkinter-based GUI
│   └── ...                 # Utility modules (OCR, Labelling, Indexing)
├── folder_labels.json      # Category definitions and keyword mappings
└── folder_embeddings.faiss # Pre-computed vector index

Setup and Installation

Prerequisites

Python 3.8+
Tesseract OCR (Optional, for OCR support)
Google Gemini API Key (Optional, for automated category generation)

Installation

pip install -r requirements.txt

Environment Configuration

Create a .env file in the root directory:

API_KEY=your_google_gemini_api_key

Initialization

To initialize a knowledge base from preseeded data:

Copy preseeded.json to folder_labels.json.
Generate the vector index:
```
python scripts/create_index.py
```

Usage

GUI Launcher

python scripts/launcher.py

Command Line Interface

python scripts/main.py --dir ./path/to/files --threads 4

Background Monitor

python scripts/watcher.py --dir ./Downloads

Privacy and RL

FileSense operates strictly locally by default. The Reinforcement Learning (RL) features, which log classification metrics for policy optimization, are disabled by default. Use the --enable-rl flag to opt-in to these features.

License

MIT License

Roadmap & Future Enhancements

Automated Knowledge Optimization: Implement scripts to iteratively refine category descriptions until semantic similarity thresholds are met across training datasets.
Self-Optimizing Prompts: Enable the model to return revised instructions after each classification update to improve zero-shot accuracy.
Reinforcement Learning Feedback: Allow users to contribute manual classification corrections to optimize the underlying RL policy.
Model Comparative Analysis: Document the specific advantages of using SentenceTransformers versus traditional text classifiers for cross-domain file sorting.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.github/workflows		.github/workflows
evaluation		evaluation
folder_label_generator		folder_label_generator
landing		landing
scripts		scripts
wiki		wiki
.gitignore		.gitignore
Complete_SFT_Handbook.pdf		Complete_SFT_Handbook.pdf
FileSense_Launcher.bat		FileSense_Launcher.bat
FileSense_Launcher.sh		FileSense_Launcher.sh
LOGGING_ARCHITECTURE.md		LOGGING_ARCHITECTURE.md
README.md		README.md
RL.md		RL.md
Run_FileSense_Silent.vbs		Run_FileSense_Silent.vbs
TODO.md		TODO.md
dataset.csv		dataset.csv
folder_embeddings.faiss		folder_embeddings.faiss
folder_labels.json		folder_labels.json
merging_dataset.csv		merging_dataset.csv
preseeded.json		preseeded.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FileSense

Introduction

Technical Architecture

Component Overview

Project Structure

Setup and Installation

Prerequisites

Installation

Environment Configuration

Initialization

Usage

GUI Launcher

Command Line Interface

Background Monitor

Privacy and RL

License

Roadmap & Future Enhancements

About

Uh oh!

Releases 1

Packages

Languages

ahhyoushh/FileSense

Folders and files

Latest commit

History

Repository files navigation

FileSense

Introduction

Technical Architecture

Component Overview

Project Structure

Setup and Installation

Prerequisites

Installation

Environment Configuration

Initialization

Usage

GUI Launcher

Command Line Interface

Background Monitor

Privacy and RL

License

Roadmap & Future Enhancements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages