MXESCO-DOCKER is a robust and modular application designed for processing audio files. It includes features for transcription, phonemization, metadata generation, and storage in a MongoDB database. Built with FastAPI, it leverages advanced libraries such as OpenAI's Whisper and Hugging Face's Wav2Vec2 for speech and phoneme recognition.
- Audio Transcription: Extracts text from audio files with word-level timestamps.
- Phonemization: Converts audio data into phonemes with detailed character offsets.
- Metadata Generation: Includes information about the transcriber model, phonemizer, and timestamps.
- Data Storage: Stores processed data and raw audio in MongoDB using GridFS.
- REST API: Exposes endpoints for uploading and processing audio files.
- Containerization: Dockerized for ease of deployment.
MXESCO-DOCKER/
├── app/
│ ├── routes/
│ │ ├── __init__.py
│ │ ├── audio_routes.py
│ ├── services/
│ │ ├── __init__.py
│ │ ├── audio_processing.py
│ │ ├── corpus_app.py
│ │ ├── database.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── phonemization.py
│ │ ├── timestamps.py
│ │ ├── transcription.py
│ ├── main.py
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── README.md
├── requirements.txt
main.py: Entry point for the FastAPI application.audio_routes.py: Defines API endpoints for processing audio files.audio_processing.py: Handles transcription, phonemization, and metadata generation.database.py: Saves metadata and audio files to MongoDB.corpus_app.py: Processes word and phoneme data for enriched metadata.utils/: Utility functions for timestamps, transcription, and phonemization.
- Docker and Docker Compose installed
- Python 3.9+
- Clone the repository:
git clone <repository_url> cd mxesco-docker
- Build and run the Docker containers:
docker compose up --build
- The API will be available at http://localhost:8000.
- Install dependencies:
pip install -r requirements.txt
- Start the FastAPI server:
uvicorn app.main:app --reload
- Endpoint:
/api/process-audio/ - Method:
POST - Description: Uploads an audio file for processing.
- Example Request:
curl -X POST "http://127.0.0.1:8000/api/process-audio/" \ -F "file=@example_audio.mp3"
- Response:
{ "status": "success", "message": "Audio processed and saved successfully." }
- Swagger UI: http://127.0.0.1:8000/docs
- ReDoc: http://127.0.0.1:8000/redoc
- FastAPI: Web framework for building APIs.
- PyTorch: For handling audio data and Wav2Vec2 model inference.
- Whisper: OpenAI's speech-to-text library.
- MongoDB & GridFS: For data persistence.
- Docker: For containerized deployment.
- Pydub: For audio file manipulation.
- Phonemizer: For generating phonemes from text.
docker-compose.yml:- Defines two services:
app: The FastAPI application.mongo: MongoDB database.
- Exposes ports
8000for the application and27017for MongoDB.
- Defines two services:
To customize settings, modify the environment variables in the docker-compose.yml file or create a .env file.
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a feature branch.
- Commit your changes.
- Submit a pull request.
- OpenAI for Whisper
- Hugging Face for Wav2Vec2
- MongoDB for efficient data handling
- Maestría en Ciencia de Datos, Universidad de Sonora (GitHub Repository)