Skip to content

A modular FastAPI-based application for audio processing, featuring transcription, phonemization, metadata generation, and MongoDB storage. Powered by Whisper, Wav2Vec2, and Docker

License

Notifications You must be signed in to change notification settings

roglz/mxesco-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MXESCO-DOCKER

MXESCO-DOCKER is a robust and modular application designed for processing audio files. It includes features for transcription, phonemization, metadata generation, and storage in a MongoDB database. Built with FastAPI, it leverages advanced libraries such as OpenAI's Whisper and Hugging Face's Wav2Vec2 for speech and phoneme recognition.

Features

  • Audio Transcription: Extracts text from audio files with word-level timestamps.
  • Phonemization: Converts audio data into phonemes with detailed character offsets.
  • Metadata Generation: Includes information about the transcriber model, phonemizer, and timestamps.
  • Data Storage: Stores processed data and raw audio in MongoDB using GridFS.
  • REST API: Exposes endpoints for uploading and processing audio files.
  • Containerization: Dockerized for ease of deployment.

Project Structure

MXESCO-DOCKER/
├── app/
│   ├── routes/
│   │   ├── __init__.py
│   │   ├── audio_routes.py
│   ├── services/
│   │   ├── __init__.py
│   │   ├── audio_processing.py
│   │   ├── corpus_app.py
│   │   ├── database.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── phonemization.py
│   │   ├── timestamps.py
│   │   ├── transcription.py
│   ├── main.py
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── README.md
├── requirements.txt

Key Files

  • main.py: Entry point for the FastAPI application.
  • audio_routes.py: Defines API endpoints for processing audio files.
  • audio_processing.py: Handles transcription, phonemization, and metadata generation.
  • database.py: Saves metadata and audio files to MongoDB.
  • corpus_app.py: Processes word and phoneme data for enriched metadata.
  • utils/: Utility functions for timestamps, transcription, and phonemization.

Installation

Prerequisites

  • Docker and Docker Compose installed
  • Python 3.9+

Steps

  1. Clone the repository:
    git clone <repository_url>
    cd mxesco-docker
  2. Build and run the Docker containers:
    docker compose up --build
  3. The API will be available at http://localhost:8000.

Running Locally Without Docker

  1. Install dependencies:
    pip install -r requirements.txt
  2. Start the FastAPI server:
    uvicorn app.main:app --reload

Usage

API Endpoints

Process Audio File

  • Endpoint: /api/process-audio/
  • Method: POST
  • Description: Uploads an audio file for processing.
  • Example Request:
    curl -X POST "http://127.0.0.1:8000/api/process-audio/" \
         -F "file=@example_audio.mp3"
  • Response:
    {
        "status": "success",
        "message": "Audio processed and saved successfully."
    }

Interactive API Documentation

Technologies Used

  • FastAPI: Web framework for building APIs.
  • PyTorch: For handling audio data and Wav2Vec2 model inference.
  • Whisper: OpenAI's speech-to-text library.
  • MongoDB & GridFS: For data persistence.
  • Docker: For containerized deployment.
  • Pydub: For audio file manipulation.
  • Phonemizer: For generating phonemes from text.

Configuration

Docker

  • docker-compose.yml:
    • Defines two services:
      1. app: The FastAPI application.
      2. mongo: MongoDB database.
    • Exposes ports 8000 for the application and 27017 for MongoDB.

Environment Variables

To customize settings, modify the environment variables in the docker-compose.yml file or create a .env file.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a feature branch.
  3. Commit your changes.
  4. Submit a pull request.

Acknowledgements

  • OpenAI for Whisper
  • Hugging Face for Wav2Vec2
  • MongoDB for efficient data handling
  • Maestría en Ciencia de Datos, Universidad de Sonora (GitHub Repository)

About

A modular FastAPI-based application for audio processing, featuring transcription, phonemization, metadata generation, and MongoDB storage. Powered by Whisper, Wav2Vec2, and Docker

Topics

Resources

License

Stars

Watchers

Forks