Skip to content

vehladkid/DNA-DFA-using-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 DNA Pattern Matching using DFA & Aho-Corasick

Theory of Computation Project | Slot: C1+TC1 | Faculty: Dr Amutha S

Quick Links


πŸ“‹ Overview

Fast, efficient DNA pattern matching using Deterministic Finite Automaton (DFA) and Aho-Corasick Algorithm. Built with a web-based frontend for easy visualization of pattern matching, motif analysis, and performance benchmarking.

⚑ Why This Project?

  • O(n) Linear Time Complexity - No backtracking, process each character once
  • Real-World Applications - Gene discovery, restriction mapping, disease detection
  • Educational - Learn automata theory, string algorithms, and complexity analysis
  • Production-Ready - Fully tested with 76+ test cases

πŸš€ Features

Feature Description Algorithm
Pattern Search Single-pattern matching DFA (KMP-based)
Motif Analysis Search known DNA motifs DFA
Multi-Pattern Search multiple patterns simultaneously Aho-Corasick
Benchmarks Performance testing & O(n) verification Custom analyzer
Web Interface Interactive Streamlit dashboard Frontend
Validation DNA sequence validation (ATCG + N) SequenceHandler

βš™οΈ Tech Stack

Layer Tech
Frontend Streamlit, Plotly, Matplotlib
Backend Python 3.9+, BioPython
Testing Pytest (76+ tests), Coverage analysis
Algorithms Custom DFA, Aho-Corasick (from scratch)

πŸ“ Project Structure

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ dfa_engine.py              # DFA pattern matching (KMP-based)
β”‚   β”œβ”€β”€ aho_corasick.py            # Multi-pattern matching
β”‚   β”œβ”€β”€ sequence_handler.py        # DNA validation & processing
β”‚   β”œβ”€β”€ motif_database.py          # Biological motif database
β”‚   β”œβ”€β”€ benchmark.py               # Performance testing
β”‚   └── performance_analyzer.py    # Complexity verification
β”‚
β”œβ”€β”€ app/
β”‚   └── app.py                     # Streamlit web interface (5 pages)
β”‚
β”œβ”€β”€ tests/                         # 76+ test cases βœ“
β”‚   β”œβ”€β”€ test_dfa_engine.py        # DFA tests
β”‚   β”œβ”€β”€ test_sequence_handler.py  # Validation tests
β”‚   β”œβ”€β”€ test_motif_database.py    # Motif tests
β”‚   └── test_benchmark.py         # Performance tests
β”‚
└── requirements.txt

🎯 Core Algorithms

1️⃣ DFA (Deterministic Finite Automaton)

  • Best for: Single pattern matching
  • Time: O(n) | Space: O(m)
  • Key: KMP failure function for smart state transitions
  • No backtracking - each character processed exactly once

2️⃣ Aho-Corasick Algorithm

  • Best for: Multiple patterns simultaneously
  • Time: O(n + m + z) | Space: O(mΓ—k)
  • Key: Trie with failure links
  • Why: Find all motifs in one pass

3️⃣ Performance Comparison

Algorithm Single Pattern Multiple Patterns Space Backtrack
Naive O(nΓ—m) O(nΓ—kΓ—m) O(1) Yes
DFA O(n) O(nΓ—k) O(m) No βœ“
Aho-Corasick O(n) O(n+z) O(m) No βœ“

πŸƒ Getting Started

Prerequisites

Python 3.9+
pip
Git

Installation

# Clone repository
git clone <repo-url>
cd DNA-DFA-using-python

# Virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run Tests

# All tests (76+)
pytest tests/ -v

# Specific test file
pytest tests/test_dfa_engine.py -v

# With coverage
pytest tests/ --cov=src --cov-report=html

Launch Web App

cd app
streamlit run app.py
# Opens: http://localhost:8501

πŸ“– Usage

Example 1: Find Pattern

from src.dfa_engine import DFAStateMachine

# Create DFA for pattern
dfa = DFAStateMachine("ACG")

# Find matches in sequence
matches = dfa.match("AACGTACG")
# Output: [{'position': 1, 'sequence': 'ACG', 'score': 1.0}, ...]

Example 2: Motif Analysis

from src.motif_database import MotifDatabase
from src.dfa_engine import DFAStateMachine

# Get TATA box motif
motif = MotifDatabase.PROMOTER_MOTIFS['TATA_BOX']['sequence']

# Search in sequence
dfa = DFAStateMachine(motif)
results = dfa.match("ATGCTATAAACGATGC")
# Found TATAAA at position 5

Example 3: Validate Sequence

from src.sequence_handler import SequenceHandler

# Check if valid
is_valid = SequenceHandler.validate_sequence("ATGC")  # True
is_valid = SequenceHandler.validate_sequence("ATGCX")  # False

πŸ§ͺ Testing

Test Coverage

Module Tests Coverage
test_dfa_engine.py 17 tests DFA matching, failure function
test_sequence_handler.py 21 tests Validation, FASTA loading
test_motif_database.py 13 tests Motif retrieval
test_benchmark.py 25 tests Performance testing
Total 76+ tests βœ“ All passing

Sample Test Output

pytest tests/ -v
========================= 76 passed in 2.34s =========================
βœ“ DFA pattern matching tests
βœ“ Sequence validation tests  
βœ“ Motif database tests
βœ“ Benchmark performance tests

πŸ“Š Complexity Analysis

DFA: O(n + m)

Build failure function: O(m)
Scan text: O(n) - each char processed once
Total: O(n + m) βœ“

Aho-Corasick: O(n + m + z)

Trie construction: O(m)
Text scanning: O(n)
Output matches: O(z)
Total: O(n + m + z)

Space Complexity

Algorithm Space
DFA O(m)
Aho-Corasick O(m Γ— k)
Naive O(1)

πŸŽ“ Learning Outcomes

After this project:

  • βœ… Understand DFA construction and execution
  • βœ… Master KMP and Aho-Corasick algorithms
  • βœ… Work with trie data structures
  • βœ… Analyze O(n) vs O(nΒ²) complexity
  • βœ… Apply automata theory to real problems
  • βœ… Write production-quality code with tests

πŸ“ References

Algorithms

Tools


πŸ‘₯ Team

ID Name
24BAI1040 Tejvir Singh
24BAI1049 Mouli Gupta
24BAI1629 Chitwan Singh
24BAI1631 Sreeansh Dash

Last Updated: January 2026 | Python: 3.9+ | License: MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages