Theory of Computation Project | Slot: C1+TC1 | Faculty: Dr Amutha S
- Features β’ Installation β’ Usage β’ Testing β’ Architecture
Fast, efficient DNA pattern matching using Deterministic Finite Automaton (DFA) and Aho-Corasick Algorithm. Built with a web-based frontend for easy visualization of pattern matching, motif analysis, and performance benchmarking.
- O(n) Linear Time Complexity - No backtracking, process each character once
- Real-World Applications - Gene discovery, restriction mapping, disease detection
- Educational - Learn automata theory, string algorithms, and complexity analysis
- Production-Ready - Fully tested with 76+ test cases
| Feature | Description | Algorithm |
|---|---|---|
| Pattern Search | Single-pattern matching | DFA (KMP-based) |
| Motif Analysis | Search known DNA motifs | DFA |
| Multi-Pattern | Search multiple patterns simultaneously | Aho-Corasick |
| Benchmarks | Performance testing & O(n) verification | Custom analyzer |
| Web Interface | Interactive Streamlit dashboard | Frontend |
| Validation | DNA sequence validation (ATCG + N) | SequenceHandler |
| Layer | Tech |
|---|---|
| Frontend | Streamlit, Plotly, Matplotlib |
| Backend | Python 3.9+, BioPython |
| Testing | Pytest (76+ tests), Coverage analysis |
| Algorithms | Custom DFA, Aho-Corasick (from scratch) |
βββ src/
β βββ dfa_engine.py # DFA pattern matching (KMP-based)
β βββ aho_corasick.py # Multi-pattern matching
β βββ sequence_handler.py # DNA validation & processing
β βββ motif_database.py # Biological motif database
β βββ benchmark.py # Performance testing
β βββ performance_analyzer.py # Complexity verification
β
βββ app/
β βββ app.py # Streamlit web interface (5 pages)
β
βββ tests/ # 76+ test cases β
β βββ test_dfa_engine.py # DFA tests
β βββ test_sequence_handler.py # Validation tests
β βββ test_motif_database.py # Motif tests
β βββ test_benchmark.py # Performance tests
β
βββ requirements.txt
- Best for: Single pattern matching
- Time: O(n) | Space: O(m)
- Key: KMP failure function for smart state transitions
- No backtracking - each character processed exactly once
- Best for: Multiple patterns simultaneously
- Time: O(n + m + z) | Space: O(mΓk)
- Key: Trie with failure links
- Why: Find all motifs in one pass
| Algorithm | Single Pattern | Multiple Patterns | Space | Backtrack |
|---|---|---|---|---|
| Naive | O(nΓm) | O(nΓkΓm) | O(1) | Yes |
| DFA | O(n) | O(nΓk) | O(m) | No β |
| Aho-Corasick | O(n) | O(n+z) | O(m) | No β |
Python 3.9+
pip
Git# Clone repository
git clone <repo-url>
cd DNA-DFA-using-python
# Virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# All tests (76+)
pytest tests/ -v
# Specific test file
pytest tests/test_dfa_engine.py -v
# With coverage
pytest tests/ --cov=src --cov-report=htmlcd app
streamlit run app.py
# Opens: http://localhost:8501from src.dfa_engine import DFAStateMachine
# Create DFA for pattern
dfa = DFAStateMachine("ACG")
# Find matches in sequence
matches = dfa.match("AACGTACG")
# Output: [{'position': 1, 'sequence': 'ACG', 'score': 1.0}, ...]from src.motif_database import MotifDatabase
from src.dfa_engine import DFAStateMachine
# Get TATA box motif
motif = MotifDatabase.PROMOTER_MOTIFS['TATA_BOX']['sequence']
# Search in sequence
dfa = DFAStateMachine(motif)
results = dfa.match("ATGCTATAAACGATGC")
# Found TATAAA at position 5from src.sequence_handler import SequenceHandler
# Check if valid
is_valid = SequenceHandler.validate_sequence("ATGC") # True
is_valid = SequenceHandler.validate_sequence("ATGCX") # False| Module | Tests | Coverage |
|---|---|---|
| test_dfa_engine.py | 17 tests | DFA matching, failure function |
| test_sequence_handler.py | 21 tests | Validation, FASTA loading |
| test_motif_database.py | 13 tests | Motif retrieval |
| test_benchmark.py | 25 tests | Performance testing |
| Total | 76+ tests | β All passing |
pytest tests/ -v
========================= 76 passed in 2.34s =========================
β DFA pattern matching tests
β Sequence validation tests
β Motif database tests
β Benchmark performance testsBuild failure function: O(m)
Scan text: O(n) - each char processed once
Total: O(n + m) β
Trie construction: O(m)
Text scanning: O(n)
Output matches: O(z)
Total: O(n + m + z)
| Algorithm | Space |
|---|---|
| DFA | O(m) |
| Aho-Corasick | O(m Γ k) |
| Naive | O(1) |
After this project:
- β Understand DFA construction and execution
- β Master KMP and Aho-Corasick algorithms
- β Work with trie data structures
- β Analyze O(n) vs O(nΒ²) complexity
- β Apply automata theory to real problems
- β Write production-quality code with tests
Algorithms
Tools
| ID | Name |
|---|---|
| 24BAI1040 | Tejvir Singh |
| 24BAI1049 | Mouli Gupta |
| 24BAI1629 | Chitwan Singh |
| 24BAI1631 | Sreeansh Dash |
Last Updated: January 2026 | Python: 3.9+ | License: MIT