ALPHABET provides haplogroup branch estimates
ALPHABET is a tool for the detection of mtDNA haplogroup-branches in low coverage or highly contaminated ancient DNA samples. It uses the PhyloTree 17 phylogeny to provide coverage-statistics for each haplogroup-node to find the haplogroup branch(es) with the highest support.
ALPHABET attempts to find the branches with the highest support, it is not a haplogroup caller!
- singularity
- nextflow v22.10 (or larger)
The input for ALPHABET is a folder with BAM-files. These BAM-files should contain human mtDNA sequences (e.g. the Hominidae 'Extracted Reads' provided by the quicksand pipeline).
ALPHABET maps these sequences to the RSRS, extracts deaminated sequences (based on C-to-T damage) and creates summary statistics for each Haplogroup-Node in the PhyloTree 17 release (see workflow section below).
--split DIR Directory containing BAM-files with human mtDNA sequences
--min_support N filter haplogroup branches below support of N (default: 70)
--max_gaps N filter haplogroup branches with N intermediate unsupported (or missing) gaps (default: 3)
(Work in Progress)
The alphabet output tells you which parts of the mtDNA haplogroup phylogeny are most compatible with the data, and why. It produces three files for each input-file (NAME) (in out/06_haplogroups/):
- Full table for each node in the tree (
NAME.raw.tsv) - Filtered table, showing the path to haplogroups with the lowest penalty (
NAME.best.tsv) - Filtered table, showing only nodes with >=70% branch-support AND <= 3 consequtive gaps (
NAME.70.0perc_3gaps.tsv)
NAME.raw.tsv
Full table containing all haplogroup nodes in PhyloTree 17 (unfiltered) and the coverage stats.
NAME.best.tsv
Filtered table showing the path(s) through the tree with the lowest overall penalty, representing the best-supported haplogroup branches.
NAME.70.0perc_3gaps.tsv
This file corresponds to the default filtering thresholds and is intended for manual inspection of other supported branches.
The table shows only haplogroup nodes that:
- Have ≥ 70% accumulated branch support, and
- Contain ≤ 3 consecutive unsupported (or missing) intermediate nodes
- Order: Incrementing index reflecting the original row order in the table.
- Parent: The immediate parent haplogroup of the node (useful for custom parsing or tree reconstruction).
- PhyloTree: The haplogroup path as defined in PhyloTree 17.
- Penalty: Score used to find the
bestnodes. Lower values show better supported branches. Calculated asSumOfGaps+TotalMismatch+DistanceToBest - RequiredGaps: Number of intermediate haplogroup nodes skipped since the last supported node on the branch.
- SumOfGaps: Total number of skipped intermediate haplogroup nodes along the entire branch from the root to this node.
- BranchSupport: Accumulated number of supported haplogroup-defining positions along the branch.
- TotalMismatch: Absolute number of mismatches along the branch (covered diagnostic positions minus supported positions). Used to calculate Penalty.
- DistanceToBest: The maximum number of supported positions observed in any branch minus the number of supported positions accumulated up to this node. Indicates missing unexplained variation in the branch and is used to calculate the Penalty.
- BranchSupportPercent: Accumulated branch support (as percentage).
- PositionSupport: The number for haplogroup-defining positions that are covered by sequences and share the required state (see 'ReadCoverage')
- SequenceSupport: Coverage support for each diagnostic position. Shows the number of sequences covering that position and support the diagnostic state.
Files are (re-)mapped to the RSRS (Reconstructed Sapiens Reference Sequence) with BWA and saved to out/01_bwa/
Files are filtered for minimum mapping quality (25) and minimum length (35). Then the alignment is sorted using samtools
PCR duplicates are removed with bam-rmdup and saved to out/02_uniq/
Sequences are removed from the alignment that overlap low-complexity poly-c stretches (positions 303-315, 513-576, 3565-3576 and 16184-16193) with bedtools intersect. These poly-c stretches can introduce face C-to-T deamination patterns, e.g.:
.
The filtered alignment is saved to out/03_bedfilter/
Variable positions can cause C-to-T differences in the first and last 3 bases because of haplogroup differences, rather than DNA damage. Example: 
Positions in the alignment are masked if the majority of sequences (but at least 2 sequences) shows a differenct base than the reference.
The pileup and the extracted positions are saved to out/04_pileup/. The positions are only masked for the extraction of deaminated sequences.
Sequences are extracted that have a C-to-T substitution in the first or last three bases (unless this subsitution is one of the masked positions). Deaminated sequences are saved to 05_deaminated/
Set the mapping quality score of the first and last three T-bases of all deaminated sequences to 0. They are ignored in the phylotree-analysis
Walk through the PhyloTree-file and create summary statistics for each haplogroup node. The resulting tables are saved to 06_haplogroups/
This repository includes the RSRS-based PhyloTree17 XML-file provided under MIT License by the Institute of Genetic Epidemiology, Insbruck