Skip to content

A tree-based approach to detecting haplogroups in low-coverage human mtDNA sequences

License

Notifications You must be signed in to change notification settings

merszym/alphabet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALPHABET

ALPHABET provides haplogroup branch estimates

ALPHABET is a tool for the detection of mtDNA haplogroup-branches in low coverage or highly contaminated ancient DNA samples. It uses the PhyloTree 17 phylogeny to provide coverage-statistics for each haplogroup-node to find the haplogroup branch(es) with the highest support.

ALPHABET attempts to find the branches with the highest support, it is not a haplogroup caller!

Requirements

  • singularity
  • nextflow v22.10 (or larger)

Usage

The input for ALPHABET is a folder with BAM-files. These BAM-files should contain human mtDNA sequences (e.g. the Hominidae 'Extracted Reads' provided by the quicksand pipeline).

ALPHABET maps these sequences to the RSRS, extracts deaminated sequences (based on C-to-T damage) and creates summary statistics for each Haplogroup-Node in the PhyloTree 17 release (see workflow section below).

flags

--split          DIR     Directory containing BAM-files with human mtDNA sequences
--min_support    N       filter haplogroup branches below support of N (default: 70)
--max_gaps       N       filter haplogroup branches with N intermediate unsupported (or missing) gaps (default: 3)

quickstart

(Work in Progress)

Output files

The alphabet output tells you which parts of the mtDNA haplogroup phylogeny are most compatible with the data, and why. It produces three files for each input-file (NAME) (in out/06_haplogroups/):

  1. Full table for each node in the tree (NAME.raw.tsv)
  2. Filtered table, showing the path to haplogroups with the lowest penalty (NAME.best.tsv)
  3. Filtered table, showing only nodes with >=70% branch-support AND <= 3 consequtive gaps (NAME.70.0perc_3gaps.tsv)

NAME.raw.tsv

Full table containing all haplogroup nodes in PhyloTree 17 (unfiltered) and the coverage stats.

NAME.best.tsv Filtered table showing the path(s) through the tree with the lowest overall penalty, representing the best-supported haplogroup branches.

NAME.70.0perc_3gaps.tsv This file corresponds to the default filtering thresholds and is intended for manual inspection of other supported branches. The table shows only haplogroup nodes that: - Have ≥ 70% accumulated branch support, and - Contain ≤ 3 consecutive unsupported (or missing) intermediate nodes

Column description

  • Order: Incrementing index reflecting the original row order in the table.
  • Parent: The immediate parent haplogroup of the node (useful for custom parsing or tree reconstruction).
  • PhyloTree: The haplogroup path as defined in PhyloTree 17.
  • Penalty: Score used to find the best nodes. Lower values show better supported branches. Calculated as SumOfGaps + TotalMismatch + DistanceToBest
  • RequiredGaps: Number of intermediate haplogroup nodes skipped since the last supported node on the branch.
  • SumOfGaps: Total number of skipped intermediate haplogroup nodes along the entire branch from the root to this node.
  • BranchSupport: Accumulated number of supported haplogroup-defining positions along the branch.
  • TotalMismatch: Absolute number of mismatches along the branch (covered diagnostic positions minus supported positions). Used to calculate Penalty.
  • DistanceToBest: The maximum number of supported positions observed in any branch minus the number of supported positions accumulated up to this node. Indicates missing unexplained variation in the branch and is used to calculate the Penalty.
  • BranchSupportPercent: Accumulated branch support (as percentage).
  • PositionSupport: The number for haplogroup-defining positions that are covered by sequences and share the required state (see 'ReadCoverage')
  • SequenceSupport: Coverage support for each diagnostic position. Shows the number of sequences covering that position and support the diagnostic state.

Workflow

1. Mapping with BWA

Files are (re-)mapped to the RSRS (Reconstructed Sapiens Reference Sequence) with BWA and saved to out/01_bwa/

2. Filter Alignment

Files are filtered for minimum mapping quality (25) and minimum length (35). Then the alignment is sorted using samtools

3. Duplicate Removal

PCR duplicates are removed with bam-rmdup and saved to out/02_uniq/

4. Remove Poly-C Stretches

Sequences are removed from the alignment that overlap low-complexity poly-c stretches (positions 303-315, 513-576, 3565-3576 and 16184-16193) with bedtools intersect. These poly-c stretches can introduce face C-to-T deamination patterns, e.g.: .

The filtered alignment is saved to out/03_bedfilter/

5. Mask variable positions

Variable positions can cause C-to-T differences in the first and last 3 bases because of haplogroup differences, rather than DNA damage. Example:

Positions in the alignment are masked if the majority of sequences (but at least 2 sequences) shows a differenct base than the reference.

The pileup and the extracted positions are saved to out/04_pileup/. The positions are only masked for the extraction of deaminated sequences.

6. Extract deaminated sequences

Sequences are extracted that have a C-to-T substitution in the first or last three bases (unless this subsitution is one of the masked positions). Deaminated sequences are saved to 05_deaminated/

7. Mask first and last three bases

Set the mapping quality score of the first and last three T-bases of all deaminated sequences to 0. They are ignored in the phylotree-analysis

8. Haplogroup Statistics

Walk through the PhyloTree-file and create summary statistics for each haplogroup node. The resulting tables are saved to 06_haplogroups/

Resources

This repository includes the RSRS-based PhyloTree17 XML-file provided under MIT License by the Institute of Genetic Epidemiology, Insbruck

About

A tree-based approach to detecting haplogroups in low-coverage human mtDNA sequences

Resources

License

Stars

Watchers

Forks

Packages

No packages published