Functional and ecological determinants of evolutionary dynamics in crAss-like phages

This project was directed at microbial / viral ecology, metagenomics, focusing on gene-level and codon-level variation in reference genomes (of phages) and the mapping of multiple metagenomic samples to those references.

In the first phase I did wide net phage ecology from mining.
In in the second phase I focused on a common phage crAssphage in the gut microbiome. I wrote scripts to calculates selective pressure and micro diversity measures of genes of reference genomes. Variation of genes was found by mapping metagenomic reads of samples against these reference genomes.

contains Python code.
It could be merged with Linux bash scripts, which call the Python programs and therefore also contain clear examples on parameter usage.

to do

installation instructions
all Conda dependencies (refer to generated file by Conda?)

tip: consider using github commit messages as part of the documentation

Summary

This pipeline calculates selective pressure and micro diversity measures of genes of reference genomes. Variation of genes is found by mapping metagenomic reads of samples against these reference genomes. Reference genomes, paired-end fastq files of samples and gene annotation information should be provided. This project also contains a specific gene annotation pipeline, which can be replaced by any other method. The main pipeline uses DiversiTools to calculate measures on codon level, which are then further aggregated on gene and gene family level.

Examples

Accompanying shell scripts are in the mgx folder of: https://github.com/resharp/phages_scripts

dependencies

numpy, pandas
matplotlib.pyplot, seaborn
scipy.stats
biopython, Bio.SeqIO

Installation

this repository can be best put in local directory:

source/phages

the shell scripts reference the Python scripts by this path

Annotation pipeline

AnnotateCrassGenomes.py
- All results of annotation pipeline are integrated, input:
  - HMM search results for Yutin HMM profiles
  - searched all genes against HMM profiles of pVOG database (Grazziotin 2017, date accessed 14/1/2020).
  - to do: (all-to-all pairwise blastp search between the genes)

help scripts

create_proteins_from_gbk.py
- We extracted the proteins from NC_024711.1 (genus 1)
- output: crassphage_refseq.proteins.faa
- used for Yutin hmm-searching, pairwise blastp and pVOG searches
extract_positions_from_prodigal_predictions.py
- generates gene position files (as input for DiversiTools) from the prodigal gene predictions
create_categories_from_gbk.py
- For genus 1 we created the gene positions from the GenBank file

measures calculated per sample

CalcDiversiMeasures.py
- aggregate the output of DiversiTools
- calculated the following measures per gene per sample
- (also per sliding window/bin per sample):
  - Mean and standard deviation (dev) of coverage
  - Mean and standard dev of top amino acid (AA) count, second AA count and third AA count
  - Sum of synonymous and non-synonymous counts
  - Total number of SCPs
  - dN/dS
  - entropy
  - pN/pS
MakeGenePlots.py
- aggregates the output of CalcDiversiMeasures.py. It contains
  - analysis for one reference genome
  - analysis for gene families based on multiple ref genomes
- statistical testing and plotting
- (uses matplotlib.pyplot and seaborn)

creates figures 2 and 3 of main text
and supplementary tables [to do]

measures calculated across samples

CalcCodonMeasures.py
- For cross-sample measures all reads of the samples were first stacked on top of each other, ...
MakeDiversityPlots.py
- ... after which we calculated the following measures for every codon position
  - the overall coverage
  - the total number of possible codon types in all samples (equal to one when no variation is present; we accepted a codon polymorphism (SCP) if it occurred at least in three reads in all samples and at a minimum of 1% of the total depth at that codon position.)
  - the entropy
  - pN/pS

creates figure 4 of main text
and supplementary tables [to do]

help files

MakeSynProbabilitiesTable.py
- creates table for codon bias
- output: codon_syn_non_syn_probabilities.txt

macro diversity

MakeSamplePlots.py
- calculated macro diversity per sample based on the sample/genome mapping statistics produced by samtools idxstats (Li 2009) on the bam files

creates figure 1 and 5 of main text

Summary table

MakeGeneSummary.py
- integrates different statistics for genus 1 genes and gene families in one table
- also contains a pangenome analysis based on the hmm hits from the protein family profiles

extra stuff

Scripts used for selecting samples

ExtractSraMetadata.py
- parses metadata files in MGXDB database
- extract samples containg a certain taxonid from metadata files User for proof of concept:
DownloadMgnifySamples.py
- download metadata for samples and runs based on some properties
  - human gut fecal samples
  - only metagenomics data sets (and being able to filter on this in runs)
  - also add information from analysis:
  - number of reads
download_fastq.py
- download ...

clustering proteins and prophages from prophage predictions by VirSorter

extra dependencies

igraph, unittest2

scipts

This is obsolete, and should be moved to different GIT repository:

MclClusterEvaluation.py
PhageSharedContent.py
PhageClusterEvaluation.py
VirsorterStats.py
VirsorterStatsTest.py
- also depends on MyTestCase.py
PhageContentPlots.py
MyTestCase.py
- parent class of all test case classes (contains generic code for logging start of function)

Code for learning Pandas and python:

IctvStats.py
- code for getting statistics from:
- ICTV Master Species List 2018b.v1.csv
  - downloaded from https://talk.ictvonline.org/
- prokaryotes.cv
IctvStatsTest.py
- accompanying tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Functional and ecological determinants of evolutionary dynamics in crAss-like phages

Summary

Examples

dependencies

Installation

Annotation pipeline

measures calculated per sample

measures calculated across samples

macro diversity

Summary table

extra stuff

clustering proteins and prophages from prophage predictions by VirSorter

extra dependencies

scipts

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
examples		examples
input_files		input_files
.gitignore		.gitignore
AnnotateCrassGenomes.py		AnnotateCrassGenomes.py
CalcCodonMeasures.py		CalcCodonMeasures.py
CalcDiversiMeasures.py		CalcDiversiMeasures.py
DownloadMgnifySamples.py		DownloadMgnifySamples.py
ExtractSraMetadata.py		ExtractSraMetadata.py
IctvStats.py		IctvStats.py
IctvStatsTest.py		IctvStatsTest.py
MakeDiversityPlots.py		MakeDiversityPlots.py
MakeGenePlots.py		MakeGenePlots.py
MakeGeneSummary.py		MakeGeneSummary.py
MakeSamplePlots.py		MakeSamplePlots.py
MakeSynProbabilitiesTable.py		MakeSynProbabilitiesTable.py
MclClusterEvaluation.py		MclClusterEvaluation.py
MyTestCase.py		MyTestCase.py
PhageClusterEvaluation.py		PhageClusterEvaluation.py
PhageContentPlots.py		PhageContentPlots.py
PhageSharedContent.py		PhageSharedContent.py
README.md		README.md
VirsorterStats.py		VirsorterStats.py
VirsorterStatsTest.py		VirsorterStatsTest.py
codon_syn_non_syn_probabilities.txt		codon_syn_non_syn_probabilities.txt
create_categories_from_gbk.py		create_categories_from_gbk.py
create_proteins_from_gbk.py		create_proteins_from_gbk.py
download_fastq.py		download_fastq.py
extract_positions_from_prodigal_predictions.py		extract_positions_from_prodigal_predictions.py

resharp/phages

Folders and files

Latest commit

History

Repository files navigation

Functional and ecological determinants of evolutionary dynamics in crAss-like phages

Summary

Examples

dependencies

Installation

Annotation pipeline

measures calculated per sample

measures calculated across samples

macro diversity

Summary table

extra stuff

clustering proteins and prophages from prophage predictions by VirSorter

extra dependencies

scipts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages