Notes:
This folder contains the main Bash script and the R scripts for each processing step. Details of each step are explained in the workflow figure and description below.
The data_processing.sh script processes raw FASTQ files using the nf-core/rnavar pipeline, which performs quality control and recalibration, alignment with STAR, and variant calling with GATK4 HaplotypeCaller.
Gene and transposable element (TE) expressions are quantified from the aligned BAM files using featureCounts v2.0, with genes mapped to GRCh38 and TEs mapped to RepeatMasker. Gene counts are normalised to transcripts per million (TPM), and TE counts are normalised using variance stabilizing transformation (VST) at class and family levels.
Variant call files (VCFs) generated by HaplotypeCaller are annotated for clinical relevance and mutation information using VEP, with reference to ClinVar (20241111) and COSMIC (v102). The annotated VCFs are then converted into MAF files and merged at the cohort level. Finally, to reduce artefacts and low-confidence calls, we applied the filtering strategy described by Jakobsen et al.1, retaining only high-confidence pathogenic somatic variants.
1Jakobsen, N. A. et al. Selective advantage of mutant stem cells in human clonal hematopoiesis is associated with attenuated response to inflammation and aging. Cell Stem Cell 31, 1127–1144.e1117 (2024).
This folder contains the scripts and data used to generate the figures for the project.
Notes:
- Each figure has its corresponding script and data file(s).
- For detailed explanations of the analyses, please refer to the Methods section and the figure legends in the manuscript.