This repository contains information and scripts to train and benchmark genomic deep learning models on their performance in cell-type specific regulatory regions. It makes use of pre-trained Enformer and Sei models, which can be downloaded from the linked sources. It also makes use of scripts from the Basenji repository for training additional models to benchmark different training decisions.
We have organized the repository by figure. Within the generate_figures/ directory, there is a notebook per figure to reproduce the analyses. Additional preprocessing and model training scripts are included in the relevant directories.
This code has been tested on Python 3.7, and also makes use of Tensorflow 2.1. Please set up a conda environment and install the packages listed in the requirements.txt file.
Several different datasets and pre-trained models are used in these analyses. The following instructions can be used to download the relevant resources.
Processed data and resources used in this repository will be made available for download soon. This includes the model parameters and model weights of tissue-specific models, and various data used for evaluating performance in different peak regions. Details on how to download additional resources, such pre-trained models and large datasets hosted elsewhere, are described below.
- The pre-trained Enformer model can be downloaded from TFhub (link)
- Enformer training, validation and test data can be downloaded from Google Cloud (link). Note: This data is ~320 GB and is in a requester pays bucket.
- Pre-computed variant effect predictions for all frequent variants in the 1000 genomes cohort (MAF>0.05% in any population) can be downloaded from Google Cloud (link). Note: this data is ~100GB.
- The pre-trained Sei model and relevant resources can be downloaded from Zenodo (model, resources)
- Sei test data and predictions can be downloaded from S3 using the folllowing instructions. Note: This data is 186 GB compressed and 612 GB decompressed.
wget https://sei-files.s3.amazonaws.com/performance_curves.tar.gz tar -xzvf performance_curves.tar.gz
Calderon et al. data (reference)
- Bigwig files used to train models can be downloaded from S3 using the
download_bigwigs.shscript (link) - Cell type specific peaks and allelic imbalance data can be found in Supplementary Table 1 of Calderon et al. (2019). Cell type specific peaks are found in the sheet
lineage_groupsand allelic imbalance data is found in the sheetsignificant_ASCs.
- These data are available upon request and will be made public upon publication of Loeb, et al. Variants in tubule epithelial regulatory elements mediate most heritable differences in human kidney function. (Submitted).
- GTeX SuSie fine-mapped eQTL data from Wang et al. (2021) and Avsec et al. (2021) can be downloaded from Google Cloud (link)
- UK Biobank GWAS summary statistics can be downloaded from the Neale lab server using the
download_gwas_sumstats.shscript (link)
Within the scripts/ directory, a README within each subfolder describes how to perform the relevant analysis.