Skip to content

ni-lab/CellTypeSpecificAccessibilityPrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CellTypeSpecificAccessibilityPrediction

This repository contains information and scripts to train and benchmark genomic deep learning models on their performance in cell-type specific regulatory regions. It makes use of pre-trained Enformer and Sei models, which can be downloaded from the linked sources. It also makes use of scripts from the Basenji repository for training additional models to benchmark different training decisions.

Overview

We have organized the repository by figure. Within the generate_figures/ directory, there is a notebook per figure to reproduce the analyses. Additional preprocessing and model training scripts are included in the relevant directories.

This code has been tested on Python 3.7, and also makes use of Tensorflow 2.1. Please set up a conda environment and install the packages listed in the requirements.txt file.

Data and models

Several different datasets and pre-trained models are used in these analyses. The following instructions can be used to download the relevant resources.

Processed data and resources used in this repository will be made available for download soon. This includes the model parameters and model weights of tissue-specific models, and various data used for evaluating performance in different peak regions. Details on how to download additional resources, such pre-trained models and large datasets hosted elsewhere, are described below.

Enformer

  • The pre-trained Enformer model can be downloaded from TFhub (link)
  • Enformer training, validation and test data can be downloaded from Google Cloud (link). Note: This data is ~320 GB and is in a requester pays bucket.
  • Pre-computed variant effect predictions for all frequent variants in the 1000 genomes cohort (MAF>0.05% in any population) can be downloaded from Google Cloud (link). Note: this data is ~100GB.

Sei

  • The pre-trained Sei model and relevant resources can be downloaded from Zenodo (model, resources)
  • Sei test data and predictions can be downloaded from S3 using the folllowing instructions. Note: This data is 186 GB compressed and 612 GB decompressed.
    wget https://sei-files.s3.amazonaws.com/performance_curves.tar.gz
    tar -xzvf performance_curves.tar.gz
    

Calderon et al. data (reference)

  • Bigwig files used to train models can be downloaded from S3 using the download_bigwigs.sh script (link)
  • Cell type specific peaks and allelic imbalance data can be found in Supplementary Table 1 of Calderon et al. (2019). Cell type specific peaks are found in the sheet lineage_groups and allelic imbalance data is found in the sheet significant_ASCs.

Loeb et al. data

  • These data are available upon request and will be made public upon publication of Loeb, et al. Variants in tubule epithelial regulatory elements mediate most heritable differences in human kidney function. (Submitted).

Additional benchmark datasets

  • GTeX SuSie fine-mapped eQTL data from Wang et al. (2021) and Avsec et al. (2021) can be downloaded from Google Cloud (link)
  • UK Biobank GWAS summary statistics can be downloaded from the Neale lab server using the download_gwas_sumstats.sh script (link)

Analysis

Within the scripts/ directory, a README within each subfolder describes how to perform the relevant analysis.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published