Instructions
We will be working with two datasets in this course: a yeast DNA dataset with some thymidines substituted by BrdU and a human dataset of methylated DNA. The yeast DNA was sequenced using an R9 workflow, whereas the human DNA was sequenced using an R10 workflow. The yeast dataset is at this link and the human dataset is at this link. We will infer single-molecule DNA replication dynamics from BrdU-incorporation patterns in the yeast DNA dataset, and infer methylation patterns from the human dataset. Please read on to see how to download the datasets and prepare them for use.
If you are a participant in the Earlham Institute training course
You do not need to read any further than this section. You will be given login details to virtual machines by the organisers or the trainers. The input data you need has already been downloaded onto your computer. You need to copy it into the correct location. Please execute the following commands
mkdir -p ~/nanomod_course_data
cd ~/nanomod_course_data
cp -r /mnt/trainings/BaseMod2024/* .
You should now see two directories called yeast
and human
.
Data preparation is now complete.
If you are a self-study student
Yeast
Please download the yeast dataset from this link. Please put it in a suitable folder and untar it using the command
filename= # substitute filename suitably
tar -xzvf $filename
Then, move the contents of the resultant for_ckan
folder to a suitable location.
In the course, we place it under ~/nanomod_course_data/yeast
, but you
can use any suitable location.
Just remember to use the correct paths when executing the commands in each session.
Human
The human dataset we are using is > 1 TB but we will be using only a small fraction of it. As any human dataset is of a sensitive nature, we are unable to provide you the subset we are using directly and you will have to recreate it using the commands below. You will need to download tens of GB and use tens of minutes of computational time.
Please make a directory called ~/nanomod_course_data/human
to store the data.
As stated in the yeast subsection above, you can use any other directory.
Just remember to use the correct paths when executing commands in the course.
We first make a subset of a mod BAM file as shown below.
cd ~/nanomod_course_data/human
mod_bam=http://ont-open-data.s3.amazonaws.com/cliveome_kit14_2022.05/gdna/basecalls/PAM63974/bonito_calls.bam
samtools view -h -b -e 'rlen>=30000' $mod_bam chr20:58000000-60000000 > bonito_calls.subset.bam
Using samtools sort
and samtools index
,
sort and index the file to form bonito_calls.subset.sorted.bam
and bonito_calls.subset.sorted.bam.bai
.
You will learn how to run these commands on day 1, so if you do not know how to run them,
just come back to this section after day 1.
Next, we make a random subset of nanopore currents from a fast5 file after converting it to pod5.
cd ~/nanomod_course_data/human
wget http://ont-open-data.s3.amazonaws.com/cliveome_kit14_2022.05/gdna/flowcells/ONLA29134/20220510_1127_5H_PAM63974_a5e7a202/fast5_pass/PAM63974_pass_58881fec_60.fast5
pod5 convert fast5 ./PAM63974_pass_58881fec_60.fast5 --output ./PAM63974_pass_58881fec_60.pod5
pod5 view ./PAM63974_pass_58881fec_60.pod5 --ids --no-header | shuf | head -n 20 > twenty_read_ids.txt
pod5 filter ./PAM63974_pass_58881fec_60.pod5 --ids twenty_read_ids.txt --output ./PAM63974_pass_58881fec_60.twenty_random_reads.pod5
Next, we run pycoQC to judge the quality of the dataset. In the course run by the Earlham Institute, we run the program ourselves and give the results to the course participants to minimize computer runtime. Unfortunately, we cannot host these files on the internet and give them to you. The commands below require downloading tens of GB of data and tens of minutes of computational time.
cd ~/nanomod_course_data/human
wget http://ont-open-data.s3.amazonaws.com/cliveome_kit14_2022.05/gdna/flowcells/ONLA29134/20220510_1127_5H_PAM63974_a5e7a202/sequencing_summary_PAM63974_58881fec.txt
wget http://ont-open-data.s3.amazonaws.com/cliveome_kit14_2022.05/gdna/basecalls/PAM63974/bonito_calls.bam
wget http://ont-open-data.s3.amazonaws.com/cliveome_kit14_2022.05/gdna/basecalls/PAM63974/bonito_calls.bam.bai
pycoQC -f sequencing_summary_PAM63974_58881fec.txt\
-a bonito_calls.bam -o ./analysis.html -j ./analysis.json
Data preparation is now complete.