OK, so my enthusiasm is a bit reserved now that I've gotten to the final phase, joint calling! My naive approach was to hack up the lovely featured WDLs to ignore BQSR, etc. to produce first-round VCFs. What I neglected to realize is that Firecloud can't handle 4,000 scatter jobs gracefully. My first run failed, and I can't load the page to see what even happened.
Scratching my head, I went back to the drawing boards and message boards. One way of solving this would be to remake a 'pseudo genome' linked by NNN's across my smaller contigs, but this would require realignment and a bunch of other stuff I didn't feel like doing. Instead, I spent the past two days banging my head against the keyboard, and came up with a workflow that relies on each "task" looping over an interval list such that each process handles about 16mb of the genome. WDLs require some really special care to avoid bash variable expansion issues, so there was a lot of iteration to get something to work.
The obvious solution to this would be to get GenomicsDBImport to handle more than one interval at a time, which I hear is in development. Any updates on when this may happen?
Here's my general approach. The contortions to get bash to play nice with WDL were a bit tricky, so if someone has suggestions for a 'better way' I'm all ears.
Currently testing on FC right now, can report back if things get dicey. So far, I'm getting some reasonable runtimes. I'm documenting all of this, FWIW, on GitHub: https://github.com/msuefishlab/pkings_firecloud.
The general workflow goes something like this:
Run SplitIntervals on the original Genome (using the appropriate mode to get equal sized chunks for best performance on scatter)
./gatk SplitIntervals \
-R draft_assembly.fasta \
-L ./intervals.list \
-scatter 50 \
-O interval-files \
-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW
Run the following WDL:
## Simplified Joint Genotyping With Contig 'Batching'
## Jason Gallant
## Modified From gatk/joint-discovery-gatk4 snapshot6 on Firecloud (bshifaw@broadinstitute.org)
workflow JointGenotyping {
File list_of_scatter_files
String callset_name
File sample_name_map
File ref_fasta
File ref_fasta_index
File ref_dict
String gatk_docker
String gatk_path
String python_docker
Int small_disk
Int medium_disk
Int huge_disk
Array[File] scatter_array = read_lines(list_of_scatter_files)
#should be a list of 50 lists of intervals, approximately same length (here 15 mbp)
scatter (idx in range(length(scatter_array))) {
call ImportGVCFs {
input:
sample_name_map = sample_name_map,
interval_list = scatter_array[idx],
workspace_dir_name = "genomicsdb",
disk_size = medium_disk,
docker_image = gatk_docker,
gatk_path = gatk_path,
batch_size = 50
}
output {
ImportGVCFs.output_genomicsdb
}
call GenotypeGVCFs {
input:
workspace_tar = ImportGVCFs.output_genomicsdb,
interval_list = scatter_array[idx],
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
disk_size = medium_disk,
docker_image = gatk_docker,
gatk_path = gatk_path
}
} #end scatter
call GatherVcfs as FinalGatherVcf {
input:
input_vcfs_fofn = write_lines(GenotypeGVCFs.output_vcf),
output_vcf_name = callset_name + ".raw.vcf.gz",
disk_size = medium_disk,
docker_image = gatk_docker,
gatk_path = gatk_path
}
output {
# outputs from the small callset path through the wdl
FinalGatherVcf.output_vcf
FinalGatherVcf.output_vcf_index
}
} # end workflow
task ImportGVCFs {
File sample_name_map
File interval_list
String workspace_dir_name
String java_opt
String gatk_path
String docker_image
Int disk_size
String mem_size
Int preemptibles
Int batch_size
String dollar = "$"
command <<<
i=0 && \
grep -v '^@' ${interval_list} | while read -r line ; do
let "i++"
the_interval=`echo $line | awk '{printf "%s:%s-%s\n", $1,$2,$3}'`
echo "working on the $i th file..., interval $the_interval..."
${gatk_path} --java-options "${java_opt}" \
GenomicsDBImport \
--genomicsdb-workspace-path ${workspace_dir_name}_$i \
--batch-size ${batch_size} \
--L $the_interval \
--sample-name-map ${sample_name_map} \
--reader-threads 5 \
-ip 500
tar -cf ${workspace_dir_name}_$i.tar ${workspace_dir_name}_$i
done
>>>
output {
Array[File] output_genomicsdb = glob("${workspace_dir_name}_*.tar")
}
runtime {
docker: docker_image
memory: mem_size
cpu: "2"
disks: "local-disk " + disk_size + " HDD"
preemptible: preemptibles
}
}
task GenotypeGVCFs {
Array[File] workspace_tar
File interval_list
String gatk_path
String java_opt
File ref_fasta
File ref_fasta_index
File ref_dict
String docker_image
Int disk_size
String mem_size
Int preemptibles
String dollar = "$"
command <<<
i=0 && \
grep -v '^@' ${interval_list} | while read -r line ; do
let "i++"
the_interval=`echo $line | awk '{printf "%s:%s-%s\n", $1,$2,$3}'`
echo "working on the $i th file..., interval $the_interval..., and the workspace $the_wkspc"
the_wkspc=$(cat ${write_lines(workspace_tar)} | sed "${dollar}{i}q;d")
tar -xf $the_wkspc
WORKSPACE=$( basename $the_wkspc .tar)
${gatk_path} --java-options "${java_opt}" \
GenotypeGVCFs \
-R ${ref_fasta} \
-O output_$i.vcf.gz \
-G StandardAnnotation \
--only-output-calls-starting-in-intervals \
-new-qual \
-V gendb://$WORKSPACE \
-L $the_interval
done
>>>
runtime {
docker: docker_image
memory: mem_size
cpu: "2"
disks: "local-disk " + disk_size + " HDD"
preemptible: preemptibles
}
output {
Array[File] output_vcf = glob("output_*.vcf.gz")
Array[File] output_vcf_index = glob("output_*.vcf.gz.tbi")
}
}
task GatherVcfs {
File input_vcfs_fofn
String output_vcf_name
String gatk_path
String java_opt
String docker_image
Int disk_size
String mem_size
Int preemptibles
command <<<
tr '\t' '\n' < ${input_vcfs_fofn} > inputs.args
#cat inputs.args
# ignoreSafetyChecks make a big performance difference so we include it in our invocation
${gatk_path} --java-options "${java_opt}" \
GatherVcfsCloud \
--ignore-safety-checks \
--gather-type BLOCK \
--input inputs.args \
--output ${output_vcf_name}
${gatk_path} --java-options "-Xmx6g -Xms6g" \
IndexFeatureFile \
--feature-file ${output_vcf_name}
>>>
runtime {
docker: docker_image
memory: mem_size
cpu: "1"
disks: "local-disk " + disk_size + " HDD"
preemptible: preemptibles
}
output {
File output_vcf = "${output_vcf_name}"
File output_vcf_index = "${output_vcf_name}.tbi"
}
}