I'm trying to run the wdl posted on the gatk-workflows Github page, under the gatk4-germline-snps-indels repository. The wdl is "haplotypecaller-gvcf-gatk4.wdl"
I'm attempting to run this wdl locally on my computer. This wdl script makes use of GATK in a docker containers to execute tools such as HaplotypeCaller, and MergeVcfs. I'm using Cromwell in "run mode" to run the wdl script. I'm using the exact inputs listed in the haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json file.
The bam file is the NA12878_24RG_small.hg38.bam, which is about 5 gigs in size.
The fasta file is the Homo_sapiens_assembly38.fasta, which is about 3 gigs in size
Anytime I run this I eventually get out-of-memory errors. It seems like 50 GATK docker containers are getting spun up and run HaplotypeCaller in parallel. This is due to the number of interval lists declared in hg38_wgs_scattered_calling_intervals.txt I think?
I'm running it on a machine with 32G of RAM and 512GB of disk space. My questions are basically:
- How much RAM is needed to run this workflow?
- Should I set a limit on how much memory each docker container can use in the Cromwell configuration file, and if so, how much should I set it to?
- What should the Java heap size be set to?
- It looks like it is using the "scatter-gather" technique for paralyzation. Does this require me to set up a cluster of servers to run the workflow? I'm not sure if I can run it like this on just my local computer.
Any insight would be greatly appreciated. Thank you!