Hi,
I am testing BaseRecalibratorSpark on GCP through wdl-runner (i.e. "gcloud alpha genomics pipelines run"), since I would like to include the Spark version of this tool in my future wdl scripts, once it is out of the beta. However, I have encountered the following issue:
When I allocate disk space by dynamic sizing the tool fails because it runs out of disk space. To avoid this I need to double-size the expected local disk size. Reading the info written to stderr I have found that the spark engine copies all files (including ref fasta file and dbSNP.vcf file, which are pretty large files) to a temp folder at /cromwell_root, even if the files are already at another sub-folder of /cromwell_root (they are copied there when the VM machine is created). This double copy of files implies a more expensive run of the tool on the cloud due to a substantial extra disk size being required and also due to the extra copying time.
An example of this:
19/02/12 12:39:20 INFO SparkContext: Added file /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta at file:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta with timestamp 1549975160856
19/02/12 12:39:20 INFO Utils: Copying /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta to /cromwell_root/spark-639f237f-74bc-41f4-86c7-af2f294f9c0c/userFiles-7a595035-e731-49fd-a493-f129381ff1e7/Homo_sapiens_assembly38.fasta
Or:
19/02/12 12:39:49 INFO SparkContext: Added file /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf at file:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf with timestamp 1549975189045
19/02/12 12:39:49 INFO Utils: Copying /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf to /cromwell_root/spark-639f237f-74bc-41f4-86c7-af2f294f9c0c/userFiles-7a595035-e731-49fd-a493-f129381ff1e7/Homo_sapiens_assembly38.dbsnp138.vcf
This also happens when run locally on my laptop.
Is there a way to configure the tool in order to prevent this "double-copying" of files from happening? Can the NIO implementation be used here to access the big reference files from the Broad's public buckets directly (i.e. without copying them to the VM machine at all)?
This is the command used:
gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc_log.log -Xms${command_mem}m" \
BaseRecalibratorSpark \
-R ${ref_fasta} \
-I ${input_bam} \
--use-original-qualities \
-O ${recalibration_report_filename} \
--known-sites ${dbSNP_vcf} \
--known-sites ${sep=" --known-sites " known_indels_sites_VCFs} \
-L ${sequence_group_interval} \
-- --spark-master 'local[*]'
Thanks in advance