BaseRecalibratorSpark double copying of reference input files

Hi,

I am testing BaseRecalibratorSpark on GCP through wdl-runner (i.e. "gcloud alpha genomics pipelines run"), since I would like to include the Spark version of this tool in my future wdl scripts, once it is out of the beta. However, I have encountered the following issue:

When I allocate disk space by dynamic sizing the tool fails because it runs out of disk space. To avoid this I need to double-size the expected local disk size. Reading the info written to stderr I have found that the spark engine copies all files (including ref fasta file and dbSNP.vcf file, which are pretty large files) to a temp folder at /cromwell_root, even if the files are already at another sub-folder of /cromwell_root (they are copied there when the VM machine is created). This double copy of files implies a more expensive run of the tool on the cloud due to a substantial extra disk size being required and also due to the extra copying time.

An example of this:

19/02/12 12:39:20 INFO SparkContext: Added file /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta at file:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta with timestamp 1549975160856
19/02/12 12:39:20 INFO Utils: Copying /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta to /cromwell_root/spark-639f237f-74bc-41f4-86c7-af2f294f9c0c/userFiles-7a595035-e731-49fd-a493-f129381ff1e7/Homo_sapiens_assembly38.fasta

Or:

19/02/12 12:39:49 INFO SparkContext: Added file /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf at file:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf with timestamp 1549975189045
19/02/12 12:39:49 INFO Utils: Copying /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf to /cromwell_root/spark-639f237f-74bc-41f4-86c7-af2f294f9c0c/userFiles-7a595035-e731-49fd-a493-f129381ff1e7/Homo_sapiens_assembly38.dbsnp138.vcf

This also happens when run locally on my laptop.

Is there a way to configure the tool in order to prevent this "double-copying" of files from happening? Can the NIO implementation be used here to access the big reference files from the Broad's public buckets directly (i.e. without copying them to the VM machine at all)?

This is the command used:

gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc_log.log -Xms${command_mem}m" \
      BaseRecalibratorSpark \
      -R ${ref_fasta} \
      -I ${input_bam} \
      --use-original-qualities \
      -O ${recalibration_report_filename} \
      --known-sites ${dbSNP_vcf} \
      --known-sites ${sep=" --known-sites " known_indels_sites_VCFs} \
      -L ${sequence_group_interval} \
      -- --spark-master 'local[*]'

Thanks in advance

BaseRecalibratorSpark double copying of reference input files

Trending Articles

Moondru Mudichu 27-05-2016 – Polimer tv Serial

Password Reset on SX6036?

Snes4Sym emulator for nokia s60v3

the range cannot be deleted (6028) in microsoft word

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

Man to stand trial on three charges of money laundering

Joshua Pigden from Bristol faces trial over rape and Diazepam...

DRP MAKER WITH CHEMICALS 9491234553

Muloraki Au

ESENT データベース USS.jtx で、エラーイベント ID 490、454、489、455 が記録される事象について

Revised GDS Gratuity, Severance Amount and SDBS contribution - Social...

Name Of Parts Of The Day In hindi And English-List Of Part Of Days In Hindi

Practice Sheet of Right form of verbs for HSC Students

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Chai Status, Funny Tea Quotes in Hindi, चाय पर शायरी

PRC MOE SCHOOL TEACHER CHARGED FOR SEXUALLY PENETRATING 12 YEAR-OLD WITH FINGERS

Nahitaji matokeo ya kidato cha nne ya mwaka 1998

Bhiknur Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers List...

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise