Hi,
I am working with WGS data, and since it's so huge (upwards of 300 GB in some cases), when I scatter across many instances I'd like to be able to avoid localizing the entire bam for each scatter. Instead, I'd like to be able to operate on only the portion of the corresponding to the interval I've assigned to each scatter instance. To this end, I'm trying to use samtools to view only certain parts of the bam. I'm trying to follow the instructions listed here, but can't get it to work: http://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/data/data2/data_in_GCS.html
These are the commands being run in my instance, along with the output/error message ensuing.
>> gcloud auth print-access-token
+ GCS_OAUTH_TOKEN=*******[redacted]********
>> samtools view gs://fc-47b16dc3-db04-48f5-a26a-ddec3c09c578/workspace_name/RP-1476/WGS/MSK-004_T_P1/v7/MSK-004_T_P1.bam 1:1-15000000
open: No such file or directory
[main_samview] fail to open "gs://fc-47b16dc3-db04-48f5-a26a-ddec3c09c578/workspace_name/RP-1476/WGS/MSK-004_T_P1/v7/MSK-004_T_P1.bam" for reading.
And this is the WDL command code that generated those commands:
task ProportionalCoverage_WGS_Task {
File reference
File referenceDict
File referenceIndex
File inputBamLocation
String sampleID
Int memoryGb
Int diskSpaceGb
File targetsIntervalList
Int preemptible
command <<<
samtools view ${inputBamLocation} $(head -n1 ${targetsIntervalList}) >> bam_section.bam
samtools index bam_section.bam
java -jar /gatk/gatk.jar CalculateTargetCoverage \
-L ${targetsIntervalList} \
--output ${sampleID}.pcov \
--groupBy SAMPLE \
--transform PCOV \
--input bam_section.bam \
--reference ${reference}
>>>
output {
File pcov = "${sampleID}.pcov"
}
runtime {
docker: "broadinstitute/gatk:4.beta.6"
memory: "${memoryGb} GB"
cpu: "1"
disks: "local-disk ${diskSpaceGb} HDD"
preemptible: preemptible
}
}
How can I do this? This will save me countless hours while developing my workflows for WGS, and I'm sure would be very useful to others in the community.
Thanks,
Eric