Quantcast
Channel: Ask the FireCloud Team — GATK-Forum
Viewing all articles
Browse latest Browse all 1147

Workaround for the GenomicsDBImport + Draft Assemblies + Firecloud = World of Pain Equation

$
0
0

OK, so my enthusiasm is a bit reserved now that I've gotten to the final phase, joint calling! My naive approach was to hack up the lovely featured WDLs to ignore BQSR, etc. to produce first-round VCFs. What I neglected to realize is that Firecloud can't handle 4,000 scatter jobs gracefully. My first run failed, and I can't load the page to see what even happened.

Scratching my head, I went back to the drawing boards and message boards. One way of solving this would be to remake a 'pseudo genome' linked by NNN's across my smaller contigs, but this would require realignment and a bunch of other stuff I didn't feel like doing. Instead, I spent the past two days banging my head against the keyboard, and came up with a workflow that relies on each "task" looping over an interval list such that each process handles about 16mb of the genome. WDLs require some really special care to avoid bash variable expansion issues, so there was a lot of iteration to get something to work.

The obvious solution to this would be to get GenomicsDBImport to handle more than one interval at a time, which I hear is in development. Any updates on when this may happen?

Here's my general approach. The contortions to get bash to play nice with WDL were a bit tricky, so if someone has suggestions for a 'better way' I'm all ears.

Currently testing on FC right now, can report back if things get dicey. So far, I'm getting some reasonable runtimes. I'm documenting all of this, FWIW, on GitHub: https://github.com/msuefishlab/pkings_firecloud.

The general workflow goes something like this:

Run SplitIntervals on the original Genome (using the appropriate mode to get equal sized chunks for best performance on scatter)

./gatk SplitIntervals \
  -R draft_assembly.fasta \
  -L ./intervals.list \
  -scatter 50 \
  -O interval-files \
  -mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW

Run the following WDL:

## Simplified Joint Genotyping With Contig 'Batching'
## Jason Gallant
## Modified From gatk/joint-discovery-gatk4 snapshot6 on Firecloud (bshifaw@broadinstitute.org)

workflow JointGenotyping {
  File list_of_scatter_files

  String callset_name
  File sample_name_map

  File ref_fasta
  File ref_fasta_index
  File ref_dict

  String gatk_docker
  String gatk_path
  String python_docker

  Int small_disk
  Int medium_disk
  Int huge_disk


  Array[File] scatter_array = read_lines(list_of_scatter_files)


  #should be a list of 50 lists of intervals, approximately same length (here 15 mbp)
  scatter (idx in range(length(scatter_array))) {

  call ImportGVCFs {
      input:
        sample_name_map = sample_name_map,
        interval_list = scatter_array[idx],
        workspace_dir_name = "genomicsdb",
        disk_size = medium_disk,
        docker_image = gatk_docker,
        gatk_path = gatk_path,
        batch_size = 50
    }

  output {
    ImportGVCFs.output_genomicsdb
  }

  call GenotypeGVCFs {
    input:
      workspace_tar = ImportGVCFs.output_genomicsdb,
      interval_list = scatter_array[idx],
      ref_fasta = ref_fasta,
      ref_fasta_index = ref_fasta_index,
      ref_dict = ref_dict,
      disk_size = medium_disk,
      docker_image = gatk_docker,
      gatk_path = gatk_path
  }

  } #end scatter

call GatherVcfs as FinalGatherVcf {
  input:
    input_vcfs_fofn = write_lines(GenotypeGVCFs.output_vcf),
    output_vcf_name = callset_name + ".raw.vcf.gz",
    disk_size = medium_disk,
    docker_image = gatk_docker,
    gatk_path = gatk_path
}

output {
  # outputs from the small callset path through the wdl
  FinalGatherVcf.output_vcf
  FinalGatherVcf.output_vcf_index
}

} # end workflow

task ImportGVCFs {
  File sample_name_map
  File interval_list

  String workspace_dir_name

  String java_opt
  String gatk_path

  String docker_image
  Int disk_size
  String mem_size
  Int preemptibles
  Int batch_size
  String dollar = "$"

  command <<<
    i=0 && \
    grep -v '^@' ${interval_list} | while read -r line ; do
      let "i++"
      the_interval=`echo $line | awk '{printf "%s:%s-%s\n", $1,$2,$3}'`
      echo "working on the $i th file..., interval $the_interval..."
            ${gatk_path} --java-options "${java_opt}" \
            GenomicsDBImport \
            --genomicsdb-workspace-path ${workspace_dir_name}_$i \
            --batch-size ${batch_size} \
            --L $the_interval \
            --sample-name-map ${sample_name_map} \
            --reader-threads 5 \
            -ip 500

            tar -cf ${workspace_dir_name}_$i.tar ${workspace_dir_name}_$i
    done

  >>>

  output {
          Array[File] output_genomicsdb = glob("${workspace_dir_name}_*.tar")
  }
  runtime {
    docker: docker_image
    memory: mem_size
    cpu: "2"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptibles
  }
}

task GenotypeGVCFs {
  Array[File] workspace_tar
  File interval_list
  String gatk_path
  String java_opt

  File ref_fasta
  File ref_fasta_index
  File ref_dict

  String docker_image
  Int disk_size
  String mem_size
  Int preemptibles
  String dollar = "$"


  command <<<

        i=0 && \
        grep -v '^@' ${interval_list} | while read -r line ; do
        let "i++"
        the_interval=`echo $line | awk '{printf "%s:%s-%s\n", $1,$2,$3}'`
        echo "working on the $i th file..., interval $the_interval..., and the workspace $the_wkspc"
        the_wkspc=$(cat ${write_lines(workspace_tar)} | sed "${dollar}{i}q;d")
        tar -xf $the_wkspc
        WORKSPACE=$( basename $the_wkspc .tar)

        ${gatk_path} --java-options "${java_opt}" \
        GenotypeGVCFs \
        -R ${ref_fasta} \
        -O output_$i.vcf.gz \
        -G StandardAnnotation \
        --only-output-calls-starting-in-intervals \
        -new-qual \
        -V gendb://$WORKSPACE \
        -L $the_interval

      done
  >>>
  runtime {
    docker: docker_image
    memory: mem_size
    cpu: "2"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptibles
  }
  output {
    Array[File] output_vcf = glob("output_*.vcf.gz")
    Array[File] output_vcf_index = glob("output_*.vcf.gz.tbi")
  }
}
task GatherVcfs {
  File input_vcfs_fofn
  String output_vcf_name

  String gatk_path
  String java_opt

  String docker_image
  Int disk_size
  String mem_size
  Int preemptibles

  command <<<

    tr '\t' '\n' < ${input_vcfs_fofn} > inputs.args
    #cat inputs.args

    # ignoreSafetyChecks make a big performance difference so we include it in our invocation
    ${gatk_path} --java-options "${java_opt}" \
    GatherVcfsCloud \
    --ignore-safety-checks \
    --gather-type BLOCK \
    --input inputs.args \
    --output ${output_vcf_name}

    ${gatk_path} --java-options "-Xmx6g -Xms6g" \
    IndexFeatureFile \
    --feature-file ${output_vcf_name}
  >>>
  runtime {
    docker: docker_image
    memory: mem_size
    cpu: "1"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptibles
  }
  output {
    File output_vcf = "${output_vcf_name}"
    File output_vcf_index = "${output_vcf_name}.tbi"
  }
}

Viewing all articles
Browse latest Browse all 1147

Trending Articles