Quantcast
Channel: Ask the FireCloud Team — GATK-Forum
Viewing all articles
Browse latest Browse all 1147

Why does this family of tasks with array inputs repeatedly fail to cache?

$
0
0

Hi again FC team,

I've noticed a somewhat unsavory behavior--always failing to hit the call cache--in a very specific type of task.

As best as I have been able to determine, the only features that distinguish this family of "always-cache-miss" tasks from others is the presence of an Array[String] or Array[File] input that:

  • (1) is being generated as an output by earlier task(s) in the same workflow; and

  • (2) is either being called within the command block of the WDL as some construction of write_lines() or ${sep=' ' my_input_array} (or similar).

I've encountered this at several points in a recent workflow I've been developing, and this behavior (missing the cache) is remarkably consistent, even when rerun twice with all variables held constant (inputs, snapshot #, entity, etc).

Here is a toy/example task that would fit the case I'm highlighting above:

task make_list_of_gs_paths {
  Array[String] gs_paths_to_write_to_file

  command <<<
    cat ${write_lines(gs_paths_to_write_to_file)} > combined_gs_paths.txt
  >>>

  output {
    File combined_paths_list = combined_gs_paths.txt
  }

  runtime {
    ...etc...
  }
}

In my experience, this task should always fail to hit the cache when the Array[String] input is generated by tasks upstream in the same workflow.

Based on testing I've done, you should also encounter this same behavior if you replace the write_lines() syntax in the command block with something like:

  command <<<
    echo -e "${sep='\n' gs_paths_to_write_to_file}" > combined_gs_paths.txt
  >>>

My naive guess to explain this behavior is that FC is copying the outputs from the tasks upstream to their new execution buckets when they hit the cache, so even though the upstream tasks hit the cache (as they should) the actual gs:// paths represented in the Array[] input to the task in question change, which is in turn causing the cache to miss.

Is this a plausible explanation? If so, is this the intended behavior? And irrespective of whether or not this is the intended behavior, is there any way to stop this from happening?

The reason I think this caching behavior is undesirable is prompted by the situation where one of these always-cache-miss tasks resides fairly early in a workflow, meaning that all of the downstream tasks (which may be the compute-heavy ones, as they are in my case) will also fail to cache.

Any thoughts on this? I mentioned this briefly to Alex B at the end of last week, and I know DSP is out on their retreat today, but it would be great to get some insight into this once you get a chance.

As always, I'm happy to provide specific workspaces / workflows / other details if useful.

Thanks,
Ryan


Viewing all articles
Browse latest Browse all 1147

Trending Articles