What I want to do
One of the standard evaluations of somatic variant callers is normal-normal calling, where you take a bunch of replicate normal (non-tumor) bams from the same individual eg NA12878 and run your caller over every pair of samples, assigning one sample as the "tumor" and one as the "normal."
If I were writing a wdl outside of Firecloud I could implement this with the cross product:
workflow NormalNormal {
Array[File] bams
scatter (pair in cross(bams, bams)) {
call Mutect2 { input: tumor = pair.left, normal = pair.right }
}
}
My understanding of why this is tricky
As I understand the data model, it's baked into Firecloud that you run a method over each sample, which means you don't know about the other samples. That is, you can't perform the cross
because your input is a File
egsample.bam
, and not an Array[File]
eg sample_set.bams
.
Hacky solution 1
I suppose one could set up a bunch of pairs, basically by implementing the cross
manually and then uploading the resulting data model, and run the analysis over a pair set. Besides being really ugly this is not very maintainable because the part of the workflow that forms all pairs out of the list of samples lives outside of Firecloud.
What I mean is that even though the natural data model of the problem is
sample | bam |
---|---|
sample1 | sample1.bam |
sample2 | sample2.bam |
sample3 | sample3.bam |
I would run on pairs:
pair | tumor_sample | normal_sample |
---|---|---|
pair1 | sample1 | sample2 |
pair2 | sample2 | sample1 |
pair3 | sample1 | sample3 |
pair4 | sample3 | sample1 |
pair5 | sample2 | sample3 |
pair5 | sample3 | sample2 |
Hacky solution 2
I could also imagine the following hack. The data model would be a single, dummy, sample, with the attribute "bam_paths" which is a file with a path to a different bam on each line. Then the method would take in this FoFN, use read_tsv to get the Array[File]
of bams, and then scatter over the cross
.
Similarly, there's another analysis I want to do that scatters Mutect2 linearly (not sure what the word is but I mean trivially, without a cross or anything) over samples, but then does a cross over the output, the pairwise overlap between callsets FWIW. I'm also not sure how to do that.
Is there a good way to do these things?