It's a common paradigm for tools to expect files to be organized in a certain way. Is there a simple way to preserve the initial organization of files? E.g. They are loaded in a directory in the workspace bucket, and they are scattered to do some pre-processing, and I want to gather them back into the same initial organization for processing together.
I have some example code for a method I've come up with to do this - in this case I'm simply unzipping a bunch of individual files, returning them to their original directory structure, then zipping them together en masse to pass them to whatever comes next pre-organized.
task gunzip {
File archive
String file = sub(archive, "\\.gz$", "")
command {
set -euo pipefail
# According to example 2 of
# https://github.com/openwdl/wdl/blob/develop/SPEC.md#string-substring-string-string
# the mkdir shouldn't be necessary, but this fails without it.
mkdir -p $(dirname ${file})
zcat -f ${archive} > ${file}
}
output {
File files = file
}
runtime {
docker : "broadgdac/firecloud-ubuntu:16.04"
}
meta {
author : "David Heiman"
email : "dheiman@broadinstitute.org"
}
}
task zip {
Array[File] files
command <<<
set -euo pipefail
strtdir=`dirname ${select_first(files)}`
# The original base directory is the first directory after .*/shard-[0-9]+/(execute/)?
basedir=$(basename `echo $strtdir | sed 's|^.*/shard-[0-9]\{1,\}/\(execution/\)\{0,1\}\([^/]\{1,\}\)/.*$|\2|'`)
mkdir $basedir
# Start the search for base directories one directory above shard-[0-9]+/
# Copy/symlink the contents recursively to the base directory in the working directory
rootdir=`echo $strtdir | sed 's|^\(.*\)/shard-[0-9].*$|\1|'`
find $rootdir -name $basedir -type d -exec bash -c 'cp -r -s "$1"/* "$2"' Cp {} $basedir \;
# Create an archive preserving the recreated file paths
zip -r files_archive.zip $basedir
>>>
output {
File files_archive="files_archive.zip"
}
runtime {
docker : "broadgdac/firecloud-ubuntu:16.04"
}
meta {
author : "David Heiman"
email : "dheiman@broadinstitute.org"
}
}
workflow merge_archives {
Array[File] archives
scatter (archive in archives) {
call gunzip {input: archive=archive}
}
call zip {input: files=gunzip.files}
output {zip.files_archive}
}
I feel like there must/should be a simpler way to do this.
Thanks!