Quantcast
Channel: Ask the FireCloud Team — GATK-Forum
Viewing all articles
Browse latest Browse all 1147

Is there a simple way to regenerate the directory structure of scattered files?

$
0
0

It's a common paradigm for tools to expect files to be organized in a certain way. Is there a simple way to preserve the initial organization of files? E.g. They are loaded in a directory in the workspace bucket, and they are scattered to do some pre-processing, and I want to gather them back into the same initial organization for processing together.

I have some example code for a method I've come up with to do this - in this case I'm simply unzipping a bunch of individual files, returning them to their original directory structure, then zipping them together en masse to pass them to whatever comes next pre-organized.

task gunzip {
    File archive
    String file = sub(archive, "\\.gz$", "")

    command {
        set -euo pipefail
        # According to example 2 of
        # https://github.com/openwdl/wdl/blob/develop/SPEC.md#string-substring-string-string
        # the mkdir shouldn't be necessary, but this fails without it.
        mkdir -p $(dirname ${file})

        zcat -f ${archive} > ${file}
    }

    output {
        File files = file
    }

    runtime {
        docker : "broadgdac/firecloud-ubuntu:16.04"
    }

    meta {
        author : "David Heiman"
        email : "dheiman@broadinstitute.org"
    }
}

task zip {
    Array[File] files

    command <<<
        set -euo pipefail

        strtdir=`dirname ${select_first(files)}`

        # The original base directory is the first directory after .*/shard-[0-9]+/(execute/)?
        basedir=$(basename `echo $strtdir | sed 's|^.*/shard-[0-9]\{1,\}/\(execution/\)\{0,1\}\([^/]\{1,\}\)/.*$|\2|'`)
        mkdir $basedir

        # Start the search for base directories one directory above shard-[0-9]+/
        # Copy/symlink the contents recursively to the base directory in the working directory
        rootdir=`echo $strtdir | sed 's|^\(.*\)/shard-[0-9].*$|\1|'`
        find $rootdir -name $basedir -type d -exec bash -c 'cp -r -s "$1"/* "$2"' Cp {} $basedir \;

        # Create an archive preserving the recreated file paths
        zip -r files_archive.zip $basedir
    >>>

    output {
        File files_archive="files_archive.zip"
    }

    runtime {
        docker : "broadgdac/firecloud-ubuntu:16.04"
    }

    meta {
        author : "David Heiman"
        email : "dheiman@broadinstitute.org"
    }
}

workflow merge_archives {
    Array[File] archives

    scatter (archive in archives) {
        call gunzip {input: archive=archive}
    }

    call zip {input: files=gunzip.files}

    output {zip.files_archive}
}

I feel like there must/should be a simpler way to do this.

Thanks!


Viewing all articles
Browse latest Browse all 1147

Trending Articles