Quantcast
Channel: Ask the FireCloud Team — GATK-Forum
Viewing all articles
Browse latest Browse all 1147

gcsfuse on Firecloud

$
0
0

In many cases, an algorithm wants to access just a small portion of a really large file. Scatter jobs operating on a BAM are a major example, but this also occurs when dealing with reference files. The naive solutions to this have involved extra (sometimes punishing) compute time and cost.

NIO has been proposed as a solution to this. Apart from the logistics of manually getting the appropriate credentials to the Firecloud VM, it also requires that the algorithm interface with NIO. While GATK has been rewritten to handle this natively, many important algorithms do not live within GATK.

gcsfuse appears to be a ready-made solution. It can serve as an adaptor between the NIO interface on one side and a file system interface on the other. If data is accessed read-only, it are streamed right from the bucket on demand at basically the same speed as gsutil, leaving no footprint on the filesystem, allowing the algorithm to seek and use whatever portion of it is needed. While gcsfuse is no longer officially supported, we have been told it is unofficially supported because our application is important, and indeed it has been receiving bug fixes and has only a tiny number of open issues.

In March, I believe Dimitri tried running gcsfuse in a Firecloud docker, but was blocked because Firecloud omitted a certain flag when starting up the container, which prohibited adding mount points. If I understand right, one of the issues was not wanting to convey credentials with overly broad permissions to the running docker, but that seems to be an issue that can be solved independently.

What can be done to move forward with a general adaptor between NIO and the Docker containers' internal file system?

Gordon


Viewing all articles
Browse latest Browse all 1147

Trending Articles