Quantcast
Channel: Ask the FireCloud Team — GATK-Forum
Viewing all articles
Browse latest Browse all 1147

Bulky stderr causes node to hang

$
0
0

I was trying to run SamToFastq, using a docker which I have successfully run in the past, via a standalone instance of Cromwell with a JES backend. I believe it is Cromwell version 31, though the swagger API call reports version 30.

When feeding it a new (somehow malformed) BAM, it emitted to stderr a 100 character error message for every read, causing the stderr to swell to multiple GB's. Since I had left the boot disk size at the default 10GB, this caused the node to hang - log files do not get copied out, the node was unresponsive to logging into it, and Cromwell appeared unable to kill it when I aborted the submission. Bumping the boot disk size up to 1TB avoids this hang, accommodating a 49GB stderr file. This also reveals the boot disk usage expanding at 4x the rate (!) of the stderr file (based on ssh'ing into the running node). Since nothing is writing to /tmp or elsewhere outside of the output directory that would cause the bloat, there must be multiple copies being kept of the stderr file.

If stdout and stderr files were stored on the data disk rather than the boot disk, excessive output would crash only the algorithm and allow the error messages to propagate off of the VM, rather than crashing the whole VM. We shouldn't have to size the boot disk to accommodate such worst-case scenarios.


Viewing all articles
Browse latest Browse all 1147

Trending Articles