I ran a submission that launched parallel workflows on 462 samples. Each workflow simply downloads a single file from the GDC. These files are relatively small: 150MB - 400 MB, and the workflow, when launched, should take something on the order of 5 minutes, once it enters the running state.
Over half of the workflows got stuck in the "submitted" state. While each of these stuck workflows was assigned a workflow id, the workflow's single task was never called, and no corresponding folder was created on the workspace bucket.
FireCloud reports the following with respect to its queue status: Workflows: 0 Queued; 1958 Active; 0 ahead of yours
If I run the workflow again on one of the samples that failed in the previous "bulk" launch, it runs fine.
As a work-around, I can alter my workflow to use call caching, and then launch the submission (against the sample set) repeatedly until all files have been downloaded, but I don't view this as satisfactory from a user's perspective. Large numbers of workflows should not be getting stuck in the submitted state, with no feedback from the system on the reason they are stuck.
As part of our GDC integration strategy, we plan to rely heavily on workflow-based retrieval of files from the GDC. The reliable launching of large numbers of simple workflows in parallel is crucial to this strategy.
I've attached some screen shots displaying how 50% of my workflows got stuck in the submitted state.