Hi FC team,
Our team has been unable to get any workflows to successfully complete since mid-afternoon yesterday, Tues, Oct 2. We have experienced these issues across multiple workspaces, workflows, and configs.
There are no persistent errors, but here is a sampling of the issues we have encountered:
Workflow died because of a
temporary server error
(example: https://portal.firecloud.org/#workspaces/talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2/monitor/43065914-ef9b-4634-81b2-8675b1176ca5/b02e2598-e568-447c-8eac-43ba6bf40185)Tasks in workflows with call caching disabled suddenly are spending 1-2 hours in a
CheckingCacheEntryExistence
state (example: taskCleanVCF.Clean4.combine_multi_IDs
in Call #2 here: https://portal.firecloud.org/#workspaces/talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2/monitor/4678e997-47d2-4fbe-8c4f-4c65120e341b/d9c779da-d6eb-4a31-b5a5-0321b6b46823)Workflows with call caching enabled not launching for over 12 hours (example: https://portal.firecloud.org/#workspaces/talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2/monitor/555bda8d-6cdd-45ec-b91d-7b89d7ac83b6/1357a70f-afad-4a77-8dcb-9f6a7804ac29)
Tasks failing because they aren't able to find outputs from previous tasks, despite these outputs existing in the
gs://
bucket and looking correct when downloaded & investigated locally (example: taskCleanVCF.cleanvcf5
in Call #2 here: https://portal.firecloud.org/#workspaces/talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2/monitor/4678e997-47d2-4fbe-8c4f-4c65120e341b/d9c779da-d6eb-4a31-b5a5-0321b6b46823)
I suspect these issues could be related to the following two posts from yesterday by @jgould and @Chip:
https://gatkforums.broadinstitute.org/firecloud/discussion/13147/long-wait-times#latest
https://gatkforums.broadinstitute.org/firecloud/discussion/13148/2-hours-per-task-to-check-call-cache#latest
At this point, we are completely stalled on all workspaces, and don't want to launch any new workflows due to these unpredictable errors and long queue times.
Any idea what could be going on, or how long we should expect this behavior to persist?
Thanks a lot,
Ryan & the Talkowski lab