A submission made March 10, 2017, 5:14 PM, containing 19 entities in its pairset, had 3 of them fail within minutes. While it was easier to mitigate because the problem 1) occurred quickly and 2) was easily visible, this is an area the system could be made more robust. There was no useful error message, and manually relaunching the 3 failed jobs worked fine.
submission id
475120fa-2411-4c24-9d7c-de11c956174b
One of the three failing workflows:
workflow id: d5b4ddd2-eb3c-4d3a-afa0-01ddd5114b9e
ID:operations/EI7kqNKrKxjj6YnzjubghakBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
lines from workflow log with the error and line before:
2017-03-10 22:15:13,590 INFO - JesAsyncBackendJobExecutionActor [UUID(d5b4ddd2)pcawg_full_workflow.pcawg_full:NA:1]: JesAsyncBackendJobExecutionActor [UUID(d5b4ddd2):pcawg_full_workflow.pcawg_full:NA:1] Status change from - to Running
2017-03-10 22:15:59,151 INFO - JesAsyncBackendJobExecutionActor [UUID(d5b4ddd2)pcawg_full_workflow.pcawg_full:NA:1]: JesAsyncBackendJobExecutionActor [UUID(d5b4ddd2):pcawg_full_workflow.pcawg_full:NA:1] Status change from Running to Failed