Hi Team,
We are facing yet another issue with the large file(68GB) processing wherein workflow is getting aborted all of sudden without any exception on AWS enabled Cromwell. This behaviour is not permanent though, sometimes it gets successful but gets fail in another attempt.
We have tried increasing memory which we are assigning to Docker container and Xms values but no luck. Also, tried increasing TIMEOUT setting in Cromwell config.
Most of the time it is happening with SamSplitter(to split large file) and SamToFastqAndBwaMemAndMba tasks. I'm also copying timeout setting which we are maintaining in Cromwell.config.
Please help us to resolve this issue.
Timeout Setting in Cromwell config:
akka {
http {
server {
request-timeout = 1800s
idle-timeout = 2400s
}
client {
request-timeout = 1800s
connecting-timeout = 300s
}
}
}
WorkFlow stopped logger without exception even when a process is still running:--
[2019-05-03 14:34:00,51] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Initializing to Running
[2019-05-03 15:27:11,95] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2m461a6066^[[0mUBTAB.CQYM:0:1]: Status change from Running to Succeeded
[2019-05-03 15:43:25,43] [info] Workflow polling stopped
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] 0 workflows released by cromid-abdb07d
[2019-05-03 15:43:25,46] [info] Aborting all running workflows.
[2019-05-03 15:43:25,46] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-05-03 15:43:25,47] [info] JobExecutionTokenDispenser stopped
[2019-05-03 15:43:25,47] [info] WorkflowStoreActor stopped
[2019-05-03 15:43:25,47] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-05-03 15:43:25,47] [info] WorkflowManagerActor Aborting all workflows
[2019-05-03 15:43:25,47] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] WorkflowLogCopyRouter stopped
[2019-05-03 15:43:25,47] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] c68f0e1b-9998-4cca-9749-40586e3d097f-SubWorkflowActor-SubWorkflow-SplitRG:0:1 [^[[38;5;2mc68f0e1b^[[0m]: Aborting workflow
[2019-05-03 15:43:25,55] [info] Attempted CancelJob operation in AWS Batch for Job ID fd00c634-7a5f-453c-afad-9dbdead31a91. There were no errors during the operation
[2019-05-03 15:43:25,55] [info] We have normality. Anything you still can't cope with is therefore your own problem
[2019-05-03 15:43:25,55] [info] https://www.youtube.com/watch?v=YCRxnjE7JVs
[2019-05-03 15:43:25,55] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0m:SLRG.SamSplitter:NA:1] Aborted StandardAsyncJob(fd00c634-7a5f-453c-afad-9dbdead31a91)
[2019-05-03 15:56:31,10] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Running to Succeeded
[2019-05-03 15:56:31,81] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: WorkflowExecutionActor [^[[38;5;2m461a6066^[[0m] aborted: SubWorkflow-SplitRG:0:1
[2019-05-03 15:56:32,09] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: WorkflowExecutionActor [^[[38;5;2m155c13d0^[[0m] aborted: SubWorkflow-UBTAB:-1:1
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows are aborted
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows finished
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor stopped
[2019-05-03 15:56:33,07] [info] Connection pools shut down
[2019-05-03 15:56:33,07] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] SubWorkflowStoreActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,07] [info] JobStoreActor stopped
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-05-03 15:56:33,08] [info] DockerHashActor stopped
[2019-05-03 15:56:33,08] [info] IoProxy stopped
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] WriteMetadataActor Shutting down: 72 queued messages to process
[2019-05-03 15:56:33,08] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,09] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,09] [info] WriteMetadataActor Shutting down: processing 0 queued messages
[2019-05-03 15:56:33,09] [info] ServiceRegistryActor stopped
[2019-05-03 15:56:33,11] [info] Database closed
[2019-05-03 15:56:33,11] [info] Stream materializer shut down
[2019-05-03 15:56:33,12] [info] WDL HTTP import resolver closed