WorkFlow getting aborted intermittently without any exception

Hi Team,

We are facing yet another issue with the large file(68GB) processing wherein workflow is getting aborted all of sudden without any exception on AWS enabled Cromwell. This behaviour is not permanent though, sometimes it gets successful but gets fail in another attempt.

We have tried increasing memory which we are assigning to Docker container and Xms values but no luck. Also, tried increasing TIMEOUT setting in Cromwell config.

Most of the time it is happening with SamSplitter(to split large file) and SamToFastqAndBwaMemAndMba tasks. I'm also copying timeout setting which we are maintaining in Cromwell.config.

Please help us to resolve this issue.

Timeout Setting in Cromwell config:
akka {
http {
server {
request-timeout = 1800s
idle-timeout = 2400s
}
client {
request-timeout = 1800s
connecting-timeout = 300s
}
}
}

WorkFlow stopped logger without exception even when a process is still running:--

[2019-05-03 14:34:00,51] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Initializing to Running
[2019-05-03 15:27:11,95] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2m461a6066^[[0mUBTAB.CQYM:0:1]: Status change from Running to Succeeded
[2019-05-03 15:43:25,43] [info] Workflow polling stopped
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] 0 workflows released by cromid-abdb07d
[2019-05-03 15:43:25,46] [info] Aborting all running workflows.
[2019-05-03 15:43:25,46] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-05-03 15:43:25,47] [info] JobExecutionTokenDispenser stopped
[2019-05-03 15:43:25,47] [info] WorkflowStoreActor stopped
[2019-05-03 15:43:25,47] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-05-03 15:43:25,47] [info] WorkflowManagerActor Aborting all workflows
[2019-05-03 15:43:25,47] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] WorkflowLogCopyRouter stopped
[2019-05-03 15:43:25,47] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] c68f0e1b-9998-4cca-9749-40586e3d097f-SubWorkflowActor-SubWorkflow-SplitRG:0:1 [^[[38;5;2mc68f0e1b^[[0m]: Aborting workflow
[2019-05-03 15:43:25,55] [info] Attempted CancelJob operation in AWS Batch for Job ID fd00c634-7a5f-453c-afad-9dbdead31a91. There were no errors during the operation
[2019-05-03 15:43:25,55] [info] We have normality. Anything you still can't cope with is therefore your own problem
[2019-05-03 15:43:25,55] [info] https://www.youtube.com/watch?v=YCRxnjE7JVs
[2019-05-03 15:43:25,55] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0m:SLRG.SamSplitter:NA:1] Aborted StandardAsyncJob(fd00c634-7a5f-453c-afad-9dbdead31a91)
[2019-05-03 15:56:31,10] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Running to Succeeded
[2019-05-03 15:56:31,81] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: WorkflowExecutionActor [^[[38;5;2m461a6066^[[0m] aborted: SubWorkflow-SplitRG:0:1
[2019-05-03 15:56:32,09] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: WorkflowExecutionActor [^[[38;5;2m155c13d0^[[0m] aborted: SubWorkflow-UBTAB:-1:1
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows are aborted
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows finished
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor stopped
[2019-05-03 15:56:33,07] [info] Connection pools shut down
[2019-05-03 15:56:33,07] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] SubWorkflowStoreActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,07] [info] JobStoreActor stopped
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-05-03 15:56:33,08] [info] DockerHashActor stopped
[2019-05-03 15:56:33,08] [info] IoProxy stopped
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] WriteMetadataActor Shutting down: 72 queued messages to process
[2019-05-03 15:56:33,08] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,09] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,09] [info] WriteMetadataActor Shutting down: processing 0 queued messages
[2019-05-03 15:56:33,09] [info] ServiceRegistryActor stopped
[2019-05-03 15:56:33,11] [info] Database closed
[2019-05-03 15:56:33,11] [info] Stream materializer shut down
[2019-05-03 15:56:33,12] [info] WDL HTTP import resolver closed

WorkFlow getting aborted intermittently without any exception

Trending Articles

LAG, Lacp configuration on Mellanox switches

Karimnagar District Police Office Mobile Numbers List in Telangana State

Griffith faces three more offences

Imitation gun was fired at motorist in Leicester road-rage incident

Derbyshire jeweller and scrap gold dealer, Jonathan Haag, must pay £57,000...

MCKINNEY EMALINE EMMA OF WES...

Shatta Wale – You Shock Me (Prod. by Willis Beatz)

The 10 Tennessee Cities With The Largest Black Population For 2021

Practice Sheet of Right form of verbs for HSC Students

09g927750** 6 speed transmission TCM VAG original firmware files

FLASHBACK WITH SIRASA FM AT GALGAMUWA 2022

Ifield Avenue closed following crash in Langley Green

Stories • Goddess Stepmom

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

Skint TV teen to be sentenced

Black Angus Grilled Artichokes

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Moondru Mudichu 19-09-2017 – Polimer tv Serial

YOSVANI JAMES Arrested by Miami-Dade County Corrections on Jan 10, 2017

Parris out on $9,000 bail