We launched a number of jobs on Feb 24 that are still running; it is not unexpected for them to take a few days. However, they appear to be getting restarted while the old ones are still running. The stdout contains a start datestamp, and this date stamp keeps jumping around when you download the file repeatedly.
Eddie did more digging, and it appears that in this case there are three operations ids writing to the same stdout, one started on the 24th, another on the 27th, and a third today. This would indicate that three 32 core machines are busy working on the same thing and localizing to the same bucket path, it is not clear what will happen with the attributes when one of them finishes (does one win, are some jobs already forgotten, etc).
submission id: 4872ddf0-0b47-4da4-9381-7d3bc3ed678b
non-preemptible operations id's from one example task writing into the same bucket location:
operations/EKeB89KoKxjW39jp_qmSvI4BIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
operations/EJb7mIioKxi728nH_Y3ajbEBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
operations/EPjWm5OnKxiy2dSLq8z-s-kBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
Is this related to Firecloud restarts?
Given the uncertainty of when/whether these will finish correctly, we are more interested at this point in killing (reaping?) these jobs, and restarting them using an algorithm tweak that should decrease the runtime.