ECS and Rapid Failure

Yesterday, I created a condition where my Tasks were dying and restarting in rapid succession. When the container started, it immediately failed. Because of this rapid failure, I ran out of disk space because the images where building up. So eventually the tasks were no longer able to deploy, what to do?

ECS does cleanup the old images on the host, but that cleanup interval is defaulted at 3 hours. When the application fails after a second, the deployed images stack up and are not cleaned in enough time. There is a number of settings that you can set in the launch configuration if you need to and are detailed here:

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/automated_image_cleanup.html

There is a specific setting that I am going to try tomorrow that I hope will help protect me from this resource exhaustion. ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION seems to be what I am looking for, but I think the question becomes the impact of setting it to a lower value than 3 hours. I guess I will find that out tomorrow.

As an aside, I found myself in the situation because I was using the wrong access keys for the environment. When the application starts, it reads from secret manager before configuration in the IOC container. The only fix was to either wait 3 hours, not an option, or spool up another host and then let the tasks re-deploy. Hopefully I will have good news to report back!

Leave a comment