Response Slowness – Post Mortem Update

Issue Summary
Transaction (eligibility, alerts, estimates) processes were intermittently slower than usual or timed out.

Timeline
The first instances began at approximately 12:45 p.m. CT on Tuesday 5/26/2015 and ended at 2:30 p.m. CT on Wednesday 5/27/2015.

Scope
6% of transactions on average were delayed or timed out.  This number may have been slightly higher or lower depending on how a specific client was routed.

Root Cause
A server stopped responding to requests.  It was in a state where our health monitors didn’t detect the issue.  Although this server was in a group of multiple redundant servers that processed the majority of the load, some traffic continued to route to the troubled server.

Resolution and Recovery
Once we were able to identify that the “rougue” server did not have our ‘regular’ performance tuning installed, we made the necessary changes (max threads and other app pool setting changes) to restore processing to normal.

Corrective and Preventative Measures
Implemented additional performance tracking and active monitoring for these servers.

 

This entry was posted in Archived Posts. Bookmark the permalink.