Last week I reviewed the IIS log files and noticed a cluster of HTTP Status Code 500 errors logged at approximately the same across many web servers in the farm. Opening up one of the log files I noticed that transactions during the same period were running into thousands of milliseconds, rather than the expected sub-second response times. I spent some time that day and the next sharing that observation with Ops and other developers, scoured error and application logs across several tiers of the farm to try and identify the index event.
A few days later I was checking all the web servers in production for integrity. The integrity check talkes a few minutes, so I sat back to wait for it to finish. While waiting my mailbox chimed. The subject line: Zomg, we're getting an error spew in Production!
Pulling up our near-real-time error viewer and thinking for a minute, it dawned on me I had just caused this--my long-running transaction was blocking other transactions using a shared resource! Okay, my bad, I'll write up a defect on this so it gets fixed. Nevermind that I'd both caused it by running the integrity checker--I was also the fellow who WROTE the integrity checker. Doubly-bad, only slightly mitigated by being the discoverer of the cause of the error spew, and the fellow who then developed and tested the fix.
At least I've got a new theory to check out, that the problem of last week was caused by the integrity checker. Going back to the error logs, I search for evidence that same transaction had been called at about the time of the error code cluster and long-running transaction times. Sure enough, there it is in the middle of that group of IIS log entries.
The index event I looked for the other day but couldn't find--there it is in the MIDDLE of the other log entries--not at the beginning. Huh, wot?!
I knew that IIS logs are queued to the log in the order of completion, not in the order of request. So if a non-blocking long-running transaction begins at the same time or just before a fast transaction, the fast transaction will be logged first. In order to better understand the order of events on my web servers, I'd do better adding a new datetime entity to the log that is the calculation of the time minus the time-taken entities, and then sorting on the calculated column.
Right, I get it that this is likely obvious to lots of engineers. But I have to admit that I've been working on web application for years, and even though I don't often review IIS log files, it's taken this long for to come to this trivial realization--for business intelligence, resort to re-sort!
No comments:
Post a Comment