Friday, January 13, 2012

Keeping on Top of Tool Friction

I love work when things flow--when I'm able to work on a problem and make steady progress. I get in a groove of discovering, coding, testing. If I get blocked then a quick search on the internet or an elbow-bump with a colleague resolves things and I'm back to coding and iterating without upset or delay, "getting it done." Happy hours flow by...Sooner than expected the day is done, I'm hungry but feeling satisfied about what got accomplished. I look forward to heading home, cuddling with the cat, and coming back for more the next day. That is the flow I'm talking about. Measuring the number of days that flow against those that don't is one of my personal indicators of work life happiness, and of the goodness-of-fit of me to the job.

Beyond happiness, when I'm in the flow is when I'm most productive.  And not just me--my co-workers are happiest and most productive when they are in their own flow--and when we're flowing together? Look out, engineering miracles suddenly become possible.

This is why I pay attention to things that pop me out of flow.  Yes, partly it is selfish--my happiness is impacted after all.  But the other part is my sense of efficiency and productivity is piqued.  And we can't have that!

Which brings me closer to my point. Yesterday I was looking over the shoulder of one our Ops engineers--you know, those guys that cover the systems 24 hours a day, 7 days a week, ensuring we reach the difficult goal of zero downtime?  He was launching the IIS Manager (IIS7) in the Staging or Production environment to validate a configuration setting.  It took two full minutes until the local machine node was populated in the UI. Imagine how my flow was interrupted during this wait.  Now think about how the flow of our Ops engineers is interrupted every time they do this on dozens of boxes in all the environments they manage (Dev1/Dev2/Test1/Test2/Staging/Production). Yup...every time!  Two minutes!  Sorry, but my flow was interrupted watching their flow being interrupted.

I went back to desk, with the hint from the Chief of Ops that the IIS Manager was "probably phoning home," and the reminder that for security reasons the web server tier of the web farm is not permitted to initiate any outward-bound connections to the internet.  I began formulating a cunning plan (thanks, Baldric of Black Adder fame).  To be thoroughly pedantic, here is my plan:
  1. Verify the problem per environment.  Operationalize the fail.
  2. If verified, consider the tools research the problem
    --Internet Search for something like "IIS7 Manager Slow to Load"
    --Network Monitor
    --Process Monitor
  3. Develop a changeset based on the research results
  4. Develop a test to prove the change
  5. Apply the change
  6. Test the change
  7. Verify the test if possible by rolling back, testing the prior problem, rolling forward, re-testing the fix.
  8. Communicating the results to Operations staff.
Rolling up my sleeves...I first check my own development web server--no repro.  Then I check each of the dev and test environments...no repro, huh, wot?!  Finally I check staging (finally for me--I don't have access to Production).  At last...I too see the long delay.  Quickly checking back with Ops verifies the problem is only in these two most critical environments..  Back to my desk.

Now I operationalize the failure case.  I itemize each trivial step so that someone who doesn't know the problem space can reproduce it too.  Or I can later knowing I'll forget exactly what I did.  Screenshots help, so I take them using the Snipping tool built the windows OS and highlight the interesting UI bits.

My internet search reveals others with the same problem...but no solutions offered.  Logging back into Staging, I attempt to run Network Monitor.  Fail--not sure why, might be a bad version of the tool on that box.  Staying with flow I didn't stop to problem solve this, but I add it to my list of things to investigate later.

Next...what does Process Monitor reveal?  Trivial use of this tool can be learned in just a few minutes--and it was just the trivial case I needed.  I load up Process Monitor and start a new capture.  I launch IIS Manager, click on the machine node and wait-wait-wait until the screen populates.  Stopping the capture in Process Monitor, I scroll the list and see the InetMgr.exe process name looks interesting.  Filtering on only that process and quickly scrolling through the thousands of entries while watching the time stamp column--I can see periods where it stops/blocks/waits for 5 second intervals.  Just before the blocking happens each time I see this registry key queried:
HKCU\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections\DefaultConnectionSettings
I bet what is happening, just as Ops suggested, is that the IIS Manager is "phoning home" to Microsoft, probably for very good reasons, throws it's request out and then sits and waits for a response that is never going to happen because there is no outward-bound intenet connection on this box.

On a hunch I pull up the "Internet Options" control panel applet...click on the Lan Settings button...and make these changes--my goal is to make the request fail fast rather than fail slow:
  • [x] Use a proxy server for your LAN
  • Address: localhost
  • Port: 8080
Committing those changes, I find that I can now pull up IIS Manager and it's usable within 10 seconds--a six-to-12-fold improvement.  I reverse the changes, re-verify the fail case, re-apply the changes and re-verify the success.

The operations personnel are now armed with this information, and can decide for themselves if they want the fast or slow IIS Manager experience.  They can decide within their workgroup if they want to "push" the change to all domain users, or let each user manage the setting individually on each box.  They got choices, and choices is what makes them happiest.

For another day I'll investigate further what it takes for me to successfully use Network Monitor in this locked-down environment.  And look into what prevents IIS7 Manager from being usable in less than 3 seconds as it is on my box (perhaps my solid-state drives make the difference?).  Until then, I'm looking for the next task to re-enter the flow experience.  Soon enough it'll be dark again, I'll be hungry, and my cat will need some cuddling.

Thursday, January 12, 2012

Website Slowdown -- Discovering Index Event

Last week I reviewed the IIS log files and noticed a cluster of HTTP Status Code 500 errors logged at approximately the same across many web servers in the farm.  Opening up one of the log files I noticed that transactions during the same period were running into thousands of milliseconds, rather than the expected sub-second response times. I spent some time that day and the next sharing that observation with Ops and other developers, scoured error and application logs across several tiers of the farm to try and identify the index event.

A few days later I was checking all the web servers in production for integrity.  The integrity check talkes a few minutes, so I sat back to wait for it to finish.  While waiting my mailbox chimed.  The subject line: Zomg, we're getting an error spew in Production! 

Pulling up our near-real-time error viewer and thinking for a minute, it dawned on me I had just caused this--my long-running transaction was blocking other transactions using a shared resource!  Okay, my bad, I'll write up a defect on this so it gets fixed.  Nevermind that I'd both caused it by running the integrity checker--I was also the fellow who WROTE the integrity checker.  Doubly-bad, only slightly mitigated by being the discoverer of the cause of the error spew, and the fellow who then developed and tested the fix.

At least I've got a new theory to check out, that the problem of last week was caused by the integrity checker.  Going back to the error logs, I search for evidence that same transaction had been called at about the time of the error code cluster and long-running transaction times.  Sure enough, there it is in the middle of that group of IIS log entries.

The index event I looked for the other day but couldn't find--there it is in the MIDDLE of the other log entries--not at the beginning.  Huh, wot?!

I knew that IIS logs are queued to the log in the order of completion, not in the order of request.  So if a non-blocking long-running transaction begins at the same time or just before a fast transaction, the fast transaction will be logged first.  In order to better understand the order of events on my web servers, I'd do better adding a new datetime entity to the log that is the calculation of the time minus the time-taken entities, and then sorting on the calculated column. 

Right, I get it that this is likely obvious to lots of engineers.  But I have to admit that I've been working on web application for years, and even though I don't often review IIS log files, it's taken this long for to come to this trivial realization--for business intelligence, resort to re-sort!

Wednesday, January 4, 2012

SQL Injection Happen? Nope!

Having pulled all those IIS logs has already paid off.  Within minutes of being asked if we were vulnerable to a new SQL injection attack I was able to quickly rediscover how to "grep" in windows. 

The quoted string portion is a subset of the attack:
    FINDSTR /S /N /I "73657420616e73695f7761726e696e6773206f66662044454" *.log
Within a minute gigabytes of log files were evaluated, and it was definitive the attack had not hit nor propigated on our servers.

In truth, we were safe from this attack anyways, because we'd invested years ago in ensuring that all of our DB access was through parameterized stored procedures. It's tough for those that haven't yet protected themselves--the effort to change from the other coding practice can be significant.

IIS Log File Analysis Breadcrumbs


One of the web applications I work on has significant seasonal variability, with significantly more load experienced during the Holiday season than at other times of the year. Having just past the peak season, now is time to look back and check logs to spot areas to improve or refactor in anticipation of peak loads expected in the next year. This particular web farm (IIS on Windows 2003 server, and Windows 2008R2) has no centralized logging, and the files have to be fetched from another network. Plus, the individual log files can be quite large, so even on a relatively fast network it is time consuming and tedious to fetch the files. I've not spent much time closely scrutinizing web server logs, so my skills are limited and/or rusty. That's the big picture...let's see if I can break this down to a problem statement:
How to analyse and monitor website activity real-time, centralized, ad hoc, with triggers, for least effort and cost?
Glad to have gotten that question off my chest...because i'm a small-picture guy i need to think about what I can do within my scope of control, with fewest change requirements to existing systems, no budget and limited time.  Right, here's my list initial list of steps to begin surveying these logs:
  1. Grab a couple logs (copy them to my desktop machine)
  2. Inspect the logs (doh! they're too big to view in NotePad...huh, wot?!)
  3. Look for interesting patterns
  4. Come up a scalable solution for analysing tens-to-hundreds of log files
Luckily Microsoft's excellent Excel application is quite happy loading huge files.  I haven't tested Excel's limits lately, but it's got no problem at all loading the 5 gigabyte files I'm looking at today.  Right...so my desktop machine is running Win2008R2, has 24GB of RAM, and a nice fat SSD--all of this makes loading/parsing big fat files in Excel fairly pain free.
Visually scanning the hundreds of thousands of rows quickly reveals some of our clients are misbehaving and spamming us with unnecessary requests. Great to know...add that to the research-and-fix list.
Next, I know that Excel has a nice "text-to-columns" button in the Data ribbon that quickly parses the IIS log files into nice columns.  And filtering is quickly accomplished using the Filter button.  Now I can filter the list to look at particular source IP's, user agents, and query string values, etc. All of this is simple enough...and I'm ready to consider how to scale this up to gigabytes-per-day of log files:
  1. Identify all the servers and paths to the target website log files
  2. Write a batch file to copy those files across the network to my local machine
  3. Learn how to use Microsoft's Log Parser 2.2
  4. Learn how to use Lizard Lab's useful Log Parser Lizard GUI application
None of that took too long--and as long as I limit my queries to one day then the queries complete fairly quickly.  Questions I had were:
  1. Do transactions times increase with load? [no, or not by much]
  2. Do a significant number of transactions block/take long times to complete? [no, few appear to be blocking, those taking longer than a couple seconds appear to be byte-size related--that means clients with low bandwidth. Consider ways to minimize size of responses]
  3. Is our caching strategy (caching headers) working correctly? [thank goodness, yes]
Other than the misbehaving clients that I identified by visually scanning the log files, the parser tools made it easy to satisfy the high-level questions, so it was well worth the few hours required to get up to speed using them. I'll try and find time to share the specific Log Parser queries. Shucks, I'll have to rediscover them next year if I don't post them.  That should be motivation enough to extend this article at some point.