Friday, January 13, 2012

Keeping on Top of Tool Friction

I love work when things flow--when I'm able to work on a problem and make steady progress. I get in a groove of discovering, coding, testing. If I get blocked then a quick search on the internet or an elbow-bump with a colleague resolves things and I'm back to coding and iterating without upset or delay, "getting it done." Happy hours flow by...Sooner than expected the day is done, I'm hungry but feeling satisfied about what got accomplished. I look forward to heading home, cuddling with the cat, and coming back for more the next day. That is the flow I'm talking about. Measuring the number of days that flow against those that don't is one of my personal indicators of work life happiness, and of the goodness-of-fit of me to the job.

Beyond happiness, when I'm in the flow is when I'm most productive.  And not just me--my co-workers are happiest and most productive when they are in their own flow--and when we're flowing together? Look out, engineering miracles suddenly become possible.

This is why I pay attention to things that pop me out of flow.  Yes, partly it is selfish--my happiness is impacted after all.  But the other part is my sense of efficiency and productivity is piqued.  And we can't have that!

Which brings me closer to my point. Yesterday I was looking over the shoulder of one our Ops engineers--you know, those guys that cover the systems 24 hours a day, 7 days a week, ensuring we reach the difficult goal of zero downtime?  He was launching the IIS Manager (IIS7) in the Staging or Production environment to validate a configuration setting.  It took two full minutes until the local machine node was populated in the UI. Imagine how my flow was interrupted during this wait.  Now think about how the flow of our Ops engineers is interrupted every time they do this on dozens of boxes in all the environments they manage (Dev1/Dev2/Test1/Test2/Staging/Production). Yup...every time!  Two minutes!  Sorry, but my flow was interrupted watching their flow being interrupted.

I went back to desk, with the hint from the Chief of Ops that the IIS Manager was "probably phoning home," and the reminder that for security reasons the web server tier of the web farm is not permitted to initiate any outward-bound connections to the internet.  I began formulating a cunning plan (thanks, Baldric of Black Adder fame).  To be thoroughly pedantic, here is my plan:
  1. Verify the problem per environment.  Operationalize the fail.
  2. If verified, consider the tools research the problem
    --Internet Search for something like "IIS7 Manager Slow to Load"
    --Network Monitor
    --Process Monitor
  3. Develop a changeset based on the research results
  4. Develop a test to prove the change
  5. Apply the change
  6. Test the change
  7. Verify the test if possible by rolling back, testing the prior problem, rolling forward, re-testing the fix.
  8. Communicating the results to Operations staff.
Rolling up my sleeves...I first check my own development web server--no repro.  Then I check each of the dev and test environments...no repro, huh, wot?!  Finally I check staging (finally for me--I don't have access to Production).  At last...I too see the long delay.  Quickly checking back with Ops verifies the problem is only in these two most critical environments..  Back to my desk.

Now I operationalize the failure case.  I itemize each trivial step so that someone who doesn't know the problem space can reproduce it too.  Or I can later knowing I'll forget exactly what I did.  Screenshots help, so I take them using the Snipping tool built the windows OS and highlight the interesting UI bits.

My internet search reveals others with the same problem...but no solutions offered.  Logging back into Staging, I attempt to run Network Monitor.  Fail--not sure why, might be a bad version of the tool on that box.  Staying with flow I didn't stop to problem solve this, but I add it to my list of things to investigate later.

Next...what does Process Monitor reveal?  Trivial use of this tool can be learned in just a few minutes--and it was just the trivial case I needed.  I load up Process Monitor and start a new capture.  I launch IIS Manager, click on the machine node and wait-wait-wait until the screen populates.  Stopping the capture in Process Monitor, I scroll the list and see the InetMgr.exe process name looks interesting.  Filtering on only that process and quickly scrolling through the thousands of entries while watching the time stamp column--I can see periods where it stops/blocks/waits for 5 second intervals.  Just before the blocking happens each time I see this registry key queried:
HKCU\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections\DefaultConnectionSettings
I bet what is happening, just as Ops suggested, is that the IIS Manager is "phoning home" to Microsoft, probably for very good reasons, throws it's request out and then sits and waits for a response that is never going to happen because there is no outward-bound intenet connection on this box.

On a hunch I pull up the "Internet Options" control panel applet...click on the Lan Settings button...and make these changes--my goal is to make the request fail fast rather than fail slow:
  • [x] Use a proxy server for your LAN
  • Address: localhost
  • Port: 8080
Committing those changes, I find that I can now pull up IIS Manager and it's usable within 10 seconds--a six-to-12-fold improvement.  I reverse the changes, re-verify the fail case, re-apply the changes and re-verify the success.

The operations personnel are now armed with this information, and can decide for themselves if they want the fast or slow IIS Manager experience.  They can decide within their workgroup if they want to "push" the change to all domain users, or let each user manage the setting individually on each box.  They got choices, and choices is what makes them happiest.

For another day I'll investigate further what it takes for me to successfully use Network Monitor in this locked-down environment.  And look into what prevents IIS7 Manager from being usable in less than 3 seconds as it is on my box (perhaps my solid-state drives make the difference?).  Until then, I'm looking for the next task to re-enter the flow experience.  Soon enough it'll be dark again, I'll be hungry, and my cat will need some cuddling.