Friday, November 2, 2012

Using URLRewriter in IIS to Munge a Cookie Header

Updating a header using the URLRewrite Module in IIS is fairly straight forward--but distilling how to do it from the documentation is a bit of a pain.  A friend of mine was working on it and ended up with 30 browser windows up as he sorted out how to do it.  So in honor of that monumental effort this humble post distills what was learned into a simple example of how to remove a cookie from the cookie header.  So for example, imagine the request being made from the client to the web server with the following cookie header:
Cookie: foo1=x; foo2=y; foo3=z
 The goal is to remove "foo2-y;" so the header changes to this:
Cookie: foo1=x; foo3=z
Here is a rule that will do it:
<rule name="Change 'foo2' Cookie Value" stopProcessing="false">
  <match url="^(.*)" />
  <conditions>
    <add input="{HTTP_COOKIE}" pattern="(.*)(foo2=.\; )(.*)" />
  </conditions>
  <serverVariables>
    <set name="HTTP_COOKIE" value="{c:1}{c:3}" />
  </serverVariables>
  <action type="Rewrite" url="{R:0}" />
</rule>
Note that sometimes HTTP_COOKIE is surrounded by curly braces, and sometimes not. Also the action type is "Rewrite"--even though the URL itself is not being changed. Finally, the HTTP_COOKIE must be an "allowed server variable".  You can use the URL Rewrite applet in the IIS Manager to add it, or run something like this from powershell:
Try{$NewServerVariable = Add-WebConfiguration /system.webServer/rewrite/allowedServerVariables -atIndex 0 -value @{name="HTTP_COOKIE"}}Catch{}
or from an administrative command line prompt:
%WINDIR%\System32\inetsrv\appcmd.exe set config  -section:system.webServer/rewrite/allowedServerVariables /+"[name='HTTP_COOKIE']" /commit:apphost

I'll leave it to the reader to improve the rule to handle cases where there are no other keys in the cookie, or if the value of foo2 is more characters than one, etc.

Sunday, June 3, 2012

Scaling Website: Maximizing HTTP Caching


Abstract

Caching of content is an important consideration to maximizing the scalability of most computer applications. HTTP caching utilizes the technology built into the internet itself in web servers and user agents (browsers) to scale performance. Below I discuss the headers that control HTTP caching on an example website, beginning first with the serving of static content, and then with the serving of dynamic content. Special considerations of caching under SSL is noted. (For additional information see the List_of_HTTP_header_fields). I finish the discussion with recommendations for a particular website--based on my "huh, whot?!" experience after looking at their use of headers.

Caching Static Content

Static resources consist of hard-coded files with extensions such as .htm, .js, .cs, .gif, .jpg, etc. When a resource is initially requested from a browser, the raw HTTP/1.1 request looks like this (as seen in Fiddler):

Exhibit 1: Request
GET https://myserver.com/test.htm?aa=myorg&app=myapp HTTP/1.1 Accept: text/html, application/xhtml+xml, */* Accept-Language: en-US User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0) Accept-Encoding: gzip, deflate Connection: Keep-Alive Host: myserver.com
Exhibit 2: Response (headers only, body ellided)
HTTP/1.1 200 OK Cache-Control: max-age=10800 Content-Type: text/html Last-Modified: Mon, 30 Apr 2012 19:21:29 GMT Accept-Ranges: bytes ETag: "afa0ca71627cd1:0" Date: Sat, 02 Jun 2012 17:13:14 GMT Content-Length: 53751 Server: myserver.com
The response headers from this server show the following—the yellow highlighted value has been specifically configured (in IIS7) in the applicationHost.config (or overridden in the website.config) to expire in 3 hours (10800 seconds), the Last-Modified value corresponds to the Date Modified property of the file on the web server, while the ETag value is calculated automatically by IIS and represents a hash of the file contents.
During the three-hour max-age window the requested resource will be served from the client’s local cache (default IE setting). If the user presses “refresh” in their browser then the request is pushed back to the server. The headers in that case will be as follows—notice how the green highlighted values match the Last-Modified and ETag headers above:

Exhibit 3: Request
GET https://myserver.com/test.htm?aa=myorg&app=myapp HTTP/1.1 Accept: text/html, application/xhtml+xml, */* Accept-Language: en-US User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0) Accept-Encoding: gzip, deflate Host: myserver.com If-Modified-Since: Mon, 30 Apr 2012 19:21:29 GMT If-None-Match: "afa0ca71627cd1:0" Connection: Keep-Alive
Because the file hasn’t changed, the server response to that request consists of only headers—this is the complete response—and the client will continue to use the cached copy:

Exhibit 4: Response
HTTP/1.1 304 Not Modified Cache-Control: max-age=10800 Last-Modified: Mon, 30 Apr 2012 19:21:29 GMT Accept-Ranges: bytes ETag: "afa0ca71627cd1:0" Date: Sat, 02 Jun 2012 17:22:38 GMT Server: myserver.com
Until the file changes the calculated ETag and Last-Modified values remain the same.At the end of three hours the client will once again make a request as in Exhibit 3, the response will be the same as in Exhibit 4, with the exception that the Date header will be updated. The client will update its corresponding Last Checked property on the cached entry (as seen in the Temporary Internet Files pseudo-folder), and continue serving the original copy from the cache.
Applications that serve and cache static content of (usually small) files scale easily to hundreds, thousands and millions of users, even with fairly limited server-side hardware. Small increases in hardware and networking further scale the delivery of static content. Additionally non-SSL applications can make use of intermediate proxy servers that further distribute servicing requests, limiting direct connections to the original source servers.

Static Content: Overriding IIS7

This customer's website has been specifically tuned to cache static resource as served from the web farm. These settings include the setting the Cache-Control header for static content. In IIS6 the ETag suffix value had to be forced as well—IIS7 automatically forces the suffix to “:0”. Below is the command-line to set the max-age value in IIS7 to three hours: appcmd.exe set config -section:system.webServer/staticContent /clientCache.cacheControlMaxAge:"03:00:00" /commit:apphost
Because content from this customer is served under SSL the content is not cached by proxies—it is only cached on the client.

Caching Dynamic Content

Dynamic content are web pages or other files or streams of information that are generated (usually) at the time of access by a user or that change as a result of interaction with the user. Because the content changes, the headers issued with dynamic content typically disable caching.  My customer's website emits a typical request-response pair, notice the Cache-Control header in the response, and the absence of ETag and Last-Modified headers:

Exhibit 6: Request for dynamic content
GET http://myserver.com/test.aspx?aa=myorg&app=myapp HTTP/1.1 Accept: text/html, application/xhtml+xml, */* Accept-Language: en-US User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0) Accept-Encoding: gzip, deflate Connection: Keep-Alive Host: myserver.com

Exhibit 7: Response
HTTP/1.1 200 OK Cache-Control: private, no-store Content-Type: text/html; charset=utf-8 Date: Sat, 02 Jun 2012 23:47:20 GMT Content-Length: 4747 Server: myserver.com
Subsequent identical requests have similar responses—as there is no caching, the client round-trips to the server each time; code is executed to re-create the content and streamed back to the client for every time, even when the request is identical to prior requests.
Because there is no caching, and because code is executed for each requests, and the database is accessed for every request, the ability of this application to scale is limited to the capability of the web server to dynamically generate the file (“CPU”) and the responsiveness/capability of the backing database server and storage.

Dynamic Content: Overridden Header

Looing through the code, I see this website currently forces the cache-control header for nearly all dynamic content as noted above—this line of code is typical:
Response.Cache.SetNoStore();

Caching Proxies

As indicated above, non-SSL content can be stored by intermediary servers in between the original content provider (web server) and the user (see Caching Proxy for details). Because traffic is encrypted between this server and the user, caching proxies are not available for this applications.  As this website is part of a webfarm that supports SSL offloading, there is an opportunity to utilize caching proxies on a reverse proxy router.  Fortunately IIS7 with URL Rewrite and Application Request Routing (ARR) can act as an HTTP caching proxy, and can be trivially implemented on each web server.

The Future is Bright

I'll be working with this customer to improved their strategy for HTTP caching of dynamic content—upgrading to IIS7 and the rollout of ARR are enabling technologies that position this customer for improvements in scale and uptime in the near term. Separately it's time to being implementing a distributed memory cache for intermediate results. Both technologies involve caching, but the layer of the caching strategy is different, and solves the caching of partial results rather than complete results as discussed here.

Friday, January 13, 2012

Keeping on Top of Tool Friction

I love work when things flow--when I'm able to work on a problem and make steady progress. I get in a groove of discovering, coding, testing. If I get blocked then a quick search on the internet or an elbow-bump with a colleague resolves things and I'm back to coding and iterating without upset or delay, "getting it done." Happy hours flow by...Sooner than expected the day is done, I'm hungry but feeling satisfied about what got accomplished. I look forward to heading home, cuddling with the cat, and coming back for more the next day. That is the flow I'm talking about. Measuring the number of days that flow against those that don't is one of my personal indicators of work life happiness, and of the goodness-of-fit of me to the job.

Beyond happiness, when I'm in the flow is when I'm most productive.  And not just me--my co-workers are happiest and most productive when they are in their own flow--and when we're flowing together? Look out, engineering miracles suddenly become possible.

This is why I pay attention to things that pop me out of flow.  Yes, partly it is selfish--my happiness is impacted after all.  But the other part is my sense of efficiency and productivity is piqued.  And we can't have that!

Which brings me closer to my point. Yesterday I was looking over the shoulder of one our Ops engineers--you know, those guys that cover the systems 24 hours a day, 7 days a week, ensuring we reach the difficult goal of zero downtime?  He was launching the IIS Manager (IIS7) in the Staging or Production environment to validate a configuration setting.  It took two full minutes until the local machine node was populated in the UI. Imagine how my flow was interrupted during this wait.  Now think about how the flow of our Ops engineers is interrupted every time they do this on dozens of boxes in all the environments they manage (Dev1/Dev2/Test1/Test2/Staging/Production). Yup...every time!  Two minutes!  Sorry, but my flow was interrupted watching their flow being interrupted.

I went back to desk, with the hint from the Chief of Ops that the IIS Manager was "probably phoning home," and the reminder that for security reasons the web server tier of the web farm is not permitted to initiate any outward-bound connections to the internet.  I began formulating a cunning plan (thanks, Baldric of Black Adder fame).  To be thoroughly pedantic, here is my plan:
  1. Verify the problem per environment.  Operationalize the fail.
  2. If verified, consider the tools research the problem
    --Internet Search for something like "IIS7 Manager Slow to Load"
    --Network Monitor
    --Process Monitor
  3. Develop a changeset based on the research results
  4. Develop a test to prove the change
  5. Apply the change
  6. Test the change
  7. Verify the test if possible by rolling back, testing the prior problem, rolling forward, re-testing the fix.
  8. Communicating the results to Operations staff.
Rolling up my sleeves...I first check my own development web server--no repro.  Then I check each of the dev and test environments...no repro, huh, wot?!  Finally I check staging (finally for me--I don't have access to Production).  At last...I too see the long delay.  Quickly checking back with Ops verifies the problem is only in these two most critical environments..  Back to my desk.

Now I operationalize the failure case.  I itemize each trivial step so that someone who doesn't know the problem space can reproduce it too.  Or I can later knowing I'll forget exactly what I did.  Screenshots help, so I take them using the Snipping tool built the windows OS and highlight the interesting UI bits.

My internet search reveals others with the same problem...but no solutions offered.  Logging back into Staging, I attempt to run Network Monitor.  Fail--not sure why, might be a bad version of the tool on that box.  Staying with flow I didn't stop to problem solve this, but I add it to my list of things to investigate later.

Next...what does Process Monitor reveal?  Trivial use of this tool can be learned in just a few minutes--and it was just the trivial case I needed.  I load up Process Monitor and start a new capture.  I launch IIS Manager, click on the machine node and wait-wait-wait until the screen populates.  Stopping the capture in Process Monitor, I scroll the list and see the InetMgr.exe process name looks interesting.  Filtering on only that process and quickly scrolling through the thousands of entries while watching the time stamp column--I can see periods where it stops/blocks/waits for 5 second intervals.  Just before the blocking happens each time I see this registry key queried:
HKCU\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections\DefaultConnectionSettings
I bet what is happening, just as Ops suggested, is that the IIS Manager is "phoning home" to Microsoft, probably for very good reasons, throws it's request out and then sits and waits for a response that is never going to happen because there is no outward-bound intenet connection on this box.

On a hunch I pull up the "Internet Options" control panel applet...click on the Lan Settings button...and make these changes--my goal is to make the request fail fast rather than fail slow:
  • [x] Use a proxy server for your LAN
  • Address: localhost
  • Port: 8080
Committing those changes, I find that I can now pull up IIS Manager and it's usable within 10 seconds--a six-to-12-fold improvement.  I reverse the changes, re-verify the fail case, re-apply the changes and re-verify the success.

The operations personnel are now armed with this information, and can decide for themselves if they want the fast or slow IIS Manager experience.  They can decide within their workgroup if they want to "push" the change to all domain users, or let each user manage the setting individually on each box.  They got choices, and choices is what makes them happiest.

For another day I'll investigate further what it takes for me to successfully use Network Monitor in this locked-down environment.  And look into what prevents IIS7 Manager from being usable in less than 3 seconds as it is on my box (perhaps my solid-state drives make the difference?).  Until then, I'm looking for the next task to re-enter the flow experience.  Soon enough it'll be dark again, I'll be hungry, and my cat will need some cuddling.

Thursday, January 12, 2012

Website Slowdown -- Discovering Index Event

Last week I reviewed the IIS log files and noticed a cluster of HTTP Status Code 500 errors logged at approximately the same across many web servers in the farm.  Opening up one of the log files I noticed that transactions during the same period were running into thousands of milliseconds, rather than the expected sub-second response times. I spent some time that day and the next sharing that observation with Ops and other developers, scoured error and application logs across several tiers of the farm to try and identify the index event.

A few days later I was checking all the web servers in production for integrity.  The integrity check talkes a few minutes, so I sat back to wait for it to finish.  While waiting my mailbox chimed.  The subject line: Zomg, we're getting an error spew in Production! 

Pulling up our near-real-time error viewer and thinking for a minute, it dawned on me I had just caused this--my long-running transaction was blocking other transactions using a shared resource!  Okay, my bad, I'll write up a defect on this so it gets fixed.  Nevermind that I'd both caused it by running the integrity checker--I was also the fellow who WROTE the integrity checker.  Doubly-bad, only slightly mitigated by being the discoverer of the cause of the error spew, and the fellow who then developed and tested the fix.

At least I've got a new theory to check out, that the problem of last week was caused by the integrity checker.  Going back to the error logs, I search for evidence that same transaction had been called at about the time of the error code cluster and long-running transaction times.  Sure enough, there it is in the middle of that group of IIS log entries.

The index event I looked for the other day but couldn't find--there it is in the MIDDLE of the other log entries--not at the beginning.  Huh, wot?!

I knew that IIS logs are queued to the log in the order of completion, not in the order of request.  So if a non-blocking long-running transaction begins at the same time or just before a fast transaction, the fast transaction will be logged first.  In order to better understand the order of events on my web servers, I'd do better adding a new datetime entity to the log that is the calculation of the time minus the time-taken entities, and then sorting on the calculated column. 

Right, I get it that this is likely obvious to lots of engineers.  But I have to admit that I've been working on web application for years, and even though I don't often review IIS log files, it's taken this long for to come to this trivial realization--for business intelligence, resort to re-sort!

Wednesday, January 4, 2012

SQL Injection Happen? Nope!

Having pulled all those IIS logs has already paid off.  Within minutes of being asked if we were vulnerable to a new SQL injection attack I was able to quickly rediscover how to "grep" in windows. 

The quoted string portion is a subset of the attack:
    FINDSTR /S /N /I "73657420616e73695f7761726e696e6773206f66662044454" *.log
Within a minute gigabytes of log files were evaluated, and it was definitive the attack had not hit nor propigated on our servers.

In truth, we were safe from this attack anyways, because we'd invested years ago in ensuring that all of our DB access was through parameterized stored procedures. It's tough for those that haven't yet protected themselves--the effort to change from the other coding practice can be significant.

IIS Log File Analysis Breadcrumbs


One of the web applications I work on has significant seasonal variability, with significantly more load experienced during the Holiday season than at other times of the year. Having just past the peak season, now is time to look back and check logs to spot areas to improve or refactor in anticipation of peak loads expected in the next year. This particular web farm (IIS on Windows 2003 server, and Windows 2008R2) has no centralized logging, and the files have to be fetched from another network. Plus, the individual log files can be quite large, so even on a relatively fast network it is time consuming and tedious to fetch the files. I've not spent much time closely scrutinizing web server logs, so my skills are limited and/or rusty. That's the big picture...let's see if I can break this down to a problem statement:
How to analyse and monitor website activity real-time, centralized, ad hoc, with triggers, for least effort and cost?
Glad to have gotten that question off my chest...because i'm a small-picture guy i need to think about what I can do within my scope of control, with fewest change requirements to existing systems, no budget and limited time.  Right, here's my list initial list of steps to begin surveying these logs:
  1. Grab a couple logs (copy them to my desktop machine)
  2. Inspect the logs (doh! they're too big to view in NotePad...huh, wot?!)
  3. Look for interesting patterns
  4. Come up a scalable solution for analysing tens-to-hundreds of log files
Luckily Microsoft's excellent Excel application is quite happy loading huge files.  I haven't tested Excel's limits lately, but it's got no problem at all loading the 5 gigabyte files I'm looking at today.  Right...so my desktop machine is running Win2008R2, has 24GB of RAM, and a nice fat SSD--all of this makes loading/parsing big fat files in Excel fairly pain free.
Visually scanning the hundreds of thousands of rows quickly reveals some of our clients are misbehaving and spamming us with unnecessary requests. Great to know...add that to the research-and-fix list.
Next, I know that Excel has a nice "text-to-columns" button in the Data ribbon that quickly parses the IIS log files into nice columns.  And filtering is quickly accomplished using the Filter button.  Now I can filter the list to look at particular source IP's, user agents, and query string values, etc. All of this is simple enough...and I'm ready to consider how to scale this up to gigabytes-per-day of log files:
  1. Identify all the servers and paths to the target website log files
  2. Write a batch file to copy those files across the network to my local machine
  3. Learn how to use Microsoft's Log Parser 2.2
  4. Learn how to use Lizard Lab's useful Log Parser Lizard GUI application
None of that took too long--and as long as I limit my queries to one day then the queries complete fairly quickly.  Questions I had were:
  1. Do transactions times increase with load? [no, or not by much]
  2. Do a significant number of transactions block/take long times to complete? [no, few appear to be blocking, those taking longer than a couple seconds appear to be byte-size related--that means clients with low bandwidth. Consider ways to minimize size of responses]
  3. Is our caching strategy (caching headers) working correctly? [thank goodness, yes]
Other than the misbehaving clients that I identified by visually scanning the log files, the parser tools made it easy to satisfy the high-level questions, so it was well worth the few hours required to get up to speed using them. I'll try and find time to share the specific Log Parser queries. Shucks, I'll have to rediscover them next year if I don't post them.  That should be motivation enough to extend this article at some point.