Boost PHP Speed With “If-Modified-Since” [3/4]

October 16, 2006 § Leave a comment

<ul><li>An error has occurred; the feed is probably down. Try again later.</li></ul>

Improve PHP Performance

Welcome to the third installment of our four part series on how webmasters can reduce or even eliminate unnecessary server traffic. The heart
of this tutorial is the If-Modified-Since header and a technique called conditional GET.

Series Index

This section covers the impact of conditional GETs on RSS feeds.

Part III: Using IMS For Optimized RSS (and other) Feeds

Up to now, I’ve focused on how browsers can use IMS for conditional GETs, but in reality, browsers are only a small part of the picture. I learned last month that the real beauty of this technique has to do with making web crawlers work better.

RSS seemed to get the most attention here, so I wanted to pay particular attention to how they have been affecting the Vibe Technology website. What I found is that (for us, at least), RSS crawlers aren’t the culprits – search engine indexers are the big bandwidth hogs.

You can improve performance by optimizing RSS feeds, but you may find more impressive gains when leveraging the same conditional GET techniques with the full pages that are indexed by search engines.

However, before we talk about fixing the problem, let’s understand the problem through some web log analysis…

Required Tools for Web Log Analysis

Are you aware of just how many times a web crawler visits your website? Honestly, I wasn’t either, but since I’d heard it can be several times a day, I decided to find out.

Last month I focused on click tracking software that can give webmasters a picture of visitor behavior (Comparing Crazy Egg, Google and MyBlogLog). While I use MyBlogLog and Google Analytics on a daily basis, I needed to dig deeper for details on how spiders, worms and crawlers were affecting VibeTalk.

[sawmill logo]

To accomplish this, I used an excellent tool called Sawmill Professional by Flowerfire. They have a 30-day download-able version on the website.

Focus Analysis on Relevant Data

Web logs can be misleading at first glance. For example, Vibe Technology’s September logs show nearly 140,000 “hits”. Strictly speaking, this is true, but understand that *every* element of every requested page is a “hit”. Sawmill does a great job of grouping hits into pages and sessions, but make sure you are tracking bandwidth data at the server level (see Configure IIS section below).

Configure Sawmill Data Filters

Because of website development, much of our traffic was internal. If you have the same issue, try the following log filter for Sawmill (change the IP address and domain name accordingly)

 if (c_ip eq "") or 
  (c_ip eq "") or
  (ends_with(hostname, "")) then "reject"

The third test only works if your server resolves hostnames. Also, remember to rebuild the database after creating Log Filters.

Configure IIS for Bandwidth Reporting

[IIS Log Settings]

Only after poring over the reports for an hour or so did I realize why I couldn’t get bandwidth data – we didn’t have that level of tracking enabled at the server level. This resulted in some guestimating on the impact, so I recommend fixing this in your install, if you haven’t already. The default installation appears to have Bytes Sent and Bytes Received unchecked.

To calculate the bandwidth impact from feeds, first filter the dataset to specific content areas. For VibeTalk, I filtered traffic to that coming from these URLs:


Web Log Analysis Results

[RSS stats]

(screenshot: RSS statistics for 6 days)

Because of my settings problem, I only had a few days of feed-related bandwidth info. However, it was pretty obvious that RSS crawlers pose no significant issue – we currently get about 50 visits and transfer less than 1/4 of a megabyte of data per day (one of the benefits of gzip).

It is insignificant, really.

RSS activity could be a bear if the site were to to be slashdotted.

[crawler stats]

(screenshot: Crawler statistics for September)

The majority of activity appeared to be coming from other PHP pages fetched by web crawlers. For September, spiders accounted for nearly 12000 page views! At 100k per page (a conservative average), that’s over 1 gigabyte of data.

By looking at Response Codes, I determined that most of the pages weren’t honoring the IMS/conditional GET.

There were only eight new articles last month, which means that even with several edits, page content shouldn’t change that much, which means we still have significant improvements at hand. Maybe there’s some hope afterall!


Even though my research showed little to gain for RSS feeds, there’s plenty of reason to make web aggregators more efficient. When performing traffic analysis for your own site, follow these general steps:

  • Choose a web log analytics package that processes raw server log files
  • Quantify overall traffic generated by web crawlers
  • Compare expected and actual 200 / 300 request-response ratio
  • Calculate bandwidth impact of implementing conditional GET mechanism

Now that we understand where traffic is generated on the site, the next step will be to make sure PHP content is only generated when content is updated or new. In the next and final installment of this series, I hope to solve this problem once and for all in our WordPress installation. Stay tuned for more!

« Previous Part II: Watching IMS In Action
Next » Part IV: Implementing IMS On WordPress


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

What’s this?

You are currently reading Boost PHP Speed With “If-Modified-Since” [3/4] at VibeTalk by Vibe Technology.


%d bloggers like this: