Boost PHP Speed With “If-Modified-Since” [3/4]
October 16, 2006 § Leave a comment
Improve PHP Performance
Welcome to the third installment of our four part series on how webmasters can reduce or even eliminate unnecessary server traffic. The heart
of this tutorial is the If-Modified-Since header and a technique called conditional GET.
- Part I: Understanding IMS
- Part II: Watching IMS In Action
- Part III: Using IMS For Optimized RSS Feeds (you are here)
- Part IV: Implementing IMS On WordPress
This section covers the impact of conditional GETs on RSS feeds.
Part III: Using IMS For Optimized RSS (and other) Feeds
Up to now, I’ve focused on how browsers can use IMS for conditional GETs, but in reality, browsers are only a small part of the picture. I learned last month that the real beauty of this technique has to do with making web crawlers work better.
RSS seemed to get the most attention here, so I wanted to pay particular attention to how they have been affecting the Vibe Technology website. What I found is that (for us, at least), RSS crawlers aren’t the culprits – search engine indexers are the big bandwidth hogs.
You can improve performance by optimizing RSS feeds, but you may find more impressive gains when leveraging the same conditional GET techniques with the full pages that are indexed by search engines.
However, before we talk about fixing the problem, let’s understand the problem through some web log analysis…
Required Tools for Web Log Analysis
Are you aware of just how many times a web crawler visits your website? Honestly, I wasn’t either, but since I’d heard it can be several times a day, I decided to find out.
Last month I focused on click tracking software that can give webmasters a picture of visitor behavior (Comparing Crazy Egg, Google and MyBlogLog). While I use MyBlogLog and Google Analytics on a daily basis, I needed to dig deeper for details on how spiders, worms and crawlers were affecting VibeTalk.
To accomplish this, I used an excellent tool called Sawmill Professional by Flowerfire. They have a 30-day download-able version on the website.
Focus Analysis on Relevant Data
Web logs can be misleading at first glance. For example, Vibe Technology’s September logs show nearly 140,000 “hits”. Strictly speaking, this is true, but understand that *every* element of every requested page is a “hit”. Sawmill does a great job of grouping hits into pages and sessions, but make sure you are tracking bandwidth data at the server level (see Configure IIS section below).
Configure Sawmill Data Filters
Because of website development, much of our traffic was internal. If you have the same issue, try the following log filter for Sawmill (change the IP address and domain name accordingly)
if (c_ip eq "127.0.0.1") or (c_ip eq "126.96.36.199") or (ends_with(hostname, "vibetechnology.com")) then "reject"
The third test only works if your server resolves hostnames. Also, remember to rebuild the database after creating Log Filters.
Configure IIS for Bandwidth Reporting
Only after poring over the reports for an hour or so did I realize why I couldn’t get bandwidth data – we didn’t have that level of tracking enabled at the server level. This resulted in some guestimating on the impact, so I recommend fixing this in your install, if you haven’t already. The default installation appears to have Bytes Sent and Bytes Received unchecked.
To calculate the bandwidth impact from feeds, first filter the dataset to specific content areas. For VibeTalk, I filtered traffic to that coming from these URLs:
/vt/wp-feed.php /vt/wp-atom.php /vt/wp-rss*.php /vt/feed/
Web Log Analysis Results
(screenshot: RSS statistics for 6 days)
Because of my settings problem, I only had a few days of feed-related bandwidth info. However, it was pretty obvious that RSS crawlers pose no significant issue – we currently get about 50 visits and transfer less than 1/4 of a megabyte of data per day (one of the benefits of gzip).
It is insignificant, really.
RSS activity could be a bear if the site were to to be slashdotted.
(screenshot: Crawler statistics for September)
The majority of activity appeared to be coming from other PHP pages fetched by web crawlers. For September, spiders accounted for nearly 12000 page views! At 100k per page (a conservative average), that’s over 1 gigabyte of data.
By looking at Response Codes, I determined that most of the pages weren’t honoring the IMS/conditional GET.
There were only eight new articles last month, which means that even with several edits, page content shouldn’t change that much, which means we still have significant improvements at hand. Maybe there’s some hope afterall!
Even though my research showed little to gain for RSS feeds, there’s plenty of reason to make web aggregators more efficient. When performing traffic analysis for your own site, follow these general steps:
- Choose a web log analytics package that processes raw server log files
- Quantify overall traffic generated by web crawlers
- Compare expected and actual 200 / 300 request-response ratio
- Calculate bandwidth impact of implementing conditional GET mechanism
Now that we understand where traffic is generated on the site, the next step will be to make sure PHP content is only generated when content is updated or new. In the next and final installment of this series, I hope to solve this problem once and for all in our WordPress installation. Stay tuned for more!
« Previous Part II: Watching IMS In Action
Next » Part IV: Implementing IMS On WordPress