Bean Blog
The Mother Load
Open source collaboration and the all-you-can-eat nature of compiling a website nowadays has created a landscape that can be both blessing and curse.
On one hand, thousands of developers are actively contributing code to major open source projects and all sorts of sites have opened up their content via APIs for your use. This means you can add modules into your CMS and pull content from ten other sites in ways impossible just a few years ago. But on the other hand, open source code comes with the risks of trusting in the community of developers, while tapping into other content means that you may also be dramatically ramping up the number of possible failure points. If nothing else, the combination of all these elements can produce an unstable end product.
Case in point, we recently launched a high profile news site that incorporated several add on modules within its Drupal CMS. Adding to the complexity is the fact that the website incorporates video hosted on an external server utilizing a proprietary video player, multiple RSS feeds, blogging tools, jQuery calls and a host of dynamic content pulled together using Drupal’s CCK. Again, these sorts of combinations are the norm, but some combinations work better than others.
Like many other topical sites, this site experienced significant spikes in traffic related to their weekly schedule. Most of the week was spent with modest levels of users, but certain times saw these figures leap to several hundred times higher than normal.
While a simple solution would be to throw more hardware at the application in order to handle the load, there’s a catch: in addition to considering how this one site handles traffic, we had to consider how our contributing systems such as the external video player would handle an overflow of traffic. As it turns out, the video system is designed to throttle down in the face of traffic surges, which works great to eliminate the load on the video server caused our system to sit, churning, waiting for a response back. We needed to do some load testing to see what thresholds existed and how we could improve upon them. With a system that reaches across multiple domains, performing multiple processes and using multiple modules all as add-ons, plus some custom code, it was vital to be able to turn off and on various entities, test, then retest. And we needed to do it immediately.
Luckily, the old days of more manual load testing are gone. With a web-based load testing service, we were able to first throw 50 users into the mix at no cost. However, since 50 is such a small number we couldn’t force even the smallest of hiccups. So we ramped up to 250 users, again not a huge number, but throwing that many users at once on a heavy processing page (like a video page) allowed us to actively see where the lag occurred. At first the site was deathly slow so we took this as our baseline. We turned off all caching, used the standard Drupal install with no page or script compression and kept on full site logging, aka watchdog logs. Our goal was not to push it until it died but to push it until it did something measureable, which it did. Next we disabled all external calls and ran the same test. This time, the site flew. So we knew the code and scripts were pretty well optimized. Our issue was coming from a bottleneck between the various components. So now we had to figure out what that was.
We knew video was a requirement and we also knew that we had no control over the external video player code. Turning on Drupal’s own throttling mechanism would work but at the expense of removing the video when the site came under heavy load… which is exactly why the traffic surge occurs in the first place. Since video had to remain, we kept it turned on and knew we would have to live with the possibility that a surge on the video player would cause it to throttle down. However, the general responsiveness was tweaked and acceptable. Then we moved on to the next item, external website ads.
This was a real problem for the site. Putting the ads back in almost brought the site to its knees. We could do two things – we could throttle those or we could use a different delivery mechanism. We decided to turn on throttling and once again, the site was almost killed. Apache’s processes showed some processor spiking but definitely not enough to warrant a near-crash. So we removed all but one ad, tested it again, and the test stayed steady. We added another back in, and nothing. Upon adding the final one, the site finally responded by timing out. A quick review of the logs showed that the system was hiccupping on a file include for one particular ad type. We modified that ad type using the second solution, a different delivery mechanism, and again, we seemed to be on our way. Several hours passed and we came back and ran the process again. Unfortunately it didn’t respond as we would’ve hoped, and we seemed to be back at square one. A quick review of the logs from this test revealed that the server was fine until it was called after sitting virtually idle and attempted to recache every component on the page, simultaneously. Drupal is wonderful in that you can cache all the way down to the block level but this can cause major headaches when ads, blocks, memcache, scripts and code all attempt to call a set of objects, clear the cache, rebuild the cache, log the activity AND send a response back to the user at precisely the same time. In short, it’s too many things happening all at once. What to do?
Because we could quickly run load tests, we tried turning off and on various caching mechanisms until we found the right combination. We also came up with a cron job to take some of this caching off of Drupal, which often waits until a user request to begin its caching as specified in the bootstrap process. Finally, we had to add an intermediary caching module to assist the ads from the ads server which was conflicting with the Drupal block cache. All this only became apparent with running piecemeal testing.
In the end, a series of basically innocent individual components combined to form an overall process that could be a mother of a problem. Thankfully, load testing and the ability to rapidly recombine elements of the site revealed the issues with their interconnectivity, allowing us to make the necessary fixes. And that is the mother lode: finding the core of the issue buried in all the surrounding rock.
| «Older | Newer» |


here:
Comments
(Comment Moderation is enabled. Your comment will not appear until approved.)There are no comments for this entry.
[Add Comment]