Update: Here's a link to the project page on drupal.org
Every Drupal installation requires regular actions to perform maintenance tasks such as cleaning up log files, checking for updates, and updating the site's search index. More often than not, the Unix-based cron daemon is used to run these actions, which is why the task is often referred to as a cron run. Since larger sites have more maintenance tasks to perform, the cron run often times out or hangs on a particular function, preventing some operations from completing. A common, albeit hacky, solution is to create custom cron implementations to separate out the different tasks. With the release of the Cron Multi-Threaded module I developed, the need for custom implementations is eliminated. This post will explain the inspiration behind the module as well as the technical details of how it increases efficiency and adds reliability to the Drupal cron run.
I recently attended an inspiring meeting sponsored by bostonphp.org that centered on the ground-breaking technology behind the Barack Obama presidential campaign. More specifically, the company "Blue State Digital" spoke of how they used PHP and MySQL to digitalize the canvassing process and send a staggering amount of emails to voters. They addressed the various scaling issues that came along with storing over one billion rows of data in a MySQL database and gave some insight into how PHP was scaled in the unique environments it was running in.
Although websites were the interface canvassers and supporters used to organize and send donations, a lot of applications were built outside of the web space to handle various tasks such as personalizing emails, sending batch messages, and aiding in database replication. All of the daemons were written purely in PHP and utilized the "process control" extension to create true SMP solutions providing near-linear scalability (in other words, doubling their hardware allowed them to process roughly twice as much data).
The acronym SMP stands for "symmetric multiprocessing". In systems that have multiple CPUs and use an SMP architecture, tasks can be moved between processors to balance the workload efficiently. Since webpage scripts usually exist for less than a second, SMP systems can distribute the requests across its processors. However, the load of a single PHP process cannot be dispatched since the language has no native support for multi-threading. In processes such as Drupal's cron run, which may take minutes to complete, a single processor could be tied up for some time while the others remain idle.
As Blue State Digital did with their applications, the Cron Multi-Threaded module for Drupal utilizes the process control extension to fork the process running the PHP script. In computing terms, forking refers to a process making a copy of itself. The resulting replica is called the child process, and it is free to be distributed to another CPU by the system. This technique allows Cron Multi-Threaded to assign different tasks to the child processes, enabling the system to handle the Drupal cron run much more efficiently.
Cron MT first compiles a list of the installed modules that have maintenance tasks to perform. It then takes a module off of the stack and forks itself. The child process executes the maintenance operation while the parent process pulls another module from the stack repeating the cycle. The site administrator can configure the number of processes that are allowed to run at once as to not overload the system, but conceivably the individual tasks can be processed by separate CPUs at the same time. If one operation hangs, it will not prevent the other ones from running since they are executed separately. The only job of the parent process is to dispatch tasks to its children, thus eliminating the Achilles' heel of the Drupal cron run.
Blue State Digital has proved that PHP can yield enterprise-level scalability in the most critical environments. With the scope of PHP applications expanding, there is room for Drupal to emerge as a platform used to build applications outside of the web space. By implementing the techniques used in Cron Multi-Threaded, the performance increases gained will allow Drupal to compete in areas currently monopolized by other traditional programming languages.