July 2010 Update: Please review Charlie Arehart’s comments below about important ColdFusion 9-specific updates to this function.
Ok, so I know what you’re thinking. You’re thinking I must be crazy for suggesting that a simple built-in unique randomization function could somehow be instrumental in crashing a ColdFusion server, right? Well, before you go calling me a lunatic, let me assure you that the ColdFusion createUUID() function DOES, in fact, represent a threat to the stability of your server – and I intend to prove it to you.
Before I go too far into this topic, I should point out that this threat really only exists if you happen to be running ColdFusion on a Windows server. For those of you who are running on any sort of *NIX-based server, you can move on – nothing to see here.
If you’re still with me at this point, then I need to do a little explaining.
First off, you should be aware of a known issue with ColdFusion on Windows servers as documented in this Adobe TechNote article. This TechNote basically explains that there’s a bug with certain JVM versions under the hood of ColdFusion that results in the potential for the createUUID() function to actually increase the speed of the system clock within Windows. Now, this acceleration of the Windows system clock is only likely to occur in situations where createUUID() is “heavily utilized”. The tricky part here is that I’m not entirely certain how “heavily utilized” is classified as referenced in the TechNote. Later in this article, however, I will give you an example of a particular customer scenario that was creating this server-crashing behavior as a point of reference. For now, just note that heavy usage of the createUUID() function might cause the Windows system clock to speed ahead a matter of seconds or even minutes within the span of an hour or two depending the volume with which the function calls are made.
Now, the second vital bit of of information comes down to the inner mechanics of how the createUUID() function actually works. The function more or less generates a unique hash/GUID based on a combination of the server machine ID, a cryptographically strong random number, and the present date and time. Now, we all know that calls to the createUUID() function on the same server will all have the same machine ID input value. And even though the chance is extremely slim, fact remains there’s a fraction of a fraction of a chance that two calls to generate a “cryptographically strong random number” could result in returning the same value. That’s where the importance of the present date and time consideration made by createUUID() becomes important. If you think about how quickly many ColdFusion templates execute (quite often literally within milliseconds), you can quickly see how a series of calls to simply get the current server date and time (even if to the precision of milliseconds) might return the exact same date and time. For this very reason, the createUUID() method makes a call during its execution to another internal method – uniqueTOD(). The purpose of uniqueTOD() is to return a Unique Time Of Day – as the name suggests. But as we already discussed, it’s possible that extremely rapid calls to return the system date and time may not be separated by more than a millisecond, resulting in a return of the same date and time. So, in order to ensure that this doesn’t take place, the uniqueTOD() function has been designed as a “synchronized” method. In the Java world, that just means that the method is single-threaded – it can only be run one invocation at a time, and any attempts to invoke it while it’s already in the process of running will be blocked until execution of the running invocation completes. In addition to this, the uniqueTOD() also includes a Thread.sleep(1) statement which effectively causes the thread to sleep for 1 millisecond by designating a specific date and time at which to wake and resume execution. The combination of these two concepts ensures that dates and times returned by the uniqueTOD() function will always be separated by at least 1 millisecond.
If you need proof that this takes place, take the following bit of sample code, toss it on your server, enable server debugging output, point your browser at it, and take a look at the execution time spent.
<cfloop from="1" to="1000" index="i"> <cfset x = createUUID()> </cfloop>
When you run this code, you’ll quickly notice that the execution time takes roughly 1 second or 1,000 milliseconds in direct correlation with the to argument value of 1,000 for the cfloop. This correlates directly to the 1ms sleep time being incurred with each of the 1,000 calls within the loop to createUUID().
So now that we understand all those mechanics, it’s time for me to unveil where this might become a server killing problem. There’s a handy bit of operating system housekeeping that most of us take completely for granted these days – automatic internet-based time synchronization. Consider the following scenario with me.
- Your server starts off at 1:00:00.000pm with a date and time that are synchronized to an internet time server of some sort.
- Your ColdFusion application makes “heavy utilization” of the createUUID() function over the course of 2 hours, gradually advancing your server clock by 200 seconds or so as a result of the behavior/bugs I’ve explained above.
- Your ColdFusion server is in the middle of a series of high-volume calls to the createUUID() function.
- Your server makes a call to the configured internet time server, realizes the system clock is currently running about 200 seconds fast, and resets the system clock appropriately to 3:00:00.000pm.
- As the system clock was turned back to the proper time, a single internal call to the uniqueTOD() function had already assigned a “wake time” of 3:03:20.001pm (based on having previously been 200 seconds ahead).
In the scenario outlined above, you now have a createUUID() call waiting on an internal uniqueTOD() call that won’t actually complete execution until 3:03:20.001pm. Prior to the call to synchronize the system clock with the internet time server, this would have only been 1 millisecond away. But now that the server clock has been corrected, we’re suddenly 3 minutes, 20 seconds, and 1 millisecond away from that “wake time”. And what makes matters worse is that the waiting uniqueTOD() invocation is also synchronized. This means that all other invocations to uniqueTOD() or to any other functions that include the use of uniqueTOD() must now sit and wait until the running/waiting invocation “wakes” and completes at 3:03:20.001pm.
Wait a minute. What’s that I hear in the background? Oh, yes… that’s the sound of thread activity grinding to a halt on your production server as threads are suddenly being blocked for 3 minutes, 20 seconds and 1 millisecond until that uniqueTOD() invocation finally wakes and completes in order to generate that UUID that was requested.
I actually tracked this situation down for a Webapper customer who had been experiencing frequent yet unpredictable “mystery crashes” of their ColdFusion instances. Our customer was very technically proficient and had already looked under most of the “rocks” we would typically turn over in this sort of situation. During our initial kick-off call with the customer, we were all in consensus that this was likely a result of some sort of database transaction locking and blocking that we’d need to track down. After a little investigation and profiling on the servers in question, I realized that was simply a red herring. With the help of SeeFusion, we were able to snag stack traces from the server as the crashes were actually taking place and quickly determine that the entire system was hanging up WAY before ever getting to any database activity. In fact, the application seemed to be hung on the first line of their Application.cfm file – the <cfapplication> tag.
That’s when it hit me. The customer’s application was architected in such a way that there were very high-volume rapid-succession web service calls being made to a ColdFusion component. By high-volume rapid-succession, I specifically mean 2 or 3 calls every 50ms. That component, by design, was including the root Application.cfm file. And the <cfapplication> call was hanging because the clientmanagement argument had been set to “true” and the customer had enabled the “Use UUID for CFTOKEN” checkbox in the ColdFusion Administrator. Each web service call was resulting in the assignment of a new CFID/CFTOKEN pair – an entirely separate blog-worthy issue. Suddenly, the whole situation made perfect sense – well, at least in theory. The real test, of course, would be to have the customer uncheck the “Use UUID for CFTOKEN” setting and see if the “mystery crashes” came to an end. I’m happy to report that they haven’t had a system hang since disabling that ColdFusion setting.
So, if you have any code in your server that might be creating opportunities for high-volume utilization of the createUUID() function, then you just might be one internet clock synchronization cycle away from a complete ColdFusion meltdown.