How I almost crashed the Adobe MAX 2008 Keynote with Tour de Flex
For the past 3 months, several of us have been working crazy hours to get Tour de Flex ready for Adobe MAX. Kevin Lynch, Adobe’s CTO, told us a week before MAX that he was going to use Tour de Flex in his Monday morning keynote to demonstrate several of the cloud APIs now available to developers. This was exciting news to us so we put in a good last sprint and had the product ready for its prime-time debut on the 20ft screen at MAX in front of over 5,000 people. I was very confident that things were ready to go.
On Monday morning, two hours before Kevin’s keynote, I was having breakfast with some guys from Avoka, an Adobe partner, and decided to give them a quick demo of Tour de Flex. I quickly noticed that all of the samples requiring an Internet connection were indicating that they were offline, including the cloud API samples! After confirming that I did indeed have a working Internet connection, I immediately turned my attention to the instance of Apache running on the Tour de Flex server and as I feared, I found it had crashed with the following error (timestamps are EST):
Apache had crashed, rendering many samples in Tour de Flex offline! There were a few comments on my blog article and emails from two others reporting the problem. As an experiment, I raised ThreadsPerChild to 80 (default is 64 on Windows) and restarted Apache. Within three minutes, it crashed again! I wasn’t comfortable simply raising this number randomly until we figured out exactly what was going on.
Tour de Flex background: Tour de Flex is a desktop application that runs on Adobe AIR and provides 217 sample Flex applications to show off various components and techniques. Many of these samples are remotely hosted requiring an Internet connection. There is a URLMonitor in the app that polls our server every two seconds, and if the URLMonitor finds the URL unreachable, the app goes into offline mode and all remote samples display a nice message that reads, “An Internet connection is required”.
Earlier in the day, at 12:01am, Tour de Flex went live on flex.org/tour and the Flex Developer Center. A few of us evangelists blogged about it which resulted in a fairly significant number of early morning downloads. The resulting URLMonitors running all over the world caused the crash.
So, there I sat in the Marriott restaurant feeling a sense of panic fall over me. My mind raced… Could I fix this before Kevin’s keynote? Would another adjustment to the suggested ThreadsPerChild setting suffice or would it only delay a repeat crash? Should I try to change the code and rush backstage with a special build? How long should I work on this before I make the dreaded abort phone call? I was suddenly not hungry anymore. It’s times like this that you have to take a deep breath, take a step back and analyze the problem. Otherwise, you’ll usually just make it worse.
As you can imagine, breakfast quickly turned into a triage session. Everyone at the table offered suggestions as we quickly read through various articles on the web that referred to the error message. A few minutes later, James Ward walked up and started helping us as well. At this point, I decided that James and I should go find a better internet connection and try to think this thing through rather than randomly changing parameters.
During the elevator ride up to my room, James and I started discussing the error. I told James about the URLMonitor and he immediately suspected it was a keep-alive-related issue. He asked what the default KeepAliveTimeout was on our Apache server. KeepAliveTimeout is the number of seconds Apache will wait for a subsequent request before closing the connection. We quickly searched the docs and found that the default is five seconds. The URLMonitor was polling every two seconds! So, basically, every running instance of Tour de Flex was maintaining a dedicated connection to the server. It was doomed to crash!
We quickly changed the KeepAliveTimeout to one second, increased ThreadsPerChild to 512 (no analysis went into this… we just wanted to give ourselves some good padding). The Apache server has been running ever since.
During Kevin Lynch’s keynote, I was constantly hitting the Apache server with my iPhone to confirm that was still up. I was also ready to jump into an awesome little iPhone application called WinAdmin (Windows remote desktop for iPhone) and sneak in a restart of Apache if needed. I’m glad it wasn’t needed, although I have to admit, it would have been fun to write an article titled, “How I saved the day with my iPhone”!
Here’s a little video I shot of the few minutes that Kevin used Tour de Flex:
Tour de Flex performed flawlessly during the keynote. I finally started breathing again.
In retrospect, here are the stupid things I did:
- Putting the application in an untested scenario before the keynote. I should have insisted that we not launch until immediately after the keynote.
- Setting up a URLMonitor that polls too often. We will most likely change this to 15-30 seconds in the next release.
- I should have reviewed the Apache default settings closer… although, I’m not sure I would have ever anticipated this specific problem.
Thanks to the guys at Avoka for helping out during breakfast. I owe you guys another breakfast!
Thanks to James Ward for having the incredible instincts to suggest checking the “Keep Alive Timeout”. This is what truly led to a resolution.
Thanks to Apache for providing useful error messages with suggested changes
Finally, a big thank you to Murphy’s Law for reminding me how real you really are!