Why is Windows Azure Deployment Slow Compared to Heroku?

Saturday, January 28 2012

UPDATE [6/9/2012]: New features are available in Windows Azure that change things. Check out my new post on Windows Azure Web Sites

Earlier this month I got an email from a friend of mine who was trying out the Tutorial of running Node.js on Windows Azure. He had gotten to the point where he was deploying the code to Azure using the PowerShell command and was surprised to see it taking several minutes (like about 10). Having been using Heroku for some of his other work he thought he might be doing something wrong since deployments to Heroku takes only a few seconds ("with typing" as he put it).

So, why does it take so long to deploy to Windows Azure when Heroku only takes seconds? It has to do with the differences between the two platforms. Both Heroku and Windows Azure are Platform as a Service (PaaS) providers, meaning they abstract a large portion of their infrastructure away from you so that all you need to focus on is the application code and data, but the two platforms have different levels of abstraction.


Let's start with Heroku. First off, I'm not a Heroku expert. What follows is my understanding of the Heroku platform and how it works based on some conversations I've had, a presentation I attended by James Ward, research from their site and other articles. If I get something wrong, please let me know in the comments. Also, cloud computing platforms advance and change at an amazing pace, so what's written in this post is as of the time of it's posting. If you are reading this six months after it's posted the landscape may look a lot different.

The Heroku platform runs on Linux machines that have their resources partitioned by the open source resource partitioning layer LCX. This partitioning scheme breaks up the machines into what Heroku has termed "dynos". A dyno is a compute unit and the unit in which you are billed. You can think of them as "instances" of your application. They are completely isolated from one another so as to ensure that there is no sharing of data between the different dynos that may be running on the same physical server. A dyno has a set of resources dedicated to it from the physical server, so it gets a slice of CPU cycles, the disk subsystem, network IO, etc.
 
When you deploy an application to Heroku it gets packaged into what is called a "slug" (you can upload pure code to Heroku and let the platform compile it for you, or you can upload compiled bits, but either way a slug gets created and this is the internal package that Heroku uses to deploy to dynos). When a slug is deployed to a dyno the Heroku Dyno Manifold (the controlling software of all deployments and resourcing in the Heroku environment) finds a physical server, or servers if you are deploying multiple instances, that has available dynos and it drops the slug into it/them. Basically, they are copying your slug (deployable code) onto a machine that is already running and calling a process to start up. This is obviously pretty quick because the time is mostly in copying the code and the process start up. This is why you see average deployment times in the seconds for Heroku.
 
Now let's look at how Windows Azure works. The deployment unit to Windows Azure is a package file, which contains your compiled code and/or files. This is basically a zip file and probably not too much unlike the slug file used by Heroku. When you perform a deployment the package file is uploaded and the Windows Azure Fabric Controllers (the equivalent of the Dyno Manifold in Heroku) inspects the service definition file included in the package to determine the compute resources the deployment needs. The Fabric Controllers then select a physical server, or servers if you are deploying multiple instances, that has space on it for the deployment. A full virtual machine is brought up on the target physical servers and then your code is deployed to that VM as well. So, unlike the dynos in Heroku the virtual machines in Windows Azure aren't already fully running just ready for code (the host machines likely are). This is the case because as you are assigned a virtual machine in Windows Azure rather than just a process and you have a lot more control over that virtual machine. For example, you may run start up commands to configure the IIS server with something non-standard, or you may need to register 3rd party components before the system responds to traffic.  The abstraction Windows Azure provides starts at the Virutal Machine layer, where for Heroku the abstraction is down to the process.
 
In short, you could compare the differences between Heroku and Windows Azure deployments by thinking of Heroku as copying code to an already running machine and starting a process vs. spinning up a full virtual machine from a stopped state and then copying the code. Obviously, one is going to take longer than the other. This is why my friend was seeing a deployment to Windows Azure taking about 10 minutes. Either way, this is still faster than you can procure a server for deployment in a non-cloud environment.
 
So, if you were just looking at deployment times between the platforms Heroku wins hands down, but just as with all choices in cloud computing you have to be aware of what the differences are in order to make the right decision for your application. The approach that Heroku has taken is on the higher end of the PaaS model where they are abstracting the idea of Virtual Machines and servers as a whole away from you. Windows Azure abstracts away the infrastructure, but still provides you a lot more control over the virtual machine your application code is running on. So, for our discussion here, we are trading speed for control. Just like if you decided to use a full on Infrastructure as a Service (IaaS) provider (like EC2 instances from Amazon which Heroku actually runs on) instead of a PaaS provider you are trading less configuration and infrastructure concerns in the PaaS world for a LOT more control over your environment and infrastructure in the IaaS world.
 
My friend said he spent a few hours trying to research why there was such a discrepancy in deployment times between the platforms Hopefully this post will save others time in the future.
 

UPDATE: Added presentation by James Ward to my list of Heroku research. Thanks for coming to Cincy!