The slides and video of my talk at the OpenNetwork Summit 2013 are up.
You will need to register to get the Archive
I talked on Wednesday in the SDN for Cloud Datacenters section. Don't miss Bruce Davie's presentation too.
I've posted my slides in PDF here
It is this time of the year. I have converted this blog to a static one based on Nikola. Hopefully, I will find the time and constance to keep it updated.
eBay is starting to talk a bit more about our cloud infrastructure. Here is the first talk, still high level, describing why we are building a cloud, and what issues we had to solve to change the eBay infrastructure to implement cloud properties
One of the common issues in cloud computing, and more generally in the management of distributed systems, is the delivery of large number of potentially very large files to a very large number of devices for the purpose of OS provisioning, software release, or just plain content distribution. To give you a sense of the scale, we are talking about thousands of small files (under 1Mb) distributed to thousands of devices, or hundreds of very large files (up to 50Gb) distributed to hundreds of devices. All of that across multiple datacenters.
An approach is to use a peer to peer file transfer protocol, like BitTorrent, to more efficiently distribute content to this large number of clients. Efficiency in our context is about reducing the time required to propagate the files, and limiting the peak network bandwidth usage.
The following are examples of the utilization of BitTorrent applied to datacenter automation:
- Twitter recently released an open source version of the bittorrent based system they use for release management, called Murder, more information is available in that blog entry.
- SystemImager supports a BitTorrent based transport option. A benchmark published on the SystemImager site shows that almost 1200 servers could be provisioned in 15mn.
- The Rocks Avalanche installer uses BitTorrent to distribute packages to cluster nodes. The Avalanche Installer allows almost identical install time for 1 node compared to 128 (12min vs. 15min) while implementing several throttling techniques.
- The CERN VM Kiosk is backed by BitTorrent and SCP (as seen in this presentation). As a side note, the CERN is also developing an image distribution based on SCP, SCP Tsunami, borrowing some of the BitTorrent properties.
All of that is very well, but the deployment of BitTorrent in datacenters need to be done carefully. The following points have to be considered:
- BitTorrent protocol relies on a tracker service to maintain the list of peers. Clients will have to have access to this service at the start of the download. This tracker service has the potential of being a single point of failure.
- Datacenter topology has to be taken into consideration in order to optimize the bandwidth usage going across the core network layer and across datacenters.
- Initialization of the transfer requires the creation of a Seed. The location and number of initial seeds is critical to ensured the best efficiency.
This first post will focus on the first problem, the availability of the tracker. Other posts will address the two remaining problems.
There are basically two strategies to address the dependency of BitTorrent clients to the tracker. The first one, is to simply use the trackerless mode, relying on distributed hash tables (i.e. Kademlia), hence the name BitTorrent DHT. The second one is to use multiple trackers, either by simply using multi-tracker torrents or by implementing different distribution techniques .
As far as we know, murder, SystemImager or Rocks do not use BitTorrent DHT. This is however something that should be explored in this specific use case as the distance calculation in DHT could be modified to be topology aware as discussed in this paper or this presentation (distance calculation in key space is not a representation of network or geographical distance). This would help solving the other challenge mentioned in this post.
Assuming that a tracker will be used, if just for priming the swarm, we need to explore the distribution options. We can consider two flavors: the first one creates a partition of the peer space, the second one creates a virtually centralized tracker, or HA tracker.
Our use case for BitTorrent is a bit different than the most notorious one, namely distributing legal or illegal files to internet population at large. In our case, the partitioning of the swarm is an interesting property as it could be used to contain traffic within a network domain, one of the other problems we have to address. Let's explore how this would work.
A naive placement for the BitTorrent tracker can be described as follow:
The tracker is connected at the distribution layer level, like other infrastructure components would be. In this scenario, the clients will be configured with a single tracker. Clients in Site 1 and 2 will contact this single tracker, creating a single swarm.
This setup is problematic in multiple ways. In addition to the fact that this tracker is now a single point of failure, the Site 1 clients may potentially try to get files from peers not only in Site 1, but also in Site 2, creating traffic at the core network layer, and also cross datacenters. This is not unlike the inter ISP traffic generated by BitTorrent.
A better setup would be, at the minimum to deploy another tracker in Site 2, and have different torrent configurations for both sites, with the primary tracker being the one in the datacenter where the torrents are published.
This would mean for example a configuration like this :
Site 1 torrents : d['announce-list'] = [ [tracker1-s1], [tracker1-s2] ]
Site 2 torrents : d['announce-list'] = [ [tracker1-s2], [tracker1-s1] ]
With this setup, clients would first try the primary in their datacenter, and if it does not respond, try the one in the other datacenter.
Since it's unlikely that the two trackers will be kept in sync, this would create split swarms, each sub swarm with the clients from a given datacenter.
If there were a large number of nodes within each distribution network, it could be envisioned to have one tracker per distribution "bubble". The resulting topology would be like the following :
Even though we have seen that distribution of trackers may be desirable, it remains that each distributed instance should be as available as possible. It does not seem that there is a standard defining tracker clustering or synchronization, but some tracker will implement one, like opentracker, which uses UDP multicast between members of the cluster. Then, in order to balance the load between members of the clusters, the multi-tracker configuration should be used, but with the multiple trackers specified in the same tier. If each tracker in the above topology was deployed in a cluster, we would have this kind of configuration in the torrent files, assuming a cluster of 2 nodes:
Site 1 / distribution 1 torrents : d['announce-list'] = [ [tracker1-s1.ds1, tracker2-s1.ds1], [tracker1-s2.ds1, tracker2.s2.ds1] ]
Clients will shuffle the list, try trackers one after the other while keeping track of success and failure to keep the same order for subsequent requests. This behavior is specific to BitTornado, and actual behavior of the selected client will have to be verified to avoid unexpected cross datacenter traffic or unbalance in the usage of the members of the cluster.
Because of the characteristics of each of the techniques mentioned in this section, it will be likely that a combination of partitioning, clustering and even DHT will have to be used. While not applied to the same problem space, this paper, is reaching a similar conclusion. Also, from the same paper, it is interesting to note that tracker availability is a real issue and should treated with care in the deployment of BitTorrent for mission critical use cases.
A paper titled "Hey, you, Get Off of My Cloud: Exploring information Leakage in Third-Party Compute Clouds" soon to be released at CCS'09 is exploring the threats resulting from sharing physical compute resources in public clouds. After demonstrating that despite the likely large number of physical machines in any given public cloud, it is possible to place hostile VMs next to targeted VMs; the authors are listing methods that are taking advantage of information leaking out through shared physical resources.
The paper concludes that the only foolproof solution is to limit sharing with potentially hostile tenants:
A user might insist on using physical machines populated only with their own VMs and, in exchange, bear the opportunity costs of leaving some of these machines under-utilized. For an optimal assignment policy, this additional overhead should never need to exceed the cost of a single physical machine, so large users — consuming the cycles of many servers — would incur only minor penalties as a fraction of their total cost.
Regardless, we believe such an option is the only foolproof solution to this problem and thus is likely to be demanded by customers with strong privacy requirements.
I have one issue with this recommendation: the colocation of many VMs from the same tenant on fewer physical hosts is increasing the risk of having single points of failure. Assuming 8 small instances per physical machine (based on the document estimates), and given the default limit of 20 active VMs per account, most accounts will need less than 3 physical servers, limiting the spread across the availability zones. At that point the tradeoff will be between availability, security and cost.
As I mentioned in a previous post, I've recently upgraded a PC to the latest opensolaris release, and had to port some of the applications over. One of these is the fast and efficient rtorrent client. I did not find recent packages in the repositories and had to compile it myself. I found that a future version of opensolaris may have the client integrated, and a case for the SFW consolidation was recently submitted by Huawei Zhang with all the required patches included.
The first step in the install is to make sure that the development environment is setup correctly. From the base opensolaris, I installed the following:
$ pfexec pkg install SUNWncurses
$ pfexec pkg install SUNWcurl
$ pfexec pkg install SUNWgnome-common-devel
$ pfexec pkg install SUNWgmake
$ pfexec pkg install SUNWgcc
$ pfexec pkg install SUNWgnu-automake-110
$ pfexec pkg install SUNWlibtool
$ pfexec pkg install SUNWaconf
The next step is to install libsig++ 2.0 that is required by rlibtorrent. Your mileage may vary, but I had better chance using gmake for all the builds. Note: You will find the lib in the repositories, but I had compilation issues and had to build it myself.
$ wget http://ftp.gnome.org/pub/GNOME/sources/libsigc++/2.0/libsigc++-2.0.18.tar.gz $ gzip -dc libsigc++-2.0.18.tar.gz | tar xvf - $ cd libsigc++-2.0.18 $ ./configure $ gmake $ pfexec gmake install
If you do not change the default location, you should have the libsig++ library installed under /usr/local.
Adding the following will help later to build the rlibtorrent and rtorrent itself.
$ export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/lib/pkgconfig
Next, on to install rlibtorrent. This is were I would recommend you take to version submitted to SFW with the associated patches.
$ wget http://cr.opensolaris.org/~alz/bittorrent/raw_files/new/usr/src/lib/libtorrent/libtorrent-0.12.2.tar.gz $ gzip -dc libtorrent-0.12.2.tar.gz | tar xvf - $ mkdir patches $ cd patches $ wget -r -l1 -nd -A.diff http://cr.opensolaris.org/~alz/bittorrent/raw_files/new/usr/src/lib/libtorrent/patches/ $ cd ../libtorrent-0.12.2 $ cat ../patches/rlibtorrent-* | gpatch -p1
The following is required because some am files were modified through the patching process.
$ aclocal-1.10 -I./scripts -I.
$ libtoolize --automake --copy --force
$ ./configure --enable-shared --disable-static --with-ports --disable-libtool-lock
$ pfexec gmake install
Same principles to finally build the rtorrent client.
$ wget http://cr.opensolaris.org/~alz/bittorrent/raw_files/new/usr/src/cmd/rtorrent/rtorrent-0.8.2.tar.gz
$ gzip -dc rtorrent-0.8.2.tar.gz | tar xvf -
$ cd patches
$ wget -r -l1 -nd -A.diff http://cr.opensolaris.org/~alz/bittorrent/raw_files/new/usr/src/cmd/rtorrent/patches/
$ cd rtorrent-0.8.2
$ cat ../patches/rtorrent-0* | gpatch -p1
$ export LDFLAGS='-Wl,-zignore -Wl,-zcombreloc -Wl,-Bdirect -L/usr/sfw/lib -R/usr/sfw/lib -L/usr/gnu/lib -R/usr/gnu/lib -L/usr/lib/'
$ export CXXFLAGS=-I/usr/include/ncurses
$ aclocal-1.10 -I./scripts -I.
$ libtoolize --automake --copy --force
$ pfexec gmake install
Here it is. Hopefully I did not forget any step or made mistakes while capturing the commands, but you should have enough of a base to start and successfully build rtorrent. Do not hesitate to post a comment with your experience.
Updated on 09/07/2009 to add SUNWlibtool that I forgot. Thanks to Gustavo for pointing it out.
Now that the shipping date for Snow Leopard is approaching, I came to the realization that I will not get rid of my Solaris based NAS. It's been running flawlessly for the past 2 or 3 years (well, I lost several disks and a controller, but never lost any data), but I was hopping to consolidate everything on a Mac Pro with 8 cores. Since ZFS is not going to be there, and this apparently until the next major release, I will likely upgrade my PC to a better setup in order to keep running ZFS. Even the open source effort on Mac OS Forge seems to be going nowhere ...
In the mean time, I just upgraded my box to opensolaris 2009.06 and spent some time compiling the tools that I needed on the box, more on that later.
The puppet community has split
There was a split in the puppet community and a new project saw life as a result: Chef. Chef is describing itself as :
Chef is a systems integration framework, built to bring the benefits of configuration management to your entire infrastructure. With Chef, you can:
- Manage your servers by writing code, not by running commands. (via Cookbooks)
- Integrate tightly with your applications, databases, LDAP directories, and more. (via Libraries)
- Easily configure applications that require knowledge about your entire infrastructure ("What systems are running my application?" "What is the current master database server?")
More details about the Chef differentiators can be found here.
In a future post, I'll explore in more details the challenges around configuration automation, and the procedural approach.
Reductive Labs received funding
Reductive Labs, the company responsible for Puppet, has received $2 Million in funding. Puppet has been gaining traction against cfengine, but it will be interesting to see how Reductive Labs uses its funding, and how the new Chef solution is impacting this progression.
Cloud Computing brought configuration automation in the spotlight
One of the cornerstones of Cloud Computing is the automation of the infrastructure configuration. Either because you want to build a highly automated infrastructure supporting cloud users, or you are putting your application in the cloud. In both cases, infrastructure and applications configuration has to be captured, maintained and automatically provisioned. This will enable rapid scale out, fail over, or in general deployment and redeployment of the managed components.
After a very long hiatus, I feel like I have more to share, and decided to do some summer cleanup on this blog. This post is the first since a long time, many things have changed:
- I am changing hosting. I was squatting a friend's collocation, and need to move out. I want to thank Fred for all the resources I used and everything I've learned in the process.
- I'm switching from Pebble to Wordpress. I really liked Pebble, it served me well, I learned a lot, but Java hosting is harder to find. By the way, I admire Pebble's author, Simon Brown. If you have a chance, stop by his site. I wish I could move to Jersey too.
- I've changed employer. I'm now working at eBay, as an architect responsible for the datacenter automation area, and the implementation of cloud properties on the site. More on that later.
- I lost weight, a lot of it (50 pounds), and I've rowed 2 half marathons (indoor), and recently ran one (outdoor). Occasionally, I'll post about this here too.
I think that's it. I hope to be able to post more than I did in the past two years, at least before going through more changes.
In a previous post, I mentioned the announcement of the Sun's Ops Center product targeted to the management of virtual environments. In this post, I said that Ops Center was a re-branded N1 System Manager, while in fact, it seems that this is a merge of the Sun Connection and N1 System Manager in one tool :
A highly scalable datacenter automation tool merging discover, update, provisioning, monitoring, and reporting technologies from Sun Connection and N1SM into one tool.
However, by looking at the Oracle World demo, it seems that the UI is radically different from the N1 System Manager (gone the embedded CLI ?).
Also, by looking at the supported platform, it seems that Windows platform is not supported anymore :
From a centralized management console, customers can provision Solaris, Linux, and Windows with a simple drag-and-drop, and monitors the health of systems in an efficient manner.
The comprehensive, highly scalable Linux and Solaris life cycle management tool.
The good news is that Sun Ops Center will be delivered as open source too :
Building on Sun's commitment to open standards and customer choice, Sun will continue to innovate the Sun xVM platform and collaborate with open source communities. The first of Sun's contributions will be the Common Agent Container (CAC) code to the OpenxVM.org community under GPLv3. The CAC is the heart of the management infrastructure for many of Sun's products, including the Sun xVM Ops Center. In addition, Sun plans to make the entire code base used by Sun xVM Ops Center available to the OpenxVM.org community in the first quarter of 2008.
It's not clear however if this means the end of life for the N1 System Manager, since right now, the Ops Center does not provide a complete replacement.