Large Scale file distribution in clouds using BitTorrent – part 1

One of the common issues in cloud computing, and more generally in the management of distributed systems, is the delivery of large number of potentially very large files to a very large number of devices for the purpose of OS provisioning, software release, or just plain content distribution. To give you a sense of the scale, we are talking about thousands of small files (under 1Mb)  distributed to thousands of devices, or hundreds of very large files (up to 50Gb) distributed to hundreds of devices. All of that across multiple datacenters.

An approach is to use a peer to peer file transfer protocol, like BitTorrent,  to more efficiently distribute content to this large number of clients. Efficiency in our context is about reducing the time required to propagate the files, and limiting the peak network bandwidth usage.

The following are examples of the utilization of BitTorrent applied to datacenter automation:

  • Twitter recently released an open source version of the bittorrent based system they use for release management, called Murder, more information is available in that blog entry.
  • SystemImager supports a BitTorrent based transport option. A benchmark published on the SystemImager site shows that almost 1200 servers could be provisioned in 15mn.
  • The Rocks Avalanche installer uses BitTorrent to distribute packages to cluster nodes. The Avalanche Installer allows almost identical install time for 1 node compared to 128 (12min vs. 15min) while implementing several throttling techniques.
  • The CERN VM Kiosk is backed by BitTorrent and SCP (as seen in this presentation). As a side note, the CERN is also developing an image distribution based on SCP,  SCP Tsunami, borrowing some of the BitTorrent properties.

All of that is very well, but the deployment of BitTorrent in datacenters need to be done carefully. The following points have to be considered:

  • BitTorrent protocol relies on a tracker service to maintain the list of peers. Clients will have to have access to this service at the start of the download. This tracker service has the potential of being a single point of failure.
  • Datacenter topology has to be taken into consideration in order to optimize the bandwidth usage going across the core network layer and across datacenters.
  • Initialization of the transfer requires the creation of a Seed. The location and number of initial seeds is critical to ensured the best efficiency.

This first post will focus on the first problem, the availability of the tracker. Other posts will address the two remaining problems.

Tracker Availability

There are basically two strategies to address the dependency of BitTorrent clients to the tracker. The first one, is to simply use the trackerless mode, relying on distributed hash tables (i.e. Kademlia), hence the name BitTorrent DHT. The second one is to use multiple trackers, either by simply using multi-tracker torrents or by implementing different distribution techniques .

As far as we know, murder, SystemImager or Rocks do not use BitTorrent DHT. This is however something that should be explored in this specific use case as the distance calculation in DHT could be modified to be topology aware as discussed in this paper or this presentation (distance calculation in key space is not  a representation of network or geographical distance). This would help solving the other challenge mentioned in this post.

Assuming that a tracker will be used, if just for priming the swarm, we need to explore the distribution options. We can consider two flavors: the first one creates a partition of the peer space, the second one creates a virtually centralized tracker, or HA tracker.

Tracker distribution

Our use case for BitTorrent is a bit different than the most notorious one, namely distributing legal or illegal files to internet population at large. In our case, the partitioning of the swarm is an interesting property as it could be used to contain traffic within a network domain, one of the other problems we have to address. Let’s explore how this would work.

A naive placement for the BitTorrent tracker can be described as follow:

BitTorrent scenario 1

The tracker is connected at the distribution layer level, like other infrastructure components would be. In this scenario, the clients will be configured with a single tracker. Clients in Site 1 and 2 will contact this single tracker, creating a single swarm.

This setup is problematic in multiple ways. In addition to the fact that this tracker is now a single point of failure, the Site 1 clients may potentially try to get files from peers not only in Site 1, but also in Site 2, creating traffic at the core network layer, and also cross datacenters. This is not unlike the inter ISP traffic generated by BitTorrent.

A better setup would be, at the minimum to deploy another tracker in Site 2, and have different torrent configurations for both sites, with the primary tracker being the one in the datacenter where the torrents are published.

This would mean for example a configuration like this :

Site 1 torrents : d['announce-list'] = [ [tracker1-s1], [tracker1-s2] ]
Site 2 torrents : d['announce-list'] = [ [tracker1-s2], [tracker1-s1] ]

With this setup, clients would first try the primary in their datacenter, and if it does not respond, try the one in the other datacenter.

Since it’s unlikely that the two trackers will be kept in sync, this would create split swarms, each sub swarm with the clients from a given datacenter.

If there were a large number of nodes within each distribution network, it could be envisioned to have one tracker per distribution “bubble”. The resulting topology would be like the following :

BitTorrent Distributed trackers

HA Tracker

Even though we have seen that distribution of trackers may be desirable, it remains that each distributed instance should be as available as possible. It does not seem that there is a standard defining tracker clustering or synchronization, but some tracker will implement one, like opentracker, which uses UDP multicast between members of the cluster. Then, in order to balance the load between members of the clusters, the multi-tracker configuration should be used, but with the multiple trackers specified in the same tier. If each tracker in the above topology was deployed in a cluster, we would have this kind of configuration in the torrent files, assuming a cluster of 2 nodes:

Site 1 / distribution 1 torrents : d['announce-list'] = [ [tracker1-s1.ds1, tracker2-s1.ds1], [tracker1-s2.ds1, tracker2.s2.ds1] ]

Clients will shuffle the list, try trackers one after the other while keeping track of success and failure to keep the same order for subsequent requests. This behavior is specific to BitTornado, and actual behavior of the selected client will have to be verified to avoid unexpected cross datacenter traffic or unbalance in the usage of the members of the cluster.

Conclusion

Because of the characteristics of each of the techniques mentioned in this section, it will be likely that a combination of partitioning, clustering and even DHT will have to be used. While not applied to the same problem space, this paper, is reaching a similar conclusion. Also, from the same paper, it is interesting to note that tracker availability is a real issue and should treated with care in the deployment of BitTorrent  for mission critical use cases.

Tags: , ,

  1. bmullan’s avatar

    This is a great topic. I’ve been following the work at twitter in use of bittorrent/cap/murder for provisioning and its a very interesting idea and seems a good application of the torrent concepts.

    I’ve been hoping to see someone implement fabric as an alternative to capistrano so the whole solution could be python based.

    I look forward to you next article on this.