The Growing Diversity inside OpenStack Object Storage

October 9th, 2013 — 4:20am

Of all OpenStack projects, Object Storage (also known as Swift) has always been considered mature or, in other words, a place where new things rarely happen. I’ve always been looking at the Object Storage project closely and I’m happy to report a lot of exciting things are happening in Swift, specifically around the community participation and growing ecosystem.

The total number of contributors to OpenStack Object Storage reached 136 with as many as 16 different people committing code in a single week of July 2013. Of those, 64 have participated in the Havana cycle, 30 of whom are new contributors to Swift. The charts show a very good upward trending curve for the total authors per week, different people filing new bugs (the Closers/Openers chart) and variety of people filing, triaging, setting priority and fixing bugs (the Changers chart). The top contributors (by patch count) are from 6 different companies: SwiftStack, Red Hat, Rackspace, United Stack, IBM, and eNovance.

Features are also growing: in Havana we’ll get global clusters. This allows deployers to build a single Swift storage system that spans a wide geographic area. For example, a deployer can build a Swift storage cluster that keeps different replicas in different regions for either DR or for low-latency regional access. SwiftStack, SoftLayer, and Mirantis all contributed into the global clusters feature. More details on what’s coming are on the CHANGELOG. Get to the Summit in Hong Kong to hear how Concur set up their global Swift cluster.

More new and cool features are also coming: SwiftStack, Box, and Intel are working on an erasure coding storage policy. Rackspace is working on improving replication. Red Hat is working on making Swift’s interface to storage volumes more dynamic. Work has started on this functionality and will be a major topic of discussion in Hong Kong.

Because of this broad base of contributors, the major feature development addressing real-world use cases, and the proven performance at scale, OpenStack Object Storage is being widely deployed and is powering some of the world’s largest storage clouds. I’m tremendously excited about Swift’s progress and its future trajectory.

Comment » | Communication, community, Development

OpenStack Melbourne Australia Meetup May 15

May 21st, 2012 — 8:41pm

Last Tuesday night (17/5) in Melbourne I attended the 2nd meetup of the Melbourne contingent of the Australian OpenStack User Group. It was a fantastic night with great speakers and a great group of people who are passionate about OpenStack.
There was a mix of people with various OpenStack experience, from the NeCTAR team who have made contributions to the source, to IT staff from various companies trying to get a handle on the new technology that everyone is talking about.

The evening started with Tristan Goode (Aptira) welcoming the group to the meetup, then sharing his impressions of the recent OpenStack conference in San Francisco, which he believes was the best conference he has been to in over 20 years in the IT industry.
Tom Fifield (NeCTAR) gave an energetic talk about the OpenStack project. It’s the fastest moving open source project he has ever seen, and he is glad to see the openness of the project being protected by the founding of a non-profit organisation that will hold all the IP and trademarks of the project. An open attitude to the project will lead to a better product in the end. OpenStack is real and its ready for use now, the current Essex release will have long term support on Ubuntu.
Angus Salkeld (Red Hat) gave a demo of project Heat. Heat provides a mechanism to provision up PaaS configurations, via template. It supports versioning of templates and the full PaaS installation. His WordPress demo brought up a MySQL instance and an apache instance, in approx. 5 minutes.
The final speaker of the night was John Dickinson (Rackspace), discussing the Swift component of OpenStack. He provided some history to the project, where Rackspace were at when they decided to embark on what is now Swift, a massively scalable object storage, allowing you to store unstructured data without bounds. He also described the design of the system, providing details on the core components of Swift, the object server, the proxy server, consistency server and rings.
One of the principles of the design was to reduce the impact on the operational staff, running with minimal manual intervention. Impressive considering it aims to store data reliably on unreliable hardware.
John is also very passionate about open source, he sees OpenStack as a way to overcome data sovereignty issues which are a growing concern with Australia and other parts of the world. It is amazing that people of this calibre are willing to make themselves available to a relatively small user group.

The meetup is an unbelievable opportunity to meet not just other people who use the technology, but the people who build it and contribute to it. Everyone is so enthusiastic about where OpenStack is right now and where it plans to be in the future. If you’re using OpenStack now you really need to come along to the next meetup, if you’re just playing around with OpenStack you definitely need to be at the next meetup, if you just want to find out more about OpenStack then there is no better place to be than the next meetup.

Look forward to seeing you there.
Evan Watson
Chief Software Architect

2 comments » | community, Event, Meetup

Under the hood of Swift: the Ring

February 15th, 2012 — 3:45am

This is the first post in series that summarizes our analysis of Swift architecture. We’ve tried to highlight some points that are not clear enough in the official documentation. Our primary base was an in-depth look into the source code.

The following material applies to version 1.4.6 of Swift.

The Ring is the vital part of Swift architecture. This half database, half configuration file keeps track of where all data resides in the cluster. For each possible path to any stored entity in the cluster, the Ring points to the particular device on the particular physical node.

There are three types of entities that Swift recognizes: accounts, containers and objects. Each type has the ring of its own, but all three rings are put up the same way. Swift services use the same source code to create and query all three rings. Two Swift classes are responsible for this tasks: RingBuilder and Ring respectively.

Ring data structure

Every Ring of three in Swift is the structure that consists of 3 elements:

  • a list of devices in the cluster, also known as devs in the Ring class;
  • a list of lists of devices ids indicating partition to data assignments, stored in variable named _replica2part2dev_id;
  • an integer number of bits to shift an MD5-hashed path to the account/container/object to calculate the partition index for the hash (partition shift value, part_shift).
List of devices

A list of devices includes all storage devices (disks) known to the ring. Each element of this list is a dictionary of the following structure:

Key Type Value
id integer Index of the devices list
zone integer Zone the device resides in
weight float The relative weight of the device to the other devices in the ring
ip string IP address of server containing the device
port integer TCP port the server uses to serve requests for the device
device string Disk name of the device in the host system, e.g. sda1. It is used to identify disk mount point under /srv/node on the host system
meta string General-use field for storing arbitrary information about the device. Not used by servers directly

Some device management can be performed using values in the list. First, for the removed devices, the 'id' value is set to 'None'. Device IDs are generally not reused. Second, setting 'weight' to 0.0 disables the device temporarily, as no partitions will be assigned to that device.

Partitions assignment list

This data structure is a list of N elements, where N is the replica count for the cluster. The default replica count is 3. Each element of partitions assignment list is an array('H'), or Python compact efficient array of short unsigned integer values. These values are actually index into a list of devices (see previous section). So, each array('H') in the partitions assignment list represents mapping partitions to devices ID.

The ring takes a configurable number of bits from a path’s MD5 hash and converts it to the long integer number. This number is used as an index into the array('H'). This index points to the array element that designates an ID of the device to which the partition is mapped. Number of bits kept from the hash is known as the partition power, and 2 to the partition power indicates the partition count.

For a given partition number, each replica’s device will not be in the same zone as any other replica’s device. Zones can be used to group devices based on physical locations, power separations, network separations, or any other attribute that could make multiple replicas unavailable at the same time.

Partition Shift Value

This is the number of bits taken from MD5 hash of '/account/[container/[object]]' path to calculate partition index for the path. Partition index is calculated by translating binary portion of hash into integer number.

Ring operation

The structure described above is stored as pickled (see Python pickle) and gzipped (see Python gzip.GzipFile) file. There are three files, one per ring, and usually their names are:


These files must exist in /etc/swift directory on every Swift cluster node, both Proxy and Storage, as services on all these nodes use it to locate entities in cluster. Moreover, ring files on all nodes in the cluster must have the same contents, so cluster can function properly.

There are no internal Swift mechanisms that can guarantee that the ring is consistent, i.e. gzip file is not corrupt and can be read. Swift services have no way to tell if all nodes have the same version of rings. Maintenance of ring files is administrator’s responsibility. These tasks can be automated by means external to the Swift itself, of course.

The Ring allows any Swift service to identify which Storage node to query for the particular storage entity. Method Ring.get_nodes(account, container=None, obj=None) is used for identification of target Storage node for the given path (/account[/container[/object]]). It returns the tuple of partition and dictionary of nodes. The partition is used for constructing the local path to object file or account/container database. Nodes dictionary elements have the same structure as the devices in list of devices (see above).

Ring management

Swift services can not change the Ring. Ring is managed by swift-ring-builder script. When new Ring is created, first administrator should specify builder file and main parameter of the Ring: partition power (or partition shift value), number of replicas of each partition in cluster, and the time in hours before a specific partition can be moved in succession:

swift-ring-builder <builder_file> create <part_power> <replicas> <min_part_hours>

When the temporary builder file structure is created, administrator should add devices to the Ring. For each device, required values are zone number, IP address of the Storage node, port on which server is listening, device name (e.g. sdb1), optional device meta-data (e.g., model name, installation date or anything else) and device weight:

swift-ring-builder <builder_file> add z<zone>-<ip>:<port>/<device_name>_<meta> <weight>

Device weight is used to distribute partitions between the devices. More the device weight, more partitions are going to be assigned to that device. Recommended initial approach is to use the same size devices across the cluster and set weight 100.0 to each device. For devices added later, weight should be proportional to the capacity. At this point, all devices that will initially be in the cluster, should be added to the Ring. Consistency of the builder file can be verified before creating actual Ring file:

swift-ring-builder <builder_file>

In case of successful verification, the next step is to distribute partitions between devices and create actual Ring file. It is called ‘rebalance’ the Ring. This process is designed to move as few partitions as possible to minimize the data exchange between nodes, so it is important that all necessary changes to the Ring are made before rebalancing it:

swift-ring-builder <builder_file> rebalance

The whole procedure must be repeated for all three rings: account, container and object. The resulting .ring.gz files should be pushed to all nodes in cluster. Builder files are also needed for the future changes to rings, so they should be backed up and kept in safe place. One of approaches is to put them to the Swift storage as ordinary objects.

Physical disk usage

Partition is essentially the block of data stored in the cluster. This does not mean, however, that disk usage is constant for all partitions. Distribution of objects between the partitions is based on the object path hash, not the object size or other parameters. Objects are not partitioned, which means that an object is kept as a single file in storage node file system (except very large objects, greater than 5Gb, which can be uploaded in segments – see the Swift documentation).

The partition mapped to the storage device is actually a directory in structure under /srv/node/<dev_name>. The disk space used by this directory may vary from partition to partition, depending on size of objects that have been placed to this partition by mapping hash of object path to the Ring.

In conclusion it should be said that the Swift Ring is a beautiful structure, though it lacks a degree of automation and synchronization between nodes. I’m going to write about how to solve these problems in one of the following posts.

More information

More information about Swift Ring can be found in following sources:
Official Swift documentation – base source for description of data structure
Swift Ring source code on Github – code base of Ring and RingBuilder Swift classes.
Blog of Chmouel Boudjnah – contains useful Swift hints

Comment » | Documentation

Developer Weekly (August 12)

August 12th, 2011 — 4:57pm

Many people have asked for more insight into the developer activities for OpenStack as the large number of code changes and proposals make it difficult to monitor everything happening. In hopes of exposing more of the developer activities, I plan to post a weekly or biweekly blog post on the latest development activities. If you have any ideas for this blog post, please email me at stephen.spector@openstack.org. I am always ready to listen to the community for new ideas.


Developer Mailing List (archive: https://lists.launchpad.net/openstack/)

This is select list of topics discussed this week in the developer mailing list and is not a complete list.  Please visit the archive to see all the topics.

  • Tenants and Service Relationship… - Liem Manh Ngueyn asks “can I have a tenant associated with the “swift” service in Region X and another “swift” service in Region Y?” Yogeshwar Srikrishnan replies that Keystone would have different endpoint_template for each of those regions and provides and example.
  • Monitoring RabbitMQ Messages – Joshua Harlow asks if there is a tool to see all the messages passing thru rabbitmq. Craig Vyvial suggested changing the config options for rabbitmq (http://www.rabbitmq.com/management.html#configuration). Narayan Desai suggested using rabbitmqctl list_queues to see what the queue depth for each NOVA service was.
  • Problems connecting Dashboard and Nova – Mauricio Arango submitted the error information when the Dashboard fails to connect to Nova. Several developers offered various ideas to solve the problem – Mark Gius, Rafael Duran Castaneda, Joseph Heck, Arvind Somya, Vand ish Ishaya . The complete flow of ideas and responses is at https://lists.launchpad.net/openstack/msg03456.html.


For the latest on development activities on OpenStack please check these sites for more details:

Comment » | Communication, Development, Governance, Newsletter

Back to top