The OpenStack Blog

Author Archive

Boris Renski of Mirantis presents: What’s new in OpenStack Folsom | Webcast 4 October 2012

Date: Thursday 4 October 2012 Time: 9am PT/12 Noon ET/6pm CET
Sign up here.

As many of you know, the Folsom release marks OpenStack’s transition from a service provider platform to an enterprise ready solution, with its baseline features set and hardening for enterprise usage in place.

I’d like to invite you to join me and Piotr Siwczak, Senior Staff Engineer at Mirantis and contributor to OpenStack, for technical overview of what’s new in the Folsom Release this Thursday, October 4th, at 9am Pacific.

Here’s what we’ll cover.

  • Synopsis of market developments since the April Essex release
  • New capabilities and user features: Nova, Cinder, Keystone, hypervisor support
  • Quantum and Load Balancer as a Service
  • Under-the-covers with key new architectural features
  • Q&A

The webcast is targeted both to experienced OpenStack users and cloud infrastructure teams considering new deployments. Click here for details on signing up.  Seats are limited.

Date: Thursday 4 October 2012
Time: 9am PT/12 Noon ET/6pm CET
You can review the Mirantis Privacy Policy here.  

Boris Renski, EVP and co-founder of Mirantis, is a member of the OpenStack Foundation Board.

 

Here is what happens inside Nova when you provision a VM

At the Essex conference summit this past month, we presented a session on  OpenStack Essex architecture. As a part of that workshop we visually demonstrated the request flow for provisioning a VM and went over Essex arthicture. There was a lot of interest in this material; it’s now posted in Slideshare:

In fact, we’ve packaged up the architecture survey/overview as part of our 2-day Bootcamp for OpenStack. The next session is scheduled 14-15 June. This time around will carry out the training at the Santa Clara CA offices of our friends at Nexenta. Last course was delivered at our Mountain View office right before the OpenStack summit in April to a sold out crowd. You can find more information about the course at www.mirantis.com/training

I hear the Essex Train a-coming

With Essex train in the wilds of testing, and the Essex release intended date less than 10 days away, we are pretty excited about everyone descending on San Francisco — practically our home town — for the Design Summit and Conference.

Here at Mirantis, the company famous across OpenStack community for distributing vodka bottles at OpenStack meetups, we are gearing up in a big way for the summit and conference. If you haven’t seen the agenda, here’s what we’ve got teed up:

(1) We’ll start the frenzy with Just-in-time-Training: we have a few seats left at our 2-day OpenStack Boot Camp, crammed into the weekend of April 14-15, right before the summit and conference. REGISTER HERE and come to the event fully prepared to torment speakers and presenters with insidious technical questions about OpenStack technology and its future.

(2) Our team will participate in / moderate a few exciting sessions during the conference: OpenStack and Block Storage, OpenStack and High Performance Computing, Expanding the Community. Please be sure to pay us a visit.

(3) …and just to show how happy we are to have you here, we invite everyone at the conference to join Mirantis Summit Kick-Off Party. This is how we party at Mirantis! Vodka bottles and fun times in the best traditions of all our events are guaranteed. Be sure not to miss.

Looking forward to receiving everyone at the 2012 OpenStack Design Summit and Conference.

Under the hood of Swift: the Ring

This is the first post in series that summarizes our analysis of Swift architecture. We’ve tried to highlight some points that are not clear enough in the official documentation. Our primary base was an in-depth look into the source code.

The following material applies to version 1.4.6 of Swift.

The Ring is the vital part of Swift architecture. This half database, half configuration file keeps track of where all data resides in the cluster. For each possible path to any stored entity in the cluster, the Ring points to the particular device on the particular physical node.

There are three types of entities that Swift recognizes: accounts, containers and objects. Each type has the ring of its own, but all three rings are put up the same way. Swift services use the same source code to create and query all three rings. Two Swift classes are responsible for this tasks: RingBuilder and Ring respectively.

Ring data structure

Every Ring of three in Swift is the structure that consists of 3 elements:

  • a list of devices in the cluster, also known as devs in the Ring class;
  • a list of lists of devices ids indicating partition to data assignments, stored in variable named _replica2part2dev_id;
  • an integer number of bits to shift an MD5-hashed path to the account/container/object to calculate the partition index for the hash (partition shift value, part_shift).
List of devices

A list of devices includes all storage devices (disks) known to the ring. Each element of this list is a dictionary of the following structure:

Key Type Value
id integer Index of the devices list
zone integer Zone the device resides in
weight float The relative weight of the device to the other devices in the ring
ip string IP address of server containing the device
port integer TCP port the server uses to serve requests for the device
device string Disk name of the device in the host system, e.g. sda1. It is used to identify disk mount point under /srv/node on the host system
meta string General-use field for storing arbitrary information about the device. Not used by servers directly

Some device management can be performed using values in the list. First, for the removed devices, the 'id' value is set to 'None'. Device IDs are generally not reused. Second, setting 'weight' to 0.0 disables the device temporarily, as no partitions will be assigned to that device.

Partitions assignment list

This data structure is a list of N elements, where N is the replica count for the cluster. The default replica count is 3. Each element of partitions assignment list is an array('H'), or Python compact efficient array of short unsigned integer values. These values are actually index into a list of devices (see previous section). So, each array('H') in the partitions assignment list represents mapping partitions to devices ID.

The ring takes a configurable number of bits from a path’s MD5 hash and converts it to the long integer number. This number is used as an index into the array('H'). This index points to the array element that designates an ID of the device to which the partition is mapped. Number of bits kept from the hash is known as the partition power, and 2 to the partition power indicates the partition count.

For a given partition number, each replica’s device will not be in the same zone as any other replica’s device. Zones can be used to group devices based on physical locations, power separations, network separations, or any other attribute that could make multiple replicas unavailable at the same time.

Partition Shift Value

This is the number of bits taken from MD5 hash of '/account/[container/[object]]' path to calculate partition index for the path. Partition index is calculated by translating binary portion of hash into integer number.

Ring operation

The structure described above is stored as pickled (see Python pickle) and gzipped (see Python gzip.GzipFile) file. There are three files, one per ring, and usually their names are:

account.ring.gzcontainer.ring.gzobject.ring.gz

These files must exist in /etc/swift directory on every Swift cluster node, both Proxy and Storage, as services on all these nodes use it to locate entities in cluster. Moreover, ring files on all nodes in the cluster must have the same contents, so cluster can function properly.

There are no internal Swift mechanisms that can guarantee that the ring is consistent, i.e. gzip file is not corrupt and can be read. Swift services have no way to tell if all nodes have the same version of rings. Maintenance of ring files is administrator’s responsibility. These tasks can be automated by means external to the Swift itself, of course.

The Ring allows any Swift service to identify which Storage node to query for the particular storage entity. Method Ring.get_nodes(account, container=None, obj=None) is used for identification of target Storage node for the given path (/account[/container[/object]]). It returns the tuple of partition and dictionary of nodes. The partition is used for constructing the local path to object file or account/container database. Nodes dictionary elements have the same structure as the devices in list of devices (see above).

Ring management

Swift services can not change the Ring. Ring is managed by swift-ring-builder script. When new Ring is created, first administrator should specify builder file and main parameter of the Ring: partition power (or partition shift value), number of replicas of each partition in cluster, and the time in hours before a specific partition can be moved in succession:

swift-ring-builder <builder_file> create <part_power> <replicas> <min_part_hours>

When the temporary builder file structure is created, administrator should add devices to the Ring. For each device, required values are zone number, IP address of the Storage node, port on which server is listening, device name (e.g. sdb1), optional device meta-data (e.g., model name, installation date or anything else) and device weight:

swift-ring-builder <builder_file> add z<zone>-<ip>:<port>/<device_name>_<meta> <weight>

Device weight is used to distribute partitions between the devices. More the device weight, more partitions are going to be assigned to that device. Recommended initial approach is to use the same size devices across the cluster and set weight 100.0 to each device. For devices added later, weight should be proportional to the capacity. At this point, all devices that will initially be in the cluster, should be added to the Ring. Consistency of the builder file can be verified before creating actual Ring file:

swift-ring-builder <builder_file>

In case of successful verification, the next step is to distribute partitions between devices and create actual Ring file. It is called ‘rebalance’ the Ring. This process is designed to move as few partitions as possible to minimize the data exchange between nodes, so it is important that all necessary changes to the Ring are made before rebalancing it:

swift-ring-builder <builder_file> rebalance

The whole procedure must be repeated for all three rings: account, container and object. The resulting .ring.gz files should be pushed to all nodes in cluster. Builder files are also needed for the future changes to rings, so they should be backed up and kept in safe place. One of approaches is to put them to the Swift storage as ordinary objects.

Physical disk usage

Partition is essentially the block of data stored in the cluster. This does not mean, however, that disk usage is constant for all partitions. Distribution of objects between the partitions is based on the object path hash, not the object size or other parameters. Objects are not partitioned, which means that an object is kept as a single file in storage node file system (except very large objects, greater than 5Gb, which can be uploaded in segments – see the Swift documentation).

The partition mapped to the storage device is actually a directory in structure under /srv/node/<dev_name>. The disk space used by this directory may vary from partition to partition, depending on size of objects that have been placed to this partition by mapping hash of object path to the Ring.

In conclusion it should be said that the Swift Ring is a beautiful structure, though it lacks a degree of automation and synchronization between nodes. I’m going to write about how to solve these problems in one of the following posts.

More information

More information about Swift Ring can be found in following sources:
Official Swift documentation – base source for description of data structure
Swift Ring source code on Github – code base of Ring and RingBuilder Swift classes.
Blog of Chmouel Boudjnah – contains useful Swift hints

OpenStack Party @ CloudConnect 2012

For those attending CloudConnect 2012 in Santa Clara –   Join stackers from all over the world at the OpenStack CloudConnect 2012 party at Fahrenheit Lounge, hosted by Mirantis, Rackspace and Cloudscaling.

Open bar, Hors D’oeuvres and music all night long. This is the place to be at CloudConnect on a Wednesday night.

We’ll have shuttle buses available every 30 minutes, traveling between Santa Clara Convention center parking lot and Fahrenheit Lounge starting at 8pm, immediately after the Cloudscaling cocktail reception.

Registration is first come, first serve and the space is limited. Visit openstackparty.eventbrite.com to register.

OpenStack in Production – Event Highlights

As a matter of tradition at this point, we offer a photo report, covering OpenStack event series that Mirantis hosts. Our December 14th event focused on sharing experience around running OpenStack in production. I moderated a panel consisting of Ken Pepple – director of cloud development at Internap, Ray O’Brian – CTO of IT at NASA and Rodrigo Benzaquen – R&D director at MercadoLibre.

This time we went all out and even recorded the video of the event: http://vimeo.com/33982906

For those that are not in the mood to watch this 50 minute panel video, here is a quick photo report:


We served wine and beer with pizza, salad and deserts…


…While people ate, drank, and mingled…


…and then they drank some more…


We started the panel with myself saying smart stuff about OpenStack. After the intro we kicked off with questions to the panel.


The panelists talked…


…and talked…


…and then talked some more.


Meanwhile, the audience listened…


…and listened.


Everyone in our US team was sporting these OpenStack shirts.


At the end we gave out 5 signed copies of “Deploying OpenStack” books, written by one of our panelists – Ken Pepple. Roman (pictured above) did not get a copy.

Clustered LVM on DRBD resource in Fedora Linux

(Crossposted from Mirantis Official Blog)

As Florian Haas has pointed out in my previous post’s comment, our shared storage configuration requires special precautions to avoid corruption of data when two hosts connected via DRBD try to manage LVM volumes simultaneously. Generally, these precautions concern locking LVM metadata operations while running DRBD in ‘dual-primary’ mode.

Let’s examine it in detail. The LVM locking mechanism is configured in the [global] section of /etc/lvm/lvm.conf. The ‘locking_type’ parameter is the most important here. It defines which locking LVM is used while changing metadata. It can be equal to:

  • ’0′: disables locking completely – it’s dangerous to use;
  • ’1′: default, local file-based locking. It knows nothing about the cluster and possible conflicting metadata changes;
  • ’2′: uses an external shared library and is defined by the ‘locking_library’ parameter;
  • ’3′: uses built-in LVM clustered locking;
  • ’4′: read-only locking which forbids any changes of metadata.

The simplest way is to use local locking on one of the drbd peers and to disable metadata operations on another one. This has a serious drawback though: we won’t have our Volume Groups and Logical Volumes activated automatically upon creation on the other, ‘passive’ peer. The thing is that it’s not good for the production environment and cannot be automated easily.

But there is another, more sophisticated way. We can use the Linux-HA (Heartbeat) coupled with the LVM Resource Agent. It automates activation of the newly created LVM resources on the shared storage, but still provides no locking mechanism suitable for a ‘dual-primary’ DRBD operation.

It should be noted that full support of clustered locking for the LVM can be achieved by the lvm2-cluster Fedora RPM package stored in the repository. It contains the clvmd service which runs on all hosts in the cluster and controls LVM locking on shared storage. In this case, we have only 2 drbd-peers in the cluster.

clvmd requires a cluster engine in order to function properly. It’s provided by the cman service, installed as a dependency of the lvm2-cluster (other dependencies may vary from installation to installation):

(drbd-node1)# yum install clvmd
...
Dependencies Resolved

================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
lvm2-cluster x86_64 2.02.84-1.fc15 fedora 331 k
Installing for dependencies:
clusterlib x86_64 3.1.1-1.fc15 fedora 70 k
cman x86_64 3.1.1-1.fc15 fedora 364 k
fence-agents x86_64 3.1.4-1.fc15 updates 182 k
fence-virt x86_64 0.2.1-4.fc15 fedora 33 k
ipmitool x86_64 1.8.11-6.fc15 fedora 273 k
lm_sensors-libs x86_64 3.3.0-2.fc15 fedora 36 k
modcluster x86_64 0.18.7-1.fc15 fedora 187 k
net-snmp-libs x86_64 1:5.6.1-7.fc15 fedora 1.6 M
net-snmp-utils x86_64 1:5.6.1-7.fc15 fedora 180 k
oddjob x86_64 0.31-2.fc15 fedora 61 k
openais x86_64 1.1.4-2.fc15 fedora 190 k
openaislib x86_64 1.1.4-2.fc15 fedora 88 k
perl-Net-Telnet noarch 3.03-12.fc15 fedora 55 k
pexpect noarch 2.3-6.fc15 fedora 141 k
python-suds noarch 0.3.9-3.fc15 fedora 195 k
ricci x86_64 0.18.7-1.fc15 fedora 584 k
sg3_utils x86_64 1.29-3.fc15 fedora 465 k
sg3_utils-libs x86_64 1.29-3.fc15 fedora 54 k

 

Transaction Summary
================================================================================
Install 19 Package(s)

The only thing we need the cluster for is the use of clvmd; the configuration of cluster itself is pretty basic. Since we don’t need advanced features like automated fencing yet, we specify manual handling. As we have only 2 nodes in the cluster, we can tell cman about it. Configuration for cman resides in the /etc/cluster/cluster.conf file:

<?xml version="1.0"?\>
<cluster name="cluster" config_version="1"\>
  <!-- post_join_delay: number of seconds the daemon will wait before
        fencing any victims after a node joins the domain
  post_fail_delay: number of seconds the daemon will wait before
        fencing any victims after a domain member fails
  clean_start    : prevent any startup fencing the daemon might do.
        It indicates that the daemon should assume all nodes
        are in a clean state to start. --\>
  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
  <clusternodes>
   <clusternode name="drbd-node1" votes="1" nodeid="1">
    <fence>
    <!-- Handle fencing manually -->
     <method name="human">
      <device name="human" nodename="drbd-node1"/>
     </method>
    </fence>
   </clusternode>
   <clusternode name="drbd-node2" votes="1" nodeid="2">
    <fence>
    <!-- Handle fencing manually -->
     <method name="human">
      <device name="human" nodename="drbd-node2"/>
     </method>
    </fence>
   </clusternode>
  </clusternodes>
  <!-- cman two nodes specification -->
  <cman expected_votes="1" two_node="1"/>
  <fencedevices>
  <!-- Define manual fencing -->
   <fencedevice name="human" agent="fence_manual"/>
  </fencedevices>
</cluster>

clusternode name should be a fully qualified domain name and should be resolved by DNS or be present in /etc/hosts. Number of votes is used to determine quorum of the cluster. In this case, we have two nodes, one vote per node, and expect one vote to make the cluster run (to have a quorum), as configured by cman expected attribute.

The second thing we need to configure is the cluster engine (corosync). Its configuration goes to /etc/corosync/corosync.conf:

compatibility: whitetank
totem {
  version: 2
  secauth: off
  threads: 0
  interface {
    ringnumber: 0
    bindnetaddr: 10.0.0.0
    mcastaddr: 226.94.1.1
    mcastport: 5405
  }
}
logging {
  fileline: off
  to_stderr: no
  to_logfile: yes
  to_syslog: yes
  # the pathname of the log file
  logfile: /var/log/cluster/corosync.log
  debug: off
  timestamp: on
logger_subsys {
  subsys: AMF
  debug: off
}
}
amf {
  mode: disabled
}

The bindinetaddr parameter must contain a network address. We configure corosync to work on eth1 interfaces, connecting our nodes back-to-back on 1Gbps network. Also, we should configure iptables to accept multicast traffic on both hosts.

It’s noteworthy that these configurations should be identical on both cluster nodes.

After the cluster has been prepared, we can change the LVM locking type in /etc/lvm/lvm.conf on both drbd-connected nodes:

global {
  ...
  locking_type = 3
  ...
}

Start cman and clvmd services on drbd-peers and get our cluster ready for the action:

(drbd-node1)# service cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Unfencing self... [ OK ]
Joining fence domain... [ OK ]
(drbd-node1)# service clvmd start
Starting clvmd:
Activating VG(s): 2 logical volume(s) in volume group "vg-sys" now active
2 logical volume(s) in volume group "vg_shared" now active
[ OK ]

Now, as we already have a Volume Group on the shared storage, we can easily make it cluster-aware:

(drbd-node1)# vgchange -c y vg_shared

Now we see the ‘c’ flag in VG Attributes:

(drbd-node1)# vgs
VG        #PV #LV #SN Attr    VSize   VFree
vg_shared   1   3   0 wz--nc  1.29t   1.04t
vg_sys      1   2   0 wz--n-  19.97g  5.97g

As a result, Logical Volumes created in the vg_shared volume group will be active on both nodes, and clustered locking is enabled for operations with volumes in this group. LVM commands can be issued on both hosts and clvmd takes care of possible concurrent metadata changes.

OpenStack Nova: basic disaster recovery

We have published a new blog post about handling basic issues and virtual machines recovery methods. From the blog:

Today, I want to take a look at some possible issues that may be encountered while using OpenStack. The purpose of this topic is to share our experience dealing with the hardware or software failures which definitely would be faced by anyone who attempts to run OpenStack in production.

Read the complete blog at http://mirantis.blogspot.com/2011/06/openstack-nova-basic-disaster-recovery.html.

Back to top