OpenStack community members are voting on presentations to be presented at the OpenStack Summit, November 3-7, in Paris, France. We received hundreds of high-quality submissions, and your votes can help us determine which ones to include in the schedule.
Apache Hadoop is an open source data processing framework that is usually deployed on bare-metal commodity servers. However recently, more Hadoop clusters are being deployed in cloud environments using virtual machines for a multitude of reasons amongst which the ease of deployment and scalability are the most prominent. Cloud environments offer several advantages over bare-metal ones but introduce their own set of challenges when dealing with Hadoop clusters. The main challenge here is Data-Locality, where in a cloud environment a virtual machine might get created on one physical host and its corresponding disks (volumes) might get created on a different physical host. This separation between the compute and the storage components for a virtual machine introduces delays and network congestion when a virtual machine tries to access its non local disks over the network. In this work we propose a solution for Data-Locality for Hadoop clusters deployed on OpenStack. Our solution uses the extensible scheduling frameworks in OpenStack Nova and OpenStack Cinder to select the best physical host for a virtual machine based on storage requirements and to ensure that any disks attached to the virtual machine are local disks. We'll also present how we used this solution within the OpenStack Sahara project, which makes Hadoop clusters provisioning on OpenStack easier and more efficient. By the way, this solution is not limited to Hadoop clusters, any cluster of machines with local disk access and performance needs could benefit from this solution, such as ElasticSearch, Cassandra, etc.
After several years building web based solutions in various French companies, Yann Degat now works @Numergy, a French public cloud based on Openstack, to contribute on subjects around BigData and Paas in cloud computing environments.
Adrien Vergé is an engineer who graduated from the École Polytechnique (France) in 2012. He has done research on tracing optimization on ARM systems at École Polytechnique Montréal (Canada), in the lab where the Linux Trace Toolkit (LTTng) was created. He has a patent pending for optimizing the Tor privacy-preserving network, based on a work with Technicolor in 2012. He has published on ARM code disassembly. He now works @Numergy, a French public cloud based on Openstack, to contribute on various subjects around cloud computing.
Abbass was one of the first lead member in Data Chanel network within Alcatel Lucent and R&D developer in Internet Memory Research one. He joined VirtualScale since in 2013, were he developed the Chef code and architecture for Hadoop in Openstack environnement.