We describe the Massachusetts Open Cloud (MOC) Big Data as a Service (BDaaS) solution we built on top of OpenStack. BDaaS allow users access public data sets and stand up Hadoop and SPARK environments on-demand to work on these datasets. We use Cloud Dataverse, an open-source framework that can store data in Ceph, as our data repository. Ceph’s RADOS gateway (RGW) is used as a gateway between the Big Data analysis tasks and the Ceph storage service. To improve the performance of the Big Data environments, we modified RGW to cache data in SSDs attached to a server local to each rack. All requests for data are automatically directed by the network to the nearest RGW. Users can browse, investigate, and download datasets at MOC Dataverse and run analytics on any of the datasets by clicking a button to provision a Big Data processing environment. BDaaS will prefetch the data from Ceph into caches, and then invoke OpenStack Sahara to create the on-demand environment.
We describe the high-performance Big Data as a Service (BDaaS) framework we built on top of OpenStack for use in the Massachusetts Open Cloud. Our BDaaS framework enables Hadoop and Spark Jobs to compute on large datasets on-demand without downloading them to local storage a priori. Users can browse datasets, select relevant ones, and run analytics on them at the touch of a button. It avoids slow accesses to remote storage by caching frequently-used datasets (or portions of datasets) in per-rack SSDs and re-directing requests to the closest one. It uses Cloud DataVerse, an open-source framework that stores data in Ceph, as its data repository and implements the caching tier within Ceph’s RadoS gateway.