Cloud Dataverse is a new service for accessing and processing public data sets in an OpenStack Cloud. It is based on Dataverse, a popular framework for sharing, preserving, and analyzing research data. Cloud Dataverse extends Dataverse to replicate datasets from per-institution repositories to a cloud-based repository and store data in Swift, enabling applications running in the cloud to access data in-situ. We use OpenStack Sahara to launch on-demand Big Data applications that use Swift as a datasource for analytics jobs running on Hadoop, Spark, or Pig.
We follow the user's journey through the Cloud Dataverse: browsing datasets, the harvesting/replication process, viewing files in the object store, and the use of compute provided by Sahara. To enhance user experience in Sahara, we plan to provide the automatic generation of default cluster templates via a new UI providing users with an option to bypass the complexity of Horizon.
- The features of the existing Dataverse project
- The relevant new functionality which allow the integration of Dataverse with OpenStack
- The basics of OpenStack Sahara