Working with Data in Swift Object Storage

Openstack Swift is a cloud object storage. Swift stores arbitrary objects in specific namespaces, called containers. Objects can be text, documents, images, or other types of files. A container is a place used to collect a set of objects, similar to a folder storing a set of files.

To store (import) and retrieve (export) files from Swift, several options are possible:

Import/Export Swift Objects via Web Interface

Import swift objects

Export swift objects

Import/Export Swift Objects via GUI Clients

There exist tools running on Windows or Mac.

Setup CloudBerry Explorer

Click on File -> New Openstack Account. Put username, password and other information as follows.

Then, select kitwai to connect to swift service.

The containers under the default project are shown.

Setup Cyberduck

Click on File -> Open Connection. Put project, username, password and other information as follows.

The containers are shown.

How to Access Swift Objects directly from Spark Programs

Swift object storage is not supposed to be the working area for data intensive jobs. Use HDFS instead if performance is the priority.

Spark can access Swift objects via HDFS layer. The URL format of a Swift object is in the following form:

swift://container_name.sahara/path/file

For example, if an object churn-bigml-80.csv is stored under container mycontainer and path /dataset/churn, the URL would be

swift://mycontainer.sahara/dataset/churn/churn-bigml-80.csv

Wildcard can also be used, e.g.

swift://mycontainer.sahara/dataset/churn/churn-bigml-*.csv

To use a swift object in Spark, refer to its URL.

train_data = sqlContext.read.load('swift://mycontainer.sahara/dataset/churn/churn-bigml-80.csv',
                        format='com.databricks.spark.csv',
                        header='true',
                        inferSchema='true')

train_data.cache()
train_data.printSchema()