Hadoop FS Support
Hadoop FS Support is an implementation of the Apache Hadoop File System interface, which opens several opportunities for you to work with the platform and other industry standard, big data tools while mitigating the need for you to write a lot of custom code to do so. A few such examples include:
- Read and write to Object store layer using Hadoop FS support in Spark (see the tutorial)
- Bring your data into HERE platform using an object store layer (see the tutorial)
- Connecting an object store layer to open-source processing and analytics tools outside the platform such as Apache Spark, Apache Drill, Presto, AWS EMR, and others.
Layer support
Hadoop FS Support is only available for the objectstore
layer type.
Hadoop FileSystem interface support
The following Hadoop Filesystem methods are supported:
-
String getScheme()
returns the scheme, which is blobfs
. -
URI getUri()
returns the URI, for example blobfs://hrn:here:data::olp-here:blobfs-test:test-data
. -
void initialize(URI name, Configuration conf)
initializes the BlobFs FileSystem
. -
FSDataInputStream open(Path f, int bufferSize)
provides an InputStream
to read data from an object stored in the Object Store layer. -
FileStatus getFileStatus(Path f)
provides information about a file. -
FileStatus[] listStatus(Path f)
provides a list of files along with their respective information. -
FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress)
provides an OutputStream
to write data to. The parameters progress
, permission
and replication
are not implemented. -
boolean rename(Path src, Path dst)
. Renames the path. -
boolean delete(Path f, boolean recursive)
deletes a path, if the recursive
flag is set to true, all sub-directories and files are also deleted. -
boolean mkdirs(Path f, FsPermission permission)
creates directories for the given path. The parameter permission
is not implemented. -
void close()
closes the file system.
Usage
For the catalog HRN hrn:here:data::olp-here:blobfs-test
and the layer ID test-data
, the URL to be used in Hadoop/Spark/Drill is blobfs://hrn:here:data::olp-here:blobfs-test:test-data
.
Spark
Your spark application will need to have a dependency on the hadoop-fs-support
package.
val catalogHrn = "hrn:here:data::olp-here:blobfs-test"
val layerId = "test-data"
val sourcePath = s"blobfs://$catalogHrn:$layerId/source"
val destinationPath = s"blobfs://$catalogHrn:$layerId/destination"
val sourceRdd = sparkContext.textFile(sourcePath)
sourceRdd.saveAsTextFile(destinationPath)
Hadoop Fs Shell
You can use the Hadoop File System Shell to explore the contents of the Object Store layer. The operations supported from the Hadoop File System Shell are the following:
- cat
- copyFromLocal
- copyToLocal
- count
- cp
- ls
- mkdir
- moveFromLocal
- moveToLocal
- mv
- put
- rm
- rmdir
- rmr
- test
- text
- touchz
- truncate
- usage
export HADOOP_CLASSPATH="hadoop-fs-support_2.12-${VERSION}-assembly.jar"
hadoop fs -mkdir blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
hadoop fs -cp file.txt blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
hadoop fs -ls blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
Hadoop configurations
BlobFs supports the following Hadoop configurations:
- fs.blobfs.multipart.part-upload-parallelism BlobFs uploads an object by splitting the object into various parts and using the multi-part upload functionality of Object Store. This configuration defines how many parts for a single object can be uploaded simultaneously. The minimum allowed parallelism is
1
. The default value is 2
. The upload speed can increase with an increased parallelism, doing that is more costly as each uploaded part is buffered in the memory. - fs.blobfs.multipart.part-size Size of each part of the object that is uploaded in bytes. The minimum part size allowed is
5242880
. The maximum part size allowed is 100663296
. The default value is 100663296
.
Authentication
For instructions on how to set up HERE credentials, see Get Your Credentials.
Additionally, HERE credentials can also be passed as Hadoop configuration,
- fs.blobfs.accessKeyId HERE access key ID. This configuration is the same as
here.access.key.id
in the credentials.properties
. - fs.blobfs.accessClientId HERE client ID. This configuration is the same as
here.client.id
in the credentials.properties
. - fs.blobfs.accessKeySecret HERE access key secret. This configuration is the same as
here.access.key.secret
in the credentials.properties
. - fs.blobfs.accessEndpointUrl HERE token endpoint URL. This configuration is the same as
here.token.endpoint.url
in the credentials.properties
.
Configuring Hadoop installations
EMR
In order to run on EMR, you will need to create an EMR cluster with an additional parameter configurations
, for example
aws emr create-cluster \
--name "$cluster_name" \
--release-label "emr-5.17.0" \
--applications Name="Spark" \
--region ${region} \
--log-uri ${log_location} \
--instance-type "m4.large" \
--instance-count 4 \
--service-role "EMR_DefaultRole" \
--ec2-attributes KeyName="some-key",InstanceProfile="EMR_EC2_DefaultRole",AdditionalMasterSecurityGroups="sg-xxxx",AdditionalSlaveSecurityGroups="sg-xxxxx",SubnetId="subnet-xxxx" \
--configurations file://emr-config.json
The content of the emr-config.json
file is the following:
[
{
"Classification": "core-site",
"Properties": {
"fs.blobfs.impl": "com.here.platform.data.client.hdfs.DataServiceBlobHadoopFileSystem"
}
}
]
Standalone Hadoop installations
- The BlobFS fat jar needs to be included on the Hadoop classpath.
- In some cases, the
core-site.xml
file needs to have the following property to be added:
<property>
<name>fs.blobfs.impl</name>
<value>com.here.platform.data.client.hdfs.DataServiceBlobHadoopFileSystem</value>
</property>
Drill
The BlobFS fat jar needs to be included on the Hadoop classpath.
Notes
Object Store is not a true file system. Object Store is a distributed key-value store and as such, it does not behave exactly like a file system. A file system expects that operations such as delete
or rename
are atomic. For Object Store, these operations will finish eventually. A file system expects that during reading from or writing to a file, the content of the file should not be changed or the file should not be deleted. Object Store does not assure this behavior. You will need to take precaution against this behavioral difference on your own.
Hadoop FS Support implements Hadoop FileSystem version 2.7.3. It is highly likely that BlobFS will work with Hadoop versions up to 2.9.x, but it is not guaranteed to work. Hadoop version 3.x.x is not yet supported.
Compatibility issues with HRN. BlobFs requires HRN as the authority of the URI. Some tools create incorrect URIs for HRNs. To work around such cases, you can also pass the hex value of the HRN as the authority of the URI. For the catalog HRN hrn:here:data::olp-here:blobfs-test
and the layer ID test-data
, this looks as follows:
val hrnHex = Hex.encodeHexString("hrn:here:data::olp-here:blobfs-test:test-data".getBytes)
val blobFsUri = "blobfs://$hrnHex"
Hadoop FS Support does not support Apache Flink.