R and Hadoop

rhadoop-logo

Apache Hadoop is developed in Java and Java is one of the main programming languages for Hadoop. Although if you don’t now Java or don’t want to work with it, you can still use any other language like Python, R or Ruby to write MR(MapReduce) using streaming APIs. In this blogpost, I am going to show how to integrate R and Hadoop using rmr package(This tutorial assumes that you have a working hadoop ecosystem with R and RStudio Server installed. If you don’t have go through my previous blogpost http://www.crackstats.in/setting-up-rstudio-server-on-cloudera-quickstart-vms/)

The most common way to link R and Hadoop is to use HDFS (potentially managed by Hive or HBase) as the long-term store for all data, and use MapReduce jobs (potentially submitted from Hive, Pig, or Oozie) to encode, enrich, and sample data sets from HDFS into R. It then allows you to perform complex modeling exercises on a subset of prepared data in R . 

Revolution Analytics released RHadoop allowing integration of R and Hadoop. RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. RHadoop consists of following packages which are available for download at https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads):

rmr2 – functions providing Hadoop MapReduce functionality in R
rhdfs – functions providing file management of the HDFS from within R
rhbase – functions providing database management for the HBase distributed database from within R

So let’s begin

Step 1: Open terminal in the virtual machine and set up the following environment variables

HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce
HADOOP_CMD=/usr/bin/hadoop
HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar

Note: Update the path in HADOOP STREAMING according to your Hadoop version

Step 2: Download rmr2 package from https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads , open RStudio server execute the following code:

install.packages(“/home/username/Downloads/rmr2_3.2.0.tar.gz”, repos = NULL, type=”source”)
Sys.setenv(HADOOP_HOME=”/usr/lib/hadoop-0.20-mapreduce”) Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”) Sys.setenv(HADOOP_STREAMING=”/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar”)

Step 3: Now check RHadoop Mapreduce capability by executing the following commands

library(rmr2)
small.ints <- to.dfs(1:1000)
out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))
df <- as.data.frame(from.dfs(out))
head(df)

Reference : http://blogr-cs.blogspot.in/2012/12/integration-of-r-rstudio-and-hadoop-in.html