Tuesday, December 22, 2015

XGBoost in R. Error in xgb.iter.update SoftmaxMultiClassObj

XGBoost in R. For regression, classification and ranking

Xgboost is short for eXtreme Gradient Boosting package, XGBoost includes regression, classification and ranking.

Install XGBoost latest version from github

devtools::install_github('dmlc/xgboost',subdir='R-package')


Error

clf<-xgboost(
 data=data.matrix(train[,feature.names]),
        label     = train$Survived,
 booster   = "gblinear",
 nrounds   = 20,
 objective = "multi:softprob",
 num_class = 2,
        eval_metric = "merror"
            )  
Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) : SoftmaxMultiClassObj: label must be in [0, num_class), num_class=2 but found 2 in label
Solution

needed to convert the response variable from factor to numeric.

train$Survived<-as.numeric(levels(train$Survived))[train$Survived]

also install package "libx11-dev"
solai@vm1$ sudo apt-get install libx11-dev

then try again.

Next :

Sunday, December 20, 2015

Top 9 Rule : Code of conduct for Data Science professional

Top 9 Rule : Code of conduct for Data Science professional
previous post - Rattle : GUI for R

Top 9 Rule : Code of conduct for Data Science professional

Rule 1 - know the Terminology

Data Scientist - Client Relationship
Rule 2 - Competence
Rule 3 - Scope of Data Science Professional Services Between Client and Data Scientist
Rule 4 - Communication with Clients
Rule 5 - Confidential Information
Rule 6 - Conflicts of Interest
Rule 7 - Duties to Prospective Client
Rule 8 - Data Science Evidence, Quality of Data and Quality of Evidence

Maintaining the Integrity of the Data Science Profession
Rule 9 - Misconduct

For more details report

Thursday, December 10, 2015

How to Install rattle package : Graphical User Interface for Data Mining in R

How to Install rattle package : Graphical User Interface for Data Mining in R

step to install rattle package in R

Need to installl "libgtk2.0-dev" package in OS

bigdata@solai:~$ sudo apt-get install libgtk2.0-dev

then install "RGtk2" and "rattle" in R environment
install.packages("RGtk2")
install.packages("rattle")

load the library for the session by
library(rattle)

Saturday, November 28, 2015

Apache Spark : how to start worker across cluster

Apache Spark : how to start worker across cluster
previous post - working on RPostgresql

Apache Spark cluster setup. refer if you have any issue Spark over Yarn

How to start worker node from newly added Spark slaves.?
Spark has two slave start-up script under sbin/ dir

start-slaves.sh -: to start all the worker across slaves machine. this should run from Master node

start-slave.sh -: to start Worker daemon from each and individual Slave. this should run from each slave node. Ex:

sbin/start-slave.sh spark://10.184.48.55:7077
above command need to run from slave machine. here 10.184.48.55 is where Spark Master running.
Error
shutting down Netty transport
sbin/start-slave.sh spark://10.184.48.55:7077
15/11/27 19:53:53 ERROR NettyTransport: failed to bind to /10.184.48.183:0, shutting down Netty transport
Solution
Error due to improper configuration in /etc/hosts. set SPARK_LOCAL_IP to pointng to the local worker system.

export SPARK_LOCAL_IP=127.0.0.1

OR
export SPARK_LOCAL_IP=IP_ADDR_OF_THE_SYSTEM
Next :

Friday, November 27, 2015

RPostgresql : how to pass dynamic parameter to dbGetQuery statement

RPostgresql : R and PostgreSQL Database

Working with RPostgreSQL package

How to pass dynamic / runtime parameter to dbGetQuery in RPostgrSQL ?
#use stri_paste to form a query and pass it into dbGetQuery icd = 'A09'
require(stringi)

qry <- stri_paste("SELECT * FROM visualisation.ipd_disease_datamart WHERE icd ='", icd, "'",collapse="")

rs1 <- dbGetQuery(con, qry)
Error
 
 Error in postgresqlNewConnection(drv, ...) : 
  RS-DBI driver: (cannot allocate a new connection -- 
maximum of 16 connections already opened)
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")

con <- dbConnect(drv, dbname="DBName", host="127.0.0.1",port=5432,user="yes",password="yes")
Solution
close the connection. max 16 connection can able to establish from R to PostgreSQL, if exceeds this limit, will throw error.
##list all the connections
dbListConnections(drv)

## Closes the connection
dbDisconnect(con)

## Frees all the resources on the driver
dbUnloadDriver(drv)
#OR on.exit(dbUnloadDriver(drv), add = TRUE)


How to close/drop all the connection Postgresql session.?

We can terminate the PostgreSQL connection using "pg_terminate_backend" SQL command.
In my case I was open up 16 connection using RPostgreSQL unfortunately forget to release them.
So I ended up with Max. connection exceed limit.
SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE client_addr = '10.184.36.131' and pid > 20613 AND pid <> pg_backend_pid();
In above query, pg_stat_activity will return list of all the active connection.
I have terminating only the connection from R session which made from the (client_addr) IP 10.184.36.181

Friday, November 20, 2015

Working with Association in R : arules and arulesViz package

working with association in R : arules package
previous post - working with RHadoop

working with Association in R using arules and arulesViz packages

when I try to visualize the top five rules,
plot(highLiftRules,method="graph",control=list(type="items"))
Error in as.double(y) : 
  cannot coerce type 'S4' to vector of type 'double'
Solution
load the library "arulesViz" into R session
library(arulesViz)
During the load the library I got error,
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : namespace "lattice" 0.20-24 is already loaded, but >= 0.20.27 is required Error: package or namespace load failed for "arulesViz"
Solution
error says that i have outed package that need to be upgraded. to know the installed package list
inst = packageStatus()$inst

inst[inst$Status != "ok", c("Package", "Version", "Status")]

#will list out all the package with installed version( not current version)

old.packages()

#list out the installed version and current version of the package.

unloadNamespace("lattice")
#then restart the R session will solve the error.

detach_package("lattice", TRUE)
#will unload the package with out restarting R session
finallly got the output of the plot


Thursday, November 19, 2015

Working with RHadoop

working with RHadoop

working with RHadoop

Error

hdfs.ls("/tamil")

Error in .jcall("java/lang/Class",
"Ljava/lang/Class;", "forName", cl, :
No running JVM detected. Maybe .jinit() would help. Error in .jfindClass(as.character(class)) : No running JVM detected. Maybe .jinit() would help.

Solution


hdfs.init()

hdfs.ls("/")

Error

hdfs.init()
sh: 1: /media/bdalab/bdalab/sw/hadoop-2.7.1/bin: Permission denied
Error in .jnew("org/apache/hadoop/conf/Configuration") : java.lang.ClassNotFoundException In addition: Warning message: running command '/media/bdalab/bdalab/sw/hadoop-2.7.1/bin classpath' had status 126
Solution


Sys.setenv(HADOOP_CMD='/hadoop-2.7.1/bin/hadoop')
Sys.setenv(JAVA_HOME='/jdk1.8.0_60/')
to know the environment varibale
Sys.getenv("HADOOP_CMD")
hdfs.init()


Next :

RHadoop integration isssues

RHadoop integration issues

installaing RHadoop package for working with R and Hadoop

Installing rjava package in R
install.packages("rJava_0.9-7.tar.gz", repos = NULL)
Error
configure: error: Java Development Kit (JDK) is missing or not registered in R Make sure R is configured with full Java support (including JDK). Run R CMD javareconf as root to add Java support to R. If you don't have root privileges, run R CMD javareconf -e to set all Java-related variables and then install rJava. ERROR: configuration failed for package 'rJava' * removing '/home/bdalab/R/x86_64-pc-linux-gnu-library/3.1/rJava'
Solution
install
sudo apt-get install r-cran-rjava
then try again. I have installed succefully in my system. still I got above error, then i checked java -version it was pointing out to openJDK instaed of Oracle java HotSpot. changed to HotSpot.
Error
library(rJava) 
##only in RStudio, In terminal its working fine.

Error : .onLoad failed in loadNamespace() 
for 'rJava', details: call: dyn.load(file, DLLpath = DLLpath, ...) error: unable to load shared object
'x86_64-pc-linux-gnu-library/3.1/rJava/libs/rJava.so': libjvm.so: cannot open shared object file:
No such file or directory Error: loading failed Execution halted ERROR: loading failed
Solution
install
sudo apt-get install r-cran-rjava

Solution
locate libjvm.so and make shared link, sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so /usr/lib/
Error
R CMD javareconf

/usr/lib/R/bin/javareconf: 405: 
/usr/lib/R/bin/javareconf: cannot create
 /usr/lib/R/etc/Makeconf.new: Permission denied
Solution
sudo -i R CMD javareconf
OR
R CMD javareconf JAVA=jdk1.8.0_60/jre/bin/java JAVA_HOME=jdk1.8.0_60/ JAVAC=jdk1.8.0_60/bin/javac JAR=jdk1.8.0_60/bin/jar JAVAH=jdk1.8.0_60/bin/javah

Next :

Wednesday, September 30, 2015

3D Data visualization using R - configure: error: X11 not found but required. missing required header GL/gl.h

3D Data visualization using R

3D Data Visualization using R : Using the rgl package, rglplot plots a graph in 3D. The plot can be zoomed, rotated, shifted, etc. but the coordinates of the vertices is fixed.

Installing RGL package in R

install.packages("rgl")


Error

configure: using libpng-config configure: using libpng dynamic linkage checking for X... no configure: error: X11 not found but required, configure aborted. ERROR: configuration failed for package ‘rgl’ * removing ‘/home/bdalab/R/x86_64-pc-linux-gnu-library/3.1/rgl’ Warning in install.packages :
Solution

install
sudo apt-get install xorg
sudo apt-get install libx11-dev

then try again.
Error

In my case again I got
configure: error: missing required header GL/gl.h
Solution

install
sudo apt-get install libglu1-mesa-dev
then try again.

Next : RHadoop

Friday, September 4, 2015

Running SQL Query on Hadoop : Apache Hive Alternatives

Running SQL Query on Hadoop : Apache Hive Alternatives

Hive is the SQL programmer friendly tool for running SQL query on Hadoop HDFS File system. While running query Hive will convert SQL like query into MapReduce.

Hive is not the only tool will do the same. This post will let give synopsis on open source alternative of Hive.


1) spark sql (previously Shark - Sql on Spark) - will be the best alternative of Hive over Spark. Spark SQL is Spark's module for working with structured data.
2) Cloudera Impala - like Hive but it uses its own execution daemons which we need to install every datanodes in Hadoop cluster. Impala do BI-style Queries on Hadoop.
3) Facebook Presto - like Impala need to install all datanodes. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
4) Apache Drill - Schema free SQL for Hadoop. It support multiple datastores HDFS, MongoDB and Hbase


Monday, August 31, 2015

Installing Mahout with Apache Spark 1.4.1 : Issues and Solution

Installing Mahout with Apache Spark 1.4.1 : Issues and Solution

In this blog I will discuss the possible error you may get during the installation with how to resolve those.

The Error which I listed here based on the sequence which i got during my installation.

Cannot find Spark class path. Is 'SPARK_HOME' set?

cd $MAHOUT_HOME

bin/mahout spark-shell

Got error Cannot find Spark class path. Is 'SPARK_HOME' set?

Solution
Issue is in bin/mahout file , its point to compute-classpath.sh under $SPARK_HOME/bin dir. But in my $SPARK_HOME/bin i didn't find any such a file.

Add compute-classpath.sh under $SPARK_HOME/bin dir.

In my case I just copied it from older version i.e spark1.1


ERROR: Could not find mahout-examples-*.job

cd $MAHOUT_HOME

bin/mahout spark-shell

ERROR: Could not find mahout-examples-*.job in /media/bdalab/bdalab/sw/mahout or /media/bdalab/bdalab/sw/mahout/examples/target, please run 'mvn install' to create the .job file

Solution
set MAHOUT_LOCAL variable to true, to avoid the error.

export MAHOUT_LOCAL=true


Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver

cd $MAHOUT_HOME

bin/mahout spark-shell

MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.

MAHOUT_LOCAL is set, running locally

Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver

Solution
It indicate that need to install mahout driver.

root@solai[bin]# mvn -DskipTests -X clean install

[INFO] Scanning for projects...

[INFO] ------------------------------

[ERROR] BUILD FAILURE

[INFO] ---------------------------------

[INFO] Unable to build project '/media/bdalab/bdalab/sw/mahout/pom.xml; it requires Maven version 3.3.3
Downloaded Latest version of Maven 3.3.3 from repository and unpack it. Run previous command from the Latest Maven bin,

root@solai[bin]# $MAVEN_HOME/bin/mvn -DskipTests -X clean install

org.apache.maven.enforcer.rule.api.EnforcerRuleException: Detected JDK Version: 1.8.0-60 is not in the allowed range [1.7,1.8).
then I have change Java 1.8.06 to 1.7. Now i got this error
root@solai[bin]# $MAVEN_HOME/bin/mvn -DskipTests -X clean install

[INFO] Mahout Build Tools ..... SUCCESS [02:42 min]

[INFO] Apache Mahout ..... SUCCESS [ 0.041 s]

[INFO] Mahout Math ......FAILURE [01:45 min]

[INFO] Mahout HDFS ........ SKIPPED

[INFO] Mahout Map-Reduce ..... SKIPPED

[INFO] Mahout Integration ..... SKIPPED

[INFO] Mahout Examples .........SKIPPED

[INFO] Mahout Math Scala bindings ..... SKIPPED

[INFO] Mahout H2O backend ...... SKIPPED

[INFO] Mahout Spark bindings ..... SKIPPED

[INFO] Mahout Spark bindings shell ..... SKIPPED

[INFO] Mahout Release Package ..... SKIPPED

Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact org.apache.maven:maven-core:jar:2.0.6 from/to central (https://repo.maven.apache.org/maven2): GET request of: org/apache/maven/maven-core/2.0.6/maven-core-2.0.6.jar from central failed
I thought error caused because of the networking issues.
running the same command again,
As i guessed installation completed successfully.

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
After succefull installation I was trying to get mahout>

cd $MAHOUT_HOME

bin/mahout spark-shell

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"

Solution
export JAVA_TOOL_OPTIONS="-Xmx2048m -XX:MaxPermSize=1024m -Xms1024m"

Sunday, August 30, 2015

Installing Mahout on Spark 1.4.1

Installing Mahout and Spark

In this blog I will describe the step to install Mahout with Apache Spark 1.4.1 (latest version). Also list out the possible Error and remedies.

Installing Mahout & Spark on your local machine

1) Download Apache Spark 1.4.1 and unpack the archive file

2) Change to the directory where you unpacked Spark and type sbt/sbt assembly to build it

3) Make sure right version of maven (3.3) installed in your system. If not install mvn before build Mahout

4) Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub git clone https://github.com/apache/mahout mahout

5) Change to the mahout directory and build mahout using mvn -DskipTests clean install


Starting Mahout's Spark shell

1) Goto the directory where you unpacked Spark and type sbin/start-all.sh to locally start Spark

2) Open a browser, point it to http://localhost:8080/ to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with spark://)

3) Define the following environment variables:

export MAHOUT_HOME=[directory into which you checked out Mahout]

export SPARK_HOME=[directory where you unpacked Spark]

export MASTER=[url of the Spark master]

4) Finally, change to the directory where you unpacked Mahout and type bin/mahout spark-shell, you should see the shell starting and get the prompt mahout>



In next blog will discuss the possibility of Error while installing Mahout with solution.

Next : Resolved issues - Installing Mahout 0.11.0 with Saprk 1.4.1

Wednesday, July 29, 2015

what india said at #KalamSir - social media reflection about Former president Dr.APJ Kalam

here is the snippet of what people were talking about former president Dr.APJ Kalam over micro blogging site(twitter).



the word cloud image above light is "amazing" "scientist" "best" "india". most of the tweet mentioned about Dr.APJ Kalam.

next to that, more number of tweets also mentioned word's like "rip" "martyr" "great" "human"

note : Due to restriction on my twitter account, I have got only 1500 tweets from 27/07/2015. the word cloud, posted above based on the 1500 tweets.

kalasir_rip = searchTwitter("#KalamSir", n=500000, lang="en" , since='2015-07-27' , until='2015-07-28')

Thursday, May 14, 2015

Hive configuration at Zeppelin - error and way to solve

Hive configuration at Zeppelin - error and way to solve

I have successfully compliled latest version of Zeppelin from source. You can get Apache Zeppelin binary on request, give your mail id, I will provide link to Download.

hive interpreter not available in my Apache Zeppelin what should i do?

Install Hive and Set HIVE_HOME.

if you have installed Hive in your system, Zeppelin will automatically recognize and list the Hive interpreter. if you have installed Hive but no interpreter listed out in Zeppelin, check HIVE_HOME and set it properly

If you have not installed Hive, install Hive it will take automaticall by next restart

Error

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
Solution

Add export HADOOP_USER_CLASSPATH_FIRST=true at HIVE_HOME/conf/ hive-env.sh

Next :

Thursday, May 7, 2015

Configuring Hive at HUE (Hadoop ui)

Hive configuration at HUE
previous post - Hue installation error

I have successfully compliled latest version of Hue from source. when i was access by http://localhost:8000/ have seen warning Hive editor not configured.

Hive Editor - The application won't work without a running HiveServer2.

to start Hive

bdalab@solai:/opt$ $HIVE_HOME/bin/hive --service hiveserver2

in another terminal

bdalab@solai:/opt$ $HIVE_HOME/bin/beeline -u jdbc:hove2//localhost:10000

i have got permission denied error Error

Error: Failed to open new session: java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=APP, access=EXECUTE, inode="/tmp/hive/APP/704d9b8d-a9d5-4be6-a8f1-082e8eba9a0c":hdfs:supergroup:drwx------
Solution

then I have changed derby (javax.jdo.option.ConnectionUserName ) username From APP to hdfs in hive_default.xml under HIVE_HOME/conf

Wednesday, May 6, 2015

Hue installation Error

Hue installation Error

I have been installing recent version of HUE from source, During compilation i have got few issues. here is the step to resolve the issues.

Error

    Unable to get dependency information: Unable to read the metadata file for artifact 'com.sun.jersey:jersey-core:jar': Cannot find parent: net.java:jvnet-parent for project: com.sun.jersey:jersey-project:pom:1.9 for project com.sun.jersey:jersey-project:pom:1.9 com.sun.jersey:jersey-core:jar:1.9

....
Path to dependency: 1) com.cloudera.hue:hue-plugins:jar:3.8.1-SNAPSHOT
2) org.apache.hadoop:hadoop-client:jar:2.6.0-mr1-cdh5.5.0-SNAPSHOT
3) org.apache.hadoop:hadoop-hdfs:jar:2.6.0-cdh5.5.0-SNAPSHOT


Solution

update maven, by default mvn -version shows 2.x, but hue needs maven version 3 and above.
Download , extract and install Maven3 (if not already installed) and set maven in PATH

export PATH=$PATH:$MAVEN_HOME/bin

even i have updated the PATH variable with latest maven. mvn -version shows 2.x
now i manually updated the maven using update-alternatives.

  bdalab@solai:/opt$ sudo update-alternatives --install /usr/bin/mvn mvn $MAVEN_HOME/bin/mvn 1

  bdalab@solai:/opt$ sudo update-alternatives --config mvn

Now, select a number referring to the recent maven3 installation, from the list of choices

Error

    Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: Unauthorized connection for super-user: hue (error 401)
Solution

Configure Hue as a proxy user for all other users and groups, meaning it may submit a request on behalf of any other user. Add below property to core-site.xml within the configuration tags:
<property> <name>hadoop.proxyuser.hue.hosts</name> <value>*</value>
</property>
<property> <name>hadoop.proxyuser.hue.groups</name> <value>*</value>
</property>

Thursday, April 23, 2015

return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

I have created partiton table in Hive, When I do Insert into partition table I was stuck of with these error..

Error

    Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

....
[Fatal Error] Operator FS_2 (id=2): Number of dynamic partitions exceeded hive.exec.max.dynamic.partitions.pernode.

above error in first line was little confusing, if you scroll up the console you may get the second line which wss the real cause of the exception.

Solution

  bdalab@solai:/opt$ hive

  hive> set hive.exec.max.dynamic.partitions.pernode=500

by default hive.exec.max.dynamic.partitions.pernode set to 100, if the partition will exceeds the limit, you will get an error. Just change the default value based on the requirement to rid out of these.

Wednesday, April 22, 2015

how to : Execute HDFS commands from DataNode

how to : Work on NameNode (HDFS) from DataNode

command work on NameNode from DataNode or any other Hadoop installed system(which may/may not be part of hadoop cluster)

HDFS 'FS' command execute from DataNode

    bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -rm /input/log.csv


Above command will be executed from DataNode, File '/input/log.csv' will be removed from NameNode.
here, masterNodeIP -> IP address of remote system

List/show all the files in NameNode from DataNode

    bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -ls /


Create dir 'pjt' in NameNode from DataNode

    bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -mkdir /pjt

all the above command will be run from DataNode and executed on NameNode

Tuesday, April 21, 2015

How to : Move the file from DataNode to NameNode

1 step to Move the file from DataNode to NameNode


solai In some case we would need to move the file which is available in DataNode but not in HDFS to HDFS cluster.

Same command also usefull when you have a file in any Hadoop installed system (which may/may not be part of Hadoop cluster) to any NameNode.

move file from DataNode system to Hadoop cluster

    bdalab@solai:/opt$ hadoop fs -fs NameNodeIP:9000/ -put /FileToMove /hadoopFolderName/


here,
NameNodeIP -> IP address of NameNode system
FileToMove -> is a file to be moved to HDFS

OR
    bdalab@solai:/opt$ hadoop fs -fs hdfs://10.0.18.269:9000/ -put /FileToMove /hadoopFolderName/

Friday, April 17, 2015

1 step to Move the file from Hadoop HDFS to remote system

1 step to Move the file from Hadoop HDFS to remote system

command to move the file from Hadoop HDFS cluster to remote system.

In some case we would need to move the output of MapReduce file from
Hadoop HDFS to non installed Hadoop system.

move file from Hadoop cluster to remote system ( Non Hadoop system )

    bdalab@solai:/opt$ hadoop dfs -cat hdfs://NameNodeIP:9000/user/part-* | ssh userName@remoteSystemIP 'cat - > /home/hadoop/MRop'


here,
part-* -> is a file to be moved from HDFS
userName -> userName of remote system
remoteSystemIP -> IP address of remote system

1 step to Move the file from remote system to Hadoop HDFS

1 step to Move the file from remote system to Hadoop HDFS
1 command to move the file from remote system to Hadoop HDFS cluster.

In some case we would need to move the file from non installed
Hadoop system to Hadoop cluster.

move file from Non Hadoop system to Hadoop cluster

    bdalab@solai:/opt$ cat moveToHdfs.txt | ssh userName@NameNodeIP "hadoop dfs -put - hadoopDirName/"


here,
moveToHdfs.txt -> is a file to be moved to HDFS
userName -> userName of NameNode system
NameNodeIP -> IP address of NameNode
hadoopDirName -> Dir in HDFS


If you face any error like

    bash: hadoop: command not found
re-run the above command by hadoop full path

    bdalab@solai:/opt$ cat moveToHdfs.txt | ssh userName@NameNodeIP "/opt/hadoop-2.6.0/bin/hadoop dfs -put - hadoopDirName/"

Saturday, April 11, 2015

4 best tools for Big Data visualization

4 best tools for Big Data analytics and visualization
Vizuvalization will play the major role in big data data analytics. Human role in
visuvalization are limited to
      Identify the visual patterns and anomalies
      Seeing pattern across groups.
Once you have done big data analytics using your favourite tools (Hadoop,
Spark or Machine learning), next to impress customer by dashbord/graphcs in order to making better business decision.


Big Data visualization tools From Apache

zeppelin

kylin

note : both are now part of apache incubator project


Other interesting tools



I had experience with gephi. Now working with apache incubator Zeppelin and Kylin. Will update further with working model.

Tuesday, April 7, 2015

Configure UBER mode - MapReduce job for small dataset

Uber job configuration in YARN - Hadoop2
previous post - what is Uber mode

How to configure uber job.?

    To enable uber jobs, need to set the following property in yarn-site.xml.
    mapreduce.job.ubertask.enable=true
    mapreduce.job.ubertask.maxmaps=9 (default 9)
    mapreduce.job.ubertask.maxreduces=0 (default 1)
    mapreduce.job.ubertask.maxbytes=4096

mapreduce.job.ubertask.maxbytes

    above value for 4MB, default value = bloksize. The total input size of a job must be less then or equal to this value for the job to be uberized.
   Ex. say if you have data set which is 5MB of size, but you have set 4MB for mapreduce.job.ubertask.maxbytes, then uber mode will not set.
    If you omit this, by default bloksize value assigned (12MB). If you are going to run a dataset of size 50MB will not set in uber mode.)

UBER mode in YARN Hadoop2 - Running MapReduce jobs in small dataset

what is Uber mode in YARN - Hadoop2

You might have seen these lines while running MapReduce in Hadoop2. mapreduce.Job: Job job_1387204213494_0005 running in uber mode : false

what is UBER mode in Hadoop2?

    In normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer.
    uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).

Uber jobs :

    Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather then communicate with RM to create the mapper and reducer containers.
    The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.
Why

    If you have a small no dataset, want to run MapReduce on small amount of data. Uber configuration will help you out, by reducing additional time that MapReduce normally spends mapper and reducers phase.
Can I configure/have a Uber for all MapReduce job.?

    As of now,
       map-only jobs
       jobs with one reducer are supported.

Saturday, April 4, 2015

Easy way to recover the deleted files/dir in Hadoop hdfs

Easy way to recover the deleted files/dir in hdfs
In some cause, accidently files or dir will be deleted,
Is there any way to recover to get back.??

   By default Hadoop will delete the files/dir forever. It has Trash feature, which is not enabled by default.

   By configuring #fs.trash.interval and #fs.trash.checkpoint.interval in Hadoop core-site.xml will move the deleted files/dir into .Trash folder.

   location of .Trash folder is in HDFS /user/$USER/.Trash

configuring core-site.xml

<property>
<name>fs.trash.interval</name>
<value>120</value> 
</property>

<property>
<name>fs.trash.checkpoint.interval</name>
<value>45</value>
</property>

   In above configuration, all the deleted files / dir will me moved to .Trash folder and keep the data for two hours.
   checkpoint intervel check will performed for every 45 min and deletes all the file/dir which more then 2 hours old from .Trash folder.

restart the hadoop
    once you modify the core-site.xml , stop and start the hadoop

   here is the example of remove dir command
hadoop@solai# hadoop fs -rmr /testTrash
15/04/05 01:10:14 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 45 minutes. Moved: 'hdfs://127.0.0.1:9000/testTrash' to trash at: hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current
   you can clearly get the message say that deletd folder will be moved to /user/bdalab/.Trash/Current and will keep the data for 2 hours with check point interval 45 min.

list the deleted files/dir in .Trash folder using -ls
hadoop@solai# hadoop fs -ls hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current/testTrash
you can view the content (by -cat) or move the files to original path.
hadoop@solai# hadoop fs -mv hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current/testTrash /testTrash

3 simple steps to resolve linux read-only file system to read-write - ubuntu

3 simple steps to resolve linux read-only file system to read-write - ubuntu
My internal partition which has read write permission earlier,
turned back to readonly mode.

issue

when I tried to change it into full permission by
root@boss[bin]# sudo chmod 777 -R /media/bdtools
, I got chmod changing permissions of read-only file system
Then i tried to re-mount the partition, as many people were discussed over web.
root@boss[bin]# mount -o remount,rw /dev/sda9 /media/bdalab/bdtools
, I got filesystem drive is now write-protected
Solution
I have gone through the dmesg log, have seen error like
ext4_put_super:792: Couldn't clean up the journal

I followd below approach to overcome the readonly filesystem issue.
1) un mount the partition
2) fsck /dev/sda9
3) remount the partition


Note : before running fsck, advised to get an idea

posts you may like

solution for Hadoop High Availability
Distributed Hadoop cluster setup
Issue while setup Hadoop cluster

Tuesday, March 3, 2015

Hive 1.0 Status KILLED Aggregation is not enabled

HIVE partition : Status KILLED Aggregation is not enabled
While creating partition on Hive, the job killed frequently with status KILLED.
Then i have gone through the log file, have seen actual root cause of the issue.

Error 1)

#ERROR

1) Status KILLED Aggregation is not enabled. Try the nodemanager at localhost:54455

2) return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce Jobs Launched:

3) [Fatal Error] Operator FS_2 (id=2): Number of dynamic partitions exceeded hive.exec.max.dynamic.partitions.pernode.

Solution
error shows it exceeds number of partition on single node I just increased the no.of partitions pernode to 200, by default it is 100
root@solai[bin]# hive
hive> set hive.exec.max.dynamic.partitions.pernode=200
the above solution will apply only for the current session, if you come out / exit the session hive.exec.max.dynamic.partitions.pernode property will assign 100.
If you want the changes to be made permanent, open hive-site.xml file under $HIVE_HOME/conf and change.


root@solai[bin]# sudo vim.tiny $HIVE_HOME/conf/hive-site.xml

Hadoop 2.5.x/2.6.x Multi Node cluster setup on Ubuntu / lubuntu

Thursday, February 5, 2015

Sentiment Analysis : YennaiArinthal Tamil movie review

Sentiment Analysis on YennaiArinthal, Thala 55 movie review
In this blog will show the sentiment analyis on thala 55, YennaiArinthal movie from twitter #YennaiArinthal

#After succeful authentication, Get the tweet by,
bdalab@bdalabsys:/$ ya = searchTwitter("#YennaiArinthal", n=500, lang="en")
#here I have taken 500 tweet and it was analysed at 12:15 on 05/02/2015

#After necessary data cleansing, use Naive Bayes classifier to into 7 emotion
anger, disgust, fear, joy, sadness, surprise, best_fit
#In the next step we will plot the obtained results (in data frame), using ggplot2

detailed step by step instruction for sentiment Analysis with R will be posted soon,
now my ultimate aim is to show the review graph.

Sunday, January 4, 2015

Setup Apache Flink on cluster mode - Ubuntu

Configure / Setup Apache Flink on Hadoop cluster - ubuntu
Thus is continue of previous post installing Flink on Local , In this blog will see how to Setup Apache Flink on Cluster with Hadoop, once it's done will Execute / Run Flink job on the files which is stored in HDFS.

If you are new to Hadoop, find here to setting up Hadoop Cluster

#start Hadoop cluster- HDFS
bdalab@bdalabsys: HADOOP_HOME/$ ./sbin/start-dfs.sh
#start Hadoop cluster- YARN MR2
bdalab@bdalabsys: HADOOP_HOME/$ ./sbin/start-yarn.sh
make sure all the Hadoop daemons up and running

#Download the latest Flink (matching to your Hadoop version) and un-tar the file.
bdalab@bdalabsys:/$ tar -xvzf flink-0.8-incubating-SNAPSHOT-bin-hadoop2.tgz
#rename the folder
bdalab@bdalabsys:/$ mv flink-0.8-incubating-SNAPSHOT/ flink-0.8
#move the working dir into flink_home
bdalab@bdalabsys:/$ cd flink-0.8
#similar to the HDFS configuration, edit the file $FLINK_HOME/conf/slaves and
enter the IP/Host name of each worker node.

#Enable Password less ssh from master to all the slave's
bdalab@bdalabsys:flink-0.8/$ ssh-keygen -t rsa -P ""
bdalab@bdalabsys:flink-0.8/$ ssh-copy-id -i /home/bdalab/.ssh/id_dsa.pub bdalab@slave1
repaet the last step to as many slave mentioned in conf/slaves file.

#run flink on cluster mode
bdalab@bdalabsys:flink-0.8/$ ./bin/start-cluster.sh
....
Starting job manager
Starting task manager on host bdalabsys
.....
#JobManager will started by above command. check the status by
bdalab@bdalabsys:flink-0.8/$ jps
6740 Jps
6725 JobManager
6895 TaskManager

Flink cluster mode will work on both local / HDFS. If you want to process the
data from HDFS, make sure all the HDFS daemons are up&running .