Saturday, December 27, 2014

Apache Flink setup on ubuntu

Apache Flink setup on ubuntu
    Apache Flink

  • Compines feature from RDBMS ( query optimization capabilities)
    and MapReduce (scalability)
  • Write like a programming language, execute like a database
  • Like Spark, Flink execution engine that aggressively uses
    in-memory execution, but very gracefully degrades to
    disk-based execution when memory is not enough
  • Flink support filesystems : HDFS, HBase, Local FS, S3, JDBC.
  • Run on Local, Cluster and YARN
In this blog will see how to Setup Apache Flink on local mode,
once it's done will Execute / Run Flink job on the files which is stored in HDFS.

#Download the latest Flink and un-tar the file.
bdalab@bdalabsys:/$ tar -xvzf flink-0.8-incubating-SNAPSHOT-bin-hadoop2.tgz
#rename the folder
bdalab@bdalabsys:/$ mv flink-0.8-incubating-SNAPSHOT/ flink-0.8
#move the working dir into flink_home
bdalab@bdalabsys:/$ cd flink-0.8
#start Flink on local mode
bdalab@bdalabsys:flink-0.8/$ ./bin/start-local.sh
#JobManager will started by above command. check the status by
bdalab@bdalabsys:flink-0.8/$ jps
6740 Jps
6725 JobManager
#JobManager web UI will started by default on port 8081 Now we have everything up & running. will try to Run job.
as we all are aware a familier WordCount example in distributed
computing, lets begin with WordCount in Flink

#*-WordCount.jar file available under $FLINK_HOME/examples
bdalab@bdalabsys:flink-0.8/$ bin/flink run examples/flink-java-examples-0.8-incubating-SNAPSHOT-WordCount.jar /home/ipPath /home/flinkop
Above command, will run on file from local and store the result back to
local file system.

#If we want to process the same in HDFS
bdalab@bdalabsys:flink-0.8/$ bin/flink run examples/flink-java-examples-0.8-incubating-SNAPSHOT-WordCount.jar hdfs://localhost:9000/ip/tvvote hdfs://localhost:9000/op/
make sure HDFS daemons are up&running . else will get an error.
#bin/flink has 4 major Action.
  • run #runs a program
  • info #displays information about a program.
  • list #lists running and finished programs. -r & -s
  • cancel #cancels a running program. -i
#Display the running JobID by
bdalab@bdalabsys:flink-0.8/$bin/flink list -r -s


In Next blog will explain you the Setup Flink on Cluster mode

Tuesday, December 16, 2014

Simple way to configuring mysql or Postgresql RDBMS as Hive metastore

simple way to configuring mysql or Postgresql RDBMS as Hive metastore

Hive will store the metadata information (i.e like RDBMS will stores the table
and column information) out of HDFS and it will process the data available in HDFS.

By default Hive store its metastore into Derby a lightweight database.
which will serve single instance at a time. If you try to start mutltiple instance of Hive, you will get error like
"Another instance of Derby may have already booted the database".

In this will see how we can configure other RDBMS (MySQL & PostgreSQL) as Hive metastore.


Create / rename hive-default.xml.template TO hive-site.xml under $HIVE_HOME/conf
hadoop@solai# vim.tiny $HIVE_HOME/conf/hive-default.xml


change the value of the following property



<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/hivedb</value> 

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name> 

<value>mysqlroot</value> 

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name> 

<value>hive@123</value> 
</property>




download and plcae the "mysql-connector-java-5.x.xx-bin.jar" to the $HIVE_HOME/lib
hadoop@solai# mv /home/hadoop/Downloads/mysql-connector-java-5.1.31.tar.gz $HIVE_HOME/lib
In Mysql create database "hivedb" and load the hive schema to the database "hivedb"
mysql> create database hivedb;
mysql> use hivedb;

## following will create hive schema in mysql database.
mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.12.0.mysql.sql

its important to restrict user to alter / delete hivedb.
mysql> CREATE USER 'mysqlroot'@'hivedb' IDENTIFIED BY 'hive@123';



mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'mysqlroot'@'hivedb';



mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON hivedb.* TO 'mysqlroot'@'localhost';



mysql> FLUSH PRIVILEGES;



mysql> quit;
Enter Hive CLI, for create table
hadoop@solai#$HIVE_HOME/bin/hive

hive> create table testHiveMysql(uname string, uplace string);
enter into mysql to check the schema information created in hive environment. following lines will return the table and column information.
mysql> select * from TBLS;
mysql> select * from COLUMNS_V2;
mysql> show tables;
show tables, will return all the tables pertaining to the Hive schema

Thursday, November 27, 2014

Upgrade Hadoop with Latest version - Simple steps

Upgrade Hadoop Namenode with Latest version - Simple steps

Here i've listed few simple steps to upgrade Hadoop NameNode with out loss of exsiting Data in the cluster.

It's advisable to take backup of Hadoop metadata placed under : dfs.namenode.name.data OR dfs.name.dir dir


Step's

1) stop-yarn.sh


2) stop-dfs.sh

3) Download and configure the latest version of Hadoop

4) cd $HADOOP_PREFIX/etc/hadoop
    in hdfs-site.xml ,
       change the dfs.namenode.name.dir and (if in case of pseudo node) dfs.datanode.data.dir to point to the old version of Hadoop path

5) ./sbin/hadoop-daemon.sh start namenode -upgrade

6) you will see following message in Web UI namenodeIP:50070 "Upgrade in progress. Not yet finalized." and SafeMode is ON

7) ./bin/hdfs dfsadmin -finalizeUpgrade

8) investigate the NameNode log, which should contains this information,
 
Upgradepgrade of local storage directories.
   old LV = -57; old CTime = 0.
   new LV = -57; new CTime = 1417064332016


9) safeMode will go off automatically, once you complete all these..

10) start the DFS
    ./sbin/start-dfs.sh --config $HADOOP_PREFIX/etc/hadoop

11) start the Yarn
    ./sbin/start-yarn.sh --config $HADOOP_PREFIX/etc/hadoop

Friday, October 17, 2014

Error and Solution : Detailed step by step instruction on Spark over Yarn - Part 2

Excpetion Apache SPARK deployment
This is continue post, find Spark issues part 1 here

I have Hadoop cluster setup, decided to Deploy Apache Spark over Yarn.
for test case I have tried different option to summit Saprk job.
Here I have discussed few Exception / issues during
Spark deployment on Yarn.

Error 1)

:19: error: value saveAsTextFile is not a member of Array[(String, Int)] arr.saveAsTextFile("hdfs://localhost:9000/sparkhadoop/sp1")

Step to reproduce

val file = sc.textFile("hdfs://master:9000/sparkdata/file2.txt")

val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

val arr = counts.collect()

arr.saveAsTextFile("hdfs://master:9000/sparkhadoop/sp1")

Solution

Error caused on the bolted line above. Its due to storing the array value to the HDFS. In scala for Spark everything should be in RDD (Resilient Distributed datasets). so that scala variable can use Spark realated objects / methos. in this case just convert array into RDD ( replace bolded line by )
sc.makeRDD(arr).saveAsTextFile("hdfs://master:9000/sparkhadoop/sp1")


Error 2)

when I run the above wordcount example, I got this error too,
WARN TaskSetManager: Lost task 1.1 in stage 5.0 (TID 47, boss): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1474416393-10.184.36.194-1406741613979:blk_1073742690_1866 file=sparkdata/file2.txt

Solution

I was geting data from Hadoop HDFS filesystems, my Datanode was down. i just started datanode alone by

root@boss:/opt/hadoop-2.2.0# ./sbin/hadoop-daemon.sh start datanode


Error 3)
My nodemanager keep on goes off. i tried many time to start up by

root@solaiv[hadoop-2.5.1]# ./sbin/yarn-daemon.sh start nodemanager

FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager java.lang.NoClassDefFoundError: org/apache/hadoop/http/HttpServer2$Builder
Solution

     I checked the hadoop classpath
root@boss:/opt/hadoop-2.5.1# ./bin/hadoop classpath
Few Jar file were still refering to old version of Hadoop i.e hadoop-2.2.0. corrected by
changing Latest hadoop-2.5.1 version to HADOOP_HOME.

Related posts

Few more issues Apache Spark on Yarn

Error and Solution : Detailed step by step instruction on Spark over Yarn - Part 1

Apache SPARK deployment on YARN
I have Hadoop cluster setup, decided to Deploy Apache Spark over Yarn.
for test case I have tried different option to summit Saprk job.
Here I have discussed few Exception / issues during
Spark deployment on Yarn.

Error 1)
Initially I Connected Spark via shell using "local" mode. everything working great.
root@boss:/opt/spark-1.1.0-bin-hadoop2.4# ./bin/spark-shell --master local[2]
when I tried to coonect via "master" mode,
root@boss:/opt/spark-1.1.0-bin-hadoop2.4# ./bin/spark-shell --master spark://boss:7077
I can safly enter into spark-shell, Then I summitted the job,
Resource manager(domainname:8088) Acceptted my job but not allowed to run the job.
I have been waiting quite long time then decide to check log files.

14/09/25 15:54:59 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

14/09/25 15:55:14 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
.....

Solution
There was no runnning worker for the master to assign the job. Started the worker for the master running on server "solai" and port "7077"
root@boss:/opt/spark-1.1.0-bin-hadoop2.4# ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://solai:7077


Error 2)
then I decided to enter spark-shell via Yarn.
root@boss:/opt/spark-1.1.0-bin-hadoop2.4# ./bin/spark-shell --master yarn-client

Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread "main" java.lang.Exception: When running with master 'yarn-client' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.
checkRequiredArguments(SparkSubmitArguments.scala:182) at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:62) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.
scala:70) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Solution
as said in error set HADOOP_CONF_DIR in $SPARK_HOME/conf/spark-env.sh (create this file from conf/spark-env-template.sh)

HADOOP_CONF_DIR=/opt/hadoop-2.5.1/etc/hadoop


Error 3)
Successfully deployed Spark on top of Yarn. Next, Tryied to submit the job via Yarn-cluster.

root@slavedk:/opt/spark-1.1.0-bin-hadoop2.4# ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master yarn-cluster lib/spark-examples-1.1.0-hadoop2.4.0.jar /sip/ /op/sh2

Container launch failed for container_1412078568642_0005_01_000003 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is 1412083584173 found 1412082408578

OR

Application application_1411747873135_0002 failed 2 times due to AM Container for appattempt_1411747873135_0002_000002 exited with exitCode: -100 due to: Container expired since it was unused.Failing this attempt.. Failing the application
Solution
     One of the options would be increasing lifespan of container by changing the default time. defaut container expairy itervel is 10 sec.. make it as 20sec.
add the below property to the Hadoop yarn-site.xml file. in my case /opt/hadoop-2.5.1/etc/hadoop/yarn-site.xml



<property>
  	<name>yarn.resourcemanager.rm.container-allocation.
        expiry-interval-ms
</name> <value>2000000</value> </property>


Related posts

Few more issues Apache Spark on Yarn

Friday, August 22, 2014

Working with Hadoop Eco systems tools : Exception and Solution

Error & Solution : Hadoop Eco-System tools
This is the continue post on Error & Solution during setup Hadoop HA

Here I have discussed few error / issues during Automatic Fail-over configuration a part of the Hadoop HA setup.

Error 1)
Application application_1406816598739_0003 failed 2 times due to Error launching appattempt_1406816598739_0003_000002. Got exception: java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable; Host Details : local host is: "localhost/127.0.0.1"; destination host is: "boss":32939;
Solution
no IP address assigned for the node. manually assigned just like
root@boss[bin]#sudo ifconfig eth0 10.184.36.194


Error 2)

root@solaiv[sqoop]# bin/sqoop import --connect jdbc:mysql://localhost/hadoop --table movies --username root -P --split-by id

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
Solution
Download and put the "sqoop-1.4.4-hadoop200.jar" to SQOOP_HOME


Error 3)
PIG, while executing PIG statement from GRUNT
grunt> cnt = foreach grpd generate group, count(words) as nos; Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve count using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

Solution
count in pig statement shoud be upper case COUNT
grunt> cnt = foreach grpd generate group, COUNT(words) as nos;


Error 4)
using SQOOP: MySQL to HIVE
root@boss:/opt/sqoop-1.4.4# bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop --table csurvey --target-dir /sqoop-hive --hive-import --split-by pname --hive-table csurveysu -username root -P

ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot delete /tmp/hadoop-yarn/staging/root/.staging/job_1407336913887_0001. Name node is in safe mode. The reported blocks 197 has reached the threshold 0.9990 of total blocks 197. The number of live datanodes 1 has reached the minimum number 2 . Safe mode will be turned off automatically in 3 seconds.
Solution
My Hadoop NameNode was in safe mode. i.e read-only mode for the HDFS cluster. it can't Write.. until enough datanode up for replication After started up enough datanode still my NameNode in safe mode, just manually OFF safemode by
root@localhost:/opt/hadoop-2.2.0# bin/hadoop dfsadmin -safemode leave
Just check the inconsistency of HDFS cluster by

root@solaiv[bin]#root@localhost:/opt/hadoop-2.2.0# bin/hadoop fsck /


Error 5)
Hive show tables does not display table "sqooptest" , which was imported by SQOOP
hive> show tables tablename I have imported MySQL to HVE using SQOOP, once it done I can able to see from Hadoop File Systes (HDFS), but When I use "show tables tablname" from HIVE console, it throws no table exists

Solution
just enter into hiveQL from where you used the sqoop import commands.
root@solaiv[bin]# cd /opt root@solaiv[opt]# $SQOOP_HOME/bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop --table huser --hive-import --split-by name --hive-table sqooptest -username root -P execute hive from /opt, if you are trying to enter other-then this dir, you will not able to view root@solaiv[opt]# $HIVE_HOME/bin/hive hive> show tables tablename


More clarity..
By default HIVE meta data stored in derby database. Derby store the data into current working dir. that can be customize by user "hive-default.xml"

Monday, July 7, 2014

Hadoop High Availability - Daemons overview

Hadoop High Availability - Daemons overview
Discussed few concept which I came across setting up Hadoop Cluster High Availability

Role of StandBy Node

  Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

  In order provide fast fail-over Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called "JournalNodes" (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs.

 The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace.

  In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.


DataNode configuration in Hadoop HA

In order to provide a fast fail-over, it is also necessary that the Standby node has up-to-date information regarding the location of blocks in the cluster and status of each DataNode.

 In order to achive this, all the DataNodes are configured with the location of both NameNodes(Active & StandBy), and they send block location information and heartbeats to both NameNodes.


umm.. What is Secondary NameNode
will explain from NameNode for more clarity

Namenode 
 Namenode holds the meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But these information also stored in disk for persistence storage.
  • fsimage -> Its the snapshot of the filesystem when namenode started

  • edit logs -> Its the sequence of changes made to the filesystem after NameNode started Only in the restart of NameNode, edit logs are applied to fsimage to get the latest snapshot of the file system. But NameNode restart are rare in production clusters which means edit logs can grow very large for the clusters where NameNode runs for a long period of time. The following issues we will encounter in this situation

    • Editlog become very large , which will be challenging to manage it
    • Namenode restart takes long time because lot of changes has to be merged
    • In the case of crash, we will lost huge amount of metadata since fsimage is very old So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage ,so that load on namenode reduces
Secondary NameNode
 Secondary NameNode helps to overcome the above issues by taking over responsibility of merging editlogs with fsimage.
  • It gets the edit logs from the namenode in regular intervals and applies to fsimage (i.e build new image)
  • Once it has new fsimage, it copies back to namenode
  • Namenode will use this fsimage for the next restart,which will reduce the startup time



Why Secondary NameNode not needed in HA Hadoop cluster

In an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error.


What is split-brain scenario in Hadoop HA

  In HA cluster that only one of the NameNodes be active at a time and the Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes will only ever allow a single NameNode to be a writer at a time.
  During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes. which will effectively prevent the other NameNode from continuing in the Active state.

Related posts

HA Isuues

Tuesday, June 24, 2014

issues : Login in RStudio, SWIRL Package installation and R 3.1.0 installation

Issues & Solution : R and RStudio
Tried to access Rstudio via HTTP using default port 8787. http://localhost:8787.

Error
In my first login i was given root as a username and password. got below error
cannot login to rtudio server on Debian (RStudio initialization error) Unable to connect to service (RStudio initialization error)
Solution
Create new user called ruser.
root@solaiv[bin]# adduser ruser
Adding user `ruser' ...
Adding new group `ruser' (1002) ...
Adding new user `ruser' (1002) with group `ruser' ...
Creating home directory `/home/ruser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully 

Refreshed the browser and login with newly created user credentials


Error
R version 3.1.0, while configure R
root@solaiv[R-3.1.0]# ./configure

configure: error: --with-readline=yes (default) and headers/libs are not available configure: error: --with-x=yes (default) and X11 headers/libs are not available
Solution
do not need R to be built with this library, you can simply set with-readline to "no":
root@solaiv[R-3.1.0]# ./configure --with-x=no --with-readline=no


Error
while installing "SWIRL" packages
> install.packages("swirl")

Cannot find curl-config
ERROR: configuration failed for package "RCurl"
* removing "/usr/local/lib/R/library/RCurl"
ERROR: dependency "RCurl" is not available for 
package "httr"
* removing "/usr/local/lib/R/library/httr" ERROR: dependencies "httr", "RCurl" are not
available for package "swirl"
* removing "/usr/local/lib/R/library/swirl"
Solution
> install.packages("swirl", dependencies=TRUE)

still getting the same error, install and re-try

root@solaiv[R-3.1.0]# apt-get install libcurl4-openssl-dev

root@solaiv[R-3.1.0]# apt-get install libxml2-dev

root@solaiv[R-3.1.0]# R

> install.packages("swirl", dependencies=TRUE)

Related posts

Install R on debian

Wednesday, June 18, 2014

Error & Solution : Automatic Failover configuration (HDFS High Availability for Hadoop 2.X)

Error & Solution : Automatic Failover configuration (HDFS High Availability for Hadoop 2.X)
This is the continue post on Error & Solution during setup Hadoop HA

Here I have discussed few error / issues during Automatic Failover configuration
a part of the Hadoop HA setup.

Error 1)
If you are converting a non-HA NameNode to be HA, you should run the command "hdfs namenode -initializeSharedEdits", which will initialize the JournalNodes with the edits data from the local NameNode edits directories
root@solaiv[bin]#./hdfs namenode -initializeSharedEdits

ERROR namenode.NameNode: Could not initialize shared edits dir java.io.IOException: Cannot start an HA namenode with name dirs that need recovery. Dir: Storage Directory /app/hadoop2/namenode state: NON_EXISTENT
Solution
create namenode dir in
root@boss[bin]#mkdir -P /app/hadoop2/namenode


Error 2)

root@solaiv[bin]#./hdfs namenode -initializeSharedEdits

namenode.NameNode: Could not initialize shared edits dir The directory is already locked;
Solution
make sure full permission to hadoop.dir for namenode, datanode and journalnode
root@boss[bin]#chmod 777 -R /app/hadoop2/
I have configured all the dirs under /app/hadoop2

root@boss[bin]#ls -l /app/hadoop2/

drwxrwxrwx 2 root root 4096 Nov 29 12:27 datanode
drwxrwxrwx 3 root root 4096 Nov 28 19:38 jn
drwxrwxrwx 3 root root 4096 Nov 29 12:32 namenode


Error 3)
This time when i run the initializeSharedEdits on standby node,
root@standby[bin]#hdfs namenode -initializeSharedEdits

14/06/03 14:42:28 ERROR namenode.NameNode: Could not initialize shared edits dir java.io.FileNotFoundException: No valid image files found at org.apache.hadoop.hdfs.server.namenode.
FSImageTransactionalStorageInspector.
getLatestImages(FSImageTransactionalStorageInspector.java:144)
Solution
Error due to standby node couldn't sync with active namenode
format the satndby namenode
standby@hadoop[bin]#hdfs namenode -format


Error 4)
in order to Initialize standby node. Format standby node namenode and copy the latest checkpoint (FSImage) from master to standby by executing the following command:
root@standby[bin]#hdfs namenode -bootstrapStandby
This command connects with master node to get the namespace metadata and the checkpointed fsimage. This command also ensures that standby node receives sufficient editlogs from the JournalNodes (corresponding to the fsimage). This command fails if JournalNodes are not correctly initialized and cannot provide the required editlogs.
root@standby[bin]#hdfs namenode -bootstrapStandby

org.apache.hadoop.hdfs.qjournal.protocol.
JournalNotFormattedException: Journal Storage Directory /app/hadoop2/jn/mycluster not formatted

10.184.39.147:8485: Journal Storage Directory /app/hadoop2/jn/mycluster not formatted at org.apache.hadoop.hdfs.qjournal.server.Journal.
checkFormatted(Journal.java:453) at org.apache.hadoop.hdfs.qjournal.server.Journal.
getEditLogManifest(Journal.java:636) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.
getEditLogManifest(JournalNodeRpcServer.java:181) …... FATAL ha.BootstrapStandby: Unable to read transaction ids 3-13784 from the configured shared edits storage qjournal://master:8485;standby:8485/mycluster. Please copy these logs into the shared edits storage or call saveNamespace on the active node. Error: Gap in transactions. Expected to be able to read up until at least txid 13784 but unable to find any edit logs containing txid 3
Solution
I finally solved this by copying data for a 'good' journal node (aka, from 'master') to the unformatted one (aka, standby where i was getting error)
root@master[bin]#scp -r /app/hadoop2/jn/mycluster/ root@standby:/app/hadoop2/jn/
then restarted the journanl node.

root@standby[bin]#../sbin/hadoop-daemon.sh start journalnode

root@standby[bin]#hdfs namenode -bootstrapStandby


Related posts

Error and Solution - Hadoop HA
distributed Hadoop setup
Issue while setup Hadoop cluster

Issues & Solution : HDFS High Availability for Hadoop 2.X

Issues & Solution : HDFS High Availability for Hadoop 2.X
Here I have discussed few error / issues during the Hadoop HA setup.

Error 1)
when I start resourcemanager from active namenode in Hadoop HA ,
root@master:/opt/hadoop-2.2.0# sbin/yarn-daemon.sh start resourcemanager

Problem binding to [master:8025] java.net.BindException: Cannot assign requested address;
Solution
Check your /etc/hosts file, If you have multiple enrty for same IP/localhost, delete and make sure only one valued entry.
Just I removed all other entry for the IP ' 10.184.39.167' from /etc/hosts

10.184.39.167 standby


Error 2)
Once I have configured Haddop HA, strated Haddop cluster
root@master[bin]#hdfs namenode -format

FATAL namenode.NameNode:Exception in namenode join

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Unable to check if JNs are ready for formatting. 1 exceptions thrown:

10.184.39.67:8485: Call From standby/10.184.39.62 to master:8485 failed on connection exception: java.net.ConnectException:
Solution
start all the configured journalnode then format the Hadoop namenode
root@master[bin]#../sbin/hadoop-daemon.sh start journalnode
root@master[bin]#hadoop namenode -format


Error 3)
As mentioned in Solution 2 above, journalnode started and hadoop namenode formatted successfully
when I try to start DFS
root@master[bin]#../sbin/start-dfs.sh

java.io.IOException: Cannot start an HA namenode with name dirs that need recovery. Dir: Storage Directory /app/hadoop2/namenode state: NOT_FORMATTED
Solution
  Error due to data sync. b/w active node and standby node, If your distributed Hadoop setup is fresh you may not get this error.
I copied dfs directory from the active namenode to the standby namenode
root@standby[hadoop-2.2.0]#scp -r /app/hadoop2/namenode/* root@master:/app/hadoop2/namenode/
make sure full permission to hadoop.dir
root@master[bin]#chmod 777 -R /app/hadoop2/
re started it.
root@master[bin]#../sbin/start-dfs.sh


Error 4)
Once I have configured Haddop HA, strated Haddop cluster
root@master[bin]#hdfs namenode -format

FATAL namenode.NameNode:Exception in namenode join

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Unable to check if JNs are ready for formatting. 1 exceptions thrown:

10.184.39.67:8485: Call From standby/10.184.39.62 to master:8485 failed on connection exception: java.net.ConnectException:
Solution
start all the configured journalnode then format the Hadoop namenode
root@master[bin]#../sbin/hadoop-daemon.sh start journalnode
root@master[bin]#hadoop namenode -format


Error 5)
NameNode doesn't start in Hadoop2.x
root@master[bin]#../sbin/start-dfs.sh

Incorrect configuration namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured. Starting namenodes on [] …. …..
Solution
This error caused due to *.site.xml tag problem, I'v checked my *site.xml file, all seems to be correct. But mistakenly I have commended fs.default.name in core-site.xml.
 
<property>
   <name>fs.default.name</name>
         <value>hdfs://master:9000</value>
</property>



Related posts

Error and Solution - Hadoop HA Automatic Failover
distributed Hadoop setup
Issue while setup Hadoop cluster

Tuesday, June 17, 2014

Indexing and Searching using IDOL OnDemand

IDOL OnDemand API Tutorial
IDOL OnDemand delivers a rich set of web service APIs to enable developers to create ground-breaking data driven apps.To get more details on each HP IDOL OnDemand API click here
In this tutorial we are going to cover only 5 APIs,
  • Create Index
  • Obejct Store
  • View Document
  • Find Related Concepts
  • Find Similar
For all APIs, calls are made using JQuery
$('div.api').bind('click', function() {
    $.ajax({url:URL,
 success:function(result){    
 //convert into JSON
 var third4 = JSON.stringify(result,undefined,2)     
 //process the data 
  },error:function(data){
//Handling Error 
  }
  }); 
});


Create Index : The Create Text Index API allows you to create a text index, that you can use to add your own content to IDOL OnDemand. Here we are creating index called 'Cancer Prevention'
URL = https://api.idolondemand.com/1/api/sync/createtextindex
/v1?index=Cancer+Prevention&flavor=explorer&
apikey=dac630d2-4aed-45b7-8fc7-99fa87858460


Store Object : The Store Object API takes a file, reference, or an input URL and stores the contents of the document for use in other APIs. It returns the object store reference, which you can pass to other APIs to process the document that you store. In this tutorial API takes Files from url -> http://www.cancer.org/acs/groups/cid/documents/webcontent/002577-pdf.pdf
URL = https://api.idolondemand.com/1/api/sync/storeobject
/v1?url=http%3A%2F%2Fwww.cancer.org%2Facs%2Fgroups%2Fcid%2F
documents%2Fwebcontent%2F002577-pdf.pdf&
apikey=dac630d2-4aed-45b7-8fc7-99fa87858460


view document call: Converts a document to HTML format for viewing in a Web browser. in this snippet we are executing a view document call and and highlighting the phrase 'physical activity' in the document.
URL = https://api.idolondemand.com/1/api/sync/viewdocument/
v1?url=http%3A%2F%2Fwww.cancer.org%2Facs%2Fgroups%2Fcid%2F
documents%2Fwebcontent%2F002577-pdf.pdf&highlight_expression
=physical+activity&start_tag=%3Cb%3E&raw_html=true&
apikey=dac630d2-4aed-45b7-8fc7-99fa87858460


Find Related Concepts Returns the best terms and phrases in documents that match the specified query.
URL = https://api.idolondemand.com/1/api/sync/findrelatedconcepts/
v1?reference=b856d643-bc0a-48f6-88a2-e7489aa33be9&
sample_size=5&apikey=dac630d2-4aed-45b7-8fc7-99fa87858460


Find Similar API : Finds documents that are conceptually similar to your text or a document. Returns documents in the IDOL OnDemand databases that are similar to text or a document that we provide. We can either submit text, an index reference, a file, an Store Objetc reference, or a URL.
 URL = https://api.idolondemand.com/1/api/sync/findsimilar
/v1?text=Hello+World&indexes=wiki_eng&apikey=
dac630d2-4aed-45b7-8fc7-99fa87858460

Related posts

Few more API calls using JQuery

IDOL OnDemand API Call using Java

IDOL OnDemand API call using JAVA

IDOL OnDemand API Call using Java
IDOL OnDemand delivers a rich set of web service APIs to enable developers to create ground-breaking data driven apps.To view complete set of API here
In this tutorial we are going to cover only 3 APIs,
  • Find Similar
  • OCR Document
  • Sentiment Analysis
Find Similar API : Finds documents that are conceptually similar to your text or a document. Returns documents in the IDOL OnDemand databases that are similar to text or a document that we provide. we can either submit text, an index reference, a file, an object store reference, or a URL.
In this tutorial, Find Similar API call from Wikipedia on the phrase 'Hello World'.

OCR Document API : The OCR Document API extracts text from an image that we provide. The API returns the extracted text, along with information about the location of the detected text in the original image.
In this tutorial, OCR Document API will call from the following image: 'http://www.java-made-easy.com/images/hello-world.jpg' and display the extracted text.

Sentiment Analysis API : The Sentiment Analysis API analyzes text to return the sentiment as positive, negative or neutral. It contains a dictionary of positive and negative words of different types, and defines patterns that describe how to combine these words to form positive and negative phrases.
In this tutorial, Sentiment Analysis API call with the text from the Hello World response (i.e reference link ). Display the Sentiment Score along with the Sentiment Rating

Java Application, in web.xml

<welcome-file-list>
        <welcome-file>test.jsp</welcome-file>
        <welcome-file>test.jsp</welcome-file>
    </welcome-file-list>
    
    <servlet>
        <servlet-name>findsimilar</servlet-name>
        <servlet-class>idol.api.HelloWorld</servlet-class> 
    </servlet>

    <servlet-mapping>
        <servlet-name>findsimilar</servlet-name> 
        <url-pattern>/findsimilar</url-pattern> 
    </servlet-mapping>

in test.isp


<div class="section1">
<h3>In this tutorial</h3>

<div class="block">
<img src="images/similar.jpg" alt="" width="40" height="55" />
<h6>  <a href="findsimilar?api=api1"> servlet </a>  .</h6>
</div>

<div class="block">
<img src="images/ocr.jpg" alt="" width="40" height="55"/>
<h6>  <a href="findsimilar?api=api2"> servlet </a>  </h6>
</div>

<div class="block">
<img src="images/senti.jpg" alt="" width="40" height="55"/>
<h6> <a href="findsimilar?api=api3"> servlet </a>   </h6>
</div>

</div>

In servlet HelloWorld.java
String reference = "";	
String url3 = "https://api.idolondemand.com/1/api/sync/analyzesentiment/v1?url="+reference+"&apikey=dac630d2-4aed-45b7-8fc7-99fa87858460";
	
 protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
       
     
    	String url = "";  	
    	
        String url1 = "https://api.idolondemand.com/1/api/sync/findsimilar/v1?text=Hello+World&indexes=wiki_eng&apikey=dac630d2-4aed-45b7-8fc7-99fa87858460";
        String url2 = "https://api.idolondemand.com/1/api/sync/ocrdocument/v1?url=http%3A%2F%2Fwww.java-made-easy.com%2Fimages%2Fhello-world.jpg&apikey=dac630d2-4aed-45b7-8fc7-99fa87858460";
        
       	String api = request.getParameter("api");	
		
		if (api.equalsIgnoreCase("api1")){
			url = url1;
			}
		else if(api.equalsIgnoreCase("api2")){
			url = url2;
		}
		else{
			url = url3;			
		}
		
           Client1 cl1 = new Client1(); 
			String str =  cl1.run(url);			
			
			if (api.equalsIgnoreCase("api1")){
				 String []a =  str.split("reference");
				 String []b = a[1].split("weight");
				 reference =  b[0].substring(b[0].indexOf("http"), (b[0].indexOf(",",b[0].indexOf("http")))-1 );				 
				 url3 = "https://api.idolondemand.com/1/api/sync/analyzesentiment/v1?url="+reference+"&apikey=dac630d2-4aed-45b7-8fc7-99fa87858460";
				 request.setAttribute("ref", reference);
				}
			
			if (api.equalsIgnoreCase("api2")){			     
				 String []a = str.split("text", 3);
				 String []b = a[2].split("left");				
				 str =  b[0].substring((b[0].indexOf("="))+1, (b[0].indexOf(",",b[0].indexOf("left")))-1 );				 				
				}
			
						
			if (api.equalsIgnoreCase("api3")){
			     String []a = str.split("aggregate", 3);			     
			     if (a.length !=1){
				 str = a[1];}				 
				}
			
			response.setContentType("text/html");
			request.setAttribute("servletName", str); 		    
	    	
	      getServletConfig().getServletContext().getRequestDispatcher(
	    		  "/test2.jsp").forward(request,response); 
	    
    }


in HelloWorld.java, we are using reference global variable and it stores link from the Find Similar API, which will be used for sentiment analysis API,

in Client.java

public String run(String url){
		
		String body = "";
		 HttpClient httpclient = new DefaultHttpClient();
		 HttpGet httpget = new HttpGet(url);
		 System.out.println("httpget : "+httpget);		 
		 BasicHttpResponse response = null;
		 
		 try {
			response = (BasicHttpResponse) httpclient.execute(httpget);			
			StatusLine statusLine = response.getStatusLine();					
			 
			HttpEntity entity = response.getEntity(); 
		    if (entity != null) {		    	
		    	body = EntityUtils.toString(entity);		        	
		    }
		 }catch (Exception e) {
			// TODO: handle exception
			 e.printStackTrace();
		} 
		 finally {
			    System.out.println("Finally !!!");
		 }
		return body;	 
	}



<div class="section2">
<h3> API Response is </h3>
<div style="height: 227px; overflow:scroll;width:632px">
<pre>
<% 
out.print(request.getAttribute("servletName").toString()); 
%>
</pre>
</div>



Output of Sentiment Analysis API

: {
    "sentiment": "neutral",
    "score": -0.06540204218347706
  }
}

Monday, June 16, 2014

IDOL OnDemand API call using JavaScript

IDOL OnDemand API Tutorial
IDOL OnDemand delivers a rich set of web service APIs to enable developers to create ground-breaking data driven apps.To view complete set of API here
In this tutorial we are going to cover only 3 APIs,
  • Find Similar
  • OCR Document
  • Sentiment Analysis


Find Similar API : Finds documents that are conceptually similar to your text or a document. Returns documents in the IDOL OnDemand databases that are similar to text or a document that we provide. we can either submit text, an index reference, a file, an object store reference, or a URL.
In this tutorial, Find Similar API call from Wikipedia on the phrase 'Hello World'.
 
$.ajax({url:"https://api.idolondemand.com
/1/api/sync/findsimilar/v1?text=Hello+World&indexes=wiki_eng&
apikey=dac630d2-4aed-45b7-8fc7-99fa87858460",
success:function(result)
{window.reference = result.documents[0].reference;..}
Output response will be
  "documents": [
    {
      "reference": "http://en.wikipedia.org/wiki/Hell",
      "weight": 89.25,
      "links": [
        "HELL",
        "WORLD"
      ],
      "index": "wiki_eng",
      "title": "Hell"
    },
    {
      "reference": "http://en.wikipedia.org/wiki/Hello world program",
      "weight": 89.07,
      "links": [
        "HELL",
        "WORLD"
      ],
      "index": "wiki_eng",
      "title": "Hello world program"
    }, 


OCR Document API : The OCR Document API extracts text from an image that we provide. The API returns the extracted text, along with information about the location of the detected text in the original image.
In this tutorial, OCR Document API will call from the following image: 'http://www.java-made-easy.com/images/hello-world.jpg' and display the extracted text.
$.ajax({url:"https://api.idolondemand.com/1/api/sync
/ocrdocument/v1?url=http%3A%2F%2Fwww.java-made-easy.com%2Fimages%2F
hello-world.jpg&apikey=dac630d2-4aed-45b7-8fc7-99fa87858460",
success:function(result){ ..}
Output response will be
{
  "text_block": [
    {
      "text": "= 5\nJR '\npublic class Hellouorld {\n/ # 'X;
\n* @param args\n*!\n; public static void main
(String[] argsj {\n// TODO Auto-generated method stub El\nSystem
.out.printlnl"Hello world!"1;\n}\n}\n< ) i",
      "left": 0,
      "top": 0,
      "width": 562,
      "height": 472
    }
  ]
}


Sentiment Analysis API : The Sentiment Analysis API analyzes text to return the sentiment as positive, negative or neutral. It contains a dictionary of positive and negative words of different types, and defines patterns that describe how to combine these words to form positive and negative phrases.
In this tutorial, Sentiment Analysis API call with the text from the Hello World response (i.e reference link ). Display the Sentiment Score along with the Sentiment Rating
 var url3 = 'https://api.idolondemand.com/1/api/
sync/analyzesentiment/v1?url='+reference+'&
apikey=dac630d2-4aed-45b7-8fc7-99fa87858460';
$.ajax({url:url3,success:function(result){ var sentiment = result.aggregate.sentiment; var score = result.aggregate.score;}
Output response will be
 "aggregate": {
    "sentiment": "neutral",
    "score": -0.06540204218347706
  }

Thursday, June 5, 2014

Install R in Debian squeeze/BOSS OS

Install R/Rstudio in Debian squeeze/BOSS OS

Install R Prog. on Debian Squeeze

R is a Free software environment for Data Analysis and Graphics
  • Programming Language
  • Data Visualization
By default, Debian squeeze comes up with R packages, you can verify by
root@solaiv[~]#apt-cache search ^r-.*
if you were not find anything add an appropriate entry in
root@solaiv[~]#vi /etc/apt/sources.list
the newest R release can be installed using a command sequence like
root@solaiv[~]#apt-get update
root@solaiv[~]#apt-get install r-base
root@solaiv[~]#apt-get install r-base-dev
to enter to R console
root@solaiv[~]#R
>
simple add example in R
> a <- 5;
> b <-10;
> c <- a+b;
> c
[1] 15
R packages may then be installed by the local user/admin from the CRAN source packages, typically from inside R using the working wit addition R packages

Install packages from CRAN
> install.packages("RJDBC")
Install NEW packeges from downloaded file
> install.packages("file.tar.gz")
loading packages into sessions
> library("strings")
> require("strings")

Related posts

Install RStudio
issues cannot login to RStudio server

Monday, June 2, 2014

Hbase HMaster cannot started/Aborted in Debiain/Cent OS.

Failed to start Hbase HMaster in Debiain/Cent OS.
When I start Hbase, my HMaster started successfully, but after some time (with in one minute) HMaster aborted all other daemons are still running. Here is the log.
 
2014-05-21 15:45:05,075 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)

FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.

java.net.ConnectException: Call to localhost/127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused
2014-05-21 15:45:16,794 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2014-05-21 15:45:16,794 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads
2014-05-21 15:45:16,794 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000
2014-05-21 15:45:16,794 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60000: exiting
2014-05-21 15:45:16,794 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60000: exiting
2014-05-21 15:45:16,794 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60000: exiting
java.lang.RuntimeException: HMaster Aborted
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java

SOLUTION
In my  hbase-site.xml, I have missed out hbase.rootdir propery
<property> 
     <name>hbase.rootdir</name> 
     <value>hdfs://localhost:54310/hbase</value> 
</property> 
Once I added, restarted the hbase now Hmaster running fine.


Thursday, May 29, 2014

Step by step instruction how Start/Stop & manage each Hadoop Daemons in distributed Hadoop environment

Step by step instruction how Start/Stop Hadoop Daemons
 Refer setting up sudo/single node Hadoop 2.X
This post will help you out how to start & stop the Hadoop daemons from master & slave nodes.
All the file available under $HADOOP_PATH/sbin
start-all.sh & stop-all.sh 
           Used to start and stop hadoop daemons all at once. Issuing it on the master machine will start/stop the daemons on all the nodes of a cluster. 
Hadoop daemons are
NameNode                 
SecondaryNameNode   
ResourceManager
JobHistoryServer
DataNode
NodeManager

start-dfs.sh, stop-dfs.sh 
           start/stop only  HDFS daemons separately on all the nodes from the master machine. (HDFS Daemons are NameNode , SecondaryNameNode and DataNode )
in master node
NameNode 
SecondaryNameNode  
in slave node
DataNode
start-yarn.sh, stop-yarn.sh 
           start/stop YARN daemons separately on all the nodes from the master machine. (YARN daemons are ResourceManager and NodeManager )
in master node
ResourceManager
in slave node
NodeManager
Start individual Hadoop daemons
hadoop-daemon.sh namenode/datanode &
yarn-deamon.sh resourcemanager/nodemanager 
         To start individual daemons on an individual machine manually. You need to go to a particular node and issue these commands.
sbin/hadoop-daemon.sh start datanode 
Use case : In distributed Hadoop cluster  Suppose you have added a new DataNode to your cluster and you need to start the DataNode daemon only on this machine.
all the DataNode in cluster can start from server
sbin/hadoop-daemons.sh start datanode
Use case : In distributed Hadoop cluster  Suppose you want stop/start all the  DataNode in your cluster from master node.
to start histroyserver by
 sbin/mr-jobhistory-daemon.sh start historyserver

Note :    1) To Start/Stop datanode and nodemanager from master, script is *-daemons.sh and not *-daemon.sh. daemon.sh does not lookup in slaves file and hence, will only start processes on master
          2) You should have ssh enabled if you want to start all the daemons on all the nodes from one machine.