Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Friday, June 22, 2018

Big Data / Data Analytics Job

Data Analytics job in UK. Big data analytics Jobs
Data Architect   3-7+ years   

Salary : £70000 - £75000 per annum     location : Central London

Posted: 26/06/2018    Job Type : Permanent

Job Ref: 439101143    Skills & Apply ->
Data Scientist   3-7+ years   

Salary : £232 - £298 per day     location : London

Posted: 18/06/2018    Job Type : Temporary

Job Ref: 775109025    Skills & Apply ->

Saturday, October 1, 2016

Weka Hadoop Integration - weka read/write data from HDFS

Weka Hadoop Integration - weka read/write data from HDFS
Weka Hadoop Integration using distributedWekaHadoop package

In weka, Tools --> PackageManager search "distributedWekaHadoop" and install the packages.

Now go back to your weka KnowledgeFlow, you can find HDFSLoader and HDFSSaver in DataLoder and DataSink portion.

by using which you can read / write data to / from HDFS



Tuesday, January 5, 2016

RHadoop - rJava installtion Error and solution

RHadoop - Error and solution

Installaing RHadoop package. Working with R and Hadoop for large scale data analytics

Install rJava package from local system

install.packages("/home/bdlnn/Downloads/rJava_0.9-7.tar.gz", repos = NULL)

configure: error: Java interpreter '/usr/lib/jvm/default-java/jre/bin/java' does not work ERROR: configuration failed for package "rJava" * removing "/usr/local/lib/R/site-library/rJava"

Solution

Set JAVA_HOME for R

solai@vm1$ sudo R CMD javareconf JAVA_HOME='/home/bdlnn/Software/jdk1.7.0_79'

dont put '/' in end of JAVA_HOME Path


You can also try the alternate options
solai@vm1$ sudo apt-get install r-cran-rjava

inside R
Sys.setenv(JAVA_HOME='/opt/jdk1.7.0_79/jre')

Thursday, November 19, 2015

Working with RHadoop

working with RHadoop

working with RHadoop

Error

hdfs.ls("/tamil")

Error in .jcall("java/lang/Class",
"Ljava/lang/Class;", "forName", cl, :
No running JVM detected. Maybe .jinit() would help. Error in .jfindClass(as.character(class)) : No running JVM detected. Maybe .jinit() would help.

Solution


hdfs.init()

hdfs.ls("/")

Error

hdfs.init()
sh: 1: /media/bdalab/bdalab/sw/hadoop-2.7.1/bin: Permission denied
Error in .jnew("org/apache/hadoop/conf/Configuration") : java.lang.ClassNotFoundException In addition: Warning message: running command '/media/bdalab/bdalab/sw/hadoop-2.7.1/bin classpath' had status 126
Solution


Sys.setenv(HADOOP_CMD='/hadoop-2.7.1/bin/hadoop')
Sys.setenv(JAVA_HOME='/jdk1.8.0_60/')
to know the environment varibale
Sys.getenv("HADOOP_CMD")
hdfs.init()


Next :

RHadoop integration isssues

RHadoop integration issues

installaing RHadoop package for working with R and Hadoop

Installing rjava package in R
install.packages("rJava_0.9-7.tar.gz", repos = NULL)
Error
configure: error: Java Development Kit (JDK) is missing or not registered in R Make sure R is configured with full Java support (including JDK). Run R CMD javareconf as root to add Java support to R. If you don't have root privileges, run R CMD javareconf -e to set all Java-related variables and then install rJava. ERROR: configuration failed for package 'rJava' * removing '/home/bdalab/R/x86_64-pc-linux-gnu-library/3.1/rJava'
Solution
install
sudo apt-get install r-cran-rjava
then try again. I have installed succefully in my system. still I got above error, then i checked java -version it was pointing out to openJDK instaed of Oracle java HotSpot. changed to HotSpot.
Error
library(rJava) 
##only in RStudio, In terminal its working fine.

Error : .onLoad failed in loadNamespace() 
for 'rJava', details: call: dyn.load(file, DLLpath = DLLpath, ...) error: unable to load shared object
'x86_64-pc-linux-gnu-library/3.1/rJava/libs/rJava.so': libjvm.so: cannot open shared object file:
No such file or directory Error: loading failed Execution halted ERROR: loading failed
Solution
install
sudo apt-get install r-cran-rjava

Solution
locate libjvm.so and make shared link, sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so /usr/lib/
Error
R CMD javareconf

/usr/lib/R/bin/javareconf: 405: 
/usr/lib/R/bin/javareconf: cannot create
 /usr/lib/R/etc/Makeconf.new: Permission denied
Solution
sudo -i R CMD javareconf
OR
R CMD javareconf JAVA=jdk1.8.0_60/jre/bin/java JAVA_HOME=jdk1.8.0_60/ JAVAC=jdk1.8.0_60/bin/javac JAR=jdk1.8.0_60/bin/jar JAVAH=jdk1.8.0_60/bin/javah

Next :

Monday, August 31, 2015

Installing Mahout with Apache Spark 1.4.1 : Issues and Solution

Installing Mahout with Apache Spark 1.4.1 : Issues and Solution

In this blog I will discuss the possible error you may get during the installation with how to resolve those.

The Error which I listed here based on the sequence which i got during my installation.

Cannot find Spark class path. Is 'SPARK_HOME' set?

cd $MAHOUT_HOME

bin/mahout spark-shell

Got error Cannot find Spark class path. Is 'SPARK_HOME' set?

Solution
Issue is in bin/mahout file , its point to compute-classpath.sh under $SPARK_HOME/bin dir. But in my $SPARK_HOME/bin i didn't find any such a file.

Add compute-classpath.sh under $SPARK_HOME/bin dir.

In my case I just copied it from older version i.e spark1.1


ERROR: Could not find mahout-examples-*.job

cd $MAHOUT_HOME

bin/mahout spark-shell

ERROR: Could not find mahout-examples-*.job in /media/bdalab/bdalab/sw/mahout or /media/bdalab/bdalab/sw/mahout/examples/target, please run 'mvn install' to create the .job file

Solution
set MAHOUT_LOCAL variable to true, to avoid the error.

export MAHOUT_LOCAL=true


Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver

cd $MAHOUT_HOME

bin/mahout spark-shell

MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.

MAHOUT_LOCAL is set, running locally

Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver

Solution
It indicate that need to install mahout driver.

root@solai[bin]# mvn -DskipTests -X clean install

[INFO] Scanning for projects...

[INFO] ------------------------------

[ERROR] BUILD FAILURE

[INFO] ---------------------------------

[INFO] Unable to build project '/media/bdalab/bdalab/sw/mahout/pom.xml; it requires Maven version 3.3.3
Downloaded Latest version of Maven 3.3.3 from repository and unpack it. Run previous command from the Latest Maven bin,

root@solai[bin]# $MAVEN_HOME/bin/mvn -DskipTests -X clean install

org.apache.maven.enforcer.rule.api.EnforcerRuleException: Detected JDK Version: 1.8.0-60 is not in the allowed range [1.7,1.8).
then I have change Java 1.8.06 to 1.7. Now i got this error
root@solai[bin]# $MAVEN_HOME/bin/mvn -DskipTests -X clean install

[INFO] Mahout Build Tools ..... SUCCESS [02:42 min]

[INFO] Apache Mahout ..... SUCCESS [ 0.041 s]

[INFO] Mahout Math ......FAILURE [01:45 min]

[INFO] Mahout HDFS ........ SKIPPED

[INFO] Mahout Map-Reduce ..... SKIPPED

[INFO] Mahout Integration ..... SKIPPED

[INFO] Mahout Examples .........SKIPPED

[INFO] Mahout Math Scala bindings ..... SKIPPED

[INFO] Mahout H2O backend ...... SKIPPED

[INFO] Mahout Spark bindings ..... SKIPPED

[INFO] Mahout Spark bindings shell ..... SKIPPED

[INFO] Mahout Release Package ..... SKIPPED

Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact org.apache.maven:maven-core:jar:2.0.6 from/to central (https://repo.maven.apache.org/maven2): GET request of: org/apache/maven/maven-core/2.0.6/maven-core-2.0.6.jar from central failed
I thought error caused because of the networking issues.
running the same command again,
As i guessed installation completed successfully.

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
After succefull installation I was trying to get mahout>

cd $MAHOUT_HOME

bin/mahout spark-shell

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"

Solution
export JAVA_TOOL_OPTIONS="-Xmx2048m -XX:MaxPermSize=1024m -Xms1024m"

Sunday, August 30, 2015

Installing Mahout on Spark 1.4.1

Installing Mahout and Spark

In this blog I will describe the step to install Mahout with Apache Spark 1.4.1 (latest version). Also list out the possible Error and remedies.

Installing Mahout & Spark on your local machine

1) Download Apache Spark 1.4.1 and unpack the archive file

2) Change to the directory where you unpacked Spark and type sbt/sbt assembly to build it

3) Make sure right version of maven (3.3) installed in your system. If not install mvn before build Mahout

4) Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub git clone https://github.com/apache/mahout mahout

5) Change to the mahout directory and build mahout using mvn -DskipTests clean install


Starting Mahout's Spark shell

1) Goto the directory where you unpacked Spark and type sbin/start-all.sh to locally start Spark

2) Open a browser, point it to http://localhost:8080/ to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with spark://)

3) Define the following environment variables:

export MAHOUT_HOME=[directory into which you checked out Mahout]

export SPARK_HOME=[directory where you unpacked Spark]

export MASTER=[url of the Spark master]

4) Finally, change to the directory where you unpacked Mahout and type bin/mahout spark-shell, you should see the shell starting and get the prompt mahout>



In next blog will discuss the possibility of Error while installing Mahout with solution.

Next : Resolved issues - Installing Mahout 0.11.0 with Saprk 1.4.1

Thursday, May 14, 2015

Hive configuration at Zeppelin - error and way to solve

Hive configuration at Zeppelin - error and way to solve

I have successfully compliled latest version of Zeppelin from source. You can get Apache Zeppelin binary on request, give your mail id, I will provide link to Download.

hive interpreter not available in my Apache Zeppelin what should i do?

Install Hive and Set HIVE_HOME.

if you have installed Hive in your system, Zeppelin will automatically recognize and list the Hive interpreter. if you have installed Hive but no interpreter listed out in Zeppelin, check HIVE_HOME and set it properly

If you have not installed Hive, install Hive it will take automaticall by next restart

Error

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
Solution

Add export HADOOP_USER_CLASSPATH_FIRST=true at HIVE_HOME/conf/ hive-env.sh

Next :

Thursday, May 7, 2015

Configuring Hive at HUE (Hadoop ui)

Hive configuration at HUE
previous post - Hue installation error

I have successfully compliled latest version of Hue from source. when i was access by http://localhost:8000/ have seen warning Hive editor not configured.

Hive Editor - The application won't work without a running HiveServer2.

to start Hive

bdalab@solai:/opt$ $HIVE_HOME/bin/hive --service hiveserver2

in another terminal

bdalab@solai:/opt$ $HIVE_HOME/bin/beeline -u jdbc:hove2//localhost:10000

i have got permission denied error Error

Error: Failed to open new session: java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=APP, access=EXECUTE, inode="/tmp/hive/APP/704d9b8d-a9d5-4be6-a8f1-082e8eba9a0c":hdfs:supergroup:drwx------
Solution

then I have changed derby (javax.jdo.option.ConnectionUserName ) username From APP to hdfs in hive_default.xml under HIVE_HOME/conf

Wednesday, May 6, 2015

Hue installation Error

Hue installation Error

I have been installing recent version of HUE from source, During compilation i have got few issues. here is the step to resolve the issues.

Error

    Unable to get dependency information: Unable to read the metadata file for artifact 'com.sun.jersey:jersey-core:jar': Cannot find parent: net.java:jvnet-parent for project: com.sun.jersey:jersey-project:pom:1.9 for project com.sun.jersey:jersey-project:pom:1.9 com.sun.jersey:jersey-core:jar:1.9

....
Path to dependency: 1) com.cloudera.hue:hue-plugins:jar:3.8.1-SNAPSHOT
2) org.apache.hadoop:hadoop-client:jar:2.6.0-mr1-cdh5.5.0-SNAPSHOT
3) org.apache.hadoop:hadoop-hdfs:jar:2.6.0-cdh5.5.0-SNAPSHOT


Solution

update maven, by default mvn -version shows 2.x, but hue needs maven version 3 and above.
Download , extract and install Maven3 (if not already installed) and set maven in PATH

export PATH=$PATH:$MAVEN_HOME/bin

even i have updated the PATH variable with latest maven. mvn -version shows 2.x
now i manually updated the maven using update-alternatives.

  bdalab@solai:/opt$ sudo update-alternatives --install /usr/bin/mvn mvn $MAVEN_HOME/bin/mvn 1

  bdalab@solai:/opt$ sudo update-alternatives --config mvn

Now, select a number referring to the recent maven3 installation, from the list of choices

Error

    Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: Unauthorized connection for super-user: hue (error 401)
Solution

Configure Hue as a proxy user for all other users and groups, meaning it may submit a request on behalf of any other user. Add below property to core-site.xml within the configuration tags:
<property> <name>hadoop.proxyuser.hue.hosts</name> <value>*</value>
</property>
<property> <name>hadoop.proxyuser.hue.groups</name> <value>*</value>
</property>

Wednesday, April 22, 2015

how to : Execute HDFS commands from DataNode

how to : Work on NameNode (HDFS) from DataNode

command work on NameNode from DataNode or any other Hadoop installed system(which may/may not be part of hadoop cluster)

HDFS 'FS' command execute from DataNode

    bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -rm /input/log.csv


Above command will be executed from DataNode, File '/input/log.csv' will be removed from NameNode.
here, masterNodeIP -> IP address of remote system

List/show all the files in NameNode from DataNode

    bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -ls /


Create dir 'pjt' in NameNode from DataNode

    bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -mkdir /pjt

all the above command will be run from DataNode and executed on NameNode

Tuesday, April 21, 2015

How to : Move the file from DataNode to NameNode

1 step to Move the file from DataNode to NameNode


solai In some case we would need to move the file which is available in DataNode but not in HDFS to HDFS cluster.

Same command also usefull when you have a file in any Hadoop installed system (which may/may not be part of Hadoop cluster) to any NameNode.

move file from DataNode system to Hadoop cluster

    bdalab@solai:/opt$ hadoop fs -fs NameNodeIP:9000/ -put /FileToMove /hadoopFolderName/


here,
NameNodeIP -> IP address of NameNode system
FileToMove -> is a file to be moved to HDFS

OR
    bdalab@solai:/opt$ hadoop fs -fs hdfs://10.0.18.269:9000/ -put /FileToMove /hadoopFolderName/

Friday, April 17, 2015

1 step to Move the file from Hadoop HDFS to remote system

1 step to Move the file from Hadoop HDFS to remote system

command to move the file from Hadoop HDFS cluster to remote system.

In some case we would need to move the output of MapReduce file from
Hadoop HDFS to non installed Hadoop system.

move file from Hadoop cluster to remote system ( Non Hadoop system )

    bdalab@solai:/opt$ hadoop dfs -cat hdfs://NameNodeIP:9000/user/part-* | ssh userName@remoteSystemIP 'cat - > /home/hadoop/MRop'


here,
part-* -> is a file to be moved from HDFS
userName -> userName of remote system
remoteSystemIP -> IP address of remote system

1 step to Move the file from remote system to Hadoop HDFS

1 step to Move the file from remote system to Hadoop HDFS
1 command to move the file from remote system to Hadoop HDFS cluster.

In some case we would need to move the file from non installed
Hadoop system to Hadoop cluster.

move file from Non Hadoop system to Hadoop cluster

    bdalab@solai:/opt$ cat moveToHdfs.txt | ssh userName@NameNodeIP "hadoop dfs -put - hadoopDirName/"


here,
moveToHdfs.txt -> is a file to be moved to HDFS
userName -> userName of NameNode system
NameNodeIP -> IP address of NameNode
hadoopDirName -> Dir in HDFS


If you face any error like

    bash: hadoop: command not found
re-run the above command by hadoop full path

    bdalab@solai:/opt$ cat moveToHdfs.txt | ssh userName@NameNodeIP "/opt/hadoop-2.6.0/bin/hadoop dfs -put - hadoopDirName/"

Tuesday, April 7, 2015

Configure UBER mode - MapReduce job for small dataset

Uber job configuration in YARN - Hadoop2
previous post - what is Uber mode

How to configure uber job.?

    To enable uber jobs, need to set the following property in yarn-site.xml.
    mapreduce.job.ubertask.enable=true
    mapreduce.job.ubertask.maxmaps=9 (default 9)
    mapreduce.job.ubertask.maxreduces=0 (default 1)
    mapreduce.job.ubertask.maxbytes=4096

mapreduce.job.ubertask.maxbytes

    above value for 4MB, default value = bloksize. The total input size of a job must be less then or equal to this value for the job to be uberized.
   Ex. say if you have data set which is 5MB of size, but you have set 4MB for mapreduce.job.ubertask.maxbytes, then uber mode will not set.
    If you omit this, by default bloksize value assigned (12MB). If you are going to run a dataset of size 50MB will not set in uber mode.)

UBER mode in YARN Hadoop2 - Running MapReduce jobs in small dataset

what is Uber mode in YARN - Hadoop2

You might have seen these lines while running MapReduce in Hadoop2. mapreduce.Job: Job job_1387204213494_0005 running in uber mode : false

what is UBER mode in Hadoop2?

    In normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer.
    uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).

Uber jobs :

    Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather then communicate with RM to create the mapper and reducer containers.
    The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.
Why

    If you have a small no dataset, want to run MapReduce on small amount of data. Uber configuration will help you out, by reducing additional time that MapReduce normally spends mapper and reducers phase.
Can I configure/have a Uber for all MapReduce job.?

    As of now,
       map-only jobs
       jobs with one reducer are supported.

Saturday, April 4, 2015

Easy way to recover the deleted files/dir in Hadoop hdfs

Easy way to recover the deleted files/dir in hdfs
In some cause, accidently files or dir will be deleted,
Is there any way to recover to get back.??

   By default Hadoop will delete the files/dir forever. It has Trash feature, which is not enabled by default.

   By configuring #fs.trash.interval and #fs.trash.checkpoint.interval in Hadoop core-site.xml will move the deleted files/dir into .Trash folder.

   location of .Trash folder is in HDFS /user/$USER/.Trash

configuring core-site.xml

<property>
<name>fs.trash.interval</name>
<value>120</value> 
</property>

<property>
<name>fs.trash.checkpoint.interval</name>
<value>45</value>
</property>

   In above configuration, all the deleted files / dir will me moved to .Trash folder and keep the data for two hours.
   checkpoint intervel check will performed for every 45 min and deletes all the file/dir which more then 2 hours old from .Trash folder.

restart the hadoop
    once you modify the core-site.xml , stop and start the hadoop

   here is the example of remove dir command
hadoop@solai# hadoop fs -rmr /testTrash
15/04/05 01:10:14 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 45 minutes. Moved: 'hdfs://127.0.0.1:9000/testTrash' to trash at: hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current
   you can clearly get the message say that deletd folder will be moved to /user/bdalab/.Trash/Current and will keep the data for 2 hours with check point interval 45 min.

list the deleted files/dir in .Trash folder using -ls
hadoop@solai# hadoop fs -ls hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current/testTrash
you can view the content (by -cat) or move the files to original path.
hadoop@solai# hadoop fs -mv hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current/testTrash /testTrash

Friday, August 22, 2014

Working with Hadoop Eco systems tools : Exception and Solution

Error & Solution : Hadoop Eco-System tools
This is the continue post on Error & Solution during setup Hadoop HA

Here I have discussed few error / issues during Automatic Fail-over configuration a part of the Hadoop HA setup.

Error 1)
Application application_1406816598739_0003 failed 2 times due to Error launching appattempt_1406816598739_0003_000002. Got exception: java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable; Host Details : local host is: "localhost/127.0.0.1"; destination host is: "boss":32939;
Solution
no IP address assigned for the node. manually assigned just like
root@boss[bin]#sudo ifconfig eth0 10.184.36.194


Error 2)

root@solaiv[sqoop]# bin/sqoop import --connect jdbc:mysql://localhost/hadoop --table movies --username root -P --split-by id

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
Solution
Download and put the "sqoop-1.4.4-hadoop200.jar" to SQOOP_HOME


Error 3)
PIG, while executing PIG statement from GRUNT
grunt> cnt = foreach grpd generate group, count(words) as nos; Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve count using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

Solution
count in pig statement shoud be upper case COUNT
grunt> cnt = foreach grpd generate group, COUNT(words) as nos;


Error 4)
using SQOOP: MySQL to HIVE
root@boss:/opt/sqoop-1.4.4# bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop --table csurvey --target-dir /sqoop-hive --hive-import --split-by pname --hive-table csurveysu -username root -P

ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot delete /tmp/hadoop-yarn/staging/root/.staging/job_1407336913887_0001. Name node is in safe mode. The reported blocks 197 has reached the threshold 0.9990 of total blocks 197. The number of live datanodes 1 has reached the minimum number 2 . Safe mode will be turned off automatically in 3 seconds.
Solution
My Hadoop NameNode was in safe mode. i.e read-only mode for the HDFS cluster. it can't Write.. until enough datanode up for replication After started up enough datanode still my NameNode in safe mode, just manually OFF safemode by
root@localhost:/opt/hadoop-2.2.0# bin/hadoop dfsadmin -safemode leave
Just check the inconsistency of HDFS cluster by

root@solaiv[bin]#root@localhost:/opt/hadoop-2.2.0# bin/hadoop fsck /


Error 5)
Hive show tables does not display table "sqooptest" , which was imported by SQOOP
hive> show tables tablename I have imported MySQL to HVE using SQOOP, once it done I can able to see from Hadoop File Systes (HDFS), but When I use "show tables tablname" from HIVE console, it throws no table exists

Solution
just enter into hiveQL from where you used the sqoop import commands.
root@solaiv[bin]# cd /opt root@solaiv[opt]# $SQOOP_HOME/bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop --table huser --hive-import --split-by name --hive-table sqooptest -username root -P execute hive from /opt, if you are trying to enter other-then this dir, you will not able to view root@solaiv[opt]# $HIVE_HOME/bin/hive hive> show tables tablename


More clarity..
By default HIVE meta data stored in derby database. Derby store the data into current working dir. that can be customize by user "hive-default.xml"

Monday, July 7, 2014

Hadoop High Availability - Daemons overview

Hadoop High Availability - Daemons overview
Discussed few concept which I came across setting up Hadoop Cluster High Availability

Role of StandBy Node

  Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

  In order provide fast fail-over Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called "JournalNodes" (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs.

 The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace.

  In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.


DataNode configuration in Hadoop HA

In order to provide a fast fail-over, it is also necessary that the Standby node has up-to-date information regarding the location of blocks in the cluster and status of each DataNode.

 In order to achive this, all the DataNodes are configured with the location of both NameNodes(Active & StandBy), and they send block location information and heartbeats to both NameNodes.


umm.. What is Secondary NameNode
will explain from NameNode for more clarity

Namenode 
 Namenode holds the meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But these information also stored in disk for persistence storage.
  • fsimage -> Its the snapshot of the filesystem when namenode started

  • edit logs -> Its the sequence of changes made to the filesystem after NameNode started Only in the restart of NameNode, edit logs are applied to fsimage to get the latest snapshot of the file system. But NameNode restart are rare in production clusters which means edit logs can grow very large for the clusters where NameNode runs for a long period of time. The following issues we will encounter in this situation

    • Editlog become very large , which will be challenging to manage it
    • Namenode restart takes long time because lot of changes has to be merged
    • In the case of crash, we will lost huge amount of metadata since fsimage is very old So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage ,so that load on namenode reduces
Secondary NameNode
 Secondary NameNode helps to overcome the above issues by taking over responsibility of merging editlogs with fsimage.
  • It gets the edit logs from the namenode in regular intervals and applies to fsimage (i.e build new image)
  • Once it has new fsimage, it copies back to namenode
  • Namenode will use this fsimage for the next restart,which will reduce the startup time



Why Secondary NameNode not needed in HA Hadoop cluster

In an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error.


What is split-brain scenario in Hadoop HA

  In HA cluster that only one of the NameNodes be active at a time and the Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes will only ever allow a single NameNode to be a writer at a time.
  During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes. which will effectively prevent the other NameNode from continuing in the Active state.

Related posts

HA Isuues

Labels