what i learnt - Data and Analytics: Hadoop

Showing posts with label Hadoop. Show all posts

Friday, June 22, 2018

Big Data / Data Analytics Job

Data Analytics job in UK. Big data analytics Jobs

Data Architect 3-7+ years

Salary : £70000 - £75000 per annum location : Central London

Posted: 26/06/2018 Job Type : Permanent

Job Ref: 439101143 Skills & Apply ->

Data Scientist 3-7+ years

Salary : £232 - £298 per day location : London

Posted: 18/06/2018 Job Type : Temporary

Job Ref: 775109025 Skills & Apply ->

Saturday, October 1, 2016

Weka Hadoop Integration - weka read/write data from HDFS

previous post - Top 10 Deep Learning Tools

Weka Hadoop Integration using distributedWekaHadoop package

In weka, Tools --> PackageManager search "distributedWekaHadoop" and install the packages.

Now go back to your weka KnowledgeFlow, you can find HDFSLoader and HDFSSaver in DataLoder and DataSink portion.

by using which you can read / write data to / from HDFS

Related posts :

Big Data / Data Analytics Jobs

Tuesday, January 5, 2016

RHadoop - rJava installtion Error and solution

RHadoop - Error and solution

previous post - Twitter hashflags for TN2016 election

Installaing RHadoop package. Working with R and Hadoop for large scale data analytics

Install rJava package from local system

install.packages("/home/bdlnn/Downloads/rJava_0.9-7.tar.gz", repos = NULL)

configure: error: Java interpreter '/usr/lib/jvm/default-java/jre/bin/java' does not work ERROR: configuration failed for package "rJava" * removing "/usr/local/lib/R/site-library/rJava"

Solution

Set JAVA_HOME for R

solai@vm1$ sudo R CMD javareconf JAVA_HOME='/home/bdlnn/Software/jdk1.7.0_79'

dont put '/' in end of JAVA_HOME Path

You can also try the alternate options

solai@vm1$ sudo apt-get install r-cran-rjava

inside R
Sys.setenv(JAVA_HOME='/opt/jdk1.7.0_79/jre')

Related posts :
RHadoop Installation Issues

Working With RHadoop

Thursday, November 19, 2015

Working with RHadoop

working with RHadoop

previous post - RHadoop installation issues

working with RHadoop

Error

hdfs.ls("/tamil")

Error in .jcall("java/lang/Class",
 "Ljava/lang/Class;", "forName", cl,  : 

  No running JVM detected. Maybe .jinit() would help. 
Error in .jfindClass(as.character(class)) : 
  No running JVM detected. Maybe .jinit() would help.

Solution

hdfs.init()

hdfs.ls("/")

Error

hdfs.init()
sh: 1: /media/bdalab/bdalab/sw/hadoop-2.7.1/bin: Permission denied
Error in .jnew("org/apache/hadoop/conf/Configuration") : java.lang.ClassNotFoundException In addition: Warning message: running command '/media/bdalab/bdalab/sw/hadoop-2.7.1/bin classpath' had status 126

Solution

Sys.setenv(HADOOP_CMD='/hadoop-2.7.1/bin/hadoop')
Sys.setenv(JAVA_HOME='/jdk1.8.0_60/')
to know the environment varibale
Sys.getenv("HADOOP_CMD")
hdfs.init()

RHadoop integration isssues

RHadoop integration issues

previous post - 3D Data Visualization using R

installaing RHadoop package for working with R and Hadoop

Installing rjava package in R

install.packages("rJava_0.9-7.tar.gz", repos = NULL)

Error

configure: error: Java Development Kit (JDK) is missing or not registered in R Make sure R is configured with full Java support (including JDK). Run R CMD javareconf as root to add Java support to R. If you don't have root privileges, run R CMD javareconf -e to set all Java-related variables and then install rJava. ERROR: configuration failed for package 'rJava' * removing '/home/bdalab/R/x86_64-pc-linux-gnu-library/3.1/rJava'

Solution

install

sudo apt-get install r-cran-rjava

then try again. I have installed succefully in my system. still I got above error, then i checked java -version it was pointing out to openJDK instaed of Oracle java HotSpot. changed to HotSpot.

Error

library(rJava) 
##only in RStudio, In terminal its working fine.

Error : .onLoad failed in loadNamespace() 
for 'rJava', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object
 'x86_64-pc-linux-gnu-library/3.1/rJava/libs/rJava.so':
  libjvm.so: cannot open shared object file:
 No such file or directory
Error: loading failed
Execution halted
ERROR: loading failed

Solution

install

sudo apt-get install r-cran-rjava

Solution

locate libjvm.so and make shared link, sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so /usr/lib/

Error

R CMD javareconf

/usr/lib/R/bin/javareconf: 405: 
/usr/lib/R/bin/javareconf: cannot create
 /usr/lib/R/etc/Makeconf.new: Permission denied

Solution

sudo -i R CMD javareconf

R CMD javareconf JAVA=jdk1.8.0_60/jre/bin/java JAVA_HOME=jdk1.8.0_60/ JAVAC=jdk1.8.0_60/bin/javac JAR=jdk1.8.0_60/bin/jar JAVAH=jdk1.8.0_60/bin/javah

Installing Mahout with Apache Spark 1.4.1 : Issues and Solution

Previous post - Installing Mahout with Spark1.4.1

In this blog I will discuss the possible error you may get during the installation with how to resolve those.

The Error which I listed here based on the sequence which i got during my installation.

Cannot find Spark class path. Is 'SPARK_HOME' set?

cd $MAHOUT_HOME

bin/mahout spark-shell

Got error Cannot find Spark class path. Is 'SPARK_HOME' set?

Solution
Issue is in bin/mahout file , its point to compute-classpath.sh under $SPARK_HOME/bin dir. But in my $SPARK_HOME/bin i didn't find any such a file.

Add compute-classpath.sh under $SPARK_HOME/bin dir.

In my case I just copied it from older version i.e spark1.1

ERROR: Could not find mahout-examples-*.job

cd $MAHOUT_HOME

bin/mahout spark-shell

ERROR: Could not find mahout-examples-*.job in /media/bdalab/bdalab/sw/mahout or /media/bdalab/bdalab/sw/mahout/examples/target, please run 'mvn install' to create the .job file

Solution
set MAHOUT_LOCAL variable to true, to avoid the error.

export MAHOUT_LOCAL=true

Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver

cd $MAHOUT_HOME

bin/mahout spark-shell

MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.

MAHOUT_LOCAL is set, running locally

Error: Could not find or load main class org.apache.mahout.driver.MahoutDriver

Solution
It indicate that need to install mahout driver.

root@solai[bin]# mvn -DskipTests -X clean install

[INFO] Scanning for projects...

[INFO] ------------------------------

[ERROR] BUILD FAILURE

[INFO] ---------------------------------

[INFO] Unable to build project '/media/bdalab/bdalab/sw/mahout/pom.xml; it requires Maven version 3.3.3

Downloaded Latest version of Maven 3.3.3 from repository and unpack it. Run previous command from the Latest Maven bin,

root@solai[bin]# $MAVEN_HOME/bin/mvn -DskipTests -X clean install

org.apache.maven.enforcer.rule.api.EnforcerRuleException: Detected JDK Version: 1.8.0-60 is not in the allowed range [1.7,1.8).

then I have change Java 1.8.06 to 1.7. Now i got this error

root@solai[bin]# $MAVEN_HOME/bin/mvn -DskipTests -X clean install

[INFO] Mahout Build Tools ..... SUCCESS [02:42 min]

[INFO] Apache Mahout ..... SUCCESS [ 0.041 s]

[INFO] Mahout Math ......FAILURE [01:45 min]

[INFO] Mahout HDFS ........ SKIPPED

[INFO] Mahout Map-Reduce ..... SKIPPED

[INFO] Mahout Integration ..... SKIPPED

[INFO] Mahout Examples .........SKIPPED

[INFO] Mahout Math Scala bindings ..... SKIPPED

[INFO] Mahout H2O backend ...... SKIPPED

[INFO] Mahout Spark bindings ..... SKIPPED

[INFO] Mahout Spark bindings shell ..... SKIPPED

[INFO] Mahout Release Package ..... SKIPPED

Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact org.apache.maven:maven-core:jar:2.0.6 from/to central (https://repo.maven.apache.org/maven2): GET request of: org/apache/maven/maven-core/2.0.6/maven-core-2.0.6.jar from central failed

I thought error caused because of the networking issues.
running the same command again,
As i guessed installation completed successfully.

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
After succefull installation I was trying to get mahout>

cd $MAHOUT_HOME

bin/mahout spark-shell

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"

Solution

export JAVA_TOOL_OPTIONS="-Xmx2048m -XX:MaxPermSize=1024m -Xms1024m"

Next : Hive alternatives : running SQL like query on Hadoop

Sunday, August 30, 2015

Installing Mahout on Spark 1.4.1

Installing Mahout and Spark

Related post - Setting up SPARK over YARN

In this blog I will describe the step to install Mahout with Apache Spark 1.4.1 (latest version). Also list out the possible Error and remedies.

Installing Mahout & Spark on your local machine

1) Download Apache Spark 1.4.1 and unpack the archive file

2) Change to the directory where you unpacked Spark and type sbt/sbt assembly to build it

3) Make sure right version of maven (3.3) installed in your system. If not install mvn before build Mahout

4) Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub git clone https://github.com/apache/mahout mahout

5) Change to the mahout directory and build mahout using mvn -DskipTests clean install

Starting Mahout's Spark shell

1) Goto the directory where you unpacked Spark and type sbin/start-all.sh to locally start Spark

2) Open a browser, point it to http://localhost:8080/ to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with spark://)

3) Define the following environment variables:

export MAHOUT_HOME=[directory into which you checked out Mahout]

export SPARK_HOME=[directory where you unpacked Spark]

export MASTER=[url of the Spark master]

4) Finally, change to the directory where you unpacked Mahout and type bin/mahout spark-shell, you should see the shell starting and get the prompt mahout>

In next blog will discuss the possibility of Error while installing Mahout with solution.

Next : Resolved issues - Installing Mahout 0.11.0 with Saprk 1.4.1

Thursday, May 14, 2015

Hive configuration at Zeppelin - error and way to solve

previous post - Configuring Hive with Hue (Hadoop UI)

I have successfully compliled latest version of Zeppelin from source. You can get Apache Zeppelin binary on request, give your mail id, I will provide link to Download.

hive interpreter not available in my Apache Zeppelin what should i do?

Install Hive and Set HIVE_HOME.

if you have installed Hive in your system, Zeppelin will automatically recognize and list the Hive interpreter. if you have installed Hive but no interpreter listed out in Zeppelin, check HIVE_HOME and set it properly

If you have not installed Hive, install Hive it will take automaticall by next restart

Error

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

Solution

Add export HADOOP_USER_CLASSPATH_FIRST=true at HIVE_HOME/conf/ hive-env.sh

Configuring Hive at HUE (Hadoop ui)

Hive configuration at HUE

previous post - Hue installation error

I have successfully compliled latest version of Hue from source. when i was access by http://localhost:8000/ have seen warning Hive editor not configured.

Hive Editor - The application won't work without a running HiveServer2.

to start Hive

bdalab@solai:/opt$ $HIVE_HOME/bin/hive --service hiveserver2

in another terminal

bdalab@solai:/opt$ $HIVE_HOME/bin/beeline -u jdbc:hove2//localhost:10000

i have got permission denied error Error

Error: Failed to open new session: java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=APP, access=EXECUTE, inode="/tmp/hive/APP/704d9b8d-a9d5-4be6-a8f1-082e8eba9a0c":hdfs:supergroup:drwx------

Solution

then I have changed derby (javax.jdo.option.ConnectionUserName ) username From APP to hdfs in hive_default.xml under HIVE_HOME/conf

Next : Hive with Zeppelin

Wednesday, May 6, 2015

Hue installation Error

previous post - Hive dynamic partition issues

I have been installing recent version of HUE from source, During compilation i have got few issues. here is the step to resolve the issues.

Error

Unable to get dependency information: Unable to read the metadata file for artifact 'com.sun.jersey:jersey-core:jar': Cannot find parent: net.java:jvnet-parent for project: com.sun.jersey:jersey-project:pom:1.9 for project com.sun.jersey:jersey-project:pom:1.9 com.sun.jersey:jersey-core:jar:1.9

....
Path to dependency: 1) com.cloudera.hue:hue-plugins:jar:3.8.1-SNAPSHOT
2) org.apache.hadoop:hadoop-client:jar:2.6.0-mr1-cdh5.5.0-SNAPSHOT
3) org.apache.hadoop:hadoop-hdfs:jar:2.6.0-cdh5.5.0-SNAPSHOT

Solution

update maven, by default mvn -version shows 2.x, but hue needs maven version 3 and above.
Download , extract and install Maven3 (if not already installed) and set maven in PATH

export PATH=$PATH:$MAVEN_HOME/bin

even i have updated the PATH variable with latest maven. mvn -version shows 2.x
now i manually updated the maven using update-alternatives.

bdalab@solai:/opt$ sudo update-alternatives --install /usr/bin/mvn mvn $MAVEN_HOME/bin/mvn 1

bdalab@solai:/opt$ sudo update-alternatives --config mvn

Now, select a number referring to the recent maven3 installation, from the list of choices

Error

Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: Unauthorized connection for super-user: hue (error 401)

Solution

Configure Hue as a proxy user for all other users and groups, meaning it may submit a request on behalf of any other user. Add below property to core-site.xml within the configuration tags:
<property> <name>hadoop.proxyuser.hue.hosts</name> <value>*</value>
</property>
<property> <name>hadoop.proxyuser.hue.groups</name> <value>*</value>
</property>

Next : Configuring Hive at Hue

Wednesday, April 22, 2015

how to : Execute HDFS commands from DataNode

how to : Work on NameNode (HDFS) from DataNode

previous post - move file from DataNode system to NameNode

command work on NameNode from DataNode or any other Hadoop installed system(which may/may not be part of hadoop cluster)

HDFS 'FS' command execute from DataNode

bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -rm /input/log.csv

Above command will be executed from DataNode, File '/input/log.csv' will be removed from NameNode.
here, masterNodeIP -> IP address of remote system

List/show all the files in NameNode from DataNode

bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -ls /

Create dir 'pjt' in NameNode from DataNode

bdalab@solai:/opt$ hadoop fs -fs hdfs://masterNodeIP:9000/ -mkdir /pjt

all the above command will be run from DataNode and executed on NameNode

Next : Hive table Partition - resolved issue

Tuesday, April 21, 2015

How to : Move the file from DataNode to NameNode

1 step to Move the file from DataNode to NameNode

previous post - move file from Hadoop HDFS to remote system

solai In some case we would need to move the file which is available in DataNode but not in HDFS to HDFS cluster.

Same command also usefull when you have a file in any Hadoop installed system (which may/may not be part of Hadoop cluster) to any NameNode.

move file from DataNode system to Hadoop cluster

bdalab@solai:/opt$ hadoop fs -fs NameNodeIP:9000/ -put /FileToMove /hadoopFolderName/

here,
NameNodeIP -> IP address of NameNode system
FileToMove -> is a file to be moved to HDFS

bdalab@solai:/opt$ hadoop fs -fs hdfs://10.0.18.269:9000/ -put /FileToMove /hadoopFolderName/

Next : Control NameNode data from DataNode

Friday, April 17, 2015

1 step to Move the file from Hadoop HDFS to remote system

previous post - move file from remote system to Hadoop HDFS cluster

command to move the file from Hadoop HDFS cluster to remote system.

In some case we would need to move the output of MapReduce file from
Hadoop HDFS to non installed Hadoop system.

move file from Hadoop cluster to remote system ( Non Hadoop system )

bdalab@solai:/opt$ hadoop dfs -cat hdfs://NameNodeIP:9000/user/part-* | ssh userName@remoteSystemIP 'cat - > /home/hadoop/MRop'

here,
part-* -> is a file to be moved from HDFS
userName -> userName of remote system
remoteSystemIP -> IP address of remote system

Executing HDFS commands from remote system

1 step to Move the file from remote system to Hadoop HDFS

1 command to move the file from remote system to Hadoop HDFS cluster.

In some case we would need to move the file from non installed
Hadoop system to Hadoop cluster.

move file from Non Hadoop system to Hadoop cluster

bdalab@solai:/opt$ cat moveToHdfs.txt | ssh userName@NameNodeIP "hadoop dfs -put - hadoopDirName/"

here,
moveToHdfs.txt -> is a file to be moved to HDFS
userName -> userName of NameNode system
NameNodeIP -> IP address of NameNode
hadoopDirName -> Dir in HDFS

If you face any error like

bash: hadoop: command not found

re-run the above command by hadoop full path

bdalab@solai:/opt$ cat moveToHdfs.txt | ssh userName@NameNodeIP "/opt/hadoop-2.6.0/bin/hadoop dfs -put - hadoopDirName/"

Next - Move the file from HDFS to remote system

Tuesday, April 7, 2015

Configure UBER mode - MapReduce job for small dataset

Uber job configuration in YARN - Hadoop2

previous post - what is Uber mode

How to configure uber job.?

    To enable uber jobs, need to set the following property in yarn-site.xml.
    mapreduce.job.ubertask.enable=true
    mapreduce.job.ubertask.maxmaps=9 (default 9)
    mapreduce.job.ubertask.maxreduces=0 (default 1)
    mapreduce.job.ubertask.maxbytes=4096

mapreduce.job.ubertask.maxbytes

    above value for 4MB, default value = bloksize. The total input size of a job must be less then or equal to this value for the job to be uberized.
   Ex. say if you have data set which is 5MB of size, but you have set 4MB for mapreduce.job.ubertask.maxbytes, then uber mode will not set.
    If you omit this, by default bloksize value assigned (12MB). If you are going to run a dataset of size 50MB will not set in uber mode.)

UBER mode in YARN Hadoop2 - Running MapReduce jobs in small dataset

what is Uber mode in YARN - Hadoop2

You might have seen these lines while running MapReduce in Hadoop2. mapreduce.Job: Job job_1387204213494_0005 running in uber mode : false

what is UBER mode in Hadoop2?

In normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer.
uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).

Uber jobs :

Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather then communicate with RM to create the mapper and reducer containers.
The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.

Why

If you have a small no dataset, want to run MapReduce on small amount of data. Uber configuration will help you out, by reducing additional time that MapReduce normally spends mapper and reducers phase.

Can I configure/have a Uber for all MapReduce job.?

    As of now,
       map-only jobs
       jobs with one reducer are supported.

Next How to configure uber job.?

Saturday, April 4, 2015

Easy way to recover the deleted files/dir in Hadoop hdfs

Easy way to recover the deleted files/dir in hdfs

In some cause, accidently files or dir will be deleted,
Is there any way to recover to get back.??

   By default Hadoop will delete the files/dir forever. It has Trash feature, which is not enabled by default.

   By configuring #fs.trash.interval and #fs.trash.checkpoint.interval in Hadoop core-site.xml will move the deleted files/dir into .Trash folder.

   location of .Trash folder is in HDFS /user/$USER/.Trash

configuring core-site.xml

<property>
<name>fs.trash.interval</name>
<value>120</value> 
</property>

<property>
<name>fs.trash.checkpoint.interval</name>
<value>45</value>
</property>

In above configuration, all the deleted files / dir will me moved to .Trash folder and keep the data for two hours.
checkpoint intervel check will performed for every 45 min and deletes all the file/dir which more then 2 hours old from .Trash folder.

restart the hadoop

once you modify the core-site.xml , stop and start the hadoop

here is the example of remove dir command

hadoop@solai# hadoop fs -rmr /testTrash
15/04/05 01:10:14 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 45 minutes. Moved: 'hdfs://127.0.0.1:9000/testTrash' to trash at: hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current

you can clearly get the message say that deletd folder will be moved to /user/bdalab/.Trash/Current and will keep the data for 2 hours with check point interval 45 min.

list the deleted files/dir in .Trash folder using -ls

hadoop@solai# hadoop fs -ls hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current/testTrash

you can view the content (by -cat) or move the files to original path.

hadoop@solai# hadoop fs -mv hdfs://127.0.0.1:9000/user/bdalab/.Trash/Current/testTrash /testTrash

Tuesday, March 3, 2015

Hadoop 2.5.x/2.6.x Multi Node cluster setup on Ubuntu / lubuntu

Friday, August 22, 2014

Working with Hadoop Eco systems tools : Exception and Solution

Error & Solution : Hadoop Eco-System tools

This is the continue post on Error & Solution during setup Hadoop HA

Here I have discussed few error / issues during Automatic Fail-over configuration a part of the Hadoop HA setup.

Error 1)

Application application_1406816598739_0003 failed 2 times due to Error launching appattempt_1406816598739_0003_000002. Got exception: java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable; Host Details : local host is: "localhost/127.0.0.1"; destination host is: "boss":32939;

Solution
no IP address assigned for the node. manually assigned just like

root@boss[bin]#sudo ifconfig eth0 10.184.36.194

Error 2)

root@solaiv[sqoop]# bin/sqoop import --connect jdbc:mysql://localhost/hadoop --table movies --username root -P --split-by id

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

Solution
Download and put the "sqoop-1.4.4-hadoop200.jar" to SQOOP_HOME

Error 3)
PIG, while executing PIG statement from GRUNT

grunt> cnt = foreach grpd generate group, count(words) as nos; Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve count using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

Solution
count in pig statement shoud be upper case COUNT

grunt> cnt = foreach grpd generate group, COUNT(words) as nos;

Error 4)
using SQOOP: MySQL to HIVE

root@boss:/opt/sqoop-1.4.4# bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop --table csurvey --target-dir /sqoop-hive --hive-import --split-by pname --hive-table csurveysu -username root -P

ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot delete /tmp/hadoop-yarn/staging/root/.staging/job_1407336913887_0001. Name node is in safe mode. The reported blocks 197 has reached the threshold 0.9990 of total blocks 197. The number of live datanodes 1 has reached the minimum number 2 . Safe mode will be turned off automatically in 3 seconds.

Solution
My Hadoop NameNode was in safe mode. i.e read-only mode for the HDFS cluster. it can't Write.. until enough datanode up for replication After started up enough datanode still my NameNode in safe mode, just manually OFF safemode by

root@localhost:/opt/hadoop-2.2.0# bin/hadoop dfsadmin -safemode leave

Just check the inconsistency of HDFS cluster by

root@solaiv[bin]#root@localhost:/opt/hadoop-2.2.0# bin/hadoop fsck /

Error 5)
Hive show tables does not display table "sqooptest" , which was imported by SQOOP

hive> show tables tablename I have imported MySQL to HVE using SQOOP, once it done I can able to see from Hadoop File Systes (HDFS), but When I use "show tables tablname" from HIVE console, it throws no table exists

Solution
just enter into hiveQL from where you used the sqoop import commands.

root@solaiv[bin]# cd /opt root@solaiv[opt]# $SQOOP_HOME/bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop --table huser --hive-import --split-by name --hive-table sqooptest -username root -P execute hive from /opt, if you are trying to enter other-then this dir, you will not able to view root@solaiv[opt]# $HIVE_HOME/bin/hive hive> show tables tablename

More clarity..
By default HIVE meta data stored in derby database. Derby store the data into current working dir. that can be customize by user "hive-default.xml"

Monday, July 7, 2014

Hadoop High Availability - Daemons overview

Discussed few concept which I came across setting up Hadoop Cluster High Availability

Role of StandBy Node

Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

In order provide fast fail-over Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called "JournalNodes" (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs.

The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace.

In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

DataNode configuration in Hadoop HA

In order to provide a fast fail-over, it is also necessary that the Standby node has up-to-date information regarding the location of blocks in the cluster and status of each DataNode.

In order to achive this, all the DataNodes are configured with the location of both NameNodes(Active & StandBy), and they send block location information and heartbeats to both NameNodes.

umm.. What is Secondary NameNode
will explain from NameNode for more clarity

Namenode

Namenode holds the meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But these information also stored in disk for persistence storage.

fsimage -> Its the snapshot of the filesystem when namenode started

edit logs -> Its the sequence of changes made to the filesystem after NameNode started Only in the restart of NameNode, edit logs are applied to fsimage to get the latest snapshot of the file system. But NameNode restart are rare in production clusters which means edit logs can grow very large for the clusters where NameNode runs for a long period of time. The following issues we will encounter in this situation

Editlog become very large , which will be challenging to manage it
Namenode restart takes long time because lot of changes has to be merged
In the case of crash, we will lost huge amount of metadata since fsimage is very old So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage ,so that load on namenode reduces

Secondary NameNode
Secondary NameNode helps to overcome the above issues by taking over responsibility of merging editlogs with fsimage.

It gets the edit logs from the namenode in regular intervals and applies to fsimage (i.e build new image)
Once it has new fsimage, it copies back to namenode
Namenode will use this fsimage for the next restart,which will reduce the startup time

Why Secondary NameNode not needed in HA Hadoop cluster

In an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error.

What is split-brain scenario in Hadoop HA

In HA cluster that only one of the NameNodes be active at a time and the Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes will only ever allow a single NameNode to be a writer at a time.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes. which will effectively prevent the other NameNode from continuing in the Active state.

HA Isuues

Pages

Friday, June 22, 2018

Saturday, October 1, 2016

Tuesday, January 5, 2016

Thursday, November 19, 2015

Monday, August 31, 2015

Sunday, August 30, 2015

Thursday, May 14, 2015

Thursday, May 7, 2015

Wednesday, May 6, 2015

Wednesday, April 22, 2015

Tuesday, April 21, 2015

Friday, April 17, 2015

Tuesday, April 7, 2015

Saturday, April 4, 2015

Tuesday, March 3, 2015

Friday, August 22, 2014

Monday, July 7, 2014

Related posts

Labels