what i learnt - Data and Analytics: August 2013

Friday, August 30, 2013

Installing PostgreSQL on windows : The "Secondary Logon" service is not running.

Error

The "Secondary Logon" service is not running. This service is required for the installer to initialize the database. Please start this service and try again.

Solution
As error mentioned clearly PostgreSQL server Installation needed the Secondary logon service to be up and run. to start the service by

home-> Control panel -> administrative tools -> services

then find Secondary Logon service. Start this service by right click.

continue with your Installation.

Thursday, August 29, 2013

MongoDB-Hadoop integration and Data processing with Apache Pig. Running pig script with Node.js

I have a weather data in MongoDB database, As a experiment weather data porting into hadoop environment and processing using Pig.

By using mongodb-hadoop connector, from Pig one can able to read from / write /to mongoDB. once output written into MongoDB from browser user can able to view the result data.

I've created simple web application with MongoDB and Node.js using Express framework. also executing Pig script with in Node.

trades collection data

>  use week6                                      
> db.trades.findOne()
{
 "_id" : ObjectId("51b05fbce3600f7b48448eda"),
 "ticker" : "abcd",
 "time" : ISODate("2012-03-03T04:24:13.003Z"),
 "price" : 110,
 "shares" : 200,
 "ticket" : "z135" // this key used for group in pig script
}

Pig script

root@boss[bin]#vi cntWthr.pig


.....-- MongoDB Java driver
REGISTER  /opt/hadoop-1.1.0/lib/mongo-2.10.1.jar;
-- Core Mongo-Hadoop Library
REGISTER /opt/bigdata/mongo-hadoop/core/target/mongo-hadoop-core_1.1.2-1.1.0.jar;
-- mongo-hadoop pig support
REGISTER /opt/bigdata/mongo-hadoop/pig/target/mongo-hadoop-pig_1.1.2-1.1.0.jar;

trades = LOAD 'mongodb://localhost:27017/week6.trades' using com.mongodb.hadoop.pig.MongoLoader; 
grp = GROUP trades by $0#'ticket';
cnt = FOREACH grp GENERATE group,COUNT(trades);
--dump cnt;
STORE cnt INTO 'mongodb://localhost:27017/mongo_hadoop.yield_historical.outt' USING com.mongodb.hadoop.pig.MongoInsertStorage('group:float,cnt:int', 'group');

in above script

LOAD the MongoDB data from week6 database and trades collection.
GROUP the loaded data based on the key ticket
GENERATE the COUNT for each ticket
Instead of display the result in console or store into hdfs, here I'm STORE back to MongoDB mongo_hadoop and collection yield_historical.

Node.js

root@boss[bin]#vi mongoHadoopNode.js


// This script must be run from PIG_HOME/bin and file cntWthr.pig must be exist in same path.
var express = require('express'),     
    app = express(),
    cons = require('consolidate'), 
    mongoClient = require('mongodb').MongoClient,
    Server  = require('mongodb').Server;// Configuring view template
app.engine('html',cons.swig);
app.set('view engine','html');
app.set('views', __dirname + "/views");

// Running pig script file  
var spawn = require('child_process').spawn,
runPig = spawn('pig',['cntWthr.pig']);

// Handling the output 
runPig.stdout.on('data',function(data){
 console.log('stdout : '+data);
});

runPig.stderr.on('data',function(data){
 console.log('stderr : '+data );
});

app.get('/',function(req,res){
 mongoClient.connect('mongodb://localhost:27017/mongo_hadoop',function(err,db)   {
        if(err) throw err;
        db.collection('yield_historical.outt').findOne({},function(err,doc){
  res.render('template',{'Group':doc.group,'Value':doc.val_0});
                db.close();
         });;
     });
});

app.get('*',function(req,res){
        res.send("Page Not Found !!!! ");
});

mongo.open(function(err,mongo){
   if(err) throw err;
    app.listen(8000);
    console.log("Express server started successfully localhost:8000")
});

simple Html page ( view/template.html )

MongoDB-Hadoop integraton and Job processing with PIG 
Group : {{Group}}  Value : {{Value}}

Run node.js

root@boss[bin]#node mongoHadoopNode.js
Express server started successfully localhost:8000

MongoDB new collection created under mongo_hadoop database

>  use mongo_hadoop
> db.yield_historical.outt.findOne()
{
 "_id" : ObjectId("521f0ebf908dfe1853af7c01"),
 "group" : "z135",
 "val_0" : NumberLong(1667)
}

enter the IP in browser, which will show the output like

http://localhost:8000

MongoDB-Hadoop integraton and Job processing with PIG
Group : z447 
Value : 834

Executing Apache Pig script from Node.js

my previous post help to create mongodb - hadoop connector.

once you have done,

put mongo-hadoop-core_1.1.2-1.1.0 connector into $HADOOP_HOME/lib
Download latest version of MongoDB java driver and put into $HADOOP_HOME/lib

in both case the node script i.e runPig.js file should be in $PIG_HOME/bin

method 1

root@boss:/opt/bigdata/pig-0.11.1/bin>vi runPig.js



var spawn = require('child_process').spawn;

var runPig = spawn('pig',['cntWthr.pig']);



runPig.stdout.on('data',function(data){

 console.log('stdout : '+data);

});



runPig.stderr.on('data',function(data){

 console.log('stderr : '+data + " process Home : "+process.env.HOME);

});

root@boss:/opt/bigdata/pig-0.11.1/bin>node runPig.js

method 2

root@boss:/opt/bigdata/pig-0.11.1/bin>vi runPig.js


var sys = require('sys');

var exec = require('child_process').exec;

function puts(error, stdout, stderr) { sys.puts(stdout) }

exec("pig -f cntWthr.pig", puts);

root@boss:/opt/bigdata/pig-0.11.1/bin>node runPig.js

method 2 will not show any log in console, but method 1 will show the log as it execute from pig environment.

Tuesday, August 27, 2013

Issues with MongoDB and Hadoop Integration

I just tried create connector for mongoDB NoSQL with Hadoop 0.21.0

while i'm doing so got few error which is solvable and hadoop-core unresolved dependency which is unsolvable. Then I decided to create connector for Hadoop 1.1.0.

After downloaded from git, did

ERROR 1)

root@boss[mongo-hadoop]# ./sbt package


java.lang.RuntimeException: Hadoop Release '%s' is an invalid/unsupported release.  Valid entries are in 0.21.0
 at scala.sys.package$.error(package.scala:27)
 at MongoHadoopBuild$$anonfun$streamingSettings$6$$anonfun$apply$8.apply(MongoHadoopBuild.scala:176)
 at MongoHadoopBuild$$anonfun$streamingSettings$6$$anonfun$apply$8.apply(MongoHadoopBuild.scala:176)
 at scala.collection.MapLike$class.getOrElse(MapLike.scala:122)
 at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:38)
 at MongoHadoopBuild$$anonfun$streamingSettings$6.apply(MongoHadoopBuild.scala:176)
 at MongoHadoopBuild$$anonfun$streamingSettings$6.apply(MongoHadoopBuild.scala:175)
 at scala.Function1$$anonfun$compose$1.apply(Function1.scala:49)
 at scala.Function1$$anonfun$compose$1.apply(Function1.scala:49)
 at sbt.EvaluateSettings$$anonfun$sbt$EvaluateSettings$$single$1.apply(INode.scala:159)
 at sbt.EvaluateSettings$$anonfun$sbt$EvaluateSettings$$single$1.apply(INode.scala:159)
 at sbt.EvaluateSettings$MixedNode.evaluate0(INode.scala:177)
 at sbt.EvaluateSettings$INode.evaluate(INode.scala:132)
 at sbt.EvaluateSettings$$anonfun$sbt$EvaluateSettings$$submitEvaluate$1.apply$mcV$sp(INode.scala:64)
 at sbt.EvaluateSettings.sbt$EvaluateSettings$$run0(INode.scala:73)
 at sbt.EvaluateSettings$$anon$3.run(INode.scala:69)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
[error] Hadoop Release '%s' is an invalid/unsupported release.  Valid entries are in 0.21.0
[error] Use 'last' for the full log.

SOLUTION
I simply changed in build.sbt

"0.21.0"  to "0.21"

root@boss[mongo-hadoop]#vi build.sbt 
.......
.......
hadoopRelease in ThisBuild := "0.21"

ERROR 2)
then I executed the same

root@boss[mongo-hadoop]# ./sbt package


module not found: org.apache.hadoop#hadoop-core;0.21.0
[warn] ==== local: tried
[warn]   /root/.ivy2/local/org.apache.hadoop/hadoop-core/0.21.0/ivys/ivy.xml
[warn] ==== Simile Repo at MIT: tried
[warn]   http://simile.mit.edu/maven/org/apache/hadoop/hadoop-core/0.21.0/hadoop-core-0.21.0.pom
[warn] ==== Cloudera Repository: tried
[warn]   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-core/0.21.0/hadoop-core-0.21.0.pom
[warn] ==== Maven.Org Repository: tried
[warn]   http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/0.21.0/hadoop-core-0.21.0.pom
[warn] ==== releases: tried
[warn]   https://oss.sonatype.org/content/repositories/releases/org/apache/hadoop/hadoop-core/0.21.0/hadoop-core-0.21.0.pom
[warn] ==== public: tried
[warn]   http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/0.21.0/hadoop-core-0.21.0.pom
[info] Resolving org.specs2#specs2-scalaz-core_2.9.2;6.0.1 ...
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.hadoop#hadoop-core;0.21.0: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
sbt.ResolveException: unresolved dependency: org.apache.hadoop#hadoop-core;0.21.0: not found

SOLUTION

surfing around web, concluded issues related to port 443. even after 443 port is opened error remains.

then i gone through all the link, I never find any hadoop-core-0.21.0.jar file.

---

then i tried to create connector for Hadoop version 1.1.0, it was successfully created. you can also download mongoDB connector for Hadoop 1.1 here.

I'm working with MongoDB + Hadoop with Pig. will update any issues or example work flow.

---

working with MongoDB + Hadoop with Pig, followed Treasury Yield Calculation example from here.

1) mongoimport of .json data
2) downloaded piggybank-0.3-amzn jar from s3://elasticmapreduce/libs/pig/0.3/piggybank-0.3-amzn.jar

while I'm executing

grunt> REGISTER /opt/bigdata/mongo-hadoop/piggybank-0.3-amzn.jar  
...... 
   grunt> date_tenyear = foreach  raw generate UnixToISO($0#'_id'), $0#'bc10Year';

have got below error

ERROR 1070: Could not resolve org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]

then I went through piggybank-0.3-amzn.jar, I've never seen any method like UnixToISO().

this time I registered pigybank.jar from $PIG_HOME/ contrib/piggybank/java

grunt> REGISTER /opt/bigdata/pig-0.10.0/contrib/piggybank/java/piggybank.jar;
....

after this all the steps from earlier said example worked, can able to read/write MongoDB through Pig.

Thursday, August 15, 2013

Data transfer from one table in a PostgreSQL database to the another table in a different database.

Transfer Data between databases with PostgreSQL

pg_dump sourceDB -t fromTbl -c -s | psql -h 192.16.3.2 targetDB;

Problem with pg_dump
pg_dump: server version: 9.x.x ; pg_dump version: 9.y.y
pg_dump: aborting because of server version mismatch

above command well suited if both server running on same version of PostgreSQL.
If Source & Target server are running in different version, have to use COPY command.
copy data from one table in a PostgreSQL database to the corresponding table in a different database running on different server.

i.e cross database copy/transfer data in PostgreSQL,

general syntax for cross database copy command:

psql -c "copy (select list of column from table_name ) to stdin " dbanme | psql -c "table_name(specify the column ) from stdout " targetDB

Target table must be exist
Table/Relation may differ in name
if source and target databases are following different table schema
Want to transfer only few column from source database table to target database
copy/transfer few column between databases
Server may run different version of PostgreSQL database.
cross database data transfer

example :
in sourceDB table employee(eid,ename,esalary, edesignation )
in targetDB table staff(sid,sname,spay)
now we would like to transfer data between this two databases

psql -c " copy ( select eid, ename, esalary from employee) to stdin " sourceDB | psql -c " copy staff(sid,sname,spay) from stdout " targetDB

data from employee table in source database copy/transferred to staff table in targetDB.

note :

single command will do both backup from one database and restore to another database
we just moved only 3 column from employee table not all
data type of the two table must be sameif source and target databases in different server use -h option in psql command
ex :

psql -h 127.0.0.1 -c "copy (select eid,ename from emp ) to stdin " sourceDB | psql -h 192.168.37.2 -c "copy staff(sid,sname) from stdout " targetDB

what i learnt - Data and Analytics

Pages