Macros and parameters

Posted on March 22, 2013 by hadoopinberlin

As a follow-up to this previous post, I want to post a detail from the Pig 0.10.0 documentation:

Parameter substitution cannot be used inside of macros. Parameters should be explicitly passed to macros and parameter substitution used only at the top level. [Source]

So it appears that the example from Alan Gates’ Programming Pig on macros is indeed not working.

A special thank to Gargi for pointing that out.

Advanced Pig Latin — Macros

Posted on February 13, 2013 by hadoopinberlin

Today we worked with two datasets and learned about Pig Latin advanced functionalities like macros:

the Apache Pig Tutorial dataset
the Programming Pig dataset.

This post is inspired by this post by the HortonWorks blogging community and this book chapter on advanced Pig Latin by Alan Gates. BTW: I really liked the spirit behind this book: open feedback publishing system. I haven’t had the opportunity to read many books using this system, where readers can input comments at the end of any paragraph, and the author can answer and/or modify the paragraph to fit the reader’s change.

My little Pig script:
-- macro_tester.pig IMPORT 'common.macro'; X = LOAD 'NYSE_dividends' AS (exchange:chararray, symbol:chararray, date:chararray, dividends:float); -- Y = row_count(X); Y = row_count_by(X, symbol, 4); DUMP Y;

Note that the IMPORT statement requires single quotes and a semicolon.

The macro file:
-- common.macro

-- row_count /* * Given a relation rel, row_count returns * the COUNT_STAR of rel, i.e. the number of * rows including empty rows. */ DEFINE row_count(rel) RETURNS counted { grouped = group $rel all; $counted = FOREACH grouped GENERATE COUNT($rel); };

-- row_count_by /* * Given a relation rel, a column name col and * a parallelization parameter par, row_count_by * returns the COUNT_STAR of rel grouped by col. */ DEFINE row_count_by(rel, col, par) RETURNS counted { grouped = GROUP $rel BY $col PARALLEL $par; $counted = FOREACH grouped GENERATE group, COUNT($rel); };

To see what your script looks like right before compilation, you can call a dry run by inserting -dryrun or -r in the command line.

Note the alias names from within the macro are changed to avoid collisions with alias names in the place the macro is being expanded. Output the expanded file:
X = LOAD 'NYSE_dividends' AS (exchange:chararray, symbol:chararray, date:chararray, dividends:float); macro_row_count_by_grouped_0 = GROUP X BY (symbol) PARALLEL 4; Y = FOREACH macro_row_count_by_grouped_0 GENERATE group, COUNT(X); dump Y

Suprisingly the expanded file does not include the final semicolon. No idea why.

In the example from Programming Pig, the following line generates an error in Pig 0.10.0-cdh4.1.2:
$analyzed = foreach jnd generate dailythisyear::$daily_symbol, $daily_close - $daily_open;

Error message: Unexpected character '$' at dailythisyear::$daily_symbol

A simple way to avoid this error:
$analyzed = foreach jnd generate divsthisyear::symbol, $daily_close - $daily_open;

Because of the join, both columns should be the same, I guess.

First results with Pig and HBase

Posted on January 25, 2013 by hadoopinberlin

Results of the day:

The number of occurences of the word ‘Obama’ using .*obama.* (case insensitive) in Pig on my pseudo-distributed cluster. The data used is a collection of forum posts written between 21.01.2007 and 22.09.2012 stored in an HBase table.

Hadoop version: 2.0.0-cdh4.1.2 Pig version: 0.10.0-cdh4.1.2 Features: GROUP_BY, ORDER_BY, FILTER Processor: Intel Core 2 Duo T7500 2.20GHz (64-bit) Memory: 4GB #rows: 1,731,121 #column families: 1 #columns: 15 #relevant columns: 2 Time: 21m49s

Interesting: the peaks around his election 04.11.2008, the presidential inauguration on 20.01.2009 and his 3 years in office on 20.01.2012, as well as the slow rise in number of occurences toward the end of the data, i.e. toward the 2012 presidential election.

Dark menu in Java programs running JDK6

Posted on January 25, 2013 by hadoopinberlin

Also known as the bug #932274, this bug seems to occur with Java-based applications (like NetBeans for me) running JDK6.

NetBeans menu

The solution comes from luigimarco and worked perfectly well for me (and some others):

$ sudo pico /usr/share/themes/gtk2.0/gtkrc

Then replace the line

style "menu" {

with the line

style "menu" = "dark" {

and restart.

HBaseStorage and Pig

Posted on January 19, 2013 by hadoopinberlin

A very useful post to get started with HBaseStorage and Pig: link

If you’re not as successful as the author after typing $ pig -x local hbase_sample.pig in the terminal, you may receive an error message like the following:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/Filter

If that is the case, please refer to my previous post to get some tips on how to solve this error, your $PIG_CLASSPATH is probably not correctly set.

.profile

Posted on January 19, 2013 by hadoopinberlin

I had some trouble finding the proper lines to append to my ~/.profile to make it run with Pig until I posted my problem at stackoverflow.com and somebody helped (thank you!):

So now if somebody wants an example for something that is running (but certainly far from perfect):

# if running bash if [ -n "$BASH_VERSION" ]; then # include .bashrc if it exists if [ -f "$HOME/.bashrc" ]; then . "$HOME/.bashrc" fi fi


# set PATH so it includes user's private bin if it exists

if [ -d "$HOME/bin" ] ; then

PATH="$HOME/bin:$PATH"

fi
# set PATH so it includes the system bin if it exists

if [ -d "/bin" ] ; then

PATH="/bin:$PATH"

fi
export HADOOP_HOME=/usr/lib/hadoop

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

export HADOOP_CONF_DIR=/etc/hadoop/conf

export HBASE_HOME=/usr/lib/hbase

export HBASE_CONF_DIR=/etc/hbase/conf

export PIG_HOME=/usr/lib/pig

export PIG_CONF_DIR=/etc/pig/conf

export PATH="$HADOOP_HOME/bin:$HBASE_HOME/bin:$HADOOP_MAPRED_HOME/bin:$PIG_HOME/bin:$PATH" export PIG_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH"

Note that these backticks in the definition of the PIG_CLASSPATH are actually backticks, and not single-quotes!

Ubuntu and proprietary graphic card drivers

Posted on January 9, 2013 by hadoopinberlin

Just don’t.

I guess some of you may smile while reading this post. But as I wrote earlier, I’m new to Ubuntu. I lost the equivalent of 14 days of work by installing the proprietary NVidia driver for my graphic card, as my computer wasn’t able to support two not mirrored displays without it. Big mistake, my computer eventually crashed and I basically had to go through a clean install of Ubuntu, after trying for hours and hours to get it to work. No chance.

So if any of you newbies to Ubuntu like me is contemplating installing a proprietary driver for your graphic card, think twice, google it, and think once more.

Anyway, I’m now just using the basic Nouveau because I don’t want to risk losing my computer a second time. And two screens are just not that fancy anymore :-P

(Note to self: get back in there and install this driver properly! You can do it!)

Inductive Bias

Posted on January 8, 2013 by hadoopinberlin

It’s a blog I’ve been reading for a while, just wanted to share: Inductive Bias, I hope you enjoy reading it too!

/etc/hosts

Posted on December 28, 2012 by hadoopinberlin

I heard/read many times that I should get rid of my localhost for my cluster to work with less problems but I’m still trying to understand how to setup my hosts without using it. Last time I tried, I got this error message:

INFO ipc.HBaseRPC: Problem connecting to server: localhost/85.183.###.#:60020

But I have absolutely no idea where this IP address comes from, it’s not the one from my computer… or is it? Anyway, my /etc/hosts still look like this:

127.0.0.1 localhost 127.0.0.1 hadoop hbase

Another source for the problem is that I cannot have a statis IP, as I take my laptop around and my IP changes regularly.

Writing shell scripts

Posted on December 27, 2012 by hadoopinberlin

Very well explained, very useful for Ubuntu newbies: http://linuxcommand.org/lc3_wss0010.php

Hadoop in Berlin

blogging experience from a novice Hadoop user

Macros and parameters

Advanced Pig Latin — Macros

First results with Pig and HBase

Dark menu in Java programs running JDK6

HBaseStorage and Pig

.profile

Ubuntu and proprietary graphic card drivers

Inductive Bias

/etc/hosts

Writing shell scripts

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: