Macros and parameters

As a follow-up to this previous post, I want to post a detail from the Pig 0.10.0 documentation:

Parameter substitution cannot be used inside of macros. Parameters should be explicitly passed to macros and parameter substitution used only at the top level. [Source]

So it appears that the example from Alan Gates’ Programming Pig on macros is indeed not working.

A special thank to Gargi for pointing that out.

Advanced Pig Latin — Macros

Today we worked with two datasets and learned about Pig Latin advanced functionalities like macros:

This post is inspired by this post by the HortonWorks blogging community and this book chapter on advanced Pig Latin by Alan Gates. BTW: I really liked the spirit behind this book: open feedback publishing system. I haven’t had the opportunity to read many books using this system, where readers can input comments at the end of any paragraph, and the author can answer and/or modify the paragraph to fit the reader’s change.

My little Pig script:
-- macro_tester.pig
IMPORT 'common.macro';
X = LOAD 'NYSE_dividends' AS (exchange:chararray, symbol:chararray, date:chararray, dividends:float);
-- Y = row_count(X);
Y = row_count_by(X, symbol, 4);
DUMP Y;

Note that the IMPORT statement requires single quotes and a semicolon.

The macro file:
-- common.macro

-- row_count
/*
* Given a relation rel, row_count returns
* the COUNT_STAR of rel, i.e. the number of
* rows including empty rows.
*/
DEFINE row_count(rel)
RETURNS counted {
grouped = group $rel all;
$counted = FOREACH grouped GENERATE COUNT($rel);
};

-- row_count_by
/*
* Given a relation rel, a column name col and
* a parallelization parameter par, row_count_by
* returns the COUNT_STAR of rel grouped by col.
*/
DEFINE row_count_by(rel, col, par)
RETURNS counted {
grouped = GROUP $rel BY $col PARALLEL $par;
$counted = FOREACH grouped GENERATE group, COUNT($rel);
};

To see what your script looks like right before compilation, you can call a dry run by inserting -dryrun or -r in the command line.

Note the alias names from within the macro are changed to avoid collisions with alias names in the place the macro is being expanded. Output the expanded file:
X = LOAD 'NYSE_dividends' AS (exchange:chararray, symbol:chararray, date:chararray, dividends:float);
macro_row_count_by_grouped_0 = GROUP X BY (symbol) PARALLEL 4;
Y = FOREACH macro_row_count_by_grouped_0 GENERATE group, COUNT(X);
dump Y

Suprisingly the expanded file does not include the final semicolon. No idea why.

In the example from Programming Pig, the following line generates an error in Pig 0.10.0-cdh4.1.2:
$analyzed = foreach jnd generate dailythisyear::$daily_symbol, $daily_close - $daily_open;

Error message: Unexpected character '$' at dailythisyear::$daily_symbol

A simple way to avoid this error:
$analyzed = foreach jnd generate divsthisyear::symbol, $daily_close - $daily_open;

Because of the join, both columns should be the same, I guess.

First results with Pig and HBase

Results of the day:
Obama

The number of occurences of the word ‘Obama’ using .*obama.* (case insensitive) in Pig on my pseudo-distributed cluster. The data used is a collection of forum posts written between 21.01.2007 and 22.09.2012 stored in an HBase table.

Hadoop version: 2.0.0-cdh4.1.2
Pig version: 0.10.0-cdh4.1.2
Features: GROUP_BY, ORDER_BY, FILTER
Processor: Intel Core 2 Duo T7500 2.20GHz (64-bit)
Memory: 4GB
#rows: 1,731,121
#column families: 1
#columns: 15
#relevant columns: 2
Time: 21m49s

Interesting: the peaks around his election 04.11.2008, the presidential inauguration on 20.01.2009 and his 3 years in office on 20.01.2012, as well as the slow rise in number of occurences toward the end of the data, i.e. toward the 2012 presidential election.

HBaseStorage and Pig

A very useful post to get started with HBaseStorage and Pig: link

If you’re not as successful as the author after typing $ pig -x local hbase_sample.pig in the terminal, you may receive an error message like the following:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/Filter

If that is the case, please refer to my previous post to get some tips on how to solve this error, your $PIG_CLASSPATH is probably not correctly set.

.profile

I had some trouble finding the proper lines to append to my ~/.profile to make it run with Pig until I posted my problem at stackoverflow.com and somebody helped (thank you!):

So now if somebody wants an example for something that is running (but certainly far from perfect):

# if running bash
if [ -n "$BASH_VERSION" ]; then
# include .bashrc if it exists
if [ -f "$HOME/.bashrc" ]; then
. "$HOME/.bashrc"
fi
fi

# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/bin" ] ; then
PATH="$HOME/bin:$PATH"
fi

# set PATH so it includes the system bin if it exists
if [ -d "/bin" ] ; then
PATH="/bin:$PATH"
fi

export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HBASE_HOME=/usr/lib/hbase
export HBASE_CONF_DIR=/etc/hbase/conf
export PIG_HOME=/usr/lib/pig
export PIG_CONF_DIR=/etc/pig/conf

export PATH="$HADOOP_HOME/bin:$HBASE_HOME/bin:$HADOOP_MAPRED_HOME/bin:$PIG_HOME/bin:$PATH"
export PIG_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH"

Note that these backticks in the definition of the PIG_CLASSPATH are actually backticks, and not single-quotes!

Ubuntu and proprietary graphic card drivers

Just don’t.

I guess some of you may smile while reading this post. But as I wrote earlier, I’m new to Ubuntu. I lost the equivalent of 14 days of work by installing the proprietary NVidia driver for my graphic card, as my computer wasn’t able to support two not mirrored displays without it. Big mistake, my computer eventually crashed and I basically had to go through a clean install of Ubuntu, after trying for hours and hours to get it to work. No chance.

So if any of you newbies to Ubuntu like me is contemplating installing a proprietary driver for your graphic card, think twice, google it, and think once more.

Anyway, I’m now just using the basic Nouveau because I don’t want to risk losing my computer a second time. And two screens are just not that fancy anymore :-P

(Note to self: get back in there and install this driver properly! You can do it!)

/etc/hosts

I heard/read many times that I should get rid of my localhost for my cluster to work with less problems but I’m still trying to understand how to setup my hosts without using it. Last time I tried, I got this error message:

INFO ipc.HBaseRPC: Problem connecting to server: localhost/85.183.###.#:60020

But I have absolutely no idea where this IP address comes from, it’s not the one from my computer… or is it? Anyway, my /etc/hosts still look like this:

127.0.0.1       localhost
127.0.0.1       hadoop hbase

Another source for the problem is that I cannot have a statis IP, as I take my laptop around and my IP changes regularly.