Today we worked with two datasets and learned about Pig Latin advanced functionalities like macros:
This post is inspired by this post by the HortonWorks blogging community and this book chapter on advanced Pig Latin by Alan Gates. BTW: I really liked the spirit behind this book: open feedback publishing system. I haven’t had the opportunity to read many books using this system, where readers can input comments at the end of any paragraph, and the author can answer and/or modify the paragraph to fit the reader’s change.
My little Pig script:
-- macro_tester.pig
IMPORT 'common.macro';
X = LOAD 'NYSE_dividends' AS (exchange:chararray, symbol:chararray, date:chararray, dividends:float);
-- Y = row_count(X);
Y = row_count_by(X, symbol, 4);
DUMP Y;
Note that the IMPORT
statement requires single quotes and a semicolon.
The macro file:
-- common.macro
-- row_count
/*
* Given a relation rel, row_count returns
* the COUNT_STAR of rel, i.e. the number of
* rows including empty rows.
*/
DEFINE row_count(rel)
RETURNS counted {
grouped = group $rel all;
$counted = FOREACH grouped GENERATE COUNT($rel);
};
-- row_count_by
/*
* Given a relation rel, a column name col and
* a parallelization parameter par, row_count_by
* returns the COUNT_STAR of rel grouped by col.
*/
DEFINE row_count_by(rel, col, par)
RETURNS counted {
grouped = GROUP $rel BY $col PARALLEL $par;
$counted = FOREACH grouped GENERATE group, COUNT($rel);
};
To see what your script looks like right before compilation, you can call a dry run by inserting -dryrun
or -r
in the command line.
Note the alias names from within the macro are changed to avoid collisions with alias names in the place the macro is being expanded. Output the expanded file:
X = LOAD 'NYSE_dividends' AS (exchange:chararray, symbol:chararray, date:chararray, dividends:float);
macro_row_count_by_grouped_0 = GROUP X BY (symbol) PARALLEL 4;
Y = FOREACH macro_row_count_by_grouped_0 GENERATE group, COUNT(X);
dump Y
Suprisingly the expanded file does not include the final semicolon. No idea why.
In the example from Programming Pig, the following line generates an error in Pig 0.10.0-cdh4.1.2
:
$analyzed = foreach jnd generate dailythisyear::$daily_symbol, $daily_close - $daily_open;
Error message: Unexpected character '$' at
dailythisyear::$daily_symbol
A simple way to avoid this error:
$analyzed = foreach jnd generate divsthisyear::symbol, $daily_close - $daily_open;
Because of the join, both columns should be the same, I guess.