Pig Scripts Flashcards
How to filter
FILTER dataset BY expression
generate min/max value of column
FOREACH (GROUP dataset ALL) GENERATE MAX(dataset.column) AS var_name;
Calculate difference between min max
max_project = FOREACH (GROUP grades ALL) GENERATE MAX(grades.project) AS max_project;
– Calculate the worst ‘project’ value
min_project_high_exam = FOREACH (GROUP high_exam_students ALL) GENERATE MIN(high_exam_students.project) AS min_project_high_exam;
– Calculate the difference
difference = FOREACH max_project GENERATE max_project - min_project_high_exam AS project_difference;
Storing results for further processing
STORE var_name INTO table USING PigStorage(‘,’);
Load dataset with schema
X = LOAD ‘table’ USING PigStorage(‘,’) AS (ID:int, ca1:int);
Key functions {average, concatenate, count, difference, max/min, sum}
AVG
CONCAT
COUNT
DIFF
MAX/MIN
SUM
For Loop
FOREACH X GENERATE var_type{tuple, bag}
How to group
GROUP dataset BY dataset.column;
When to define the schema
When loading the dataset
How to sort
ORDER dataset BY dataset.column