Archive for the 'Hadoop' Category


San Diego Hadoop User Group Starting

0

We’re starting a San Diego Hadoop User Group. Please send me an email ( darugar at gmail ) if you’re interested or leave a comment here. Details to be worked out, but we’ll likely will have the first meeting towards the end of February.

Cloud Computing Hadoop Slides

0

Slides from my Data Processing in the Cloud talk:

Cloud Computing: Hadoop

View SlideShare presentation or Upload your own. (tags: pig cloud)

Python Scripts For Dumping Oracle Data And Loading Onto Hadoop DFS

2

There have been several requests for this, so I might as well post it here for general use. I put together a simple system for dumping data out of Oracle databases and loading onto Hadoop DFS. The slightly interesting part is the parallelism – Python’s Processing library is used to dump partitions in parallel and copy and load them onto DFS in parallel. This helps when dumping large amounts of data from partitioned Oracle tables.

The database interaction is handled by db.py . There are a couple of helper functions for finding table partitions, etc. DBDumper dumps the requested fields from the requested table:


dumper = db.DBDumper('username/password@yourhost:9999/DB', 'table_name',
      ('field1', 'field2', 'field3'), 'owner', 'partition', 'output_dir', 10)
dumper.dump(cp)

Where 10 is the level of concurrency, owner is the owner of the table, and partition is the name of the partitions you’re interested in (can be None).

dfs.py copies the dumped files over in parallel, again using PyProcessing. It’s simply a wrapper around “cat | ssh | hadoop dfs -put”.

DBDumper and dfs are tied together via a callback – when each partition is dumped, the callback is invoked, triggering the dfs copy.

Here’s a complete example of using these to dump and copy data:


import db
import dfs

fs = dfs.RemoteDFS('address.of.remote.machine')

def cp(arg):
    print "CALLBACK:", arg
    fs.cp(arg[0], '/some/directory/' + arg[1] + '/' + arg[2])

dumper = db.DBDumper('username/password@yourhost:9999/DB', 'table_name',
     ('field1', 'field2', 'field3'), 'owner', 'partition', 'output_dir', 10)
dumper.dump(cp)

Speaking at DataServices World on Hadoop

3

I’m giving a talk on Data Processing In The Cloud on November 20th 2008 at DataServices World in the Fairmont in San Jose. I hope to see you there. Here’s the abstract:

Hadoop, an open source implementation of map/reduce, has garnered tremendous momentum in large scale data processing, marting, and on occasion warehousing. This session will examine:

  • The current state and industry adoption of Hadoop and cloud-based data processing
  • The programming model, capabilities, common patterns, and best-practices for Hadoop deployment and usage
  • The ecology of value-add technologies and services in the grid computing and data processing world
  • Models for using grid-based data processing alongside traditional technologies and techniques.

I’ll also be participating in UCSD Hackweek coming up in about 2 weeks.

Happy: Hadoop with Python (Jython)

0

The Freebase folks have open sourced their Python (Jython) based Hadoop framework, calling it Happy. Looks interesting, will need to give it a whirl when I get a chance.

Pig (Hadoop) Commands And Sample Results

3

I find seeing the results of Pig commands on sample data a good companion to the PigLatin language reference, so I setup some simple sample data and ran commands, capturing the results.Here’s the sample data as well as the commands:

/data/one:


a	A	1
b	B	2
c	C	3
a	AA	11
a	AAA	111
b	BB	22

/data/two:


x	X	a
y	Y	b
x	XX	b
z	Z	c

Pig commands and their results:


one = load 'data/one' using PigStorage();
two = load 'data/two' using PigStorage();

generated = FOREACH one GENERATE $0, $2;
(a, 1)
(b, 2)
(c, 3)
(a, 11)
(a, 111)
(b, 22)

grouped = GROUP one BY $0;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)})
(b, {(b, B, 2), (b, BB, 22)})
(c, {(c, C, 3)})

grouped2 = GROUP one BY ($0, $1);
((a, A), {(a, A, 1)})
((a, AA), {(a, AA, 11)})
((a, AAA), {(a, AAA, 111)})
((b, B), {(b, B, 2)})
((b, BB), {(b, BB, 22)})
((c, C), {(c, C, 3)})

summed = FOREACH grouped GENERATE group, SUM(one.$2);
(a, 123.0)
(b, 24.0)
(c, 3.0)

counted = FOREACH grouped GENERATE group, COUNT(one);
(a, 3)
(b, 2)
(c, 1)

flat = FOREACH grouped GENERATE FLATTEN(one);
(a, A, 1)
(a, AA, 11)
(a, AAA, 111)
(b, B, 2)
(b, BB, 22)
(c, C, 3)

cogrouped = COGROUP one BY $0, two BY $2;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)}, {(x, X, a)})
(b, {(b, B, 2), (b, BB, 22)}, {(y, Y, b), (x, XX, b)})
(c, {(c, C, 3)}, {(z, Z, c)})

flatc = FOREACH cogrouped GENERATE FLATTEN(one.($0,$2)), FLATTEN(two.$1);
(a, 1, X)
(a, 11, X)
(a, 111, X)
(b, 2, Y)
(b, 22, Y)
(b, 2, XX)
(b, 22, XX)
(c, 3, Z)

joined = JOIN one BY $0, two BY $2;
(a, A, 1, x, X, a)
(a, AA, 11, x, X, a)
(a, AAA, 111, x, X, a)
(b, B, 2, y, Y, b)
(b, BB, 22, y, Y, b)
(b, B, 2, x, XX, b)
(b, BB, 22, x, XX, b)
(c, C, 3, z, Z, c)

crossed = CROSS one, two;
(a, AA, 11, z, Z, c)
(a, AA, 11, x, XX, b)
(a, AA, 11, y, Y, b)
(a, AA, 11, x, X, a)
(c, C, 3, z, Z, c)
(c, C, 3, x, XX, b)
(c, C, 3, y, Y, b)
(c, C, 3, x, X, a)
(b, BB, 22, z, Z, c)
(b, BB, 22, x, XX, b)
(b, BB, 22, y, Y, b)
(b, BB, 22, x, X, a)
(a, AAA, 111, x, XX, b)
(b, B, 2, x, XX, b)
(a, AAA, 111, z, Z, c)
(b, B, 2, z, Z, c)
(a, AAA, 111, y, Y, b)
(b, B, 2, y, Y, b)
(b, B, 2, x, X, a)
(a, AAA, 111, x, X, a)
(a, A, 1, z, Z, c)
(a, A, 1, x, XX, b)
(a, A, 1, y, Y, b)
(a, A, 1, x, X, a)

SPLIT one INTO one_under IF $2 < 10, one_over IF $2 >= 10;
-- one_under:
(a, A, 1)
(b, B, 2)
(c, C, 3)

Hadoop Is the Linux of Data Processing

0

Mentioned in passing by Roberto today. Sounds about right to me.