Cloud Computing Hadoop Slides
0Slides from my Data Processing in the Cloud talk:
Slides from my Data Processing in the Cloud talk:
There have been several requests for this, so I might as well post it here for general use. I put together a simple system for dumping data out of Oracle databases and loading onto Hadoop DFS. The slightly interesting part is the parallelism - Python’s Processing library is used to dump partitions in parallel and copy and load them onto DFS in parallel. This helps when dumping large amounts of data from partitioned Oracle tables.
The database interaction is handled by db.py . There are a couple of helper functions for finding table partitions, etc. DBDumper dumps the requested fields from the requested table:
dumper = db.DBDumper('username/password@yourhost:9999/DB', 'table_name',
('field1', 'field2', 'field3'), 'owner', 'partition', 'output_dir', 10)
dumper.dump(cp)
Where 10 is the level of concurrency, owner is the owner of the table, and partition is the name of the partitions you’re interested in (can be None).
dfs.py copies the dumped files over in parallel, again using PyProcessing. It’s simply a wrapper around “cat | ssh | hadoop dfs -put”.
DBDumper and dfs are tied together via a callback - when each partition is dumped, the callback is invoked, triggering the dfs copy.
Here’s a complete example of using these to dump and copy data:
import db
import dfs
fs = dfs.RemoteDFS('address.of.remote.machine')
def cp(arg):
print "CALLBACK:", arg
fs.cp(arg[0], ‘/some/directory/’ + arg[1] + ‘/’ + arg[2])
dumper = db.DBDumper(’username/password@yourhost:9999/DB’, ‘table_name’,
(’field1′, ‘field2′, ‘field3′), ‘owner’, ‘partition’, ‘output_dir’, 10)
dumper.dump(cp)
I’m giving a talk on Data Processing In The Cloud on November 20th 2008 at DataServices World in the Fairmont in San Jose. I hope to see you there. Here’s the abstract:
Hadoop, an open source implementation of map/reduce, has garnered tremendous momentum in large scale data processing, marting, and on occasion warehousing. This session will examine:
I’ll also be participating in UCSD Hackweek coming up in about 2 weeks.
I find seeing the results of Pig commands on sample data a good companion to the PigLatin language reference, so I setup some simple sample data and ran commands, capturing the results.Here’s the sample data as well as the commands:
/data/one:
a A 1
b B 2
c C 3
a AA 11
a AAA 111
b BB 22
/data/two:
x X a
y Y b
x XX b
z Z c
Pig commands and their results:
one = load 'data/one' using PigStorage();
two = load 'data/two' using PigStorage();
generated = FOREACH one GENERATE $0, $2;
(a, 1)
(b, 2)
(c, 3)
(a, 11)
(a, 111)
(b, 22)
grouped = GROUP one BY $0;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)})
(b, {(b, B, 2), (b, BB, 22)})
(c, {(c, C, 3)})
grouped2 = GROUP one BY ($0, $1);
((a, A), {(a, A, 1)})
((a, AA), {(a, AA, 11)})
((a, AAA), {(a, AAA, 111)})
((b, B), {(b, B, 2)})
((b, BB), {(b, BB, 22)})
((c, C), {(c, C, 3)})
summed = FOREACH grouped GENERATE group, SUM(one.$2);
(a, 123.0)
(b, 24.0)
(c, 3.0)
counted = FOREACH grouped GENERATE group, COUNT(one);
(a, 3)
(b, 2)
(c, 1)
flat = FOREACH grouped GENERATE FLATTEN(one);
(a, A, 1)
(a, AA, 11)
(a, AAA, 111)
(b, B, 2)
(b, BB, 22)
(c, C, 3)
cogrouped = COGROUP one BY $0, two BY $2;
(a, {(a, A, 1), (a, AA, 11), (a, AAA, 111)}, {(x, X, a)})
(b, {(b, B, 2), (b, BB, 22)}, {(y, Y, b), (x, XX, b)})
(c, {(c, C, 3)}, {(z, Z, c)})
flatc = FOREACH cogrouped GENERATE FLATTEN(one.($0,$2)), FLATTEN(two.$1);
(a, 1, X)
(a, 11, X)
(a, 111, X)
(b, 2, Y)
(b, 22, Y)
(b, 2, XX)
(b, 22, XX)
(c, 3, Z)
joined = JOIN one BY $0, two BY $2;
(a, A, 1, x, X, a)
(a, AA, 11, x, X, a)
(a, AAA, 111, x, X, a)
(b, B, 2, y, Y, b)
(b, BB, 22, y, Y, b)
(b, B, 2, x, XX, b)
(b, BB, 22, x, XX, b)
(c, C, 3, z, Z, c)
crossed = CROSS one, two;
(a, AA, 11, z, Z, c)
(a, AA, 11, x, XX, b)
(a, AA, 11, y, Y, b)
(a, AA, 11, x, X, a)
(c, C, 3, z, Z, c)
(c, C, 3, x, XX, b)
(c, C, 3, y, Y, b)
(c, C, 3, x, X, a)
(b, BB, 22, z, Z, c)
(b, BB, 22, x, XX, b)
(b, BB, 22, y, Y, b)
(b, BB, 22, x, X, a)
(a, AAA, 111, x, XX, b)
(b, B, 2, x, XX, b)
(a, AAA, 111, z, Z, c)
(b, B, 2, z, Z, c)
(a, AAA, 111, y, Y, b)
(b, B, 2, y, Y, b)
(b, B, 2, x, X, a)
(a, AAA, 111, x, X, a)
(a, A, 1, z, Z, c)
(a, A, 1, x, XX, b)
(a, A, 1, y, Y, b)
(a, A, 1, x, X, a)
SPLIT one INTO one_under IF $2 < 10, one_over IF $2 >= 10;
– one_under:
(a, A, 1)
(b, B, 2)
(c, C, 3)
Mentioned in passing by Roberto today. Sounds about right to me.