Archive for the 'Uncategorized' Category


Map/Reduce (Hadoop) First Impressions

2

I’m finally getting a chance to actually implement map/reduce instead of read or write about it. General impressions so far:

  • Hadoop is fairly easy to install and get running.
  • The choice of Java as the default programming language feels strange to me. It’d feel more natural in Perl, Python, or Ruby since most of what you do is read and massage records. (I’m actually using Python with Hadoop Streaming)
  • The map/reduce paradigm is very nice, but doesn’t fit everything. In fact, so far it hasn’t fit anything I’m trying perfectly. It works, but it always feels like you’re shoe-horning the problem into a map/reduce mode. I’m wondering how well it’d work to remove the map/reduce model and make it just a general work distribution mechanism, with map and reduce as easy add-ons. So if I only need a map, or only a reduce, or just a sort, I can do only that. Or, if my map actually produces 2 different sets of output for processing by 2 different sets of reducers, there should be an easy way to do that too.
  • Pig is promising; I haven’t actually used it hands-on yet. A higher level language seems like the right way to go.
  • Hadoop does scale as advertised, at least to the number of boxes I’ve tried so far (30). It’s great to see it crunch through something that used to take 30 minutes in 1 1/2 minutes. I’ll be trying larger clusters soon.

New Look, New Day

0

The blog went bonkers, couldn’t be easily fixed, so we have a new Wordpress install and a new theme. Lots of things are probably broken; let me know what and I’ll fix’em. 

Gmail Contact List API?

61

UPDATE: Google has released an official contact list API for GMail. This should supersede the various libraries out there. Really folks, I mean it. Instead of asking for the library please check out the Google Contacts APIs, they’re the right way to get at contacts.

I’m looking for an API to extract contact lists from Gmail accounts. I’ve tried both libgmail and gmail.py, and neither work for me, returning “HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.” and “LoginFailure: Wrong username or password.” respectively.

Ian Murdock’s post on the topic brings up the very intriguing possibility of using XMPP to access the contacts, but doesn’t get into details. I might give that a try. If you’ve tried it or have a code sample for how to get at the contacts please leave a comment with details.

There’s also Open Social’s People Data API, but I don’t immediately see the path to getting the contact list and it doesn’t look like GMail supports the API yet (I’m not particularly interested in the Orkut friends list). Perhaps there’s a way in there somewhere.

So if you have a way of getting at the contact list from GMail (or hotmail, MSN, or Yahoo for that matter), leave a comment and clue me in.

awk Example

0

This example captures most of the common things I end up doing with awk, so I’m noting it here for future reference:

awk -F, '/00:00:/ { print $3-past, $1, $2, $3; past=$3 }' state9930.rate | sort -n -r > daily.sorted.rate

Which is saying:

  • The field separator is “,” (ie. the input file fields are separated by commas).
  • For each line that matches “00:00:”
    • “$3-past” subtracts the “past” value from the current third field.
    • Print the various fields.
    • After printing, “past” is set to the current third field

Why No Picasa Plugin API?

23

[Update] Picasa eventually gained APIs, see the comments for pointers to the location and details.

It’s a little odd that in this day and age of Web 2.0 and plug-everything-into-everything-else Picasa doesn’t have a plugin API. What’s the story with that?

Inspired by Matt Croydon’s Backing up Flickr Photos with Amazon S3 post, I figured I’d automate the whole process: from Picasa upload all new photos to S3 for backup, and the ones with a special tag, or keyword as Picasa calls them, to flickr. The flickr part is incredibly simple, as Matt shows. The S3 part is relatively simple. Picasa? No way to connect to it. Shame, shame. Reason enough to dump Picasa, perhaps. Any good alternatives out there?

How Yahoo and Google Make Money

51

How do Yahoo and Google make money? This is a frequently asked question, so let me give a high level overview.

The short answer is targetted advertising. Why and how does it work?

What do you use Google for? Search. Let’s look at search. Say the user searches for “mountain bike San Diego”. Chances are he’s looking for a place to go mountain biking. Or, perhaps he’s looking to buy a mountain bike. Google will go and find the most relevant Web pages on that topic.

Now imagine you own a mountain bike store. If somebody told you they would let you show your advertisement to this specific user exactly at the moment he’s expressed interest in mountain bikes, while it’s foremost in his mind, would you be interested? Sure you would. That’s what Yahoo and Google do: not only do they find the most relevant Web pages, they also find the most relevant ads and show them to the user in the form of Sponsored Links.

The sponsored links are actually relevant, so some percentage of the people that see them click on them. Note that this is very different from Banner Ads – those are generic ads targetted at a demographic (or sometimes not targetted at all).

Each time a user clicks on your ad, Google or Yahoo has effectively sent you a lead, someone who’s likely to buy something from your mountain bike store.

This lead is valuable to you and you’re willing to pay for it. But how much?

Turns out you’re not the only mountain bike store in San Diego. I own a store too, and I want that same lead. I’m willing to pay for it too.

So how do we determine the price? Very approximately speaking, by bidding on it. It’s sort of an auction.

I want to advertise my mountain bikes. I go to Yahoo or Google and tell them: every time somebody searches on “mountain bike”, show my ad. I do this by specifying a bunch of terms related to mountain bikes, and I provide the text for my ad, and a link to my Web page. Something like:

keyword: mountain bike, offroad bike, offroad bicycle

advertisement: buy my wonderful mountain bikes, they’re the best

url: www.mywonderfulmountainbikes.com

You own a mountain bike store too, so you do a similar thing.

Along with my advertisement, I specify how much I’m willing to pay for each lead. What is a lead? It’s defined as a click on my ad. So my bid says I’m willing to pay X each time someone clicks on my ad.

Note that I pay for clicks, not for my ad being shown (aka impressions). It’s a good deal – I only pay if the user is interested enough in what I offer to click on it.

I specify a bid. You specify a bid too. Approximately speaking, the higher the bid, the more prominently/more frequently the ad is shown. Other factors go into picking the actual ordering of the ads, but bids play a big part.

So there you have it. Lots of advertisers bidding on many millions of keywords, and hundreds of millions of users doing billions of searches, a small charge each time somebody clicks on an ad, and you get a big business.

There are also contextual ads, the ads you see on blogs and other random web pages. Same idea, except instead of using the user’s search term to select the ads, the contents of the page you’re looking at is used. So if you’re looking at a page about mountain bikes, you’d see ads related to mountain bikes.

There’s more to it than this, but to a first order approximation, this is the core of the business.

I Blog

0

I have nothing interesting to say. Yet the urge to share my inner-most superficial thoughts with you, dear stranger, is so strong, I must start blogging. All the cool kids are doing it. And so I blog.

« Previous Page