Archive for October, 2008

93% Marginal Tax Rate Under Obama

1

Let me start by saying: I’m voting Obama.

However, this analysis by Greg Mankiw, a Harvard economist who wrote “the book” on macroeconomics, is bringing tears to my eyes:

If there were no taxes, … then $1 earned today would yield my kids $28. That is simply the miracle of compounding.

Under the McCain plan, … a dollar earned today yields my kids $4.81. That is, even under the low-tax McCain plan, my incentive to work is cut by 83 percent compared to the situation without taxes.

Under the Obama plan, … a dollar earned today yields my kids $1.85. That is, Obama’s proposed tax hikes reduce my incentive to work by 62 percent compared to the McCain plan and by 93 percent compared to the no-tax scenario. In a sense, putting the various pieces of the tax system together, I would be facing a marginal tax rate of 93 percent.

The bottom line: If you are one of those people out there trying to induce me to do some work for you, there is a good chance I will turn you down. And the likelihood will go up after President Obama puts his tax plan in place. I expect to spend more time playing with my kids. They will be poorer when they grow up, but perhaps they will have a few more happy memories.

Python Dot Notation Dictionary Access

3

In most cases I prefer dot notation over bracket notation for dictionary access. That is, I prefer mydictionary.myfield over mydictionary['myflield']. I also prefer attempted access to undefined keys to return None instead of raising an exception.

With the help of this thread, this is what I’ve been using:



class dotdict(dict):
    def __getattr__(self, attr):
        return self.get(attr, None)
    __setattr__= dict.__setitem__
    __delattr__= dict.__delitem__

>>> dd = dotdict()
>>> dd.a
>>> dd.a = 'one'
>>> dd.a
'one'
>>> dd.keys()
['a']

>>> existing = {'a':'A', 'b':'B'}
>>> dot_existing = dotdict(existing)
>>> dot_existing.a
'A'

Python Scripts For Dumping Oracle Data And Loading Onto Hadoop DFS

2

There have been several requests for this, so I might as well post it here for general use. I put together a simple system for dumping data out of Oracle databases and loading onto Hadoop DFS. The slightly interesting part is the parallelism – Python’s Processing library is used to dump partitions in parallel and copy and load them onto DFS in parallel. This helps when dumping large amounts of data from partitioned Oracle tables.

The database interaction is handled by db.py . There are a couple of helper functions for finding table partitions, etc. DBDumper dumps the requested fields from the requested table:


dumper = db.DBDumper('username/password@yourhost:9999/DB', 'table_name',
      ('field1', 'field2', 'field3'), 'owner', 'partition', 'output_dir', 10)
dumper.dump(cp)

Where 10 is the level of concurrency, owner is the owner of the table, and partition is the name of the partitions you’re interested in (can be None).

dfs.py copies the dumped files over in parallel, again using PyProcessing. It’s simply a wrapper around “cat | ssh | hadoop dfs -put”.

DBDumper and dfs are tied together via a callback – when each partition is dumped, the callback is invoked, triggering the dfs copy.

Here’s a complete example of using these to dump and copy data:


import db
import dfs

fs = dfs.RemoteDFS('address.of.remote.machine')

def cp(arg):
    print "CALLBACK:", arg
    fs.cp(arg[0], '/some/directory/' + arg[1] + '/' + arg[2])

dumper = db.DBDumper('username/password@yourhost:9999/DB', 'table_name',
     ('field1', 'field2', 'field3'), 'owner', 'partition', 'output_dir', 10)
dumper.dump(cp)

GMail Contact Groups: Add By Tag / Search

0

I’m trying to setup ad-hoc small mailing lists using GMail. They have a contact groups feature that mostly does what I need. However, the interface for adding contacts is quite limited – you pick from a list.

It’d make a lot more sense to allow addition of contacts to groups based on searching – eg. look for everyone labeled “tester” and add them to the “tester” contact group en-masse, or for any message containing the term “paintball” and add all the senders/recipients to “paintballers” group.

So there you go GMail group, please get to work.

Yahoo Live Surprisingly Good

3

I’m using Yahoo Live to get a look at the Yahoo UCSD Hack Day event, and I’m quite surprised at how well it works.

Seeing someone’s face via their webcam is an absolute waste of time, particularly on on video conference calls, but here I’m looking at a view of the room in the larger window, and at several of the hackers in smaller windows. This is actually a pretty workable virtual office. I can see Paul eating Pizza, Rasmus helping people, and the group I was sitting with mostly still there. I half feel like driving back over and re-joining the group.

Why is this useful? Well, for one, I can see people working, and it motivates me to stay up and work. Compare this to having no communication, just email, or IRC, and you’ll see it provides a much more tactile, real-world feel. I can see and hear the activity.

I could actually imagine a technology like this being useful for virtual office. Kind of like hanging out on IRC/IM, but with a more human feel.

RESTful URL Design For Search And Collections

4

I’m trying to find the appropriate design for RESTful design of URLs for search and for collections of items.

The setup: we have two models, Cars and Garages, where Cars can be in Garages. Base URLs:


/car/xxx           (xxx = car id)
/garage/yyy     (yyy = garage id)

Now we want to provide a search for cars – eg. show me all the blue sedans with 4 doors. What’s the appropriate URL?


1  - /cars/color/blue/type/sedan/doors/4
2  - /cars/color:blue/type:sedan/doors:4
3  - /cars/?color=blue&type=sedan&doors=4
4  - /car/search/...

None of these are satisfying.

1 through 3 use “cars” as the base (as opposed to “car”). So the pattern for doing searches / collections would be to pluralize the model. This seems ok.

1 has arbitrary ordering of the fields and no good way to distinguish fields versus their values. 2 is slightly better, but still doesn’t seem right.

3 uses the QUERYSTRING for the parameters instead of the PATHINFO, and frankly looks better to me, but I’ve heard of objections to using QUERYSTRING. The problem I have with it is it’s not consistent – if I was searching on a single field my URL would probably be: /cars/color/red or something like that. Having the URL drastically change form just because there are more search parameters seems wrong.

4 uses the “car” base url along with the verb “search”. That seems wrong – verbs shouldn’t be part of the URL, right? It’s been suggested several times though.

Now a slightly different case – let’s find all the cars in a given garage:


1  - /garage/yyy/cars
2  - /cars/?garage=yyy

1 seems pretty good in this case.

Please chime in with your thoughts, either in the comments here or in the Stackoverflow thread.

Python Script For Finding And Removing Duplicate Files

5

My image, mp3, and ebook collection were a mess after years of copying to various servers, consolidating, and re-copying. I had lots of duplicates.

I looked for an app to find and remove duplicates but surprisingly didn’t find anything very good. So I had to write my own.

This is a very simple script – it scans the directory tree you specify, looks for exact duplicates, and removes the duplicates.

It’s not very smart about which copy it removes. It’s not smart about finding files that are “similar” – it only finds exact matches. It ignores small files (intentionally – it’s easy to make it deal with small files).

It uses /temp for its output and cache files, so it’s targeting windows. Change that to /tmp if you’re running unix.

I built in a caching mechanism to save the results of scanning the disk, but it turned out not to be too useful and the script ran faster than I expected, so the caching is commented out.

Here it is: FileInfo.py .

Fremium Works If You Do It Right

0

I am an incredibly cheap bastard frugal. I don’t like to pay for things.

Yet I just paid for Flickr without hesitation or a moment of doubt. Odd.

Here’s why, I think:

  • Cheap: $25/year is not a lot of money.
  • Excellent service. I’ve never had a problem with the service; they’re always available and fast.
  • Nice, easy to use product.
  • Trust. I trust that Flickr won’t do anything bad with my account, with my photos, with my credit card, or with anything else. I trust them because I believe they respect their customers:
    • They don’t automatically charge renewals your account – you have to take an action each time you have to pay them, so you know they’re not going behind your back to charge your credit card.
    • If you decide not to pay, they still provide a reasonable product (only your most recent 200 photos are available) instead of leaving you high and dry. If you decide to pay at a later time you get your full account back.
    • They provide full, easy access to all your data via a variety of APIs and tools at all times. They’re not trying to tie you down by trapping your data.

I feel comfortable spending money on Flickr in the same way I feel comfortable spending at Costco – I feel they’ll make things right for me without giving me a lot of trouble. So I spend.

This is the way to build a sustainable fremium business – make gaining your customer’s trust your top priority.

Small Biz Minimal CRM Thoughts

1

As the “computer guy” for my wife, various friends, and my own activities, I often see a need for a very simple “CRM” system. But it’s not really CRM; it’s more of a contact management system together with a way to send emails to various groups of friends / customers, a way to catalog past exchanges, and perhaps a simple to-do system.

Periodically I look at the various available options (there are many), give one or more of them a try, and give up, generally because they’re too complex and too capable.

The most promising is High Rise, but I don’t think it provides a way to send emails, so it doesn’t qualify.

Here are my requirements for my imaginary Pony CRM:

  • Simple. Must be simple enough that my wife will actually use it.
  • Integration with Yahoo Mail / GMail. This should be generalized to POP/IMAP integration, but I find almost everyone uses one of the large providers, either Yahoo or Gmail.
  • Automatic synchronization with Yahoo/Gmail. There shouldn’t be an “import” option – it should always be sync’d automatically. A contact in Gmail should be immediately available in Pony CRM, and vice versa.
  • Easy cataloging / collection of emails. It’d be good to be able to catalog and tag email exchanges.
  • Easy mass email. This is the most important feature – without exception, the most common use case is: I want to email a subset of contacts and I want to track who replied, looked at what I sent, etc. For example, all of your clients that have certain investments, or all customers who’ve requested feature X, all customers who signed up before date X, and so forth, should receive a nice email.
  • Easy graphical email composition. Everyone wants to send a nicely formated email with graphics that’s readable in the majority of email readers, but no one can figure out how. This should be built in to the CRM tool in some way. There should be a way to store the email templates for future use.
  • History and Audit Trails. You should be able to figure out who you sent what to, when.
  • Extensibility. APIs for access to the CRM system as well as a way to integrate external feeds/ APIs/ widgets into the CRM system.
  • Free/Cheap Hosting with Simple Full Export. If it’s cheap enough, I can push everyone to pay the monthly fee and get the support issue off of my back. However, there must be a simple full-export option as well as an alternative means of using the exported data – eg. an open source version that could be deployed should the hosted version go out of business. Can’t have lock-in if your business depends on the data.

I think this covers just about everything needed. Someone please either build or point me to Pony CRM so I can be happy.

Access Python Dictionary Keys As Properties

1

Say you want to access the values if your dictionary via the dot notation instead of the dictionary syntax. That is, you have:


d = {'name':'Joe', 'mood':'grumpy'}

And you want to get at “name” and “mood” via


d.name
d.mood

instead of the usual


d['name']
d['mood']

Why would you want to do this? Maybe you’re fond of the Javascript Way. Or you find it more aesthetic. In my case I need to have the same piece of code deal with items that are either instances of Django models or plain dictionaries, so I need to provide a uniform way of getting at the attributes.

Turns out it’s pretty simple:



class DictObj(object):
    def __init__(self, d):
        self.d = d

    def __getattr__(self, m):
        return self.d.get(m, None)

d = DictObj(d)
d.name
# prints Joe
d.mood
# prints grumpy

Beanstalkd / Python Basic Tutorial

9

(First install beanstalkd and pybeanstalk)

Beanstalkd is an in-memory queuing system. It supports named queues (called ‘tubes’), priorities, and delayed delivery of messages.

Terminology: a message is called a job, and queues are called tubes.

Let’s look at an example scenario. Say you want to create 2 tubes, one called “orders” and another called “emails”, place orders into the first tube (or queue) and emails into the second, and have different processes handle orders and emails.

You can create queues by simply naming and putting messages into them. On the producer side:


from beanstalk import serverconn
c = serverconn.ServerConn('localhost', 99988)

# put a message (or job) into the default queue:
c.put('first message, into default tube')

# now start using a named tube:
c.use('orders')
c.put('second message, into orders tube')

Now on the consumer:


from beanstalk import serverconn
c = serverconn.ServerConn('localhost', 99988)

# by default your connection will be listening on the 'default' tube.
# switch it to use the 'orders' tube.
# This should return the 'orders' message and ignore the 'default' message:
c.watchlist = ['orders']
j = c.reserve()
print j
# {'data': 'second message, into orders tube', 'jid': 39, 'bytes': 32, 'state': 'ok'}

You can similarly setup another consumer to listen on only the ‘emails’ tube, or both, or any other scenario you want.

Beanstalkd also supports priorities, with 0 being highest priority and higher numbers meaning lower priority. You define message priority with:


c.put('low priority message', pri=999 )
c.put('high priority message', pri=0 )

j = c.reserve()
print j
# {'data': 'high priority message', 'jid': 41, 'bytes': 21, 'state': 'ok'}
# the high priority message was delivered before the low priority message, even
# though the low priority message was first into the queue

The beanstalkd consumption model is to “reserve” a message (or job), process it, and then tell beanstalkd you’ve successfully dealt with the message so it can be thrown away. When you first get the job via c.reserve() you haven’t actually fully consumed it; you’ve just reserved it for processing.

What does this mean? Imagine a scenario where you reserve a message but your process dies before you have a chance to fully process it. Beanstalkd holds your message in reserve for a period of time, but since it hasn’t heard from you confirming you’ve successfully dealt with the message, it eventually removes the reservation and makes the message available once again for the next consumer to grab. This is a basic handshake between the consumer and the server to allow for some level of resiliency.

So once you’re finished dealing with the message you’ve reserved, be sure to “delete” it, letting the beanstalkd server know it can throw that message away:


j = c.reserve()
# do some processing with j
c.delete(j['jid'])

So there you have the basics. Let me know if you’re interested and I can cover a few more topics.

Setting Up Beanstalkd on Ubuntu for Python

1

beanstalkd is a promising in-memory queuing system in the mold of memcached (minimal configuration, just works) with client libraries in a variety of languages. The following worked for me for installing it on Ubuntu 8.04:


mkdir ~/packages

# pre-requisite: libevent.
cd ~/packages
wget http://monkey.org/~provos/libevent-1.4.8-stable.tar.gz
tar zxvf http://monkey.org/~provos/libevent-1.4.8-stable.tar.gz
cd libevent-1.4.8-stable
./configure
make
sudo make install

# add /usr/local/lib to your load library path so beanstalkd can find libevent
vi ~/.bashrc   (add the following somewhere near the end):
export LD_LIBRARY_PATH=$LD_LIBRYARY_PATH:/usr/local/lib

(exit vi)
source ~/.bashrc

# need git in order to get latest code for beanstalkd
cd ~/packages
sudo apt-get install git-core

# grab beanstalkd
git clone http://xph.us/src/beanstalkd.git
cd beanstalkd
make

# now you should be able to start the beanstalkd daemon
./beanstalkd -d -p 99988

# get the python beanstalkd client
cd ~/packages
svn checkout http://pybeanstalk.googlecode.com/svn/trunk/ pybeanstalk-read-only

cd pybeanstalk-read-only
sudo python setup.py install

# get pyyaml, a pre-requisite for the python beanstalkd client
cd ~/packages
wget http://pyyaml.org/download/pyyaml/PyYAML-3.06.tar.gz
tar zxvf PyYAML-3.06.tar.gz
cd PyYAML-3.06
sudo python setup.py install

# open two different shells (or use screen) type the following in the two different shells:
cd ~/packages/pybeanstalk-read-only/examples
python simple_clients.py producer localhost 99988
python simple_clients.py consumer localhost 99988

UCSD Hack Week: I’ll Be There

0

I’m participating in the UCSD Yahoo Hack U week this coming week (Oct 13th-17th) with a talk on Wed and hopefully sticking around through most of the actual hack day (Thurs/Fri). If you’re going to be on campus let me know by emailing me ( darugar at gmail ) or leaving a comment below.

Thankful For The Stupidity of Virus Authors

0

For about a week now the network at home has been slow. Just as I’d get frustrated enough to do something about it it’d get fast again, so I procrastinated.

Meanwhile I was cursing Time Warner, Linksys, Apple, and just about everybody else whose hardware I own. The Linksys wireless router is 5 years old, but come on, it should still work!

This morning my wife complained of web access on the desktop being slow, so I sat down to take a look. Sure enough, very slow. It was odd that the computer itself wasn’t as responsive as it used to be, but then it is a windows box, so curse Microsoft.

As I was dialing Time Warner to schedule someone to come out and take a look I started to look for a bandwidth tester so I could prove how crappy their service was. Oddly the search results didn’t actually go to a bandwidth meter, but rather to an intermediary page with lots of links to bandwidth meters.

And that’s when I realized: I gots me a virus. Fantastic. It’s sitting there eating my bandwidth, probably spamming the world.

Point is, I was ready to blame everyone in the world instead of thinking it’s a virus. If the virus authors hadn’t put in the search redirection they could’ve probably lived on for another month sending their merry spam.

I’m certain a clever virus could infect a very large percentage of the world’s computers without anyone noticing. Probably already has.

Back Up No More?

0

My laptop suddenly stopped working, throwing me into panic – what if it’s a problem with the motherboard, the same problem that killed my last laptop?

It turned out to be the power adapter, fairly easily remedied, but it got me thinking about backups.

I realized I actually don’t have a lot to backup. All of my important work stuff is already under version control, available from another server. Losing my pictures would suck, but the majority of the ones I like I’ve already put on Flickr.

My email is already on gmail and Yahoo mail, as are my contacts and calendar. My personal documents are mostly on Google docs, and the others are available from various servers here and there.

Just about everything I do is already in the cloud. Which makes it very nice – practically speaking, if I had to rebuild my working environment the majority of the effort would go into setting up the various development tools instead of recovering things from backup.

Quite nice, and quite different from 5 years ago.

Stackoverflow: Surprisingly Good Source of Technical Answers

0

Stackoverflow Logo

Stackoverflow is a newish service for developers – ask a question, get answers, vote on answers, build reputation. Sort of like Yahoo! Answers but for developers.

I’m surprised at how good the service is so far. I asked a question on IRC and the same question on Stackoverflow. Within 2 minutes I got an incorrect answer on IRC and the two correct answers on Stackoverflow. The voting and reputation seems to really work and there’s no IRC trolls / egos to deal with.

I’m hoping the site will maintain its usefulness as it grows. Well worth checking out. I’m here.

Speaking at DataServices World on Hadoop

3

I’m giving a talk on Data Processing In The Cloud on November 20th 2008 at DataServices World in the Fairmont in San Jose. I hope to see you there. Here’s the abstract:

Hadoop, an open source implementation of map/reduce, has garnered tremendous momentum in large scale data processing, marting, and on occasion warehousing. This session will examine:

  • The current state and industry adoption of Hadoop and cloud-based data processing
  • The programming model, capabilities, common patterns, and best-practices for Hadoop deployment and usage
  • The ecology of value-add technologies and services in the grid computing and data processing world
  • Models for using grid-based data processing alongside traditional technologies and techniques.

I’ll also be participating in UCSD Hackweek coming up in about 2 weeks.

The Big Picture Adopts ReaderScroll Navigation

0

This is an excellent development – The Big Picture from boston.com, a beautiful photo blog and one of the big reasons for putting ReaderScroll together, has implemented j/k navigation, inspired by ReaderScroll. Nicely done – now you can navigate the pictures via the j/k keys.

One down, many more to go. First step towards j/k navigation everywhere.

Btw, several people have pointed out vi used j/k navigation before Google Reader. True enough. j/k was also used in very old keyboard based games. However, the use of j/k to center web content was first brought to my attention in Google Reader, so I’m sticking with “Google Reader Style” instead of “vi style”.

Depressed

0

Watching the Palin/Biden debate and the subsequent babbling by the TV talking heads has me depressed. Nothing of substance was said during the entire debate. What we witnessed was an audition for a supporting role in a B movie, not a debate between potential 2nd in commands of the most powerful country in the world.

Apparently these folks haven’t noticed, but we’re in the midst of the most serious economic downturn since the great depression. We’re neck deep in two active wars. There are serious problems to solve. Are sound bites, repeated claims of “maverick”, and milf references really the best we have to offer?

What is depressing is not that these folks talk in pre-cooked, substance-free parrotings, but that we really and truly are stupid enough to not only tolerate, but to encourage this.

Please, give me something to vote on beyond how folksy, how old, how female, how dark, and how maverick these people are. I’m not trying to pick the person who looks the most like the president, but the one who will act the most president.

Bah. Humbug.

Nookiller

0

Is it me or does Palin struggle to say “nookiller” instead of nuclear? It seems like she knows the correct pronunciation but goes out of her way to say it incorrectly. Some sort of Bushism or republicanism maybe?

Actually it’d be interesting to draw party lines along pronunciation. I say potato, you say potAto.