Archive for March, 2009

Encarta Bites The Dust, Stirs Nostalgia

0

My first real job was consulting for Encyclopaedia Brittanica. It was fantastic work, bringing the giant of encyclopedias onto this new platform called the Web.

What most people don’t know is that Britannica was actually a tremendous innovator early in the days of the Web. Harold Kester, a very smart guy, a good friend, and later CTO of Websense, ran the advanced technology group here in San Diego. His team was smart enough to spot the importance of the Web very very early (we’re talking NCSA mosaic days, before Netscape existed), and make fantastic advances, particularly in the field of search. I was lucky enough to work with them.

Unfortunately Britannica’s core business, selling encyclopedias door-to-door, was getting killed by this CD based encyclopedia from Microsoft called Encarta. Who would pay over a thousand dollars for something you could get practically free?

Britannica’s business model didn’t survive the age of the CD, but the company did manage to transform itself and leap to the next technology, the Web.

Then it ran into this thing called Wikipedia.

All of this nostalgia is stirred up on reading that Microsoft has decided to close down the Encarta business. A technology and model that killed a long-running, well established business itself killed by a newer, shiner model.

Development Stalemate

0

Sometimes all you have left are large contiguous pieces of work requiring concentration and small fragmented pieces of time full of interruption.

I Have All This Power

3

My older son is a very gentle human being – he doesn’t want to hurt anyone or anything. When he was 4 he realized that meat comes from killing cows and declared he was a vegeterian

Yesterday we were out at mission bay and he was getting thrown around by a 5 year old. He’s almost 7, quite strong, and takes wrestling, so I was perturbed. I pulled him aside and asked him about it – how come you’re getting beaten up by a little kid?

His answer really struck me. He said

What are you worried about? I have all this power. But I don’t want to use it against him – he’s just a little kid.

I was dumbstruck and impressed. Sounds like something superman would say.

OS X Takes Over

1

I am a hypocrite. I keep complaining about OS X and macs, but every new computer I get gets OS X installed on it. OS X on the Dell Mini 9 is a beautiful thing, and I finally managed to get my beefy new server up and running. Alas, even I am becoming Mr. Apple Fanboy.

Are Schemas A Thing Of The Past?

5

Schemas often turn out to be significant barriers to innovation – adding a feature that requires a schema change brings with it the difficult and time consuming task of actually changing the schema in your live database and migrating historical data. I know from first-hand experience with large systems that this frequently turns out to be the most time consuming and complex aspect of launching new versions of software.

Lately I’ve been more and more tempted by data stores with relaxed schemas. In an attempt to better understand the pros and cons of schemas I’m enumerating my thoughts, mostly from a user’s perspective, and hoping that some of my DB expert readers (I know you’re reading this, if you don’t respond here I’ll bug you in person!) will chime in with their thoughts on the pros of strict schemas.

The Aesthetics. From a developer’s perspective I find strict-schema systems (general RDBMSes) confining – some rows simply want to have more columns than others. I can model it with relationships and foreign keys, but the multi-way joins are somehow ugly to me, and I know they make my performance minded DBA buddies upset. It’s also just not pretty, splitting up this field that naturally wants to sit in a row with the rest of his friends into his own separate table. 

The Exceptions. I’ve also seen, in literally every large scale data analysis I’ve done, that even in strict-schema systems junk gets into the system. Just last week some analysis blew up because a field that could not be NULL was in fact NULL in some odd cases. The DBA can’t explain why, but it’s there in the data. 

The Discontinuity. Changes do in fact have to happen even in strict schema systems. When they do, there’s an ugly breakage – tools and code deal only with before-change or after-change versions of the schema, not both, because they’re built assuming the schema is well defined and static. I’m looking at the effects of this right now – in one project the current analysis tools can’t be applied before a certain date, because the schema changed on that date.

The Static In A Dynamic System. We’ve found freedom in dynamic languages, allowing us to define our data structures in fluid, natural ways. Strict schemas feel like an injection of Java in a Python system: over-defined, confining, and unnatural. The data store should be as flexible as the programming language – data is code and code is data, or something like that.

To be fair, schema-less sytems have plenty of cons as well. Truth be told I don’t have very much real-world experience with them, so most of the cons are probably hidden from me, but I do know that schema-less systems are generally a bear to deal with after they’ve existed for a few years. Zombie objects and data infest your data store and crumb up the works. Tooling, general knowledge, and best practices are also severly lacking compared to static-schema systems, at least at this point in time.

So I’m very tempted to go and try a system with less strict schema requirements. Perhaps something like the FriendFeed MySQL setup. If you can enlighten me on the benefits I’d be foregoing by going schema-less, please leave a comment.

Dell Mini 9 First Impressions

0

My Dell Mini 9 arrived this morning and I finally have a chance to play with it. First impressions:

  • Cute little thing. It’s tiny. Fatter than I’d expected, but tiny.
  • The keyboard is definitely not comfortable. I have to do a quasi pecking action to type on it; my hands don’t fit.
  • Pretty clear that the Windows interface as it stands is not a great fit for a netbook (nor are linux/gnome or OS X for that matter). The screen real estate is too tight; window decorations need to be far thinner and apps need to be much more space concious.
  • The trackpad is also difficult to use; my son was trying to play a game and the difference between the built-in trackpad and external mouse was ridiculously obvious.
  • My immediate reaction to the reduced screen real estate was to desire a setup where every app is full screen and you can switch between apps with keyboard shortcuts. There isn’t enough room for multiple windows on screen. I might have to build something with autohotkey to make this happen.
  • Chrome seems to be the best browser for it because it wastes the least screen real estate.
Overall I like it a lot, but for no reason I can put into words other than it’s cute. Maybe that’s enough reason, we’ll see.

Python S3 Library For Chunked / Streaming Download

2

Born of a need to deal with multi-gig files on S3, I’ve modified the Python Amazon S3 library to allow you to read data in chunks, as well as a simple file-like object that lets you to read the file one line at a time (ala for line in f ).

The plan was to create this as a patch against the official Python S3 library from Amazon (it’s only a small change), and possibly even do a github thing, but it’s evident I’ll never get around to it, so I’m simply uploading it here

The small change is the addition of an optional readbody argument to AWSAuthConnection.get that tells the library not to read the body of the message, and a S3File class that provides the line interface. Here’s an example of using the S3File class:


if use_S3:
    f = S3.S3File(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, BUCKET, FILE)
else:
    f = file(conf.local_location + '/' + FILE)

for line in f:
    # do something

Use Python’s Marshal For Faster Serialization

0

I’d been using Python’s cPickle module for my serializing data structures to disk. As my data size got larger I started to care about performance, and after a few searches ended up at the marshal module. It comes with a few caveats, but is working great, much faster for me than cPickle. So if you’re looking for performance and can live with the caveats, give marshal a try.

Alex Shah’s iPhone Interview on Mixergy

2

I had the opportunity to introduce one friend, Alex Shah, to another, Andrew Warner, leading to this fun interview on how to approach and launch iPhone apps. Stop by and take a quick listen.

Alex is a lot of fun to talk to, even more fun to argue with, and he’s never afraid of telling you what’s on his mind. In fact I remember years ago I asked him how a panel he was on went, and he said “disaster – I couldn’t get anyone to disagree with me!”

If you run into Alex ask him about iPhone apps, or better yet, ask him how he feels about the high-speed train initiative.

How Do You Speed Up Django Unit Test Execution?

1

As I’ve mentioned before I’m a fan of unit tests, mostly because of the freedom they give you to redesign and clean up code without fear of breaking functionality.

However, they’re also a source of distraction for my distracted mind. In particular, they take a not-insignificant amount of time to run when they involve hitting the database on my local windows/sqlite setup.

This time is just long enough for me to switch over and check email / twitter / whatever. And just like that, I’m off on a tangent, I’ve lost context, and I’ve wasted my own time.

Of course the solution is for me to exert restraint and patience. But I’m impatient. 

So, is there a way to make unit tests in Django go blindingly fast?

JSON faster than Thrift and Protocol Buffers?

0

This is a bit surprising. According to Justin compressed JSON is faster than both Thrift and Protocol Buffers when used with Python (via Dion). I had previously asked about performance comparison of Thrift and Protocol Buffers on StackOverflow, but I had assumed JSON would be significantly slower. Maybe not. I’d love to see more on this.

The Growth of Laptop Sized Problems

0

I like the term laptop-sized problem for referring to problems that can be effectively solved using a laptop. As in: x million rows in your dataabse? x million pages per day? That’s a laptop-sized problem.

I like it because it’s a good common-sense check – if you’re implementing a complex solution to a laptop-sized problem you’re probably doing it wrong. 

I’m amazed by how the size of laptop-sized problems has grown. These days most laptops are multi-core and can easily be outfitted with 4G of memory. You can solve some pretty large problems with just a laptop.

Doing data analysis on some fairly large data sets I decided to first implement a basic python/numpy solution before rolling out the hadoop/EC2 version. Turns out the laptop version does make my laptop groan, but is workable even with the full data size. I can skip being smart and just do the simple thing. 

Imagine the size of problems we’ll be able to tackle with simple solutions on a laptop in a few years.