Development Stalemate
0Sometimes all you have left are large contiguous pieces of work requiring concentration and small fragmented pieces of time full of interruption.
Sometimes all you have left are large contiguous pieces of work requiring concentration and small fragmented pieces of time full of interruption.
Schemas often turn out to be significant barriers to innovation – adding a feature that requires a schema change brings with it the difficult and time consuming task of actually changing the schema in your live database and migrating historical data. I know from first-hand experience with large systems that this frequently turns out to be the most time consuming and complex aspect of launching new versions of software.
Lately I’ve been more and more tempted by data stores with relaxed schemas. In an attempt to better understand the pros and cons of schemas I’m enumerating my thoughts, mostly from a user’s perspective, and hoping that some of my DB expert readers (I know you’re reading this, if you don’t respond here I’ll bug you in person!) will chime in with their thoughts on the pros of strict schemas.
The Aesthetics. From a developer’s perspective I find strict-schema systems (general RDBMSes) confining – some rows simply want to have more columns than others. I can model it with relationships and foreign keys, but the multi-way joins are somehow ugly to me, and I know they make my performance minded DBA buddies upset. It’s also just not pretty, splitting up this field that naturally wants to sit in a row with the rest of his friends into his own separate table.
The Exceptions. I’ve also seen, in literally every large scale data analysis I’ve done, that even in strict-schema systems junk gets into the system. Just last week some analysis blew up because a field that could not be NULL was in fact NULL in some odd cases. The DBA can’t explain why, but it’s there in the data.
The Discontinuity. Changes do in fact have to happen even in strict schema systems. When they do, there’s an ugly breakage – tools and code deal only with before-change or after-change versions of the schema, not both, because they’re built assuming the schema is well defined and static. I’m looking at the effects of this right now – in one project the current analysis tools can’t be applied before a certain date, because the schema changed on that date.
The Static In A Dynamic System. We’ve found freedom in dynamic languages, allowing us to define our data structures in fluid, natural ways. Strict schemas feel like an injection of Java in a Python system: over-defined, confining, and unnatural. The data store should be as flexible as the programming language – data is code and code is data, or something like that.
…
To be fair, schema-less sytems have plenty of cons as well. Truth be told I don’t have very much real-world experience with them, so most of the cons are probably hidden from me, but I do know that schema-less systems are generally a bear to deal with after they’ve existed for a few years. Zombie objects and data infest your data store and crumb up the works. Tooling, general knowledge, and best practices are also severly lacking compared to static-schema systems, at least at this point in time.
…
So I’m very tempted to go and try a system with less strict schema requirements. Perhaps something like the FriendFeed MySQL setup. If you can enlighten me on the benefits I’d be foregoing by going schema-less, please leave a comment.
As I’ve mentioned before I’m a fan of unit tests, mostly because of the freedom they give you to redesign and clean up code without fear of breaking functionality.
However, they’re also a source of distraction for my distracted mind. In particular, they take a not-insignificant amount of time to run when they involve hitting the database on my local windows/sqlite setup.
This time is just long enough for me to switch over and check email / twitter / whatever. And just like that, I’m off on a tangent, I’ve lost context, and I’ve wasted my own time.
Of course the solution is for me to exert restraint and patience. But I’m impatient.
So, is there a way to make unit tests in Django go blindingly fast?
This is a bit surprising. According to Justin compressed JSON is faster than both Thrift and Protocol Buffers when used with Python (via Dion). I had previously asked about performance comparison of Thrift and Protocol Buffers on StackOverflow, but I had assumed JSON would be significantly slower. Maybe not. I’d love to see more on this.
I like the term laptop-sized problem for referring to problems that can be effectively solved using a laptop. As in: x million rows in your dataabse? x million pages per day? That’s a laptop-sized problem.
I like it because it’s a good common-sense check – if you’re implementing a complex solution to a laptop-sized problem you’re probably doing it wrong.
I’m amazed by how the size of laptop-sized problems has grown. These days most laptops are multi-core and can easily be outfitted with 4G of memory. You can solve some pretty large problems with just a laptop.
Doing data analysis on some fairly large data sets I decided to first implement a basic python/numpy solution before rolling out the hadoop/EC2 version. Turns out the laptop version does make my laptop groan, but is workable even with the full data size. I can skip being smart and just do the simple thing.
Imagine the size of problems we’ll be able to tackle with simple solutions on a laptop in a few years.
Due to aforementioned OS X stupidity with the Home and End keys, my new favorite Eclipse hotkey is Ctrl-Q. It takes you to the location of your last edit. So next time you hit the End key to go to the end of the line but end up at the end of the file and curse Jobs and his cult, remember to hit Ctrl-Q to go back to where you should be.
I’m trying to find the appropriate design for RESTful design of URLs for search and for collections of items.
The setup: we have two models, Cars and Garages, where Cars can be in Garages. Base URLs:
/car/xxx (xxx = car id)
/garage/yyy (yyy = garage id)
Now we want to provide a search for cars – eg. show me all the blue sedans with 4 doors. What’s the appropriate URL?
1 - /cars/color/blue/type/sedan/doors/4
2 - /cars/color:blue/type:sedan/doors:4
3 - /cars/?color=blue&type=sedan&doors=4
4 - /car/search/...
None of these are satisfying.
1 through 3 use “cars” as the base (as opposed to “car”). So the pattern for doing searches / collections would be to pluralize the model. This seems ok.
1 has arbitrary ordering of the fields and no good way to distinguish fields versus their values. 2 is slightly better, but still doesn’t seem right.
3 uses the QUERYSTRING for the parameters instead of the PATHINFO, and frankly looks better to me, but I’ve heard of objections to using QUERYSTRING. The problem I have with it is it’s not consistent – if I was searching on a single field my URL would probably be: /cars/color/red or something like that. Having the URL drastically change form just because there are more search parameters seems wrong.
4 uses the “car” base url along with the verb “search”. That seems wrong – verbs shouldn’t be part of the URL, right? It’s been suggested several times though.
Now a slightly different case – let’s find all the cars in a given garage:
1 - /garage/yyy/cars
2 - /cars/?garage=yyy
1 seems pretty good in this case.
Please chime in with your thoughts, either in the comments here or in the Stackoverflow thread.
beanstalkd is a promising in-memory queuing system in the mold of memcached (minimal configuration, just works) with client libraries in a variety of languages. The following worked for me for installing it on Ubuntu 8.04:
mkdir ~/packages
# pre-requisite: libevent.
cd ~/packages
wget http://monkey.org/~provos/libevent-1.4.8-stable.tar.gz
tar zxvf http://monkey.org/~provos/libevent-1.4.8-stable.tar.gz
cd libevent-1.4.8-stable
./configure
make
sudo make install
# add /usr/local/lib to your load library path so beanstalkd can find libevent
vi ~/.bashrc (add the following somewhere near the end):
export LD_LIBRARY_PATH=$LD_LIBRYARY_PATH:/usr/local/lib
(exit vi)
source ~/.bashrc
# need git in order to get latest code for beanstalkd
cd ~/packages
sudo apt-get install git-core
# grab beanstalkd
git clone http://xph.us/src/beanstalkd.git
cd beanstalkd
make
# now you should be able to start the beanstalkd daemon
./beanstalkd -d -p 99988
# get the python beanstalkd client
cd ~/packages
svn checkout http://pybeanstalk.googlecode.com/svn/trunk/ pybeanstalk-read-only
cd pybeanstalk-read-only
sudo python setup.py install
# get pyyaml, a pre-requisite for the python beanstalkd client
cd ~/packages
wget http://pyyaml.org/download/pyyaml/PyYAML-3.06.tar.gz
tar zxvf PyYAML-3.06.tar.gz
cd PyYAML-3.06
sudo python setup.py install
# open two different shells (or use screen) type the following in the two different shells:
cd ~/packages/pybeanstalk-read-only/examples
python simple_clients.py producer localhost 99988
python simple_clients.py consumer localhost 99988

Stackoverflow is a newish service for developers – ask a question, get answers, vote on answers, build reputation. Sort of like Yahoo! Answers but for developers.
I’m surprised at how good the service is so far. I asked a question on IRC and the same question on Stackoverflow. Within 2 minutes I got an incorrect answer on IRC and the two correct answers on Stackoverflow. The voting and reputation seems to really work and there’s no IRC trolls / egos to deal with.
I’m hoping the site will maintain its usefulness as it grows. Well worth checking out. I’m here.
Drizzle is interesting:
Drizzle: A High-Performance Microkernel DBMS for Scale-Out Applications
Drizzle is a community-driven project based on the popular MySQL DBMS that is focused on MySQL’s original goals of ease-of-use, reliability and performance.
Headed up by Brian Aker, Director of Architecture at MySQL AB. Take a look at the MySQL Differences page and you’ll mostly see features removed and cleaned up, which is great. Designed for high levels of concurrency, targeted to “cloud” applications. Monty and Brian’s posts offer motivation for the project.
Something to keep an eye on.
Three.
The number of programmers who will write most of the code in a system developed by a team of 24 engineers, two project managers, three group leaders, a quality lead and an office manager.
From Russ Olsen.
Mozilla announces TraceMonkey, a just-in-time compiler for Javascript. If you’ve watched Steve Yegge’s talk on Dynamic Languages (transcript) you’ve already had a taste of what the future could look like for dynamic languages – namely, performance on par with today’s low level languages.
Javascript started as an ugly language but has been steadily shedding its bad parts and adopting a beautiful functional style. With the performance piece figured out and a tremendously large number of installations and runtimes (just about every browser in existence has a Javascript engine), it could become the most important programming language of the near future.
All the cool kids are into Git these days and I’ve been reading plenty of articles about how good it is and how to use it. The problem is, I don’t really have a problem with Subversion. I know I should, because all the cool kids do, but I just don’t run into a lot of issues with it. In the absence of a problem to solve it would simply be peer-pressure to give Git a shot.
So in an attempt to remain ever-so-independent I’m going to try a distributed system, but not Git. I’m going with Mercurial.
Actually mainly it’s because Mercurial seems significantly simpler, and I’m a simple guy. It’s also written in Python, which gives me a warm and fuzzy. And I’m finally motivated to try it because I’m going to try a code path which may not work out, and I understand these distributed systems deal with that well.
Hmm. The main thing I’d want from a source code control system would be a bit of packaging and deployment intelligence built in. Maybe something to minify and join my javascript files and mend the files that reference them. I’m extremely pleased not to have a “make” step anywhere in my process, but I do miss some of the capabilities.
If I’m making a mistake and I should go with Git, or perhaps CVS, do let me know.
Excellent article on Simple Top-Down Parsing in Python. The nud and led business could be better explained, but the rest of the article and code is great. I learned several things I hope to employ shortly.
I’m trying to remember if we studied this in compiler class or not. I think not, although I have terrible memory, so it’s possible we did.
Anyway, a companion tutorial article that would approach this strictly from the perspective of using the toolkit Effbot built in his article would be nice. In other words, knowing the under-the-hood details is fantastic and informative, but given the tools and helper functions built in the article one could fairly easily build a parser without worrying about how the helper functions are implemented. Sort of a user manual for building parsers given the helper functions.
Another one in the category of always-forget-how-to-do-this-so-noting-here:
class Entry(models.Model):
blog = models.ForeignKey(Blog)
b = Blog.objects.get(id=1)
b.entry_set.all() # Returns all Entry objects related to Blog.
# b.entry_set is a Manager that returns QuerySets.
b.entry_set.filter(headline__contains='Lennon')
b.entry_set.count()
Full docs here.
Interesting new open source release from Google called Protocol Buffers. Language neutral data serialization and exchange via protocol definition and generated code for C++, Java, and Python.
Apparently Protocol Buffers are heavily used inside Google, so they look to be a robust implementation. Should be a good format for wire protocols.
They compare it to XML and tout its size and speed advantages. In a client/server implementation, however, JSON is the more likely alternative. I wonder how the size and speed compare.
In most programming languages (Java, C, Python, Perl) I’m generally thinking “I’ll put this thing on this shelf here, then I’ll do x, then I’ll pick up that thing, do some work on it, put the result over here,” and so forth.
With Javascript, particularly when used correctly, which for me means in the Way Of JQuery, the thought process is more like “When some event happens, this guy will wake up and he’ll know what to do. He’ll remember his name, what he was supposed to work on, and he’ll be carrying his own tools. He might get blocked at some point, but then he’ll just wait around and when he’s ready to go he’ll remember who he is, what he was doing, and how far along doing it he was. And when he’s done he’ll go away and along with him will go his tools and any other mess he made”.
Javascript is a lot more “guy with the thing” thinking instead of “what’s on this shelf here?” thinking. I guess that’s called closures, or something like that. Anyway, I’m liking it.
Photo by St-Even.

A friend forwarded me an email about yet another group using a database to implement what’s really a queue. Not surprisingly, performance is an issue.
Queues are still not part of the average developer’s standard set of tools. At least the Java world has a standard API and several good implementations to pick from. The scripting world is a hodge-podge, and I still haven’t found a great choice despite a good bit of looking.
I’m looking forward to a simple, commonly used queue interface / implementation that people can wrap their heads around and employ widely. Use of queues is one of the basic techniques for achieving scale, and we’re still lacking the basic tools to use it.
Photo by Sean Dreilinger.
This is kind of pretty:
void log(lazy char[] dg)
{
if (logging)
fwritefln(logfile, dg());
}
void foo(int i)
{
log("Entering foo() with i set to " ~ toString(i));
}
Note the lazy keyword in the definition of the log function, which tells D to only evaluate the value if needed (ie. lazily).
Nice. Smells a little like Twisted’s deferred business, except different.
Via Raganwald.
I have a need to verify user email addresses, which I’ve been doing the traditional way – sending an email with a secret to the user’s address and having them reply or click on a URL.
Unfortunately this is not optimal – emails tend to not make it to the user, go into bulk/spam buckets, and are less real-time than I’d like. I’m looking for a better way.
I’m hoping OpenID will help me. I mainly care about Yahoo, Google, and Hotmail, all of which support OpenID to some extent.
I believe OpenID Simple Registration is what I’m looking for. I have a lot of homework to do to see which providers support SREG, how to use them, etc. I’ll post my progress here, and if you have knowledge / experience with this, please leave a comment below.