Archive for the 'Programming' Category


Understanding OAuth: An Overview

0

The scenario: user Alice wants to allow SimpleService to access Twitter on her behalf.

Before OAuth this would be done by having Alice give SimpleService her login credentials for Twitter. She would type in her Twitter username and password into SimpleService, and SimpleService would use those credentials to access Twitter on her behalf. This was bad because SimpleService now knew Alice’s credentials, and could do malicious things or carelessly leak her credentials to malicious people.

With OAuth Alice doesn’t have to give SimpleService her Twitter credentials. Instead she goes through a process wherein she tells Twitter that SimpleService is allowed to act on her behalf.

The flow is:

First, before Alice is involved, Twitter and SimpleService exchange secret information: SimpleService requests Twitter to provide it with a consumer token and a secret. The consumer token will be used to identify SimpleService to Twitter, and the secret will be used to secure the communications to prevent others from pretending to be SimpleService.

This is setup once, allowing SimpleService and Twitter to communicate in general.

Once setup, for each user that wants to allow SimpleService to access Twitter on their behalf, the following happens:

  • The user Alice goes to the SimpleService website and requests SimpleService to access her Twitter account.
  • SimpleService calls Twitter with a request_token call. This tells Twitter SimpleService is about to have a conversation with it regarding authorization.
  • Twitter responds with a request_token.
  • SimpleService, having received the token from Twitter, redirects the user’s browser to a Twitter authorize page, passing the along the request_token as part of the url.
  • Twitter grabs the token (along with various signatures and timestamps to verify the request is not forged).
  • Twitter displays a page to the user asking her if she wants to give SimpleService access to her Twitter account.
  • If the user says no the game is over, and the token is not authorized.
  • If the user says yes, Twitter redirects the user back to SimpleService, passing along an authorization token, letting SimpleService know the user authorized access.
  • SimpleService now exchanges the authorization token for an access token: it calls Twitter with the authorization token, and requests Twitter to give it an access token.
  • Twitter examines the token SimpleService sent it, verifies that it’s not forged and that the user Alice had earlier authorized that token for access. It now believes that Alice wants to grant SimpleService access to her Twitter account. Twitter responds to SimpleService with an Access token.
  • SimpleService grabs the access token and stores it, associating it with the user Alice.

Now the initial authorization dance is done: SimpleService has an access token that allows it to access Twitter on behalf of Alice.

To actually access Twitter on behalf of Alice, SimpleService includes the access token with each call it makes to Twitter on behalf of Alice. Twitter checks the token, verifies that is valid, and allows SimpleService to access Alice’s resources.

To be an OAuth service provider (that is, play the part of Twitter), you need to:

  • Have a way to exchange consumer tokens and secret keys with third parties (eg. SimpleService) that want to access your APIs.
  • Provide a request_token service that provides tokens that start the conversation for each authorization.
  • Provide an authorization page that tells the user that a third party (eg. SimpleService) is requesting access to her resources, and allows her to accept or reject the request.
  • Make a callback to the service (eg. SimpleService), informing it that the request has been authorized.
  • Provide an access_token service that accepts a request token that’s been authorized and provides an access token to the third party (eg. SimpleService).
  • Store the access token for the user that authorized it.
  • Accept access tokens as an authentication mechanism for API calls, verifying the token’s validity and authenticating the user associated with that token.

A Request for Android, Tim Bray, and Google

2

Mr. Bray has joined the empire of no evil, the Android group no less, and has been writing useful things about Android and Nexus One. He also has deep roots in the world of scripting, has been an advocate, and has been fearless in his experiments with languages new fangled and old.

So I’d like to make a request of Tim, one that I think would make a tremendous impact. Tim: please help bring scripting to Android development.

I know that a lot of people like Java and find the current development environment just dandy. Which is great. But many other reasonable people would prefer to keep their hands clean of Java and feel a greater degree of productivity using higher level languages.

Imagine a scenario where you could write a Python, Ruby, or Javascript script, get it onto the phone using a simple interface (eg. just upload it to a url), and have a native app. Imagine how many more people would be developing apps, and how much more quickly.

I’m looking for something like this: supported and documented as a standard part of the Android SDK, all reasonable APIs needed to develop native apps exposed as Javascript (and/or Python or Ruby, but Javascript is likely the widest reaching bet). And a reasonable packaging process that is only slight more complex than tar.

There is absolutely no reason this can’t be achieved. In fact projects like Appcelerator and PhoneGap have already made tremendous strides in this direction. All it takes is a believer to take the initiative and make it happen.

I feel Tim is that believer. And so I humbly submit, Mr. Bray, that the most important impact you could have on Android is to embed a love and support of scripting languages into the SDK. Pretty please.

How To Use curl To Upload a File While Limiting Bandwidth

0

For future reference:

I needed to simulate a slow connection for testing an HTTP file upload, time the results, and see how reliable it was. Turns out it’s all doable with curl using the right set of incantations. Here they are:


curl -F file=@/tmp/sample-large-image.jpg -F some_parameter=1027504 \
    -u myusername:mypassword -w "\nTIME: %{time_total}\n" \
    --limit-rate 10k http://somewebsite.com/api/upload/

What this is saying is:

  • Upload the file /tmp/sample-large-image.jpg . Note the “@” symbol – that’s what tells curl this his a file upload.
  • Set the parameter some_parameter to 1027504
  • Use HTTP basic auth to login, with user “myusername” and password “mypassword”
  • Include the word “TIME” followed by how long the upload took in the output
  • Limit the upload bandwidth to 10k bytes per second

curl continues to amaze with its flexibility.

Django-mptt: Tree Storage in Django: A Brief Overview

0

django-mptt is a library for storing tree oriented data using the Django ORM. It allows you to place your model instances into a tree structure and efficiently query for ancestors and children.

Here’s a brief tutorial on how to use it:

After installing, you’ll need to modify your model to include a “parent” field, and register it with mptt:

class Person(models.Model):
    contact   = models.ForeignKey( Contact, db_index=True )
    role      = models.CharField(max_length=20, blank=True)
    parent    = models.ForeignKey('self', null=True, blank=True, related_name='children')

    def __unicode__(self):
        return "Person: <%s>" % (self.contact.email, )

mptt.register(Person)

mptt dynamically adds fields to your model, so you’ll need to syncdb after you’ve added the parent attribute and the mptt.register call to your model.

The basics are fairly easy to use:

To move a node to the root of the tree, use move_to with a targe of None:

person1.move_to(None)
person1.save()

To make a node the child of another, set its parent:

person2.parent = person1
person2.save()

To find the children of a node, use the children field:

>>>person1.children.all()
[<Person: Person: <test2@testing.com>>, <Person: Person: <test3@testing.com>>]

Here’s a little snippet of code to setup a 15 node tree where each node has two child nodes:

[UPDATE] The code in this snippet is not correct – you have to save each node as you update it, then look it up again. You can’t modify a node, save it, then use the reference you already have for it. I’ll update the code when I get a chance

contacts = []
people = []
for n in range(15):
    c = mod.Contact(email="test" + str(n) + "@testing.com")
    c.save()
    contacts.append(c)
    p = mod.Person(contact=c)
    p.save()
    people.append(p)

people[0].move_to(None)  # Root
people[0].save()
for n in range(1,15):
    people[n].parent = people[(n-1)/2]
    people[n].save()

Now let’s take a look around:

>>>people[7].parent
<Person: Person: <test3@testing.com>>

>>>people[3].children.all()
[<Person: Person: <test7@testing.com>>, <Person: Person: <test8@testing.com>>]

Now let’s move things around a bit; we’ll take person3, which is 2 levels down from the root, and make it a direct child of the root:

>>>people[3].parent = people[0]
>>>people[3].save()

>>>people[0].children.all()
[<Person: Person: <test1@testing.com>> <Person: Person: <test2@testing.com>>, <Person: Person: <test3@testing.com>>]

And we can look at the ancestors of a given node:

people[14].get_ancestors()

Move Files Older Than X Days To Another Directory

1

Here’s a little script for finding files modified more than 7 days ago and moving them to another directory:


find . -type f -mtime +7 -print > /tmp/old_files.txt
cat /tmp/old_files.txt | while read line; do mv "$line" ../old_files ; done

Eclipse + PyDev : I Recommend It

1

I used to be a vi guy who finally made the move to graphical editors. I looked for the simplest, lightest possible solutions, using ConTEXT for quite a while.

Some years ago I was forced into using Eclipse for reasons I can’t quite recall; probably Java development. I didn’t like it – the forced Project concept, the bloat, the general slowness.

Eventually I got comfortable with it, got PyDev installed, and made it my primary development environment. These days most of my development lives in Eclipse.

With the 1.5 release PyDev included quite a few previously pay-only features in the free / open source version. Since then I’ve found I’ve become even more productive in the environment, and now actually enjoy it.

In particular, the code analysis is very useful. I love the fact that it points out unused imports and variables as well as syntax errors. Going through old code I was surprised at how many spurious imports I had, as well as a few actual errors in code that had been in production for several years in rarely exercised branches.

If you’re doing python development I recommend you take a look at Eclipse+PyDev. I was surprised at the level of increased productivity it brought me.

Tim Bray on Design Patterns, Threads, and COM

0

From Tim Bray:

My experience suggests that there are few surer ways to doom a big software project than via the Design Patterns religion. Also, that multi-threading is part of the problem, not part of the solution; that essentially no application programmer understands threads well enough to avoid deadlocks and races and horrible non-repeatable bugs. And that COM was one of the most colossal piles of crap my profession ever foisted on itself.

You may have read my own two cents on design patterns (hate them) and threads (not a huge fan).

Finding and Fixing Slow MySQL Queries

1

Notes to self, for future reference.

Edit /etc/mysql/my.cnf . Uncomment the following lines:

log_slow_queries        = /var/log/mysql/mysql-slow.log
long_query_time = 2
log-queries-not-using-indexes

This enables logging of slow queries and queries not making use of indexes.

Now tail -f the mysql-slow.log file. You’ll see the slow and non-index-using queries.

Grab a query that you’d like to examine. Open a mysql shell and run “explain” on it:

explain your_query_here;

You’ll see output that looks like:

+----+-------------+---------------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table               | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+---------------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | some_table          | ALL  | NULL          | NULL | NULL    | NULL |  166 | Using where |
+----+-------------+---------------------+------+---------------+------+---------+------+------+-------------+

Look at the rows and key fields in particular. rows shows how many rows were examined for query – you don’t want to see a high number. If you’re not using an index you may be doing a full-table-scan, examining every single row to find the value you’re looking for.

key shows which index, if any, was used. If you see NULL then no index was used.

To make the query go faster you may need an index. Look at the query and see what your’e selecting based on, and then create the corresponding index. For example, if we’re doing the selection based on the “myfield” field, you could create the index:

CREATE INDEX myapp_mytable_myfield_idx on myapp_mytable(myfield);

Once you create the index you should see that query no longer appearing in the slow query log.

Rinse and repeat for other queries.

XML The Evil Adult

3

Entertaining if long winded rant from Erik Naggum:

When the markup overhead exceeds 200%, when attributes values and element
contents compete for the information, when the distance between 99%
of the “tags” is /zero/, when the character set is Unicode, and when
validation takes more time than processing, not to mention the sorry
fact that information longevity is more /threatened/ by XML than by
any other data representation in the history of computing, then SGML
has gone from good kid, via bad teenager, to malfunctioning, evil
adult as XML.

Easiest Way To Create UML Sequence Diagrams

3

In need of creating various UML Sequence Diagrams and in no mood to deal with Visio or its other graphical editor ilk, I did some research and ended up at a really nice solution.

What I wanted was to be able to specify my sequence using a simple text description and have a tool turn that into a sequence diagram. What I found was a tool called Sequence by Alex Moffat that does exactly that. The syntax takes a minute to get used to, but once you see what’s going on it’s very easy to use. The generated images are reasonably attractive and easily exportable to png, gif, etc.

Here’s a sample of the syntax:


objectOne.methodOne {
  objectTwo.methodTwo -> value {
    objectThree.methodThree -> anotherValue;
    objectFour.methodFour->finalValue;
  }
}

And here’s the resulting diagram:

UML Sequence Diagram

Once you download the jar and run it hit the “Help” option and look at the examples. They’re self explanatory and fairly complete.

Preserve JavascriptDB: Yet Another Non-Traditional Data Store

0

Non-traditional data stores are coming fast and furious these days. Here’s another interesting one: Preserve with JavascriptDB. This one I’d like to check out.

Proxies For Request Modification?

0

Interesing post from igvita on Ruby Proxies for Scale and Monitoring discussing the use of Ruby and EventMachine to create simple proxies for monitoring, benchmarking, content examination, and even request modification.

I’ve always wanted to do benchmarking as Ilya suggests. Real production traffic is the best way to test. Good stuff.

I’m tempted by the beanstalkd use case as well – he uses his proxy to detect and route certain requests to an archiving mysql instead of to his beanstalkd instance. I’m leary of maintainability issues however – I’ve generally found indirection, particularly at wire protocol level, can quickly lead to hard to find bugs.

Something to experiment with at some point.

Notes On Distributed Key Stores

0

Leonard Lin posts his notes on distributed key stores. His requirements are fairly similar to mine so I read with interest. Short version: he likes Tokyo Cabinet / Tokyo Tyrant with his own consistent hashing scheme thrown on top. 

Btw, it’s interesting how much interest there suddenly is in distributed key-value stores – everyone I know is using or evaluating one. How did we live without them for so long? Gasp.

Distributed Database Talk

0

Very informative PyCon talk on various fancy distributed data stores, including BigTable, Dynamo, Cassandra, and several others.

 

Tokyo Cabinet Observations

12

I’m using Tokyo Cabinet with Python tc for a decent sized amount of data (~19G in a single hash table) on OS X. A few observations and oddities:

  • Writes slow down significantly as the database size grows. I’m writing 97 roughly equal sized batches to the tch table. The first batch takes ~40 seconds, and processing time seems to increase fairly linearly, with the last taking ~14 minutes. Not sure why this would be the case, but it’s discouraging. I’ll probably write a simple partitioning scheme to split the data into multiple databases and keep the size of each small, but it seems like this should be handled out of the box for me.
  • [Update] I implemented a simple partitioning scheme, and sure enough it makes a big difference. Apparently keeping the file size small (where small is < 500G) is important. Surprising – why doens’t TC implement partitioning if it’s susceptible to performance issues with larger file sizes? Is this a python tc issue or a Tokyo Cabinet issue?
  • [Also] Seems I can only open 53-54 tc.HDB()’s before I get an ‘mmap error’, limiting how much I can partition.
  • Reading records that have already been read from the tch seems to go much faster on the second access (like an order of magnitude faster). I suspect this is the disk cache at work, but if anyone has extra info on this please enlighten me.
Another somewhat surprising aspect: using the tc library you’re essentially embedding Tokyo Cabinet in your app; I had assumed it was going to be network based access, but it’s not. You can do network access either using the memcached protocol or using pytyrant.

Development Stalemate

0

Sometimes all you have left are large contiguous pieces of work requiring concentration and small fragmented pieces of time full of interruption.

Are Schemas A Thing Of The Past?

5

Schemas often turn out to be significant barriers to innovation – adding a feature that requires a schema change brings with it the difficult and time consuming task of actually changing the schema in your live database and migrating historical data. I know from first-hand experience with large systems that this frequently turns out to be the most time consuming and complex aspect of launching new versions of software.

Lately I’ve been more and more tempted by data stores with relaxed schemas. In an attempt to better understand the pros and cons of schemas I’m enumerating my thoughts, mostly from a user’s perspective, and hoping that some of my DB expert readers (I know you’re reading this, if you don’t respond here I’ll bug you in person!) will chime in with their thoughts on the pros of strict schemas.

The Aesthetics. From a developer’s perspective I find strict-schema systems (general RDBMSes) confining – some rows simply want to have more columns than others. I can model it with relationships and foreign keys, but the multi-way joins are somehow ugly to me, and I know they make my performance minded DBA buddies upset. It’s also just not pretty, splitting up this field that naturally wants to sit in a row with the rest of his friends into his own separate table. 

The Exceptions. I’ve also seen, in literally every large scale data analysis I’ve done, that even in strict-schema systems junk gets into the system. Just last week some analysis blew up because a field that could not be NULL was in fact NULL in some odd cases. The DBA can’t explain why, but it’s there in the data. 

The Discontinuity. Changes do in fact have to happen even in strict schema systems. When they do, there’s an ugly breakage – tools and code deal only with before-change or after-change versions of the schema, not both, because they’re built assuming the schema is well defined and static. I’m looking at the effects of this right now – in one project the current analysis tools can’t be applied before a certain date, because the schema changed on that date.

The Static In A Dynamic System. We’ve found freedom in dynamic languages, allowing us to define our data structures in fluid, natural ways. Strict schemas feel like an injection of Java in a Python system: over-defined, confining, and unnatural. The data store should be as flexible as the programming language – data is code and code is data, or something like that.

To be fair, schema-less sytems have plenty of cons as well. Truth be told I don’t have very much real-world experience with them, so most of the cons are probably hidden from me, but I do know that schema-less systems are generally a bear to deal with after they’ve existed for a few years. Zombie objects and data infest your data store and crumb up the works. Tooling, general knowledge, and best practices are also severly lacking compared to static-schema systems, at least at this point in time.

So I’m very tempted to go and try a system with less strict schema requirements. Perhaps something like the FriendFeed MySQL setup. If you can enlighten me on the benefits I’d be foregoing by going schema-less, please leave a comment.

How Do You Speed Up Django Unit Test Execution?

1

As I’ve mentioned before I’m a fan of unit tests, mostly because of the freedom they give you to redesign and clean up code without fear of breaking functionality.

However, they’re also a source of distraction for my distracted mind. In particular, they take a not-insignificant amount of time to run when they involve hitting the database on my local windows/sqlite setup.

This time is just long enough for me to switch over and check email / twitter / whatever. And just like that, I’m off on a tangent, I’ve lost context, and I’ve wasted my own time.

Of course the solution is for me to exert restraint and patience. But I’m impatient. 

So, is there a way to make unit tests in Django go blindingly fast?

JSON faster than Thrift and Protocol Buffers?

0

This is a bit surprising. According to Justin compressed JSON is faster than both Thrift and Protocol Buffers when used with Python (via Dion). I had previously asked about performance comparison of Thrift and Protocol Buffers on StackOverflow, but I had assumed JSON would be significantly slower. Maybe not. I’d love to see more on this.

The Growth of Laptop Sized Problems

0

I like the term laptop-sized problem for referring to problems that can be effectively solved using a laptop. As in: x million rows in your dataabse? x million pages per day? That’s a laptop-sized problem.

I like it because it’s a good common-sense check – if you’re implementing a complex solution to a laptop-sized problem you’re probably doing it wrong. 

I’m amazed by how the size of laptop-sized problems has grown. These days most laptops are multi-core and can easily be outfitted with 4G of memory. You can solve some pretty large problems with just a laptop.

Doing data analysis on some fairly large data sets I decided to first implement a basic python/numpy solution before rolling out the hadoop/EC2 version. Turns out the laptop version does make my laptop groan, but is workable even with the full data size. I can skip being smart and just do the simple thing. 

Imagine the size of problems we’ll be able to tackle with simple solutions on a laptop in a few years.

Next Page »