Archive for the 'Python' Category


Python Based Key Sniffer In 10 Lines

0

I love Autohotkey but I’m not crazy about its programming language, so I decided to investigate building an alternative with a simpler language, namely Python.

Turns out the key sniffing portion is actually quite easy to do. Here’s a simple script from the Keylogger in Python thread that does it in 10 lines using pyHook and Mark Hammond’s Win32 Extensions:


import pyHook
import pythoncom

def OnKeyboardEvent(event):
	print event.Ascii

hm = pyHook.HookManager()
hm.KeyDown = OnKeyboardEvent
hm.HookKeyboard()

while True:
	pythoncom.PumpMessages()

Mindtrove’s post has further details including code and examples for event filtering.

Python Dot Notation Dictionary Access

3

In most cases I prefer dot notation over bracket notation for dictionary access. That is, I prefer mydictionary.myfield over mydictionary['myflield']. I also prefer attempted access to undefined keys to return None instead of raising an exception.

With the help of this thread, this is what I’ve been using:



class dotdict(dict):
    def __getattr__(self, attr):
        return self.get(attr, None)
    __setattr__= dict.__setitem__
    __delattr__= dict.__delitem__

>>> dd = dotdict()
>>> dd.a
>>> dd.a = 'one'
>>> dd.a
'one'
>>> dd.keys()
['a']

>>> existing = {’a':’A', ‘b’:'B’}
>>> dot_existing = dotdict(existing)
>>> dot_existing.a
‘A’

Python Scripts For Dumping Oracle Data And Loading Onto Hadoop DFS

0

There have been several requests for this, so I might as well post it here for general use. I put together a simple system for dumping data out of Oracle databases and loading onto Hadoop DFS. The slightly interesting part is the parallelism - Python’s Processing library is used to dump partitions in parallel and copy and load them onto DFS in parallel. This helps when dumping large amounts of data from partitioned Oracle tables.

The database interaction is handled by db.py . There are a couple of helper functions for finding table partitions, etc. DBDumper dumps the requested fields from the requested table:


dumper = db.DBDumper('username/password@yourhost:9999/DB', 'table_name',
      ('field1', 'field2', 'field3'), 'owner', 'partition', 'output_dir', 10)
dumper.dump(cp)

Where 10 is the level of concurrency, owner is the owner of the table, and partition is the name of the partitions you’re interested in (can be None).

dfs.py copies the dumped files over in parallel, again using PyProcessing. It’s simply a wrapper around “cat | ssh | hadoop dfs -put”.

DBDumper and dfs are tied together via a callback - when each partition is dumped, the callback is invoked, triggering the dfs copy.

Here’s a complete example of using these to dump and copy data:


import db
import dfs

fs = dfs.RemoteDFS('address.of.remote.machine')

def cp(arg):
    print "CALLBACK:", arg
    fs.cp(arg[0], ‘/some/directory/’ + arg[1] + ‘/’ + arg[2])

dumper = db.DBDumper(’username/password@yourhost:9999/DB’, ‘table_name’,
     (’field1′, ‘field2′, ‘field3′), ‘owner’, ‘partition’, ‘output_dir’, 10)
dumper.dump(cp)

Python Script For Finding And Removing Duplicate Files

0

My image, mp3, and ebook collection were a mess after years of copying to various servers, consolidating, and re-copying. I had lots of duplicates.

I looked for an app to find and remove duplicates but surprisingly didn’t find anything very good. So I had to write my own.

This is a very simple script - it scans the directory tree you specify, looks for exact duplicates, and removes the duplicates.

It’s not very smart about which copy it removes. It’s not smart about finding files that are “similar” - it only finds exact matches. It ignores small files (intentionally - it’s easy to make it deal with small files).

It uses /temp for its output and cache files, so it’s targeting windows. Change that to /tmp if you’re running unix.

I built in a caching mechanism to save the results of scanning the disk, but it turned out not to be too useful and the script ran faster than I expected, so the caching is commented out.

Here it is: FileInfo.py .

Access Python Dictionary Keys As Properties

1

Say you want to access the values if your dictionary via the dot notation instead of the dictionary syntax. That is, you have:


d = {'name':'Joe', 'mood':'grumpy'}

And you want to get at “name” and “mood” via


d.name
d.mood

instead of the usual


d['name']
d['mood']

Why would you want to do this? Maybe you’re fond of the Javascript Way. Or you find it more aesthetic. In my case I need to have the same piece of code deal with items that are either instances of Django models or plain dictionaries, so I need to provide a uniform way of getting at the attributes.

Turns out it’s pretty simple:



class DictObj(object):
    def __init__(self, d):
        self.d = d

    def __getattr__(self, m):
        return self.d.get(m, None)

d = DictObj(d)
d.name
# prints Joe
d.mood
# prints grumpy

Beanstalkd / Python Basic Tutorial

3

(First install beanstalkd and pybeanstalk)

Beanstalkd is an in-memory queuing system. It supports named queues (called ‘tubes’), priorities, and delayed delivery of messages.

Terminology: a message is called a job, and queues are called tubes.

Let’s look at an example scenario. Say you want to create 2 tubes, one called “orders” and another called “emails”, place orders into the first tube (or queue) and emails into the second, and have different processes handle orders and emails.

You can create queues by simply naming and putting messages into them. On the producer side:


from beanstalk import serverconn
c = serverconn.ServerConn('localhost', 99988)

# put a message (or job) into the default queue:
c.put('first message, into default tube')

# now start using a named tube:
c.use('orders')
c.put('second message, into orders tube')

Now on the consumer:


from beanstalk import serverconn
c = serverconn.ServerConn('localhost', 99988)

# by default your connection will be listening on the 'default' tube.
# switch it to use the 'orders' tube.
# This should return the 'orders' message and ignore the 'default' message:
c.watchlist = ['orders']
j = c.reserve()
print j
# {’data’: ’second message, into orders tube’, ‘jid’: 39, ‘bytes’: 32, ’state’: ‘ok’}

You can similarly setup another consumer to listen on only the ‘emails’ tube, or both, or any other scenario you want.

Beanstalkd also supports priorities, with 0 being highest priority and higher numbers meaning lower priority. You define message priority with:


c.put('low priority message', pri=999 )
c.put('high priority message', pri=0 )

j = c.reserve()
print j
# {'data': 'high priority message', 'jid': 41, 'bytes': 21, 'state': 'ok'}
# the high priority message was delivered before the low priority message, even
# though the low priority message was first into the queue

The beanstalkd consumption model is to “reserve” a message (or job), process it, and then tell beanstalkd you’ve successfully dealt with the message so it can be thrown away. When you first get the job via c.reserve() you haven’t actually fully consumed it; you’ve just reserved it for processing.

What does this mean? Imagine a scenario where you reserve a message but your process dies before you have a chance to fully process it. Beanstalkd holds your message in reserve for a period of time, but since it hasn’t heard from you confirming you’ve successfully dealt with the message, it eventually removes the reservation and makes the message available once again for the next consumer to grab. This is a basic handshake between the consumer and the server to allow for some level of resiliency.

So once you’re finished dealing with the message you’ve reserved, be sure to “delete” it, letting the beanstalkd server know it can throw that message away:


j = c.reserve()
# do some processing with j
c.delete(j['jid'])

So there you have the basics. Let me know if you’re interested and I can cover a few more topics.

Setting Up Beanstalkd on Ubuntu for Python

1

beanstalkd is a promising in-memory queuing system in the mold of memcached (minimal configuration, just works) with client libraries in a variety of languages. The following worked for me for installing it on Ubuntu 8.04:


mkdir ~/packages

# pre-requisite: libevent.
cd ~/packages
wget http://monkey.org/~provos/libevent-1.4.8-stable.tar.gz
tar zxvf http://monkey.org/~provos/libevent-1.4.8-stable.tar.gz
cd libevent-1.4.8-stable
./configure
make
sudo make install

# add /usr/local/lib to your load library path so beanstalkd can find libevent
vi ~/.bashrc   (add the following somewhere near the end):
export LD_LIBRARY_PATH=$LD_LIBRYARY_PATH:/usr/local/lib

(exit vi)
source ~/.bashrc

# need git in order to get latest code for beanstalkd
cd ~/packages
sudo apt-get install git-core

# grab beanstalkd
git clone http://xph.us/src/beanstalkd.git
cd beanstalkd
make

# now you should be able to start the beanstalkd daemon
./beanstalkd -d -p 99988

# get the python beanstalkd client
cd ~/packages
svn checkout http://pybeanstalk.googlecode.com/svn/trunk/ pybeanstalk-read-only

cd pybeanstalk-read-only
sudo python setup.py install

# get pyyaml, a pre-requisite for the python beanstalkd client
cd ~/packages
wget http://pyyaml.org/download/pyyaml/PyYAML-3.06.tar.gz
tar zxvf PyYAML-3.06.tar.gz
cd PyYAML-3.06
sudo python setup.py install

# open two different shells (or use screen) type the following in the two different shells:
cd ~/packages/pybeanstalk-read-only/examples
python simple_clients.py producer localhost 99988
python simple_clients.py consumer localhost 99988

Happy: Hadoop with Python (Jython)

0

The Freebase folks have open sourced their Python (Jython) based Hadoop framework, calling it Happy. Looks interesting, will need to give it a whirl when I get a chance.

Disco: Erlang/Python Based Map-Reduce

1

Disco is a map-reduce framework written in Erlang and Python. Seems reasonable - I definitely prefer Python to Java for writing maps and reduces, and Erlang is rumored to be good at parallel stuff.

Interestingly no mention of an underlying distributed file system.

Via High Scalability.

Python Generator Expressions

1

I wasn’t aware of generator expressions:


wwwlog     = open("access-log")
bytecolumn = (line.rsplit(None, 1)[1] for line in wwwlog)
bytes      = (int(x) for x in bytecolumn if x != ‘-’)
print “Total”, sum(bytes)

Similar to list comprehensions, but evaluated lazily. Voidspace describes it:

None of the generators are consumed until the final call to sum. As it iterates a line at a time (not keeping the log file in memory) it can handle huge log files - and as a bonus it runs faster than a typical solution with loops!

Neat.

A Beautiful Python Twitter API

0

Mike Verdon’s Python Twitter Tools is less popular and findable than the Dewitt Clinton’s python-twitter (I only found the former from an email on the latter’s mailing list), but it’s a beautiful library. 125 lines, most of which are comments. It implements the full API by implementing a single call class that handles everything (and that class is only a few lines). A good library to use if you’re trying to access Twitter via Python, and a good source for learning and inspiration if you’re writing a library / wrapper.

Parsing (Top-Down) in Python

0

Excellent article on Simple Top-Down Parsing in Python. The nud and led business could be better explained, but the rest of the article and code is great. I learned several things I hope to employ shortly.

I’m trying to remember if we studied this in compiler class or not. I think not, although I have terrible memory, so it’s possible we did.

Anyway, a companion tutorial article that would approach this strictly from the perspective of using the toolkit Effbot built in his article would be nice. In other words, knowing the under-the-hood details is fantastic and informative, but given the tools and helper functions built in the article one could fairly easily build a parser without worrying about how the helper functions are implemented. Sort of a user manual for building parsers given the helper functions.

Using Django Signals To Watch For Changes To Instances

3

Say you want to monitor changes to instances of a model and update something based on the changes. In my example I wanted to maintain a sum of the values that had certain characteristics. You can accomplish this with Django Signals.

Signals are events that fire at various pre-defined moments - for example, before an instance is saved, after it’s saved, etc. You can subscribe to these events, allowing your callback handler to be called at those moments.

The code below subscribes to the post_init and post_save signals. post_init gets triggered when a model’s __init__ class is done executing, which generally means when a model instance is created for the first time or instantiated from a query to the DB. This is actually too frequent for the use case I have in mind (checking the before-modification and after-modification values of certain fields), but seems to be the only place I can hook in to get the pre-modification values.

post_init gets triggered after the instance is saved to the DB. The code below stores the pre-modification values in pre_save when it gets triggered by the post_init signal, and checks them against the post-modification values when it gets triggered by the post_save signal.

Note that you’ll probably want to clean up pre_save periodically. Unfortunately post_init and post_save are not symmetrical (you’ll get a post_init anytime an instance is created, for example when you query the DB), so you can’t simply delete from pre_save when the post_save signal gets triggered.


from django.dispatch import dispatcher
from django.db.models import signals

pre_save = {}

def change_watcher(sender, instance, signal, *args, **kwargs):
    print "SIGNAL:", sender, instance.report, signal, args, kwargs
    if signal == signals.post_init:
        pre_save[instance.id] = (instance.field1, instance.field2)
    else:
        if pre_save[instance.id][0] != instance.field1:
            print “Changed field1″
        if pre_save[instance.id][1] != instance.field2:
            print “Changed field2″

for signal in (signals.post_init, signals.post_save):
    dispatcher.connect(change_watcher, sender = Expense, signal = signal)

Running Shell Scripts from Python

0

There are about a million different ways to execute an external program from Python. Here’s the right way:


import subprocess

subprocess.call('''zcat %s | tr '\02\03' '\r\n' | ssh %s "hadoop dfs -put - %s"''' %
     (local, DEST, remote), shell=True)

Where what you put inside the call statement is what gets executed in a shell. In this case I used pipes, escaping, etc, just to show a slightly complicated example; you could do something as simple as “ls” instead.

Lots more info in Doug Hellmann’s PyMOTW article.

Multi-Process vs. Multi-Threaded Python Performance

0

Interesting benchmarks in PEP 371 — Addition of the Processing module to the standard library. In short, standard Python threads perform worse than non-threaded in almost every benchmark, while the multi-process Processing module generally does as you’d expect and reduces processing time proportional to the number of processors / cores.

You could argue this just means Python has a crappy thread implementation or that the benchmarks are biased, but it’s interesting. I actually have an occasion for a multi-headed app, so perhaps I’ll give Processing a try.

Via Doug Hellmann.

MemCache as a General Protocol

0

I’ve been messing around with Starling. Working well so far, although I haven’t done anything fancy with it yet.

I also need to create several simple, small, single purpose servers (eg. one to translate IP addresses to country codes). I have the python code, I just need to turn them into network accessible servers.

HTTP/JSON would’ve been my default choice for the service interface. However, taking inspiration from Starling, I’m thinking the memcache protocol is a great general purpose way of exposing services.

The memcache API is simple - get and set - and the data is (key, value). The set and get translate well to my interpretation of RESTful GET and POST, and I don’t have to worry about the data encoding. There are many memcache client libraries available for many languages, and they’ve been built with efficiency and speed in mind. Seems like a winner to me.

So all I’d need is a framework for easily putting a memcache interface on a python code base. Know of any?

Starling in Python?

3

Starling looks very interesting - it’s a “light-weight persistent queue server that speaks the MemCache protocol”. To use it you fire up your regular memcached client library, point it at the Starling server, and do a regular set to put an item on the queue, and a get to read an item from the queue.


  # Start the Starling server as a daemonized process:
  starling -h 192.168.1.1 -d

  # Put messages onto a queue:
  require 'memcache'
  starling = MemCache.new('192.168.1.1:22122')
  starling.set('my_queue', 12345)

  # Get messages from the queue:
  require 'memcache'
  starling = MemCache.new('192.168.1.1:22122')
  loop { puts starling.get('my_queue') }

This thing is nice in many ways: it’s very simple with practically no configuration, ala memcached; it’s stable and scalable, running Twitter’s production backend clusters; and it speaks a simple and universally available protocol (memcache), meaning you can use any of the existing client libraries to access it.

This answers half of my request for a Python or Ruby messaging server (it does the work-queue half, doesn’t do the pub/sub half). I think I’m going to give it a try. Let me know if you’ve tried it.

It also has me thinking - with all of the multitude of async-IO capabilities out there for Python, why isn’t there something like this implemented in Python? Between Twisted, asynchat, Eventlet, and the 19 other libraries and toolkits out there, surely somebody smart could whip something together in short order?

Parsing and Normalizing Dates with Timezones in Python

4

This was a bit painful and not well documented, so documenting here for future reference.

Say you want to parse and normalize dates with timezones (eg. dates in email headers, I believe based on rfc822). Here’s what you do:

Install pytz.


import email, time, datetime
import pytz
utctimestamp = email.Utils.mktime_tz(email.Utils.parsedate_tz( msg['Date'] ))
utcdate= datetime.datetime.fromtimestamp( utctimestamp, pytz.utc )
pacificdate = utcdate.astimezone(pytz.timezone(’US/Pacific’))

parsedate_tz produces a tuple that can be digested by mktime_tz, which in turn spits out a timestamp based on the UTC timezone. You can turn this into a datetime via fromtimestamp and set its timezone to UTC. Once you have the TZ aware datetime you can manipulate it to your heart’s content; the final line above converts it to a US/Pacific date.

Full example:


>>> import email, time, datetime
>>> import pytz
>>> date_eastern = 'Thu, 31 Jan 2008 17:56:13 -0500'
>>> utctimestamp = email.Utils.mktime_tz(email.Utils.parsedate_tz( date_eastern ))
>>> utcdate= datetime.datetime.fromtimestamp( utctimestamp, pytz.utc )
>>> utcdate
datetime.datetime(2008, 1, 31, 22, 56, 13, tzinfo=<UTC>)
utcdate.astimezone(pytz.timezone('US/Pacific'))
datetime.datetime(2008, 1, 31, 14, 56, 13, tzinfo=<DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>)

Pathalog: Find User Paths by Analyzing Your HTTP Log Files

0

I needed a better way to visualize how people are using my site, beyond what the typical log analyzers report on. So I cooked up a quick hack to grab user paths from the http log files and display them in a no-frills report. It’s already been useful - I found 3 distinct parts of the site that were obviously confusing in retrospect and improved them.

This is not a general purpose log analyzer - it doesn’t report on number of page views, bandwidth, etc. Instead it can be used to see what pages your users click on and in what order. It’s useful for sites that have a natural flow (are “applications”). For example, you can see what leads your users to sign up and what leads them to confusion.

I’m making the code available in its current form in case people find it useful; if I waited till I packaged it up properly and made it friendly it’d join my long list of other never released projects. It requires Python, probably version 2.5, and can be invoked as:

python pathalog.py /path/to/your/access.log > paths.log

You can see a sample report here and grab the code from here under the MIT license.

The configuration is at the top of the pathalog.py file. You have the option of doing reverse dns to get hostnames from the IP addresses in the log files, but reverse dns can be quite time consuming, so you can turn it off. Note that the reverse dns results results are cached so subsequent runs are much faster.

I’ve tried it on Windows, Linux, and OS X. On Windows you’ll need to create a /tmp directory, or modify reversedns.py to use an alternative directory.

Parsing HTTP Log Files with Python

0

Couple of quick snippets on parsing apache/http log files (common format) with Python. This is the regular expression for parsing each line of the log:


combined_format_re = re.compile(r'''(?P.*?) -(?P.*?)- \[(?P.*?)\] “(?P.*?)(?P
.*?)(?P\?.*?)? (?P
.*?)” (?P\d*) (?P.*?) “(?P.*?)”"(?P.*?)””’)

You can use it ala:


match = combined_format_re.search(line)

And you can get the matches in a convenient hash form via:


fields = match.groupdict()
print fields['useragent']

And while we’re at it, here’s how you parse the timestamp into a python datetime object:


import datetime
timestamp = datetime.datetime.strptime(fields['date'].split()[0], ‘%d/%b/%Y:%H:%M:%S’)

Wordpress formatting will probably mess up some of the code above; in theory I’ll be releasing a small piece of code soon that uses all of these so you can get the source.