Archive for the 'Scaling' Category


Proxies For Request Modification?

0

Interesing post from igvita on Ruby Proxies for Scale and Monitoring discussing the use of Ruby and EventMachine to create simple proxies for monitoring, benchmarking, content examination, and even request modification.

I’ve always wanted to do benchmarking as Ilya suggests. Real production traffic is the best way to test. Good stuff.

I’m tempted by the beanstalkd use case as well – he uses his proxy to detect and route certain requests to an archiving mysql instead of to his beanstalkd instance. I’m leary of maintainability issues however – I’ve generally found indirection, particularly at wire protocol level, can quickly lead to hard to find bugs.

Something to experiment with at some point.

Distributed Database Talk

0

Very informative PyCon talk on various fancy distributed data stores, including BigTable, Dynamo, Cassandra, and several others.

 

If you have enough traffic, the cost of servers outweighs the cost of programmers

1

Quote from Bill Venners (via):

If you have enough traffic, at some point the cost of servers outweighs the cost of programmers

Absolutely true, which is why places like Yahoo and Google are among the last bastions of very skilled C/C++ programmers.

Of course I should mention: you are not at that point. You really aren’t. So for now ignore this quote.

Queueing Benefits

0

Queues are nice things. We should be using more of them.

A couple of benefits that I knew of but didn’t really appreciate until Alex started using beanstalkd in his application:

- Having a worker-pulls-jobs-from-queue model provides near optimal use of the machine and prevents overload and thrashing. Setup as many simultaneous workers as your resources can handle and let them go. You have a controlled number of workers, preventing thrashing, and your workers work continuously. You don’t have to worry about a load distribution strategy – workers pull jobs as fast as they can.

- Provisioning new workers into the system becomes trivial. Want to add another box into the mix? Just set it up and have the workers start pulling jobs. You don’t have to worry about registering the new box and getting it into the load distribution system – so long as it knows how to connect to the queue it can grab jobs. Alex commented that scaling his system is as easy as bringing up another virtual machine – as soon as it’s up it starts pulling jobs.

These are both very important operational benefits that I had largely ignored.

Happy: Hadoop with Python (Jython)

0

The Freebase folks have open sourced their Python (Jython) based Hadoop framework, calling it Happy. Looks interesting, will need to give it a whirl when I get a chance.

Disco: Erlang/Python Based Map-Reduce

1

Disco is a map-reduce framework written in Erlang and Python. Seems reasonable – I definitely prefer Python to Java for writing maps and reduces, and Erlang is rumored to be good at parallel stuff.

Interestingly no mention of an underlying distributed file system.

Via High Scalability.

Drizzle: MySQL Based Slim / Cloud-Oriented DB

1

Drizzle is interesting:

Drizzle: A High-Performance Microkernel DBMS for Scale-Out Applications
Drizzle is a community-driven project based on the popular MySQL DBMS that is focused on MySQL’s original goals of ease-of-use, reliability and performance.

Headed up by Brian Aker, Director of Architecture at MySQL AB. Take a look at the MySQL Differences page and you’ll mostly see features removed and cleaned up, which is great. Designed for high levels of concurrency, targeted to “cloud” applications. Monty and Brian’s posts offer motivation for the project.

Something to keep an eye on.

The Performance Penalty of Virtualization

0

If you’ve spent any time with virtualized environments you know how effective and productive they are. The process of expanding capacity for FaceDouble, for example, became significantly simpler once they moved to depolying virtual servers, and SmugMug has been singing the praises of Amazon’s EC2 with a clever system to provision and remove capacity based on load. My own experiments with Hadoop and EC2 have been similarly fruitful.

So I’m wondering what the downside to aggressively going virtual is – why not make all servers virtual?

The main issue that comes to mind is performance, or the loss thereof. Presumably the performance of a virtual server is less than that of the same server running directly on the native OS.

Just how much of a performance difference is there, say in terms of per request latency and capacity, for a web server, a database server, and a cpu-bound heavy computation server, for any of the common virtualization systems (Xen, VMWare, etc)? I haven’t seen any good materials on this, so if you have knowledge or pointers please let me know.

Flickr Capacity Planning Presentation

0

Unfortunately I missed the Web2.0 Expo this year, but I’ve been catching up on slides and presentations. I had John Allspaw’s Capacity Planning For Web Operations open in a tab for several days and finally got to it. Turned out to be much more interesting than I’d anticipated. Slide 9 – “Normal” growth: 4x increase in photo requests/sec. That’s pretty obscene. Slide 43 – diagonal scaling: replacing 67 dual core servers with 18 dual quads results in ~half the load per server. Slide 45: ~70% less power usage, 49U less space. I’d been curious about that last stat (power usage of horizontally scaled servers versus multitude of smaller servers), good to see some real numbers for it.