Scalability vs Performance

I have been reading articles on scalability and performance. This article summarizes some of my understanding about this topic.

Scalability is the ability to scale the system to higher load. Performance determines the throughput of the system under load [1]. In theory, scalability and performance are orthogonal; you can handle higher load either by scaling the system or by improving the performance of individual components of the system. In practice, scaling and performance improvement are used together to improve the overall system.

Suppose a single machine can handle a load of N. If it is possible to handle 2N load by adding another machine, or kN load by adding another k-1 machines, then the application is designed to be scalable. On the other hand, you can always try to optimize your application or buy more expensive hardware to make your application handle 2N load in the single machine. Clearly there is a limit to the performance gain on a single machine. Also, for the same amount of overall improvement, typically scaling the system by adding redundancy is much cheaper than improving the performance of single machine by optimization or buying more expensive hardware.

This seems to indicate that scaling should always be preferred. Unfortunately the problem is that designing your application for scalability is not trivial. As an example, Google AppEngine (GAE) is designed to be scalable, but not necessarily high-performance [2]. On the other hand, rational database such as MySQL can be optimized for high-performance, but designing your application to scale with MySQL is a challenge. In most web applications, typically the database server eventually becomes a bottleneck at high load. On the other hand Google's Bigtable is designed to be scalable. The tradeoff is that GAE API does not allow many relational database features such as join and hence requires the programmers to learn a new way of data storage and access.

Horizontal scalability refers to adding more machines to handle the load, whereas vertical scalability (which we call high-performance) refers to adding hardware components in existing machines such as more memory or better CPU to handle higher load [3].

High Scalability Techniques
Partitioning the data is most common scalability technique. It allows you to distribute different partitions on different servers. Consistent hashing has been used in distributed hash tables and distributed server farm to assist partitioning and replication of data in the presence of high churn when machines come and go frequently.

Stateless servers are much more scalable than stateful, because stateful servers may need to communicate with each other or share state which limits the scalability. Web servers and SIP proxy servers are easy to make stateless, whereas conference servers, presence servers or gateways are difficult to make stateless. Many applications too require stateful processing at the server, e.g., web applications that need stateful database storage. This concept can be used together with partitioning to build a two-stage server farm where first stage stateless servers just do load balancing whereas second stage stateful servers work on a small data partition. Unfortunately, some applications such as presence or publish-subscribe are too complex for easy data partition.

High Performance Techniques
The C10K problem [4] talks about the typical web server limitation of only about ten thousand simultaneous connections due to operating system and software constraints, and presents several references to improve the performance. The usual software performance bottlenecks are data copies, context switches, memory allocation and lock contention. Various techniques to handle these problems are summarized in [5].

Asynchronous and non-blocking IO are commonly used to convert blocking/synchronous methods to event-based. Although asynchronous and non-blocking refer to almost the same thing, there are certain crucial differences in the API [6]. Non-blocking refers to making your methods not block and hence return immediately, e.g., with an error code indicating that the method is not complete. Typically, additional method is available to know the state of the IO. For example, socket API allows non-blocking mode, and can use select to check the state of the socket, whether read or write can be done or not. Thus, the application program has full control of when the read is done and in which thread/stack. On the other hand, asynchronous API are more event-based, where the application registers a method handler for an event, and the system calls the method when that event occurs from within the system thread, or posts that event to the main application's handler loop.

A well known topic of debate is whether event-based or threads are more suitable for high performance servers? Theoretically, both are equivalent with non-preemptive threads and co-operative multi-tasking. But in practice due to the way threads are implemented and resources needed by threads, event-based systems have performed better on single CPU machines. Unfortunately, pure event-based systems are difficult to take advantage of multi-CPU hardware.

Thread-pool and process-pool have been used to improve the system performance and take advantage of multi-CPU hardware. Both multi-process and multi-thread systems have been built in practice. The advantage of multi-process implementation is that multiple processes can listen for incoming connections on the same socket, whereas in multi-thread implementation only one thread can be listening on a socket. The problem in multi-process implementation is that it needs explicit inter-process communication using message passing or shared memory, whereas in multi-thread implementation it is easy to use global variables with mutex and conditions to share state. With respect to event-based systems, there are two design patterns: a reactor pattern allows the application to register for "ready" event and perform the read operation when event is received; a proactor pattern allows the application to register for "complete" event and receives the incoming data along with the read event [7].

The thundering herd problem in OS is that when an IO event is received all the waiting threads are woken up. But only one thread will handle the event and others will go back to sleep. This wastes CPU cycles. The problem and a solution is proposed in [8].

For a high-performance server implementation, general consensus is to always use non-blocking IO, and use thread or process pool with minimum number of threads/processes. The idea is that there should be one-thread/process per CPU. This paper [9] describes a SIP server architecture which can maintain few hundred thousand active TCP connections. For pure network IO it is possible to always use non-blocking IO on commodity hardware, whereas for disk IO it is not so easy. Hence, thread-pool model with worker threads to wait on disk IO completion have been used with success in the past.

Lessons in starting a software project

This article presents my thoughts on DOs and DONTs of starting a new software project. Many lessons listed in this article are already well known or common sense, but usually not always followed!

DOs
  1. Brainstorm often: During the initial phases of software growth or even before starting to write a single line of code, you should do several sessions of brain storming. It could be on validating your idea, figuring out competition, predicting the future, picking a programming language, potential learning, etc. This is the difference between carefully planned birth versus unexpected pregnancy. Just because you can write some software, should you? Especially if better alternatives exist?
  2. Use good version control system: Even for the most trivial projects, you should try to use version control system. I like SVN (subversion) for my open-source projects, but if you can afford git, it works better for complex project management. If you are starting an open source project, consider code.google.com for hosting your SVN repository -- it is fast, simple and hassle free. It is like a good home for your baby software.
  3. Document all ideas: When the software is evolving you will have many ideas for new features, doing things differently, or incorporating competing features. Obviously due to lack of resources and time, you won't be able to incorporate all these. But you must document all the ideas, and if possible prioritize them. Keep a single list of ideas. Usually the software will evolve on its own to attract new features. Implement only the most crucial ideas and features, and resist the temptation to add many features.
  4. Few developers during growth: Keep the core set of excellent developers to one, two or at most three when the project is growing. Every major piece of software should have only one excellent developer. This avoid unnecessary friction and induces feeling of ownership. Software is like a baby, which needs a good parent to raise and grow, before it can mature and face the world. You wouldn't want to raise your software in a foster house where nobody feels ownership, i.e., in an organization with an engineering "team".
  5. Pick the right language and tools: Every programming language has some strengths and weaknesses. Make sure you select the right language, that is quick to develop with and maintain, and works well for your target application. For example, with low-level C/C++ you get performance, and with high-level Java, Python, you get portability. Over the years I have liked Python for most of my applications. Unfortunately, in corporate environment, Java is the pet-child because there are many fold more software developers and managers who know Java well. For modern Internet and web applications, Python, Ruby, Erlang and ActionScript are becoming more popular.
  6. Include testing and defensive programming: To be successful, sooner or later your software project will need to get out of the demo-mode and face the real world. It might become too late at that time to worry about scalability or glaring bugs if those involve redesigning your software. It saves a lot of time and energy to use common techniques such as good logging, unit testing, performance best practices, and defensive programming from day 1. Also maintain an issue tracker and log even the tiniest of issues with your software. Sooner or later you will need to address them.
DONTs
  1. Don't procrastinate: If you have an idea to work on, don't procrastinate. Just get started, write something up, try to get a prototype going. Most successful projects need a complete re-write at least once. So don't be afraid to write throwaway code. 
  2. Don't document before coding: While software engineering people will say that you should follow good software process -- writing requirements specification, design document, test cases, etc. -- those can be written later too! Source code is what makes or breaks a software. You can write detailed specification and design documents, after you already have a prototype and want to document it or propose a change. In my experience, any design document written before writing the code is incorrect, and needs to change drastically after the source code is written.
  3. Don't spend time on one-off items: For your software, there are some items which are directly related, and then there are one-off items. For example, for a VoIP client, the protocol implementation, good voice quality, etc., are directly related. On the other hand, having a user signup page, instant messaging text chat, file sharing, etc., are one or two-off items, which are not directly related, but indirectly assist users in VoIP. When you start a project, do not spend time doing one-off items, but work on directly related items first.
  4. Don't wait too long for 1.0 release: There is 80% difference between an 80% complete software and a released software. When you formally release your software, you have to take care of user manual, getting started guide, installer as well as finish those last annoying bugs. In the case of software projects, it is very easy to get started but very difficult to put an end. There is always an endless list of features which needs to be completed before the release, and hence your release never happens. Unless, you make it happen. You will have to make a firm decision about what bugs are important and what can remain as known issues for version 1.0.