Thursday, October 29, 2009

Reliability and scalability in P2P-SIP

In this article I list the important definitions that contribute to reliability and scalability of distributed systems, especially P2P-SIP. Scalability is defined as the ease with which a system can handle growth of demand (load, users, requests, etc.), and reliability is defined as the ease with which a system handles failure or loss (fault tolerant, availability).

Stateless: The amount of state stored in each networked component limits the scalability (as well as robustness). A truly stateless protocol or service does not need to store any state in the system. At the application level, a session oriented protocol such as RTSP or RTMP is stateful, whereas HTTP is stateless which can be made stateful using session cookies. At the transport level, TCP is stateful whereas UDP is stateless. SIP networked components come in several flavors: stateless, transaction stateful and session stateful. A stateful component has limited scalability not because of storage requirement of the state, but due to matching of a new request against the existing states -- this requires exclusive or read-write-locked access to shared resources and hence is slow. On the other hand a stateless request has all the information that is needed for handling that request. Secondly, a stateful component has limited reliability because if the component fails the state is lost and must be re-established, e.g., using new SIP transaction or new HTTP authentication. If you must use stateful component, then some form of distributed or replicated shared state is desirable.

Partitioning: Data or service partitioning among multiple machines helps in reducing load on each machine. A naive approach of using a hash function 'H(data.key) mod N' works for distributing data based on the lookup key among N identical and robust servers. However, the hashing algorithm remaps majority of the data if N changes -- new server is added or old one fails. On the other hand, consistent hashing using a large hash space, e.g., MD5, maps both data using H(data.key) and machines using id=H(machine.name) in the same identifier space. A machine can then store the data whose keys are close to its own id. This principle is used in several structured P2P algorithms such as Chord, Pastry, CAN, and works especially well with large number of machines with significant amount of churn. Partitioning can be applied to services as well. Partitioning improves system scalability, but comes with an overhead of maintaining the correct partition when machines come and go.

Replication: Replication of data improves its reliability (also known as availability) as well as scalability to some extent. In a simple master-slave replication of data, write is done to the master which replicates the command to the slave, and read can be done from either master or slave servers. With higher failure probability, such as in P2P, you need more number of replicas. With N replica of some data, you can still get the data if N-1 machines holding the replica fail. Replication interacts with partitioning -- you want to keep access to the replica almost as fast as the original data. Typically a machine can replicate the data to its N (typically small, 2-8) neighbors and when the machine fails, the next data query automatically gets to its ex-neighbors. Note that another approach where a different hashing function, e.g., H(i+data.key), is used to store the i'th replica does not work well, because replication comes with an overhead of having the machines to keep the replicas up-to-date.

Redundancy: Replication is an example of redundancy of data. Redundancy can be applied to servers and services as well. A stateless protocol or service helps in easily deploying redundant server nodes. Data redundancy goes beyond simple replication of data object to N places. Suppose a large file is split into M chunks such that only N chunks are needed to reconstruct the original file, N<=M. Assuming that many data sources are fast, this approach not only improves reliability but also performance in terms of how quickly you can download the file. Such techniques are used in P2P file sharing applications, and can be applied to P2P-SIP as well, e.g., for storing video mails or live streaming. Redundancy improves reliability as well as scalability of the system.

Load sharing: Load sharing is one type of redundancy where the load of the system is shared among N redundant machines. Although not required, load sharing typically works in conjunction with partitioning, e.g., in two-stage SIP server farm where the first stage servers forward the requests to the second stage server clusters based on H(data.key). As with redundancy, a stateless protocol or service can be easily shared whereas a stateful system requires more work. P2P systems are inherently load shared among the participating peers. Load sharing improves system scalability.

Iterative vs recursive: Iterative request routing is one where a client sends a request to one destination, receives a response that redirects it to second destination, and so on. Recursive request routing is one where a client sends a request to one node, which sends request to another node, and so on. Iterative vs recursive is also called as redirect vs proxy. Clearly, iterative poses less load on the networked element but more on client, whereas recursive is opposite of that. If scalability of networked element is desired then iterative should be preferred. However, NATs and firewalls make iterative request processing difficult on the Internet. Secondly, the topology, bandwidth and connections among the networked elements sometimes make recursive routing faster and more efficient in practice than iterative. The decision to go iterative vs recursive affects the number of message that needs to be handled as well as the state carried in the message or stored in the networked element. It is easier to incorporate redundancy in iterative mode, since only the client needs to re-try to redundant destinations.

Keep-alive: Network protocols usually have periodic keep-alive messages to ensure connectivity or to detect failures. Stateful Transport protocol such as TCP has built-in keep-alive mechanisms that can be activated using socket API. Application protocols employ some kind of application level keep-alive, e.g., XMPP has an extension to do ping, SIP has session timers. These not only detect failure due to network but also due to server software crash. The keep-alive messages help in improving the reliability of the system by quickly detecting complete or partial failures. Keep-alives are especially important in P2P network because of the large number of network paths that can fail.

Exponential back-off: After a failure has been detected, a reconnection or resend is attempted. The time for such attempts needs to be backed-off exponentially -- if the failure happens at time 0, then send first attempt at t, say t=0.5s, and subsequent ones at 2*t, 4*t, 8*t, and so on, until it reaches a cut-off, say 5 min. After that keep attempting periodically, say, every 5 min. If the failure is transient or one time, then this mechanism quickly reconnects. On the other hand, if the failure is longer term, then it reduces the load and bandwidth for reconnection attempts. SIP uses exponential back-off when sending subsequent requests in the event of failure.

Request-response: Application protocols typically come in two flavors: send-and-forget and request-response. Most signaling and control protocols such as HTTP and SIP follow the request-response architecture. Media streaming protocols such as RTP usually don't send response for every message. A request-response protocol automatically makes the entity sending the request a client and the entity receiving the request a server. The client-server distinction may be on per-transaction basis, e.g., SIP has each endpoint act as user-agent-client and user-agent-server. Similarly, in P2P network every peer acts as both client and server, while there may be some nodes behind NAT that can act only as client. A request-response architecture is needed where reliability of the message delivery is important or when RPC semantics is desired in the protocol.

Redundant connections: Redundant connections are another form of redundancy found in distributed systems. For example, a client may be connected to multiple servers. It periodically pings the server and selects the best server to actually send the request to. If the best server fails, it can fail-over to the next best server. Redundant connections improve both reliability and performance of the system. Redundant connections are also useful with geographically distributed server farms to locate the closest server. The idea is to dynamically adapt instead of having statically configured connections.

Bi-directional master-slave: Master-slave replication is used for data reliability and to improve scalability of read-dominated applications, as mentioned before. In a bi-directional master-slave configuration, both machines act as both master and slave at the same time. Any write to any of the machine gets propagated to the other, and hence any read can be done from any of the machines. This improves scalability for write-dominated applications also, such as SIP server. The bi-directional replication can be extended to more than two machines by incorporating a circular ring topology of replica machines. The bi-directional replication comes with the over head of having to maintain replicas on all the machines, and some way to solve a race condition where two updates to two different machines result in eventual consistency. For certain types of data, such as SIP contact locations of the users, it is possible to achieve such consistency.

Vertical vs horizontal scalability: A lot of time you will hear people talking about vertical vs horizontal scalability. Vertical scalability implies that when the load increases you identify the bottleneck and add a new component (e.g., CPU, memory, disk) to your machine to improve the scalability. Horizontal scalability implies that you design the system such that when the load increases you add another machine in the network to handle the load, e.g., another server in the server farm. P2P systems are horizontally scalable. Clearly vertical scalability has a limit beyond which it may not scale, whereas server farms can scale linearly by partitioning.

Proxy and Cache: Caching improves performance and scalability of the system. DNS has epitomized the concept. Caching is also used in web and media streaming protocols. The idea is to install a cache in an intermediate networked element which uses the cache instead of sending subsequent requests to the actual destination. HTTP is designed to heavily use caching to improve performance and scalability of the servers. Caching a negative response is more challenging since the time-to-live for the cached entry is unknown. On the other hand, real-time communication protocols such as SIP and RTP have limited use for caching in the network. Nevertheless caching of data improves the performance and scalability of the SIP server, e.g., by using in-memory cache of the SIP contact locations instead of reading from the database every time. With caching comes the overhead of maintaining consistency among redundant data.

Crash vs byzantine failure: Crash and byzantine failure are two types of failure models: a crash indicates that the system has stopped working and it may be possible to detect its failure, e.g., using keep-alive; whereas a byzantine failure indicates that the system does not behave consistently as per the agreed protocol or algorithm all the time. Hence, it is very difficult to detect byzantine failures, especially if it is due to malicious intention. Byzantine failure and malicious node problem is especially important in P2P since the peers may not trust each other. Both crash and byzantine failure reduce the reliability of the system. Various mechanisms mentioned in this article help mitigate the crash failure, but do not help much with the byzantine failure.

Readings:
[1] Singh, K. and Schulzrinne, H., "Failover, load sharing and server architecture in SIP telephony", Computer Communication 30, 5 (Mar. 2007), 927-942. DOI= http://dx.doi.org/10.1016/j.comcom.2006.08.037 [Author's copy]
[2] K.Singh, "Reliable, scalable and interoperable Internet telephony", PhD Thesis, Computer Science Department, Columbia University, New York, NY 10027, June, 2006.

No comments: