SIP vs XMPP or SIP and XMPP?

(This post is unrelated to P2P, and describes the differences between the two sets of protocols SIP and XMPP. I have implemented both SIP and XMPP, as well as used several existing libraries for SIP and XMPP, so I can comment on the two sets of standards from a developer point of view as well)

SIP was invented to provide rendezvous for session establishment and negotiation on the Internet. XMPP (or Jabber) was invented to do structured data exchange such as synchronous or active presence and text communication among group of people. XMPP evolved from instant messaging and presence, whereas SIP evolved from Internet voice/video communication. Later, XMPP added support for session negotiation using the Jingle extension, and SIP community added extensions such as SIMPLE to support instant messaging and presence.

Technically comparing SIP and XMPP is like comparing apples and oranges because the core protocols serve different purposes: session randevous/establishment vs structured data exchange. On the other hand, because of the extensions invented in both the protocol worlds, SIMPLE and Jingle, they now have overlapping functions, and can be compared. When one compares SIP vs XMPP, actually the comparison is SIP/SIMPLE vs XMPP for IM and presence and/or SIP/SDP vs XMPP/Jingle for session negotiation. Even though the goals of the two sets of protocols are converging, there are fundamental architectural differences that I will enumerate in this article. There are other articles on SIP vs XMPP [1, 2, 3].

Differences: SIP vs XMPP
The following table lists the crucial differences between the two sets of protocols.

PurposeProvide rendezvous for session establishment and negotiation where the actual session is independent, e.g., over RTP media transport.Provide a streaming pipe for structured data exchange between group of clients with the help of server(s), e.g., for instant messaging and presence
ProtocolText-based request-response protocol similar to HTTP, where core attributes are signaled using headers, and additional data using message body, e.g., session description of capabilities.
XML-based client-server protocol to create a streaming pipe on which it sends request, response, indication or error using XML stanza between client and server, and between servers.
TransportUsually implemented in connection-less UDP as well as connection-oriented TCP transport. Also works over secure TLS transport.Works over connection-oriented TCP or TLS transport.
ConnectionA user-agent is both client and server, hence can send or receive connections, in case of TCP or TLS. This does not work well with NATs and firewalls, hence extensions are defined to use reverse connections when server wants to send message to client.The client initiates the connection to the server, which works well with NATs and firewalls. Additionally, extensions are defined such as BOSH to carry XMPP stanza over HTTP to work with very restricted firewalls

There are many other differences, e.g., the way a URI is represented, or how authentication is done, or what kinds of messages are supported. I will not go into details of those since they tend to become too specific for the kind of application and we miss the important points. From a developer's point of view 'ease of programming' is very important.

Ease of programming
Both SIP and XMPP are easy to implement. My 39 peers project has modules for both in few thousand lines of Python code. Although the basic protocol is easy to implement, a complete system such as a collaboration client with audio/video and messaging/presence support is very complex.

Because of the way these protocols have originated, they are well suited for certain kinds of applications. For example, if you want to build an audio/video communication system, it is better to start with SIP. Features such as interoperability with other VoIP phones, incorporating any-cast call distribution, or using existing VoIP provider for trunking are easy and readily available using SIP. If you want to build an instant messaging and presence client, it is better to start with XMPP. Features such as friends roster, group chat, blocking a user, storing offline messages, etc., are readily available using XMPP. Any advanced communication or collaboration system needs to include both these kinds of features.

XMPP has solved the application's problems and has defined mechanisms for several commonly used features in an instant messenger-type or shared state-type application, e.g., group chat, visiting card, avatars, etc. The emphasis is on application design, use cases, and practical solutions.

I think there are two main reasons for SIP's difficulty among developers: (1) the emphasis of SIP is on interoperability rather than application and feature design, and (2) the emphasis in SIP community is to have one protocol solve one problem, which requires implementing a plethora of protocols for a complete system. Let me explain these further.

When a new VoIP features is implemented by one phone, it must interoperate with another phone or VoIP service provider. Hence most SIP extensions focus on wire-protocol and interoperability mechanisms. Although specifications of several SIP extensions are available, there are no evaluation or open reference implementation on how they fit in the overall design. More recently efforts have been made, including my RFC 5638 (Simple SIP Usage Scenario for Applications in the Endpoints), to simplify the specifications for certain types of SIP applications -- those endpoints that want to work in web and Internet world without the legacy of the traditional telephony systems.

Secondly, SIP community tries to keep one protocol to solve one problem. Some extensions deviate from this guideline, but they are exceptions. The problem comes when this design principle involves implementing several distinct protocols just to get a complete system. For example, a SIP system incorporates other external mechanisms such as STUN, TURN, ICE, reverse-connection-reuse and rport-based symmetric request routing to solve the NAT and firewall traversal problem, and still does not guarantee media connectivity in all scenarios unless HTTPS/TCP tunnel in used. Implementing instant messaging and presence involves implementing several RFCs and drafts related to Event, PUBLISH, CPIM, PIDF, XCAP, MSRP, and still the application does not have all the features of commonly available XMPP client. In summary the SIP community has created numerous extensions for solving several problems in a way that scares away a new developer!

As you can see, both these reasons (emphasis on interoperability and one-protocol-one-problem) are ideal in theory. So what is wrong? The practice. To solve these problems, (1) IETF working groups should not proceed with a draft without an open-source and simple reference implementation, (2) IETF working groups should build reference applications combining several protocols for different kinds of applications and evaluate (a) consistency and (b) ease of programming.

Consistency indicates whether the new extension is consistent with existing guidelines, best practices, protocol format, as well as design principles. For example, if an extension incorporates a new processing in the server which could have been done in the endpoint, then it is against the principle of intelligence in the endpoints. Such extensions should be marked as such so that developers know the trade-off. There are only a few good design principles, hence creating a consistency matrix of extensions against principles should be easy.

Ease of programming is determined by three things: (1) how easy it is to implement the set of protocols, (2) how easy it is to build a real application using those protocols, and (3) how easy it is to build the real application using existing platforms and tools. The first is usually available as a software library, the second as an application and the third is re-usability. It should be easy to not only build the library but also use the library to build a usable application. Every new extension adds new things to the library, which cause more interaction in the application and hence more complexity. When a software project is started, usually the interoperability is not the highest requirement, but the re-usability, short development time and real prototype application are crucial requirements. Once the project is started on one path, it is very difficult to change the path by changing the core communication protocol. If there are reference implementations then not only they help you get started quickly but it also becomes easy to see how much additional complexity a particular SIP extension brings to the application. An important programming quote: less is better than more!

The flexibility of SIP also comes with its limitations. For example, SIP is flexible to support both UDP and TCP transport. However, UDP is treated as a second-class citizen by many programming languages or libraries even today, e.g., Tcl didn't support built-in UDP socket when it came out, and Adobe ActionScript does not have built-in UDP sockets for Flash Player even now. This prevents a developer from building a complete SIP stack as Flash application, for example. However, if you peek further, you would expect that if UDP is not supported then the platform is not suitable for real-time communication anyway. However, this does not prevent web-style developers to implement XMPP in ActionScript, and perhaps tweak it to support signaling of media sessions as well. The result is a broken or non-interoperable software application.

Reviewing the evolution of SIP vs XMPP specifications, I think XMPP has defined an architecture that allows adding new extensions easily and hence reduces the application complexity, whereas SIP extensions have focused on interoperability and wire-protocol without much needed attention to application design. While application design may seem unnecessary for protocol specification, it is very important in the short term. Consider a developer who uses some data structures for representing protocol elements. If a new extension is defined in XMPP, and it reuses the existing XML format that gets readily mapped to the data structures, it becomes very easy to incorporate this new extension in his source code. If a new extension is defined in SIP or SDP, which re-uses an existing mechanism of another protocol for which there is no real implementation available, then the developer will first have to implement that other mechanism, then integrate it with SIP or SDP. The mechanism may have its own formatting which needs to be incorporated in the data structures. Essentially the developer will have to spend more time implementing such an extension. In the end, the actual format of the message whether text-based or XML-based is not terribly difficult once you have a library for message formatting and parsing. However, if an extension uses a different format, connections, sessions, etc., that are not readily available in existing libraries and tools, complexity arises. For example, adding ICE to SIP/SDP created custom format whereas ICE in XMPP/Jingle re-used XML. Another example is how an particular endpoint is identified in XMPP vs SIP. In XMPP the URI itself is extended to include the resource, e.g., "user@domain/resource", whereas in SIP new extension such as globally routable user-agent URI (GRUU) is defined which is, well, more programming effort!

Scalability and performance
SIP is inherently a peer-to-peer protocol whereas XMPP is inherently client-server. Tasks that are easy in client-server systems such as shared state, roster storage on server, or offline messages on server, are well accomplished with XMPP. On the other hand, one of the primary goal of SIP is to keep the intelligence in the endpoint. Ideally, a SIP proxy server does not even maintain the session state for the SIP dialog. Few messages in SIP such as REGISTER and PUBLISH are intended for client-server communication. In XMPP, server is a must and all signaling communication goes through the server. There are message semantics defined for the types of messages, e.g., client-server information query, client-server-client message sending, client-server event publishing and server-client event notifications. Clearly client-server applications are limited by scalability and performance of the server. For example, an instant messaging session need not go through the SIP server saving bandwidth and processing at the server. But that means you lose the offline message storage feature at the server. In real SIP applications today, servers have become an integral part of the system and hence the scalability difference diminishes. In fact, the bulky message format of SIMPLE makes it less scalable than XMPP for presence updates that go through the server. Note also that although P2P-SIP is possible, a P2P-XMPP is not easy because XMPP is inherently client-server.

Once we know this, we understand that SIP and XMPP systems solve two different problems, are designed for two different architectures and have evolved with two different guidelines. From here, you can do two things: either try to incorporate/translate all the features of one system to the other and eventually give up, or try to design your system that uses best of both worlds.

Interworking and co-location
There have been interworking attempts to inter-operate SIP/SIMPLE and XMPP, especially the IM and presence part [draft-saintandre-sip-xmpp-*, draft-veikkolainen-sip-voip-xmpp-*]. The first reference shows how to implement a gateway to connect between SIP and XMPP networks, and the second shows how to implement a client that can support both SIP and XMPP and co-relate the two protocol messages if the user is connected to both servers by the same provider. The popular OpenSER (now OpenSIPs and Kamailio) SIP server has a Jabber module to inter-work with XMPP network. People have developed clients that can understand both SIP and XMPP. Interworking is complex, and not all features can be completely translated or used from one protocol to another, unless the protocol is changed a lot with custom hacks.

Industry experts predict that both SIP and XMPP will stay for a long time. Rather than arguing about the differences or trying to mend the protocols to be like each other, one could build systems that use both these protocols for what each is good at. XMPP is good at creating application level streaming/secure/client-server pipes that can be used for shared state, one-to-many message delivery and publish-subscribe-notify-type use cases. SIP is good at rendezvous of session establishment and negotiation of session parameters for a separate session establishment.

To interwork between XMPP and SIP, you could (1) use a gateway at the server to translate the basic functions, (2) learn or send SIP parameters over XMPP message from a client, or (3) use SIP to establish XMPP chat session with a client. For example, a multi-protocol client of user may be talking to over SIP, and discover that both clients support XMPP, and then add each other in XMPP roster or start an XMPP chat session. Alternatively, if they are chatting over XMPP and discover that the other supports SIP as well, then they initiate a SIP session to do multimedia call. Implementing both the protocols in the client is better than in the gateway for scalability and robustness. There are other interworking architectures possible, e.g., having two XMPP servers use SIP to communicate with each other or talk to a trunking provider, or having an integrated SIP-XMPP server that allows both SIP and XMPP users to seamlessly communicate with each other. These modes, however, are not interesting from a P2P point of view.


CaoimhĂ­n said...

Excellent article - thank you

varun said...

Good comparision. With market flooded with speculation about google voice supporting sip/ more devices, certainly its very interesting.

Your article is top ten result on google on sip-xmpp search :)

Unknown said...

Very good article! easy to read, but rich of interesting information.

Ricardo Camacho said...

I found your article most illuminating, Great Job! :)

Ricardo Camacho said...

I found your article highly enlightening, great job!

GP said...

Very simple & precise explanation about the two protocols that always tend to be interchangeable and redundant in what they do

john murimi said...

Excellent artcle...and where does mqtt lie on the same kind of comparison ie. mqtt vs sip, mqtt vs xmpp

Marwan said...

regarding the suggestion to have 2 XMPP identify support for SIP, how do you envision this to be done? DNS SRV record queries or standard SIP port queries?

bazz said...

Great article. But you forget about the option of combining the best of both worlds and not giving up.

Jaime Casero said...

Great comparison. This kind of insight about protocol original purpose is quite clarifying, while leveraging lower level details.
This kind of discussion is getting more important, now that IoT is building each own stacks, potentially overlapping again, and trying to solve some of the problems dicussed here...

Unknown said...

What opensource servers for XMPP would you suggest? I have looked at some but most have not been updated recently. Is it a sign that I should move to SIP?
I want to provide IM, Group IM, Image based messages in my application. Which server would be best?