REST and SIP

This article describes a RESTful SIP application server architecture.

Why do we need this?
SIP is the protocol of choice for Internet session initiation and control such as for VoIP or multimedia calls. Although SIP is similar to HTTP in many respects, there are crucial differences in the design. Two of the major difficulties among web developers in adopting SIP are (1) no existing SIP-based web tools similar to programming libraries for HTTP and XMPP on Flash Player, (2) the initial cost to get started with basic working system is huge with lot of specifications, e.g., for NAT and firewall traversal. On the other hand, web developers are used to building applications on top of HTTP which works for most cases out of the box. More recently RESTful architectures are gaining popularity among web services. In the absence of easy to use web tools for SIP and large set of specifications for a SIP system, web developers tend to resort to quick and dirty hacks which in the end are short term and not interoperable. Hence there is a need for a easy to use RESTful architecture for SIP-based systems that allows quick application development by web developers. This article proposes such an architecture.

What exactly is difficult?
SIP supports both UDP and TCP transports. Many earlier systems implemented UDP, whereas both transports are a must for SIP proxy servers. In client-server communication, with several clients behind NAT and firewall, UDP causes problem. Secondly, with UDP you also need the reliability of transactions and hence the transaction state machines in SIP. The SIP request forking and early media feature have created lot of stir and confusion among developers. Several other telephony-style features are also not needed for many Internet oriented SIP applications that do not talk to a phone network. The NAT and firewall traversal are defined outside core SIP, e.g., using rport, sip-outbound. A developer usually prefers to have an integrated application library and API that is quick and easy to use. Moreover with lots of RFCs related to SIP, it becomes difficult to figure out what specifications are core and what are optional for a particular use case. A number of new web-based video communication systems use proprietary technologies such as on Flash Player because of lack of a ready-to-use SIP library to satisfy the needs.

To solve the difficulties faced by web developers, a subset of the core features of SIP are needed as an easy to use API. Such an API could be available as a built-in browser feature or a plugin. Once the core set of resources are identified, rest of the API can be customized by the application server providers and developers, or in separate communities.

What use cases are considered?
SIP is designed to be used consistently in different use cases such as client-to-client communication, client-to-server as well as server-to-server. The core SIP says that each SIP user agent (application client) has both UAC (client) and UAS (server). In this article I refer to client as a user agent and server as an application server, which are different from SIP terminology. Since the target audience for the proposal is application developers, only the client-server interface needs to be considered. The backend application server can translate the client-server request to appropriate SIP messaging for server-to-server case if needed, e.g., for service provider's network you may need high performance UDP based server-to-server SIP messages.

What are the SIP-related resources?
Once we focus on a small subset of the problem -- define RESTful API for client-server communication to access a SIP application server -- rest of the solution falls in place naturally. In particular, the SIP application server will provide two core resources: "/login" and "/call" to represent list of currently logged in users and list of active calls. Additionally, it can provide user profiles of signed up users at "/user" which internally may contain things like voicemail resources for the user. The client uses standard HTTP requests, with some additional methods as shown below, to access the resources and interact with others. One difference with standard RESTful architecture is that the client-server connection may be long lived, and also used for notification from server to client. In that sense it does not remain pure RESTful.

Login: The SIP registration and unregistration are mapped to "/login/{email}" resource, e.g., "/login/kundan@example.net". Doing a "POST /login/{email}" with message body containing your contacts, can be used to REGISTER. The response will return your unique identifier for the login resource, e.g., "/login/{email}/{contact_id}. Later, you can use "DELETE /login/{email}/{contact_id}" to un-REGISTER or a subsequent "PUT /login/{email}/{contact_id}" to do a REGISTER refresh. The actual representation of the login contact information can be in XML, JSON or plain text and is application dependent. For example one could combine the presence update including rich presence with the registration method. Clearly the login update requires appropriate authentication.
 POST /login/kundan@example.net      -- new registration
request-body: {"contact": "sip:kundan@192.1.2.3:5062"}
response-body: {"url": "/login/kundan@example.net", "id": 1, "expires": 3600}

PUT /login/kundan@example.net/1 -- registration refresh
request-body: "sip:kundan@192.1.2.3:5062"

DELETE /login/kundan@example.net/1 -- unregister

GET /login/kundan@example.net -- get list of contact locations
response-body: [{"id": 1, "contact": "sip:kundan@192.1.2.3:5062", ...},...]
Call: The call is split into two part: conference resource and invitation. The conference is represented using a "/call/{call_id}" resource, where a client can "POST /call" to create a new call identifier, or "POST /call/{call_id}" to join an existing call. The conference resource represents the list of participants in a call.
 POST /call             -- create a new call context
request-body: {"subject": "some discussion topic", ...}
response-body: {"id": "123", "url": "/call/123" }

POST /call/123 -- join a call
request-body: {"url": "/login/kundan@example.net", "session": "rtsp://...", ...}
response-body: {"id": 2, "url": "/call/123/2", ...}

GET /call/123 -- get participant list and call info
response-body: {"subject": "some discussion topic",
"children": [{"url": "/call/123/2", "session": "rtsp://..."}]
}

Invite: Call invitation requires a new message such as "SEND". For example, "SEND /login/{email}" sends the given message body to the target logged in user. Similarly, "CANCEL /login/{email}/1" cancels a previously sent message it is not already sent. The message body gives additional details such as whether the message is a call invitation or an instant message. The message body is application dependent. The SIP application server does not need to understand the message body, as long as it can send a SEND message from one client to another. This makes a SEND more closer to an XMPP instead of a SIP INVITE. If the callee wants to accept the call invitation, it joins the particular session URL independently.
 SEND /login/alok@example.net     -- send call invitation
request-body: {"command": "invite", "url": "/call/123", "id": 567}

SEND /login/alok@example.net -- cancel an invitation
request-body: {"command": "cancel", "url": "/call/123", "id": 567}

SEND /login/kundan@example.net -- sending a response
request-body: {"command": "reject", "url": "/call/123", "id": 567, "reason": ...}
Event: SIP includes an event subscription and notification mechanism which can be used in several applications including presence updates and conference membership updates. Similarly, one needs to define new mechanism to subscribe to any resource and get notification of a change. This gives rise to a concept known as active-resource. The idea is as follows: if a client does a GET on active resource, and does not terminate the connection, then the client keeps getting the initial state of the resource, as well as any future updates until the connection is terminated. The future updates may include the full state or a difference depending on the request parameter.
 GET /call/123          -- keep track of membership information
response 1: ... -- initial membership information
response 2: ... -- any addition or deletion in the membership

GET /login/kundan@example.net -- keep track of presence updates
response 1: ... -- initial presence information
response 2: ... -- subsequent presence updates.
Profile and messages: The SIP application server will host user profile at "/user/{user_id}". The concept of user identifier will be implementation dependent. In particular, the client could "POST /user" to create a new user account, and get the identifier in the response. It can then do a "GET /user/{user_id}" to know various URLs to get contact location of this user. It can then do a GET on that URL to fetch the contacts or do a SEND on that URL to send a message or call invitation.
 POST /user                            -- signup with a new account
request-body: {"email": "kundan@example.net", ...}
response-body: {"id": "kundan@example.net", "url": "/user/kundan@example.net" }

POST /user/kundan@example.net/message -- send offline messages (voice/video mail)
request-body: {"url": "rtsp://..."}

GET /user/kundan@example.net/message -- retrieve list of messages
response-body: [{"url": "rtsp://...", ...]
Miscelleneous: There are several other design questions that are left unanswered in the above text. Most of these can be intuitively answered. For example, the HTTP authentication credential defines the sender of a message, i.e., SIP "From" header. The sequential or parallel forking is a decision left to the client application. The decision whether to use a SDP or XML-based session description is application and implementation dependent. For example, if the client is creating a conference on RTSP server, it will just send the RTSP URL in the call invitations. Similarly, for Flash Player conferencing it will send an RTMP URL in the call invitation. The call property such as participant's session description can be fetched by accessing the call resource on the server. Thus, whether an RTSP/RTMP server is used to host a conference or a multicast address is used is all client or application dependent. The application server will provide tools to allow such freedom.

Conclusion: A RESTful interface to SIP application server is an interesting idea described in this article. The idea looks feasible and doable using existing software and tools, and hopefully will benefit both the web developer and SIP community in getting wider usage of SIP systems. The goal is not to replace SIP, but to provide a new mechanism that allows web-centric applications to use services of a SIP application server and to allow building such easy to use SIP application server.

Several of the pieces described in this article are already implemented in Python, e.g., RESTful server tools, video conferencing application server, SIP-RTMP translation and SIP server and client library. The next step would be to combine these pieces to build a complete REST and SIP project. If you are interested in doing the project feel free to get in touch with me!

REST, RESTful and restlite

This post announces a new open source software: http://code.google.com/p/restlite/

What is restlite? Restlite is a light-weight Python implementation of server tools for quick prototyping of your RESTful web service. Instead of building a complex framework, it aims at providing functions and classes that allows your to build your own application.

restlite = REST + Python + JSON + XML + SQLite + authentication

Features
  1. Very lightweight module with single file in pure Python and no other dependencies hence ideal for quick prototyping.
  2. Two levels of API: one is not intrusive (for low level WSGI) and other is intrusive (for high level @resource).
  3. High level API can conveniently use sqlite3 database for resource storage.
  4. Common list and tuple-based representation that is converted to JSON and/or XML.
  5. Supports pure REST as well as allows browser and Flash Player access (with GET, POST only).
  6. Integrates unit testing using doctest module.
  7. Handles HTTP cookies and authentication.
  8. Integrates well with WSGI compliant applications.

Motivation: As you may have noticed, the software provides tools such as (1) regular expression based request matching and dispatching WSGI compliant router, (2) high-level resource representation using a decorator and variable binding, (3) functions for converting from unified list representation to JSON and XML, and (3) data model and authentication classes. These tools can be used independent of each other. For example, you just need the router function to implement RESTful web services. If you also want to do high-level definitions of your resources you can use the @resource decorator, or bind functions to convert your function or object to WSGI compliant application that can be given to the router. You can return any representation from your application. However, if you want to support multiple consistent representations of XML and JSON, you can use the represent function of request.response method to do so. Finally, you can have any data model you like, but implementations of common SQL style data model and HTTP basic and cookie based authentication are provided for you to use if needed.

This software is provided with a hope to help you quickly realize RESTful services in your application without having to deal with the burden of large and complex frameworks. Any feedback is appreciated. If you have trouble using the software or want to learn more on how to use, feel free to send me a note!

Protocol Jungle of Internet multimedia communication

The diagram shows several protocols for Internet multimedia communication. (Click on the diagram to see the full size picture.) In the protocol jungle, a protocol is analogous to a species, its real-world implementation or deployment is an animal of the species. Some animals or species compete with each other for survival. Some animals live with each other in harmony. Some animals do not care or interact with each other since they live in different place, i.e., application or domain. Evolution and mutation results in long lasting survival of some species whereas others become extinct. Unlike using a protocol zoo metaphor, I use a protocol jungle, because there is really a competition between protocols when big companies have invested in certain protocol unlike a closely guarded and nurtured zoo system.




The diagram shows the species and its relationship with other species, e.g., whether A uses B or whether A and B are friendly. Due to space constraint, some items are grouped together, e.g., all the audio/video codecs, and some relationships are missing, e.g., RTMP is friendly with Speex. Ideally, we need a multi-dimensional representation to show multiple aspects of the jungle and how they are related. The following text lists the protocols that serve similar or common functions, and usually are competing within that function.

FunctionProtocols
Structured data encodingXML, ASN.1, RFC822, others
Audio encodingG.711, G.723.1, G.722, G.726, G.728, G.729, MP3, Speex, Nellymoser, AMR, Silk, GIPS, etc.
Video encodingH.261, H.263, H.264, MPEG, Sorenson, Vidyo, etc.
Media transportRTP/RTCP, SRTP, ZRTP, Skype, IAX, RTMP, RTMFP
RendezvousSIP, H.323, Skype, Stratus/RTMFP
Session descriptionSDP, H.245, Jingle
Session negotiationSIP/SDP, H.245, Jingle, Skype, RTMFP
Call signaling and controlSIP, H.225/Q.931, Skype, IAX, MGCP, SCCP (Skinny), RTMFP
Streaming media controlRTSP, RTMP
Session announcementSAP
ConnectivityICE/STUN/TURN, Skype, RTMFP
Remote Procedure CallSOAP, XMLRPC, REST, RTMP
Programming callsCGI, CPL, CCXML, MSCML
Programming voice dialogVoiceXML
Instant messagingXMPP, SIMPLE, MSRP
PresenceXMPP, SIMPLE
Shared resource accessREST, XMPP, XCAP
Shared stateXMPP, RTMP, HTTP


As you can see that a SIP system typically employs one protocol for one task or a few related tasks, but integrated monolithic systems such as those based on RTMP/RTMFP, Skype or IAX tend to combine multiple functions in the single protocol. I have not listed H.32x protocols other than H.323 because those are intended for non-IP networks. Nevertheless, there are several H.32x systems, e.g., for room based video conferencing or for carrying voice among carriers.

Interworking

With multiple protocols available for the same function, interoperability or interworking among those becomes important. I have talked about SIP and XMPP interworking in the last post. I have hands-on experience with several of the interworking scenarios among protocols shown in the diagram.

H.323-H.324: One of my projects in my first job was interworking between H.323 and H.324. Since both these systems use H.245 as the main session description and negotiation, the interworking task is relatively simple. I also worked on part of H.320 system to try to build H.323-H.320 interworking, but did not complete.

SIP-H.323: One of my first project during my M.S. at Columbia University was SIP-H.323 interworking. I have written sip323 software and couple of internet drafts and papers [1] on this. My PhD thesis gives a complete interworking procedure for basic call setup and registration. The conclusion was that while basic call setup and registration are easy to interwork, the full interworking of all the supplementary services is not feasible and not even needed in many cases. Since both SIP and H.323 use RTP/RTCP for media transport and can use the same set of codecs, the signaling gateway is efficient. The company SIPquest which productized my software demonstrated 10k simultaneous calls (this article).

SIP-RTSP: These protocols serve different purposes, but it is possible to build a system that needs both these functions in a standard compliant way. The sipum software is a voice mail and answering machine that uses SIP for calls and RTSP for recording and playback of media. Since both these use RTP/RTCP for media transport and can use the same set of codecs, the software is efficient as the media path can bypass the software. Please see my papers [1] for details.

SIP-RTMP: There have been several attempts at implementing Flash based SIP systems and SIP-RTMP translator is one of the approach. Some existing projects that implement these are siprtmp, gtalk2voip, red5phone and flaphone. Since RTMP is an integrated streaming protocol which can also do control and RPC, the translator is inefficient since it needs to incorporate the media path as well.

SIP-Skype: Being a proprietary protocol, it is not easy to interwork with Skype. However, Skype itself uses SIP to allow trunking with PSTN providers, and recently there was some news about SIP-based Skype gateway for enterprise.

SIP-IAX: Although IAX is open, it is an integrated protocol that combines media and signaling in the same connection, hence suffers from the same scalability problem as other integrated protocols like RTMP. Asterix also has a SIP gateway so that it can talk to SIP-enabled devices, especially carrier equipments.

SIP-XMPP: There is a interest group that discusses this in depth. My last post gives more links about the interworking scenarios using a gateway or co-location in the client.

SIP-RTMFP: Given the P2P promise of RTMFP, a gateway between these two protocols will be able to connect the proprietary Adobe protocol with the rest of the world for a true web-based end-to-end media path. I haven't seen any system that does this.

SIP-H.320: This gateway is particularly useful for existing room based video conferencing systems that want to connect with more Internet-style SIP devices. The idea is similar to SIP-H.323 translator, and in fact a real deployment may use two gateways: SIP-H.323 and H.323-H.320 in practice.

RTMP-XMPP: Since RTMP and XMPP serve two completely different functions, there is no need to interoperate. However, people have built systems that use XMPP for messaging and signaling while using RTMP for media path. Unfortunately since Jingle extension wants to define its own end-to-end session, it becomes not so useful for exchanging RTMP server session information. In particular use XMPP custom extensions based on presence and message to rendezvous, but do session control and call management in RTMP itself.

XMPP-SIMPLE: The SIP-XMPP interest group is also looking at SIMPLE-XMPP translation. However, given the disconnect between the two protocols, it is likely that all the presence and message updates go through the gateway and hence not as efficient as one would want for presence and instant messages.

RTMP-Skype: Now this is going to be really tough because firstly Skype is still a proprietary protocol, and secondly, both these are integrated protocols hence requiring complete conversion of signaling and media. An specific example could be allowing people to access Skype from web pages, e.g., by having a simple RTMP server in the Skype application itself. This works if Skype is running on your local computer. Alternatively, you need the Flash application to connect to Skype super-nodes running on public computers using RTMP. This poses security risk and is inefficient. Why inefficient? because RTMP over TCP means that only the applications on public Internet will be able to receive the connection, and RTMP is not really good for real-time interactive communication because of its latency and buffering. However, if such gateway are incorporated in Skype, then it truly become ubiquitous to web applications.

RTMP-RTSP: These are two competing streaming protocols. Instead of having a gateway that translates between the two protocols, it might be better to build an integrated client or integrated server -- you can record using RTMP and view using Quicktime (RTSP), or you can use the same client to access real-time streams from RTMP or RTSP. Since RTMP incorporates RPC along with streaming control and media path, whether as RTSP is just streaming control, a complete translation of all the functions may not be feasible.

ASN.1-XML: There has been effort to standardize this, e.g., XER. The proposed H.325 standard by ITU-T will use XML while allowing compatibility with some of the predecessors which are in ASN.1 PER. ASN.1 and XML are just data formats and for the purpose of P2P-SIP, they are not very significant.

If you have data about the usage in real deployment for particular protocol(s), feel free to post your comment.

[1] My publication page http://kundansingh.com/#papers

SIP vs XMPP or SIP and XMPP?

(This post is unrelated to P2P, and describes the differences between the two sets of protocols SIP and XMPP. I have implemented both SIP and XMPP, as well as used several existing libraries for SIP and XMPP, so I can comment on the two sets of standards from a developer point of view as well)

History
SIP was invented to provide rendezvous for session establishment and negotiation on the Internet. XMPP (or Jabber) was invented to do structured data exchange such as synchronous or active presence and text communication among group of people. XMPP evolved from instant messaging and presence, whereas SIP evolved from Internet voice/video communication. Later, XMPP added support for session negotiation using the Jingle extension, and SIP community added extensions such as SIMPLE to support instant messaging and presence.

Technically comparing SIP and XMPP is like comparing apples and oranges because the core protocols serve different purposes: session randevous/establishment vs structured data exchange. On the other hand, because of the extensions invented in both the protocol worlds, SIMPLE and Jingle, they now have overlapping functions, and can be compared. When one compares SIP vs XMPP, actually the comparison is SIP/SIMPLE vs XMPP for IM and presence and/or SIP/SDP vs XMPP/Jingle for session negotiation. Even though the goals of the two sets of protocols are converging, there are fundamental architectural differences that I will enumerate in this article. There are other articles on SIP vs XMPP [1, 2, 3].

Differences: SIP vs XMPP
The following table lists the crucial differences between the two sets of protocols.


SIPXMPP
PurposeProvide rendezvous for session establishment and negotiation where the actual session is independent, e.g., over RTP media transport.Provide a streaming pipe for structured data exchange between group of clients with the help of server(s), e.g., for instant messaging and presence
ProtocolText-based request-response protocol similar to HTTP, where core attributes are signaled using headers, and additional data using message body, e.g., session description of capabilities.
XML-based client-server protocol to create a streaming pipe on which it sends request, response, indication or error using XML stanza between client and server, and between servers.
TransportUsually implemented in connection-less UDP as well as connection-oriented TCP transport. Also works over secure TLS transport.Works over connection-oriented TCP or TLS transport.
ConnectionA user-agent is both client and server, hence can send or receive connections, in case of TCP or TLS. This does not work well with NATs and firewalls, hence extensions are defined to use reverse connections when server wants to send message to client.The client initiates the connection to the server, which works well with NATs and firewalls. Additionally, extensions are defined such as BOSH to carry XMPP stanza over HTTP to work with very restricted firewalls


There are many other differences, e.g., the way a URI is represented, or how authentication is done, or what kinds of messages are supported. I will not go into details of those since they tend to become too specific for the kind of application and we miss the important points. From a developer's point of view 'ease of programming' is very important.

Ease of programming
Both SIP and XMPP are easy to implement. My 39 peers project has modules for both in few thousand lines of Python code. Although the basic protocol is easy to implement, a complete system such as a collaboration client with audio/video and messaging/presence support is very complex.

Because of the way these protocols have originated, they are well suited for certain kinds of applications. For example, if you want to build an audio/video communication system, it is better to start with SIP. Features such as interoperability with other VoIP phones, incorporating any-cast call distribution, or using existing VoIP provider for trunking are easy and readily available using SIP. If you want to build an instant messaging and presence client, it is better to start with XMPP. Features such as friends roster, group chat, blocking a user, storing offline messages, etc., are readily available using XMPP. Any advanced communication or collaboration system needs to include both these kinds of features.

XMPP has solved the application's problems and has defined mechanisms for several commonly used features in an instant messenger-type or shared state-type application, e.g., group chat, visiting card, avatars, etc. The emphasis is on application design, use cases, and practical solutions.

I think there are two main reasons for SIP's difficulty among developers: (1) the emphasis of SIP is on interoperability rather than application and feature design, and (2) the emphasis in SIP community is to have one protocol solve one problem, which requires implementing a plethora of protocols for a complete system. Let me explain these further.

When a new VoIP features is implemented by one phone, it must interoperate with another phone or VoIP service provider. Hence most SIP extensions focus on wire-protocol and interoperability mechanisms. Although specifications of several SIP extensions are available, there are no evaluation or open reference implementation on how they fit in the overall design. More recently efforts have been made, including my RFC 5638 (Simple SIP Usage Scenario for Applications in the Endpoints), to simplify the specifications for certain types of SIP applications -- those endpoints that want to work in web and Internet world without the legacy of the traditional telephony systems.

Secondly, SIP community tries to keep one protocol to solve one problem. Some extensions deviate from this guideline, but they are exceptions. The problem comes when this design principle involves implementing several distinct protocols just to get a complete system. For example, a SIP system incorporates other external mechanisms such as STUN, TURN, ICE, reverse-connection-reuse and rport-based symmetric request routing to solve the NAT and firewall traversal problem, and still does not guarantee media connectivity in all scenarios unless HTTPS/TCP tunnel in used. Implementing instant messaging and presence involves implementing several RFCs and drafts related to Event, PUBLISH, CPIM, PIDF, XCAP, MSRP, and still the application does not have all the features of commonly available XMPP client. In summary the SIP community has created numerous extensions for solving several problems in a way that scares away a new developer!

As you can see, both these reasons (emphasis on interoperability and one-protocol-one-problem) are ideal in theory. So what is wrong? The practice. To solve these problems, (1) IETF working groups should not proceed with a draft without an open-source and simple reference implementation, (2) IETF working groups should build reference applications combining several protocols for different kinds of applications and evaluate (a) consistency and (b) ease of programming.

Consistency indicates whether the new extension is consistent with existing guidelines, best practices, protocol format, as well as design principles. For example, if an extension incorporates a new processing in the server which could have been done in the endpoint, then it is against the principle of intelligence in the endpoints. Such extensions should be marked as such so that developers know the trade-off. There are only a few good design principles, hence creating a consistency matrix of extensions against principles should be easy.

Ease of programming is determined by three things: (1) how easy it is to implement the set of protocols, (2) how easy it is to build a real application using those protocols, and (3) how easy it is to build the real application using existing platforms and tools. The first is usually available as a software library, the second as an application and the third is re-usability. It should be easy to not only build the library but also use the library to build a usable application. Every new extension adds new things to the library, which cause more interaction in the application and hence more complexity. When a software project is started, usually the interoperability is not the highest requirement, but the re-usability, short development time and real prototype application are crucial requirements. Once the project is started on one path, it is very difficult to change the path by changing the core communication protocol. If there are reference implementations then not only they help you get started quickly but it also becomes easy to see how much additional complexity a particular SIP extension brings to the application. An important programming quote: less is better than more!

The flexibility of SIP also comes with its limitations. For example, SIP is flexible to support both UDP and TCP transport. However, UDP is treated as a second-class citizen by many programming languages or libraries even today, e.g., Tcl didn't support built-in UDP socket when it came out, and Adobe ActionScript does not have built-in UDP sockets for Flash Player even now. This prevents a developer from building a complete SIP stack as Flash application, for example. However, if you peek further, you would expect that if UDP is not supported then the platform is not suitable for real-time communication anyway. However, this does not prevent web-style developers to implement XMPP in ActionScript, and perhaps tweak it to support signaling of media sessions as well. The result is a broken or non-interoperable software application.

Reviewing the evolution of SIP vs XMPP specifications, I think XMPP has defined an architecture that allows adding new extensions easily and hence reduces the application complexity, whereas SIP extensions have focused on interoperability and wire-protocol without much needed attention to application design. While application design may seem unnecessary for protocol specification, it is very important in the short term. Consider a developer who uses some data structures for representing protocol elements. If a new extension is defined in XMPP, and it reuses the existing XML format that gets readily mapped to the data structures, it becomes very easy to incorporate this new extension in his source code. If a new extension is defined in SIP or SDP, which re-uses an existing mechanism of another protocol for which there is no real implementation available, then the developer will first have to implement that other mechanism, then integrate it with SIP or SDP. The mechanism may have its own formatting which needs to be incorporated in the data structures. Essentially the developer will have to spend more time implementing such an extension. In the end, the actual format of the message whether text-based or XML-based is not terribly difficult once you have a library for message formatting and parsing. However, if an extension uses a different format, connections, sessions, etc., that are not readily available in existing libraries and tools, complexity arises. For example, adding ICE to SIP/SDP created custom format whereas ICE in XMPP/Jingle re-used XML. Another example is how an particular endpoint is identified in XMPP vs SIP. In XMPP the URI itself is extended to include the resource, e.g., "user@domain/resource", whereas in SIP new extension such as globally routable user-agent URI (GRUU) is defined which is, well, more programming effort!

Scalability and performance
SIP is inherently a peer-to-peer protocol whereas XMPP is inherently client-server. Tasks that are easy in client-server systems such as shared state, roster storage on server, or offline messages on server, are well accomplished with XMPP. On the other hand, one of the primary goal of SIP is to keep the intelligence in the endpoint. Ideally, a SIP proxy server does not even maintain the session state for the SIP dialog. Few messages in SIP such as REGISTER and PUBLISH are intended for client-server communication. In XMPP, server is a must and all signaling communication goes through the server. There are message semantics defined for the types of messages, e.g., client-server information query, client-server-client message sending, client-server event publishing and server-client event notifications. Clearly client-server applications are limited by scalability and performance of the server. For example, an instant messaging session need not go through the SIP server saving bandwidth and processing at the server. But that means you lose the offline message storage feature at the server. In real SIP applications today, servers have become an integral part of the system and hence the scalability difference diminishes. In fact, the bulky message format of SIMPLE makes it less scalable than XMPP for presence updates that go through the server. Note also that although P2P-SIP is possible, a P2P-XMPP is not easy because XMPP is inherently client-server.

Once we know this, we understand that SIP and XMPP systems solve two different problems, are designed for two different architectures and have evolved with two different guidelines. From here, you can do two things: either try to incorporate/translate all the features of one system to the other and eventually give up, or try to design your system that uses best of both worlds.

Interworking and co-location
There have been interworking attempts to inter-operate SIP/SIMPLE and XMPP, especially the IM and presence part [draft-saintandre-sip-xmpp-*, draft-veikkolainen-sip-voip-xmpp-*]. The first reference shows how to implement a gateway to connect between SIP and XMPP networks, and the second shows how to implement a client that can support both SIP and XMPP and co-relate the two protocol messages if the user is connected to both servers by the same provider. The popular OpenSER (now OpenSIPs and Kamailio) SIP server has a Jabber module to inter-work with XMPP network. People have developed clients that can understand both SIP and XMPP. Interworking is complex, and not all features can be completely translated or used from one protocol to another, unless the protocol is changed a lot with custom hacks.

Conclusion
Industry experts predict that both SIP and XMPP will stay for a long time. Rather than arguing about the differences or trying to mend the protocols to be like each other, one could build systems that use both these protocols for what each is good at. XMPP is good at creating application level streaming/secure/client-server pipes that can be used for shared state, one-to-many message delivery and publish-subscribe-notify-type use cases. SIP is good at rendezvous of session establishment and negotiation of session parameters for a separate session establishment.

To interwork between XMPP and SIP, you could (1) use a gateway at the server to translate the basic functions, (2) learn or send SIP parameters over XMPP message from a client, or (3) use SIP to establish XMPP chat session with a client. For example, a multi-protocol client of user alice@example.net may be talking to bob@home.com over SIP, and discover that both clients support XMPP, and then add each other in XMPP roster or start an XMPP chat session. Alternatively, if they are chatting over XMPP and discover that the other supports SIP as well, then they initiate a SIP session to do multimedia call. Implementing both the protocols in the client is better than in the gateway for scalability and robustness. There are other interworking architectures possible, e.g., having two XMPP servers use SIP to communicate with each other or talk to a trunking provider, or having an integrated SIP-XMPP server that allows both SIP and XMPP users to seamlessly communicate with each other. These modes, however, are not interesting from a P2P point of view.