Kundan Singh: Translating H.264 between Flash Player and SIP/RTP

Our SIP-RTMP gateway as part of the rtmplite project includes the translation of packetization between Flash Player's RTMP and SIP/RTP for H.264. There are some hurdles, but it is doable! In this article I present what it takes to do such an interoperability. If you are interested in looking at the implementation, please see the _rtmp2rtpH264 and _rtp2rtmpH264 functions in the siprtmp.py module.

Before jumping in to the details, let us take a brief background of H.264 packetization. The encoder encodes sequence of frames or pictures to generate the encoded stream, which is consumed by the decoder to re-create the video. The encoder generates what is called as NALU or network abstraction layer unit. The decoder works on a single NALU and needs sequence of NALUs to decode. Each frame can have one or more slices. Each slice can be encoded in one or more NALUs. There are certain pieces of information that remain same for all or many frames. For example, the sequence parameter set (SPS) and picture parameter set (PPS) are like configuration elements that need to be sent once or only occasionally instead of with every frame or NALU. The configuration parameters apply to the encoder, whereas the decoder should be able to decode any configuration.

RTMP Payload

Flash Player 11+ is capable of capturing from camera and encoding in H.264 to send to an RTMP stream. Each RTMP message contains header and data (or payload), where the header contains crucial information such as timestamp, stream identifier, and the payload contains the encoded video NALUs or actual configuration data. The format of the payload is same as that of the F4V/FLV tag for H.264 video in an FLV file. Each RTMP message contains one frame but may contain more than one NALUs. The first byte contains the encoding type, and for H.264 is either 0x17 (for intra-frame) or 0x27 (for non-intra frame). The second byte contains packet type and is either 0x00 (configuration data) or 0x01 (picture data). The configuration data contains both SPS and PPS as described here.

  rtmp-payload := enc-type[1B] | type[1B] | remaining
  enc-type := is-intra[4b] | codec-type[4b]
  is-intra := 1 if intra and 2 if non-intra
  codec-type := 7 for H.264/AVC

If the type is configuration data then the next four bytes are configuration version (0x01), the profile index, the profile compatibility and the level index. This is followed by one byte containing least-significant two-bits that determine the number of bytes to use for the length of the NALU in subsequent picture data messages. For example, if the bits are 11b then it indicates 3+1=4 bytes of NALU length, and if the bits are 01b then it indicates 1+1=2 bytes of NALU length. Lets call this the length-size and possible values are 1, 2 or 4. This is followed by a byte containing least-significant 5 bit for the number of subsequent SPS blocks. Each SPS block is prefixed by 16-bits length followed by the bit-wise encoding of SPS as per H.264 specification. This is followed by a byte containing the number of subsequent PPS blocks. Each PPS block is prefixed by 16-bits length followed by the bit-wise encoding of PPS as per H.264 specification. Typically only one SPS and one PPS blocks are present.


  remaining for config := version[1B] | profile-idc[1B]
      | profile-compat[1B] 
      | level-idc[1B]
      | length-flag[1B]
      | sps-count[1B] | sps0 ...
      | pps-count[1B] | pps0 ...
  length-flag := 0[6b] | value[2b] where value + 1 is length-size
  sps-count := 0[3b] | count[5b] where count is number of sps
  pps-count := number of pps elements
  sps(n) := length[2B] | sps
  pps(n) := length[2B] | pps

If the type is picture data then the next three bytes contain a 24-bit number for the decoder delay value for the frame and is applicable only for B-frames. The default baseline profile does not include the B-frames. Thus the first five bytes of the picture data payload are like header data. This is followed by one or more NALU blocks. Each NALU block is prefixed by the length of the next NALU encoded-bits. The number of bytes used to encode this length is determined by length-size mentioned earlier. Then the NALU is encoded as per H.264 specification.


  remaining-picture := delay[3B] | nalu0 | nalu1 ...
  nalu(n) := length | nalu
  length := number in length-size bytes
  nalu := NAL unit as per H.264

Each NALU has first byte of flags. The flags contains 1 most-significant bit of forbidden, next 2-bits of nri (NAL reference index) and final 5 least-significant bits of nal-type. There are several nal-types such as 0x01 for non-intra regular pictures, 0x05 for intra-pictures, etc. Please see the H.264 specification for the complete list.

The camera captured and encoded data in Flash Player contains three NALUs in each RTMP message -- the access unit delimiter (nal-type 0x06), the timing-information (nal-type 0x09) and the picture slice (nal-type 0x01 or 0x05). The Flash Player is capable of decoding other nal-types as well, and does not require access unit delimiter or timing-information NALUs for decoding. I haven't seen any support for aggregated or fragmented NALUs in the Flash Player.

RTP Payload

The RTP payload format for H.264 is specified in RFC 6184 and is typically supported in SIP-based video phones. The RTP header contains the crucial information such as the payload type, the timing data, and the sequence number, whereas the actual configuration and picture NALUs are sent in the payload as specified by this RFC. The first byte is the type containing one bit forbidden, two bits of nri and 5 bits of nal-type.


  nalu := nal-flags[1B] | encoded-data
  nal-flags := forbidden[1b] | nri[2b] | nal-type [5b]

In addition to the base nal-types of H.264, the RFC defines new nal-types for fragmentation and aggregation. Traditionally, the Internet plagued by middle-boxes, NATs and firewalls has imposed a limit on the size of the UDP packet that can be pragmatically used on the Internet, and the typically MTU is around 1400-1500 bytes. The H.264 encoder is capable of generating much larger encoded frame sizes hence cannot be successfully sent as one frame per RTP packet over UDP in many cases. On the other hand, some low-sized encoded frames may be much smaller than MTU thus incurring additional overhead for RTP headers. These low-sized frames can be aggregated for efficiency.

Many SIP video phones configure their H.264 encoders to use multiple slice NALUs in a single frame, unlike Flash Player which generates one picture NALU per frame. Thus the traditional SIP video phones are capable of using low sized encoded payload without RFC 6184 which can be sent in a single RTP/UDP packet.

When a large encoded frame is fragmented to smaller fragments, the nal-type=28 is used in the first byte of each fragment, followed by the second byte containing the actual nal-type of the frame as well as the start and end markers. This is followed by the actual encoded data. The RTP header of all these fragments contain the same timestamp value. The last fragment of the frame contains the marker set to true, whereas all the previous ones set it to false. When multiple smaller frames are aggregated, the nal-type of 24 is used in the first byte of the aggregate payload, followed by one or more NAL data. Each NAL data is prefixed by 16-bit length of the encoded NALU. There are non-trivial rules on how the nri is obtained and we refer you to the RFC for the details.


To fragment:
  let encoded-data = fragment0 | fragment1 | fragment2...
  encoded-data of fragment(n) := orig-nal-flags[1B] | fragment(n)
  orig-nal-flags := start[1b] | end[1b] | ignore[1b] 
     | orig-nal-type[5b]
  start := 1 if first fragment else 0
  end := 1 if last fragment else 0

To aggregate:
  encoded-data of aggregate := nalu0 | nalu1 | nalu2 ...
  nalu(n) := length[2B] | orig-nalu(n)

In additional to sending the SPS and PPS packets in RTP, the video phones also negotiate the configuration data via external protocol such as SIP/SDP. Since Flash Player does not do that, we will not discuss it further.

Translating

Now that we understand the packetization of H.264 for Flash Player as well as SIP/RTP, let us go over the details of the translation process.

The configuration data is sent periodically by Flash Player before every intra-picture frame. However, SIP phones may not send the configuration data periodically. It is desirable to cache the configuration data received from both sides, and re-use it when the other side connects. The first packet sent must contain the configuration data. It is also desirable to periodically send the configuration data to both Flash Player and SIP sides from the translator, irrespective of whether the configuration data is received periodically. In our translator we send the configuration data before every infra frame.

In Flash Player to SIP/RTP direction, when the configuration data is received on RTMP, it is sent in two RTP packets, one for SPS and one for PPS. Both use the same timestamp and set the marker to true. When picture data is received on RTMP and need to be sent to the RTP side, it is dropped until a previous configuration data has been sent to the RTP side. If the picture data is not dropped, all the NALUs are extracted. The last out-of-three NALUs per RTMP message is the actual picture NALU which is sent to the RTP side as follows. Only the nal-type of 1 and 5 are used, whereas others are ignored. If the NAL size is less than 1500 bytes, it is used as is in the RTP payload with marker set to true. If the NAL size is more, it is fragmented in to smaller fragments with each of size at most 1500 bytes. Multiple fragmented RTP packets are generated as per the RFC. All but the last fragment has marker set to false. The RTP marker of true indicates end of frame. All the fragments use the same timestamp value.

In the SIP/RTP to Flash Player direction, the configuration data is received in multiple RTP packets and are cached by the translator. When both SPS and PPS payloads have been received from the RTP side, we are ready to start streaming to the Flash Player side. Any incoming RTP packet is put in a queue. When the last packet in the queue (that was most recently received) has marker set to true, the queue is examined and RTMP messages are created to be sent to the Flash Player side. Since Flash Player handles complete frames in each RTMP message, we need to wait until the marker is set to true so that we only send complete frames to Flash Player. If the RTMP stream is ready but we have not received the configuration data from RTP or we have not or are not sending the first intra frame to RTMP, then received packets are dropped. If no infra frames are received for 5 seconds, then we send a fast-intra-update (FIR) request to the SIP/RTP side, so that it triggers the SIP phone to send an intra frame.

Once we decide that we can send packets to RTMP from the received RTP queue, we divide the queue in to groups of packets of same timestamp and same nal-type values while preserving the order of the packets. If the nal-type is 5 indicating that an intra-frame is being sent to RTMP, then we send a configuration data too before the actual picture data. The configuration payload format is explained earlier and contains both PPS and SPS along with other elements. Each group of packets of the same timestamp and same nal-type is sent as a single RTMP message in the same order containing one or more NALUs. If the nal-type is 1 or 5, the NALU from the RTP payload is used as is in the RTMP payload with five bytes of header as explained earlier. If the nal-type is 28 indicating fragmented packets, then all the fragmented payloads are combined in to a single NALU. If the nal-type is 24 indicating aggregated packet, then it is split in to individual NALU data. Then the sequence of NALUs generated from this group of packets of same timestamp and nal-type are combined in to a single RTMP payload to be sent to the Flash Player.

Gotchas

As mentioned in my previous article, there are a few gotchas. You must use the new-style RTMP handshake, otherwise the Flash Player will not decode/display the received H.264 stream. You must use Flash Player 11.2 (beta) or later when using "live" mode, otherwise the Flash Player does not accept multiple slice NALUs of a single frame. If audio and video are enabled, then the timestamp of video must be synchronized with the timestamp of audio sent to RTMP. Note that RTP picks random initial timestamp for each media stream so the audio and video RTP timestamp values are not easily co-related unless using RTCP or external mechanism. You need to co-related the RTP timestamps of audio and video to a single timestamp clock of RTMP.

Conclusion

It is possible to do re-packetization of H.264 between Flash Player's RTMP and standard SIP/RTP without having to do actual video transcoding. This article explains the tricks and gotchas of doing so!

The implementation works between Flash Player 11.2 and a few SIP video phones such as Ekiga and Bria 3.

References

[1] Source code of SIP-RTMP translation.

[2] Three problems in interoperating with H.264 of Flash Player.

[3] Flash Player bug 2991202 fixed in version 11.2 (beta).

[4] RFC 6184: RTP payload format for H.264 video.

[5] F4V/FLV video file format specification.

[6] ITU-T recommendation H.264, "advanced video coding for generic audiovisual services", March 2010.

[7] ISO/IEC International Standard 14496-10:2008.

8 comments:

Srikanth Vavilapalli said...: Hi Kundan

Thanks for explaining the translation procedure in detail. Appreciate your efforts..I have a question on RTP to RTMP direction. Why does the translator need to queue all the RTP packets until the packet with marker=1 is received. As per your article I understood that flash clients only support single NAL packetization mode rt? So if each RTP packet received with a single NAL/aggregated/fragmented packetization mode gets translated in to a single NAL RTMP packet and sent to flash client, will the flash client not decode the NALs?

Srikanth; 12:23 PM
Kundan Singh said...: In my experiment I found that Flash Player does not display the picture if all NALUs of a frame are not present in a single RTMP message. Bria 3's RTP stack splits a frame into multiple RTP packets with last packet containing marker=true for a frame. So collecting all RTP packets in a queue until marker=true means that you will have complete frame with all the NALUs to compose the RTMP message payload. A single RTP packet from Bria 3 contains one NALU which may contain only partial frame.

My experiment was before 11.2 so things might have changed since then.; 1:42 PM
Srikanth Vavilapalli said...: Thanks Kundan

Will you be able to share with us the wireshark traces for a RTMP h264 video call capturing the packets from a flash player 11?

So as per your article, flash player11 embed only one h264 picture data NAL in one RTMP message and when translated to RTP the entire NAL may fit in a single MTU(<1500byte) packet rt?

I understood from ur article that the flash client is sending 3 NALs (delimiter, timing and picture data NALs) in every RTMP message. Is it as per any h264 specification? I thought if RTMP and RTP clients negotaite the h264 profile and levels, why as an intermediate entity still need to bother about payload translation?; 5:02 AM
nakib said...: HI Kundan,

first of all, great work you have here!!

If I understand correctly, flash will only send packets in packetizationMode="1". Right?

Or is there any configuration to force flash to send small packets instead?

André; 7:31 AM
Kundan Singh said...: Hi Srikanth,

Regarding the traces, if you can send me an email to kundan10@gmail.com I will remember to send the tcpdump traces.

"..when translated to RTP the entire NAL may _not_ fit in a single MTU packet (for video)"

The delimiter is just a separator for access unit which is probably a single frame to display. So it is not violating the decoder I would day. The timing information is optional so can be ignored. But since the SIP side does not send these nal-types I drop those when translating from RTMP to RTP. But I am not thoroughly familiar with the H.264 specification so cannot say if these nal-types violate the standard. You could try sending these nal-type to SIP and see if things still work fine...

Thanks!; 2:03 PM
Kundan Singh said...: HI Andre,

From what I recall, the packetization mode of 1 allows fragmentation and aggregation nal-types too. But Flash Player does not generate/receive that as per my previous experiment. Since the packetization mode is applicable only for RTP payload packetization (datagram oriented), it does not apply to RTMP (stream oriented) where the payload can be of much larger size so you don't need fragmentation or aggregation.

As per RFC6184 and given the current payload format of H264/RTMP, it is possible to translate to/from RTP packetization mode=1. But I don't think it is possible to translate to/from RTP packetization mode=0 (assuming MTU of 1500 bytes).

It will be interesting to see how RTMFP does packetization. Most likely it uses the chunk encoding mechanism already available in the protocol instead of RFC6184-based fragmentation and aggregation.

I didn't see any configuration to force flash to send small nal-sizes instead.

Thanks!; 2:12 PM
Srikanth Vavilapalli said...: Hi Kundan

I have sent a mail to your gmail ID requesting the flash client11 tcpdump traces. Can you plz provide me that?

Regards
Srikanth; 8:12 AM
Daniel Tremblay said...: Hi
Great article... i am not programer and i almost understand....

i do live rtmp streaming, i look for link aggregation and bonding but its quite inefficient or very expensive solutions... i need to send one rtmp stream over multiple nics or 3g/4g modems... for redudency and bandwidth addition of connections...
so, one question: based on p2p-sip can it be possible to get an application at encoder splitting packets and dispatch upload up to all connection according there eficiency and at server side an application reordering packets or frames before streaming...?

thanks

Daniel; 5:14 PM