Getting Real with Real-Time Protocol

I’ve written quite a few blogs where I mention that SIP media is sent by something called RTP, but I’ve never described what that means.  Well, today I’ve decided to do something about that.

RTP stands for Real-Time Protocol and like the bulk of the standards used by and with SIP, it is managed by the Internet Engineering Task Force (IETF).  Like all IETF protocols, RTP has its own RFC –RFC 3550.  It’s actually a fairly easy RFC to read and comprehend and I invite you to do so, but I think that over the next several paragraphs I can tell you just about everything you really need to know about RTP.

RTP was developed as a way to deliver real-time media across an IP network.  For the most part, that real-time media will be either voice or video.  One of the cool things about RTP is that it is used by both H.323 and SIP.  In other words, it’s possible for an H.323 client to communicate with a SIP client as long as you have something in the middle to transcode between the two signaling protocols.  The media streams wouldn’t require transcoding because they are exactly the same.

The protocol itself is quite skinny.  In fact, an RTP header can be as small as 12 bytes.  The aspects that you need understand are the following:

Sequence Number:  The sequence number is used to put an identifying number on each RTP packet sent.  The sender will increment the number by one for each new packet.  RTP is sent on an unreliable, datagram protocol (e.g. UDP) so there are no retransmissions of lost packets.  However, the sequence number can be used to learn if a packet has been dropped by the network, or arrives out of order.

Timestamp:  The timestamp is used to allow the receiver to play back the packets at the appropriate intervals.

Payload Type:  This seven-bit value describes the protocol carried by RTP.  For instance, this is where G.711, G.729, or H.264 would be indicated.

RTP Payload:  This is the media and the amount of data sent is dependent on the codec and sample interval.  For example, it might be 20 bytes of G.729 when used with a 20 ms voice payload size.   G.711 with that same sample size of 20 ms would yield 160 bytes of data.  The important thing to realize is that any codec’s data (G.729a, G.711, iBLC, etc.) will be contained here.

For an in-depth explanation of payload size, please refer to this article.


RTP has a sister protocol.  Real-Time Control Protocol (RTCP) is periodically sent with an RTP stream to transport control and QoS information.  RTCP can tell you how many packets were sent and what the jitter and latency values are.  RTCP packets might help you find voice or video quality issues in your network.  For more information on QoS, please refer to my blog No Shirt, No Shoes, No Quality of Service.

Both RPT and RTCP can be encrypted.  SRTP (Secure RTP) prevents the bad guys from sniffing your network and capturing your conversations.  While not nearly as important to security, SRTCP (Secure RTCP) hides the QoS information about those calls.  For more about security, please refer to my blog Practicing Safe SIP.

There are a few more odds and ends involved with RTP, but this is pretty much all you need to know to be dangerous.  In the SIP class I teach I have my students gather RTP packets with Wireshark and playback voice calls.  Of course, they couldn’t do that if the calls were established with SRTP, but that’s the whole point of security, isn’t it?



  1. Hi Andrew,

    Just spotted in your notes above: RTP Payload:

    “This is 20 bytes of data. For example, this might be 20 bytes of G.711 or G.729. In other words, this is the actual media that RTP is transporting. ”

    This surely is 20ms sample you were thinking of here – which for G711 or G729 @ 8KHz sampling would by 160 bytes. (1 sample of 8 bits – 1 byte – taking 125us to capture – so 20ms worth is 20ms/125us = 160 samples => 160 bytes). Of course that’s assuming a 20ms packetisation time – this of course can vary as this is normally set on the device or gateway. and can vary from 10ms to 50ms in practice. The packetisation interval of course affects the quality (delay), but the more samples per packet the less overhead. 20ms is really the point we’ve all got to as a trade off of delay vs. overhead.


    1. Neill, you are absolutely right and I corrected the text. I’ve grown too used to G.729 at 20 ms (which is what I use in my SIP class). Thanks for pointing out the error of my ways. Thanks also for your continued support of my humble efforts. 🙂

  2. […] I want to show you today is how to get to the actual media.  As you may recall, media is sent in RTP packets and since RTP is just another kind of IP packet, Wireshark captures those, too.  Go back to […]

  3. SBoobathy · · Reply

    Andrew… Could you pls explain the RTP headers Padding, Extension, Marker, and what is the use of these headers…

    1. If you want to be an RTP expert, I suggest you read this. You will know everything there is to know about RTP.

    2. Honestly, this is really all you need to know. Unless you are writing a driver, this should be enough.

      1. SBoobathy · ·

        Thank you AP….. 🙂

  4. Thank you Andrew for these valuable info
    Could you please dig deep in the RTP Payloads and what it does exactly with it’s types
    RTP payloads types helped my a lot fixing DTMF issues and video conferencing issues but I don’t know what it does to fix these problems.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: