Welcome to part three in my series of “Writing Your First WebRTC Application” articles. Part one described the basic components involved in a WebRTC solution and part two presented the differences between Chrome and Firefox. Today, I want to spend some time on the client code from the standpoint of the WebRTC flow and API calls.
As I wrote in part one, Chrome and Firefox took slightly different approaches to the WebRTC API object names. I expect that those changes will go away once the specification is ratified, but until that happens, it’s best to create wrapper code that hides those differences from the main call flow logic.
Before you go any further, it would be best if you read the following articles:
Writing Your First WebRTC Application: Part One
Writing Your First WebRTC Application: Part Two
Understanding WebRTC Media Connections: ICE, STUN, and TURN
The goal of WebRTC is to enable multimedia calls to and from web browsers without the need for plug-ins such as Adobe’s Flash. This is accomplished by adding support for those connections directly into HTML-5. In this way, the HTML code that runs in the browser can directly invoke the underlying WebRTC infrastructure.
In all WebRTC solutions, there will be a calling party and a called party. While much of what each half does is identical, there are some very important differences.
A common flow that both caller and called will follow goes like this:
Connect Users by way of a Signaling Server
This can be accomplished in any number of ways, but the easiest method is for both users to visit the same website and connect via a shared signaling server. Users exchange some sort of name or token that allows for the unique identification of the session. This shared token might be a room number or a conversation ID.
The most common way for WebRTC clients to connect to a Signaling Server is the WebSocket API.
Start the signaling between the two sides
Once the clients have shared a token to identify their conversation, they can start exchanging signaling messages through the WebSocket connection established above. Since WebRTC does not specify any specific signaling protocol, it is up to the solution developer to come up with one of his or her own.
Each side will exchange information about its networks and how it can be contacted
This step is often referred to as “finding candidates” and its purpose is to allow web browsers to exchange the network information required to send direct media. Since most clients will use private IP addresses, some form of Network Address Translation (NAT) is required.
To find a publically addressable IP address, WebRTC will make use of STUN and/or TURN servers. These servers provide a client with an IP address that can be shared with its peer for media connections.
WebRTC calls this process of using STUN and TURN servers the Interactive Connectivity Establishment (ICE) framework. ICE will first attempt to use STUN and if STUN is not possible, TURN will be used.
Negotiate media sessions
Once clients know how to communicate with each other, they need to agree upon the type and format of the media. This is accomplished with something called JavaScript Session Establishment Protocol (JSEP). JSEP uses Session Description Protocol (SDP) to identify the codec, resolution, bitrate, frame size, etc. of the media supported by a client.
Start RTCPeerConnection streams
After the signaling connection has been established and the clients have completed negotiating their media capabilities, they can start streaming media. This is accomplished with the WebRTC construct, RTCPeerConnection.
The RTCPeerConnection API
The RTCPeerConnection API is where the real work of establishing a peer-to-peer connection between the two web browsers occurs. It deals with the ICE handler, media streams, access to the local microphone and camera, and the JSEP offer and answer processes.
A web browser will create an RTCPeerConnection similar to the following:
var myPeerConnection = RTCPeerConnection(configuration)
The configuration variable contains the key iceServers which consists of an array of STUN and TURN servers.
The myPeerConnection object is used by both the called and called parties, but its usage is slightly different.
For the calling party:
- Register an onicecandidate handler.
The onicecandidate handler sends ICE candidates to the caller’s peer using the signaling channel.
- Register an onstream handler.
The onstream handler displays the video stream once it is received from the called party.
- Register a message handler.
A message handler is used to process messages received from the called party. For example, if the message contained an RTCIceCandidate, it would be added to the myPeerConnection object using the addIceCandidate() method. If the message contained an RTCSessionDescription object, it would be added to myPeerConnection using the setRemoteDescription() method.
- Gain access to the local camera and microphone.
The function getUserMedia() captures the local media stream which can then be displayed on the local page. That stream must then be added to myPeerConnection using the addStream() method.
- Negotiate media.
In this step, a web browser performs the JSEP offer/answer process by creating an offer with the myPeerConnection method, createOffer(). Additionally, a callback handler is registered to the RTCSessionDescription object. This handler will eventually add the RTCSessionDescription to myPeerConnection using the method setLocalDescription(). Finally, RTCSessionDescription is sent to the remote peer via the signaling channel. The end result is that the caller’s SDP will be set for the caller and called peers.
For the called party:
The steps for the called party are very similar to the caller’s flow with the exception that the called party responds with an answer to the caller’s offer.
- Register an onicecandidate handler.
The onicecandidate handler sends ICE candidates to the called party’s peer using the signaling channel.
- Register an onstream handler.
The onstream handler displays the video stream once it is received from the calling party.
- Register a message handler.
A message handler is used to process messages received from the caller. For example, if the message contained an RTCIceCandidate, it would be added to the myPeerConnection object using the addIceCandidate() method. If the message contained an RTCSessionDescription object, it would be added to myPeerConnection using the setRemoteDescription() method.
- Gain access to the local camera and microphone.
The function getUserMedia() captures the local media stream which can then be displayed on the local page. That stream can then be added to myPeerConnection using the addStream() method.
- Negotiate media.
This is where the big differences between the caller and the called occur. When a “new session description” is received on the signaling channel, the called party will set the remote description with myPeerConnection.setRemoteDescription(). Next, myPeerConnection.createAnswer() is invoked to return the called party’s session description.
After the two sides have established a signaling connection and exchanged media descriptions, media can flow end-to-end. The session can then be managed (held, released, etc.) through the signaling channel.
Mischief Managed
Whew! I am going to stop here to allow everything to soak in before I start presenting the JavaScript to accomplish all of the above. Stay tuned for Part Four and oodles of more fun.