Building a Resilient SIP Solution

I am the father of three boys. More accurately, I am the father of three men since the oldest is 29 and the youngest is 22. However, for most of their existence they were boys and to me that meant years of worrying about broken bones, cuts, bruises, and much worse. They all came through adolescence and the teenage years in relatively good shape, but I spent more hours in emergency rooms and urgent care clinics than I care to think about.

I want to think that a lot of the reason why they survived into their twenties is that I was always insistent on doing things safely. They wore helmets when they snowboarded or road their bikes. Driving without seatbelts was punishable by a denial of car privileges. While I wasn’t overbearing, I did my best to ensure that they understood the risks of their activities and took the proper precautions.

All of this leads me to what I want to write about today – risk management for SIP.

There are a number of places in a SIP infrastructure where things can go wrong. There are network elements that move SIP sessions in, out, and around your enterprise. There are users who connect their SIP endpoints to that network. There are SIP servers that route sessions from place to place. There are trunks to and from your SIP carrier. Clearly, there are lots of places where things can wrong. Fortunately, there are just as many ways to ensure that things either don’t go wrong in the first place, or heal themselves as quickly as possible.

Network First

The quality of your SIP traffic is directly related to the quality of your network. Before I have anyone move to SIP, I insist on an assessment to ensure that all routers and switches are of a vintage that can handle real-time media and all are correctly configured. Don’t just trust that your network is ready to support quality of service. Go through the proper tests to make sure that all the i’s are dotted and the t’s are crossed. Believe me, it will be money well spent.

In terms of SIP and trunks, I highly recommend that you have multiple ingress and egress points. This may mean multiple MPLS drops to a single location. It may also mean having a primary circuit in your main data center and a backup circuit in a disaster recovery site. The important thing is to guarantee that if one connection drops there is another one ready to take over.

You may even want to consider multiple providers. For instance, you might want to use AT&T for your prime circuits and Verizon for your backup. This will complicate your dial plan, but the price of losing all call services might be worth the effort.

On the Border

The first line of SIP defense is your Session Border Controller (SBC). Every enterprise grade SBC on the market today offers a high availability (HA) option. In a SBC HA pair, one box runs in active mode processing all incoming and outgoing SIP sessions. The second box is in standby mode waiting to be told to take over that SIP traffic. There is a link between the two that keeps the standby box synchronized with the active box. This allows the standby box to hit the ground running whenever the main SBC dies. It knows everything that was occuring at the time of failure and all active sessions continue on as if nothing happened.

An important aspect of HA SBCs today is that the active and standby must be connected by a Layer-2 network. This means that they must be on the same subnet. Since many primary and backup data centers only support Layer-3 between them, HA SBC pairs cannot be geographically split. This doesn’t negate the need for HA, it simply means that HA can only be campus-based and the two boxes must be local to one another.

For a deeper dive into this, please see my article, SBC Resiliency.

Session Management

Session management is what turns SIP the protocol into SIP the architecture. I’ve written extensively about session management (please see my articles, Session Management and Avaya Session Manager vs. Session Border Controller), but the main thing to understand here is that it loosely connects disparate SIP services, servers, and endpoints together to form a cohesive whole.

Unlike SBCs, Avaya Aura Session Manager supports HA in an active-active manner. This means that all Session Managers are active at all times and nothing runs in standby mode. Session Managers are grouped into communities where each member understands what the others in the community are doing. This facilitates a seamless failover if one of the members dies or is taken off-line.

Another difference between the Aura Session Manager and a SBC is that Session Managers do not require a Layer-2 network between them. This allows an enterprise to spread Session Managers across the world and still maintain a single logical SIP solution.

The Endpoints

The last layer of resiliency resides with the endpoints. As I wrote about in In the Beginning: SIP Registration, SIP allows a single user to simultaneously register multiple devices. I always have at least two devices registered. When sitting at my desk, I take calls on my Avaya 9641 telephone, but those same calls will also ring One-X Mobile for IOS on my iPhone. It doesn’t matter if I am stationary or mobile. I never miss a call.

Lastly, Avaya telephones support something called SIP Outbound. This allows a single device to simultaneously register to multiple SIP servers. My 9641 can register to three different Session Managers at the same time (two in the core and one at the branch) and if one goes down, I am already registered to a backup. This provides for a seamless failover with no dropped calls.

No Single Points of Failure

If you follow these guidelines you will create a resilient SIP architecture with no single points of failure. I haven’t done the math, but I can easily envision five or six nines of availability. Remember, down time is lost money and who can afford that these days? So, like I always told my kids, wear your helmets and buckle up your seatbelts. No one wants to take their business to the emergency room.

Tags: Avaya, SBC, Session Border Controller, Session Management

8 comments

David · September 11, 2014 - 9:39 am · Reply→

Whats your experience with an SBC behind a NAT’d firewall. I can get my SBC to work without the firewall and vice versa with an IP office system. But the two together breaks something and I cannot put my finger on it.
1. Andrew Prokop · September 11, 2014 - 9:50 am · Reply→
  
  I am not sure if I can help, but would you mind telling me what SBC you have? I may know someone who knows someone. 🙂
David · September 11, 2014 - 9:57 am · Reply→

Avaya SBCe Portwell v6.2

I would really like to turn NAT off on the SBC and let my firewall do the NAT. This has become a big headache for me, the dual NAT is breaking my SIP registration and killing my calls.
1. Andrew Prokop · September 11, 2014 - 9:59 am · Reply→
  
  No promises, but I will see if I can find someone who can help.
  1. David · September 11, 2014 - 10:02 am ·
    
    Great, maybe we could converse offline, I’m willing to pay for help and Avaya wont open a case.
2. Andrew Prokop · September 11, 2014 - 10:11 am · Reply→
  
  I have asked a contact at Avaya, but off hand I see some issues. There are IP addresses in SIP messages that will need to be translated. Typically, you want the SBC to do that. If you want the firewall to do the NAT you will need some type of SIP ALG. SIP ALGs can mess things up for SBCs.
  1. David · September 11, 2014 - 10:16 am ·
    
    Well the way its setup now, they both do NAT, the public IP is on the firewall. If I take the firewall out and move the public to the SBC outside interface, everything works. Most design guides show firewall-sbc–IPO which I have not been successful with. I’ve tried turning off ALG on the firewall that doesnt work. Ill run it with a dual NAT, but theres gotta be one silly piece im missing. My ideal solution is dual NAT but it seems to break it so thats why I brought up the possiblity of only one NAT.
3. Andrew Prokop · September 11, 2014 - 10:25 am · Reply→
  
  This is the answer I got back from Avaya:
  
  “The SBCE doesn’t do network natting, the SBCE does NAT traversal.
  
  For sip natting we recommend the FW not have SIP ALG’s on as they cause issues and you put the natted ip in public ip field on the SBCE for us to nat the sip messages. But if for some reason they won’t disable sip ALG’s and want FW to do the sip natting then don’t put the nat IP in the public IP field in the SBC. The customer will need to closely look at the FW to ensure it nat’s and unnat’s for all sip messages and that it doesn’t block any sip messages that it shouldn’t, we have seen both cases from FW’s.”
  
  Good luck!