Archive for the ‘Next Generation Communications’ Category

Advanced Persistent Threats

November 23, 2016

titleAdvanced Persistent Threats

Coming to a network near you, or maybe your network!

 

There are things that go bump in the night and that is all they do. But once in a while things not only go bump in the night, they can hurt you. Sometimes they make no bump at all! They hurt you before you even realize that you’re hurt. No, we are not talking about monsters under the bed or real home intruders; we are talking about Advanced Persistent Threats. This is a major trend that has been occurring at a terrifying pace across the globe. It targets not the typical servers in the DMZ or the Data Center, but the devices at the edge. More importantly, it targets the human at the interface. In short, the target is you.

Now I say ‘you’ to highlight the fact that it is you, the user who is the weakest link in the security chain. And like all chains, the security chain is only as good as its weakest link. I also want to emphasize that it is not you alone, but myself or anyone or any device for that matter that accesses the network and uses its resources. The edge is the focus of the APT. Don’t get me wrong, if they can get in elsewhere they will. They will use whatever avenue they find available. That is another point. The persistence part, they will not go away. They will keep at it until eventually they find a hole, however small and exploit it. Once inside however they will be as quiet as a mouse. Being unknown and undetected is the biggest asset to the APT.

How long this next phase goes is not determinable. It is very case specific. Many time’s its months if not years. The reason why is that is not about attacking, it’s about exfiltration of information from your network and its resources and/or totally compromising your systems and holding you hostage. This will obviously be specific to your line of business. In the last article we made it plain that regardless of the line of business there are some common rules and practices that can be applied regardless to the practice of data discovery. This article hopes to achieve the same goal. To not only edify you as to what the APT is but illustrate its various methods and of course provide advice for mitigation.

We will obviously speak to the strong benefits of SDN Fx and Fabric Connect to the overall security model. But as in the last article, it will take a second seat as it is the primary practices and use of technology regardless of its type, as well as the people, policies and practices that are mandated. In other words, a proper security practice is a holistic phenomenon that is transient and is only as good as the moment of space and time it is in. We will talk to our ability and perhaps soon the ability of artificial intelligence (AI) to think beyond the current threat landscape and even perhaps learn to better predict the next steps of the APT. This is how we will close. So, this will be an interesting ride. But its time you took it.

What is the Advanced Persistent Threat?

In the past we have dealt with all sorts of viruses, Trojans and worms. Many folks ask, what is different now? Well in a nutshell, in the past these things were largely automated software devices that were not really discerning on the actual target. In other words, if you were a worm meant for a particular OS or application and you found a target that was not updated with the appropriate protection you nested there. You installed and then looked to ‘pivot’ or ‘propagate’ within the infected domain. In other words in the past these malicious software were opportunistic and non-discretionary in the way they worked. The major difference with the APT is that they are targeted. They are also typically staffed and operated by a dark IT infrastructure. They will still use the tools, the viruses, the Trojans the worms. But they will do so with stealth and the intent is not to kill but to compromise, perform exfiltration and even establish control. They will often set up traps that once it is clear they have been discovered they will run a ransomware exploit as they leave the target. This gives them a lasting influence and extension of impact.

In short, this is a different type of threat. This is like moving from the moving columns of ancient roman armies to the fast and flexible mounted assaults of the steppe populations out of Asia. The two were not well suited for one another. In the open lands, the horseback was the optimal. But in the populated farm areas and particularly in the cities, the Roman method proved superior. This went on for centuries until history and biology decided the outcome. But afterwards there was a new morphing, the mounted knight. A method which took the best from both worlds and attempted to combine them and by that created a military system that lasted for almost a thousand years. So we have to say that it had a degree of success and staying power.

We face a similar dilemma. The players are different, as are the weapons, but the scenario is largely the same. The old is passing away and the new is the threat on the horizon. But I also want to emphasize that no one throughout the evolution of warfare probably threw a weapon away unless it was hopelessly broken. Folks still used swords and bows long after guns were invented. The point is that the APT will use all weapons, all methods of approach until they succeed. So how do you succeed versus them?

Well, this comes back to another dilemma. Most folk cannot account for what is on their networks. As a result they have no idea of what a normal baseline of behavior is. If you do not have any awareness of that how do you think you will catch and see the transient anomalies of the APT? This is the intention of this article. To get you think in a different mode.

The reality of it is that APT’s can come from anywhere. They can come from any country, even internally to your organization! They can be for any purpose, monetary, political, etc. They will also tend to source in the country where the target is and use the ambiguity of DNS mapping to trace ‘home’. This is what makes them advanced. They have very well educated and trained staffs who are mounting a series of strong phases of attack against your infrastructure. Their goal is to gain communications and control (C2) channels to either gain exfiltration of information of actual control of certain subsystems. They are not out to expose themselves by creating issues. As a curious parallel there has been a noted decrease in DOS and DDOS attacks on networks as the APT trend has evolved. It’s not that it isn’t used anymore; it’s just that it is now used in a very limited and targeted fashion. Which makes it far more dangerous. Often to cover up some other clandestine activity that the APT is executing and this would be a very last resort. For them being stealth is key to their long term success. So the decreases in these types of attacks make sense when looked at holistically. But note that a major IOT DDOS attack just occurred with home video surveillance equipment. Was it just an isolated DDOS or was it to get folks to turn their attentions to it? We may never know. These organizations may be nation states, political or terrorist groups, even corporations involved in industrial espionage. The APT has the potential to be anywhere and it could put its targets on anything, anywhere, at any time according to its directives. The reason why they are so dangerous is that they are actual people who are organized and who use their intelligence and planning against you. In short, if they know more about your network than you do… you lose. Pure and simple.

So what are the methods?

There has been a lot of research on the methods that APT’s will use. Due to the fact that this is largely driven by humans, the range can be very wide and dynamic. Basically it all gets down to extending the traditional kill chain. This concept was first devised by Lockheed Martin to footprint a typical cyber-attack. This is shown in the illustration below.

figure-1

Figure 1. The traditional ‘kill chain’

The concept of infiltration needs to occur in certain fashion. An attacker can’t just willy-nilly their way into a network. Depending on the type of technology, the chain might be rather long. As an example, compare a simple WEP hacking example against a full grade Enterprise WPA implementation with strong micro-segmentation. There are many degrees of delta in the complexity of the two methods. Yet, many still run WEP. The APT will choose the easiest and most transparent method.

Reconnaissance

In the first initial phase of identifying a target a dark IT staff is called together as a team. This is known as the reconnaissance or information gathering phase. In the past, this was treated lightly at best by security solutions. Even now with highlighted interest in this area by security solutions, it tends to be the extended main avenue of knowledge acquisition. The reason for this is that much of this intelligence gathering can take place ‘off line’. There is no need to inject probes or pivots at this point. This is like shooting into a dark room and hoping you hit something. Instead the method is to gain as much intelligence about the targets as possible. This may go on for months or even years, as it continues as the next step and even the others occur. Note how I say ‘targets’. This notes that the target, when analyzed will result in a series of potential target systems. Now in the past these were typically servers, but now this may not be the case. The APT is more interested in the users or edge devices. These devices are typically more mobile with a wider degree of access media type. There is also another key thing on many of these devices. They have you or me at the interface.

Infiltration

Once the attacker feels that there is enough to move forward the next step is to try to establish a beach head into the target. In the past this was typically a server somewhere, but folks have been listening and following the advice of the security communities. They have been hardening their systems and keeping up to date and consistent with code releases. Score one for us.

There is the other side of the network though. This is more of a Wild West type of scenario. In the old west of the United States law was a tentative thing. If you were in a town out in the middle of nowhere and some dark character came into town, your safety was as good as the sheriff, which typically didn’t last the first night. Your defense was ‘thin’. Our end points are much the same way. As a result, truly persistent professional teams that are advanced in nature will target the edge, more specifically, the human at the edge. No one is immune. In the past a phishing attempt was easier to see. This has changed recently in that many times these attempts will be launched from a disguised email or other correspondence with an urgent request. The correspondence will appear very legitimate. Remember the APT has done their research. It appears to have the right format and headers; it is also from your manager. He is also referring to a project that you currently are working on with a link indicating that he needs to hear back immediately as he is in a board meeting. The link might be a spreadsheet; a word document… the list goes on. Many people would click on this well devised phish. Many have. There are also many other ways, some that in the right circumstances does not even require the user to click.

There are also methods to create ‘watering holes’ which is basically an Infiltration of websites that are known to be popular or required with the target. Cross site scripting is a very common set of methods to make this jump. Once visited the proper scripts are run and the infiltration then begins. A nice note is that this has fallen off due to improvements in the JRE.

There are also physical means. USB ‘jump sticks’. These devices can carry malware that can literally jump into any designed system interface. There is no need to log on to the computer. Only access to the USB port is necessary and even then only momentarily. In the right circumstances a visitor could wreak a huge amount of damage. In the past this would have been felt immediately. Now you might not feel anything at all. But it is now inside your network. It is wreaking no damage. It remains invisible.

Exploitation (now the truth of the matter is that it’s complicated)

When the APT does what it does if it is successful you will not know it. The exploit will occur and if undiscovered continue on. It is a scary point to note that most APT infiltrations are only pointed out after the fact to the target by a third party such as a service provider or the law enforcement. This is sad. It means that both the infiltration and exploitation capabilities of the APT are very high. The question is how does this get accomplished? The reality of it is that each phase in the chain will yield information and the need to make decisions as to the next best steps in the attack. Well, the realization is that this is the next step in the tree. This is shown in the figure below there are multiple possible exploits and further infiltrations that could be leveraged off of the initial vector. It is in reality a series of decisions that will take the intruder closer and closer to its target.

figure-2

Figure 2. The Attack Tree

Depending upon what the APT finds as it moves forward its strategy will change and optimize over time. In reality it will morph to your environment in a very specific and targeted way. So while many folks think that exploitation is it. It’s really not. In the past it was visible. Now it’s not. The exploitation phase is used to further implant into the network.

 

Execution or Weaponization

In this step there is some method established to the final phase which is either data exfiltration or complete command and control (C2). Note that again, these steps may be linked and traced back. This is important as we shall see shortly. Note that execution is a process that will have a multitude of methods ranging from complete encryption (ransomware) to simple probes or port and keyboard mappers to gain yet further intelligence. Nothing is done to expose its presence. Ideally, it will gain access to the right information and then begin the next phase.

 

Exfiltration

This is one of the options. The other is command and control (C2) which to some degree is required for exfiltration anyways. So APT’s will do both. Hey, why not? Seeing as you are already into the belly of the beast why are you not leveraging all avenues available to you? It turns out that both require a common trait; an outbound traffic requirement. At this point if the APT wants to pull the desired data out of the target it must establish an outbound communication. This is also referred to as a ‘phone home’ or ‘call back’. These channels are often very stealthy and they also are typically encrypted and mixed within the profile of the normal data flow. Remember, while there are well known ports assigned that we all should comply to, an individual with even limited skills can generate a payload with ‘counterfeit’ port mappings. DNS, ICMP and SMTP are three very common protocols for this type of behavior. It’s key to look for anomalies in behavior at these levels. The reality of it is that you need some sort of normalized baseline before you can judge whether there is an anomaly. This makes total sense.

If you bring me to edge of a river and say “Ed, tell me the high and low levels”, I could not reliably provide you with that information given what I am seeing. I would need to monitor the river for a length of time. To ‘normalize’ it, in order to tell you the highs and the lows. Even then with the possibility of extreme outliers. This is very much the same with security. We need to normalize our environments in order to see anomalies. If we can see these odd outbound behaviors early, then we can cut the intruder off and prevent the exploit from completing.

The APT needs systems to communicate in order for the tools to work for them. This means that they need to leave some sort of ‘footprint’ as they look to establish outbound channels. They will often use encryption to maintain a cloak of darkness for the transport of the data.

Remember, unlike the typical traditional threat which you probably are well prepared for. The APT will look to establish a ‘permanent’ outbound channel. The reason I use quotes around permanent is that these channels may often jump sessions, port behaviors or even whole transit nodes if the APT has built enough supporting malicious infrastructure into your network. Looking at the figure below, if the APT has compromised a series of systems; it has a choice on how to establish outbound behaviors.

figure-3

Figure 3. Established exfiltration channels

The larger the footprint the APT has the better it can adjust and randomize its outbound behaviors, which makes it much more difficult to tease out. So catching the APT early is very key. Otherwise it’s much like trying to stamp out a fire that is growing out of control.

 

Command and Control (C2)

This is the second option. Sometimes the APT wants more than just data from you. Sometimes they want to establish C2 channels. This can be for multiple purposes. As in the case above, it might be to establish a stealth outbound channel network to support exfiltration of data. On the other side of the spectrum this might be complete (C2). Think power grids, high security military, intelligent traffic management systems, automated manufacturing, subways, trains, airlines. The list goes on and on.

The reality of it is that once the APT is inside most networks it can move laterally. This could be through the network directly but it might also be through social venues that might traverse normal segment boundaries. So the lateral movement could be at the user account level, the device, or completely random based on a set of rules. Also, let’s not forget the old list of viruses, web bots and worms that the APT can use internally within the target and on a very focused basis. It has the vectors for transport and execution. Note how I do not say outright propagation, in this case it is much more controlled. As noted above once the APT has established a presence at multiple toeholds it’s very tough to knock it out of the network. A truly comprehensive approach is required to mitigate these outbound behaviors. It starts at the beginning, the infiltration. Ideally we need to catch it there. But the reality is that in some instances this will not be the case. I have written about this in the past. With the offense there is the nature of surprise. The APT can come up with a novel method that has not been seen before by us. So we are always vulnerable to infiltration to some degree. But if not cutting it off before it enters we can work to prevent the exploit and later phases of attack. While not perfect, this has merit. If we can make the infiltration limited and transient in its nature the later steps become much more difficult to accomplish. We will speak to this later as it is a very key defense tactic that if done properly is very difficult to penetrate past. Clearly these outbound behaviors are not the time to finally detect something, particularly if you pick it out of weeks of logs. The APT has already established its infrastructure, you are in reaction mode.

The overall pattern (hint – its data centric)

By now hopefully you are seeing a strong pattern. It is still nebulous and quite frankly it always will be. The offense still has a lot of flexibility. For us to think that the APT will not evolve is foolish. So we need to figure out a way to somehow co-exist with its constant and impinging presence. Due to its advanced and persistent nature (hence the APT acronym) the threat cannot be absolutely eliminated. To do so would make systems totally isolated. And while this might be desired to a certain level for certain systems as we will cover later, we have to expose some systems to the external Internet if we wish to have any public presence.

Perhaps this is another realization. We should strongly limit our public systems and strongly segment with no confidential data access. When you get down to it, the APT is not about doing a DDOS attack on your point of sales. It’s not even about it absconding credit card data on a one time hit. None of these are good for you obviously. But the establishment of a persistent dark covert channel out of your network is one of the worst scenarios that could evolve. By this time you should be seeing a pattern. It’s all about the data. They are not after general communications or other such data unless they are doing further reconnaissance. They are about moving specific forms of information out or executing C2 on specific systems within the environment. Once we recognize this we see that the intent of the APT is long term residence and preferably totally stealth. The figure below shows a totally different way to view these decision trees.

figure-4

Figure 4. A set of scoped and defined decision trees

Each layer from outer to center represents different phases in the extended kill chain. As can be seen they move from external (access), to internal (pivot compromise) and target compromise kill chains. You can also see that the external points are exposed vulnerabilities that the APT could leverage. These might be targeted and tailored email phishing or extensive water holing. There may also be explicit attacks against service points discovered. The goal is to establish a network of pivot points that can allow for a better exposure of the target. The series of decision trees all fall inward towards the target and if the APT gets its way and goes undiscovered, this will be the footprint of its web within the target. It is always looking to expand and extend it but not at the cost of losing secrecy. Its major strength lies in its invisibility.

So the concept of a linear flow to the attack has to go out the window. Again, this is the key term to persistence. This is very cyclic is the way it evolves over time. The OODA loop comes to mind which is typically taught to military pilots and quick response forces is – Orient, Observe, Decide, Access. The logic that the APT uses is very similar. This is because it is raw constructive logic. Trying to break down OODA any further becomes counterproductive, believe me many have tried. So you can see that the OODA principle is well established by the APT. Remain stealth, morph and move. But common to all of this is the target. Note how everything revolves around that center set of goals. If you are starting to see a strategy of mitigation and you haven’t read my previous article then my hat is off to you. If you have read my article and see the strategy then my hat is off to you as well. If you have not read my article and are puzzled – hang on. If you have read my last article and you are still puzzled I need to say emphatically. It’s all about the data!!!

We also should start to see and understand another pattern. This is shown in simpler terms in the diagram above; there is an inbound, a lateral and an outbound movement to the APT. This is the signature of the APT. While it looks simple, the mesh of pivots that the APT establishes can be quite sophisticated. But from this we can begin to discern that if we have enough knowledge of how our network normally behaves we can perhaps tease out these anomalies, which obviously did not exist before the APT gained residence. Note the statement I just made. Normalization means normalization against a known secure environment. A good time to establish this might be after compliance testing for example. You want to see the network as it should be.

Once you have that, you should with the right technologies and due diligence be able to see any anomalies. We will talk later about these in detail, but it can range from odd DNS behavior to random encrypted outbound channels. We will speak to methods of mitigation, detection as well as provide a strategic roadmap on goals against the APT realizing that we have limited resources available in our IT budgets.

So is this the end of IT Security as we know it?

Given all of the trends that we have seen in the industry one is tempted to throw up their arms and give up. Firewalls have been shown to have shortcomings and compromises; encryption has been abused as a normal mode of operation by the APT. What good is anti-virus in any of this? Many senior executives are questioning the value of the investment that they have made into security infrastructure, particularly if you are an executive of an organization that has been recently compromised.

After all, encryption is now being used by the bad guys, as are many other ‘security’ approaches. The target has shifted from the server to the edge. Does this mean that we jettison all of what we have built because it is no longer up to the challenge? Absolutely not! It does however indicate that we need to rethink how we are using these technologies and how they can be used with newer technologies that are coming into existence. Basically, the concept of what a perimeter is needs to change and we will discuss this in detail later on, but additionally we need to start thinking more aggressively in our security practice. We can no longer be sheep sitting behind our fences. We must learn to be more like the wolves. This may sound parochial but take a look at the recent news on the tracking and isolation of several APT groups not only down to the country of origin but the actual organization and in some instances even the site! This is starting to change the rules on the attackers.

But this is the stuff of advanced nation state cyber-warfare, what can the ‘normal’ IT practitioner do to combat this rising threat? Well, it turns out there is quite a bit. And it turns out that aside from launching your own attacks (which you shouldn’t do obviously), there is not much that the nation states can do that you can’t do. So let’s put on some different hats for this article. Let’s make them not black, but very nice dark gray. The reason why I say this is that in order to be really effective in security today you need to think like the attacker. You need to do research; you should attempt penetration and exploitation yourself (in a nice safe ISOLATED lab of course!). In short, you need to know them better than they know you because in the end it’s all about information. We will return to this very shortly. But we also need to realize that we need to create a security practice that is ‘data centric’. It needs to place the most investment in the protection of critical data assets that are often tiered in importance. Gone are the days of the hard static perimeter and the soft gooey core. We need to carry micro-segmentation to the nth degree. The microsegments need to not only strongly but exactly correspond to the tiers of the risk assets mentioned earlier. Assets with more risk should be ‘deeper’ and ‘darker’ and should require stronger authentication and much more granular monitoring and scrutinizing. All of this makes sense but it only makes sense if you have your data in order and have knowledge as to its usage, movement and residence. This gets back to the subject of my previous article and it sets the stage well for this next conversation. If you have not read it, I strongly urge you to do so before you continue.

 

Information and warfare

This is a relationship that is very ancient, as ancient as warfare itself. The basic premise is three fold. First, aggressors (hence weapon technology to a large part) has had the advantage in the theory of basic conflict. After all, it’s difficult to design defenses against weapons that you do not know about yet. But it doesn’t mean the defense lacks the ability to innovate either. As a matter of fact with a little ingenuity almost anything used in offense can be used for defense as well. So we need to think aggressively in defense. We cannot be passive sheep. Second, victory is about expectation. Expectation on a plan, on a strategy of some sort to achieve an end goal; in essence very few aggressive conflicts have no rationale. There is always a reason and a goal. Third, information is king. It to a very large degree will dictate the winners and the losers in any conflict, whether its Neolithic or modern day cyber-space. If the attacker knows more that you do, then you are likely to lose.

OK Ed! You might be saying wow! We are talking spears and swords here! Well, the point is that it’s not much has changed since the inception of conflict itself. Spying and espionage goes back as far as history, perhaps further. Let us not forget that it was espionage, according to legend that was the downfall of the Spartan 300. I can give you dozens (and dozens) of examples of espionage throughout history right up to modern times. Clandestine practice is certainly nothing new. But there may be a lot of things that we as security folk have forgotten along the way. Things that that the attackers might still remember; in today’s world if the APT knows more about your network and applications than you do; if they know more about your data than you do. You are going to lose.

Here you may be startled at the comment. How dare I. But if the question is extended to “Do you have a comprehensive data inventory? Is it mapped to relevant systems and validated? Do you know where its residence is? Who has access?” Many cannot answer these questions. The problem is that that APT can. They know where your data is and they know how it moves through your network, or at least they are in constant effort to understand that. They also understand where they can do exfiltration of the data as well. If they know and you don’t, they could be pulling information for quite a long time and you will not know. Do you think I am kidding? Well consider this. About 90% of the information compromises that occur are not discovered by internal IT security staff, they are notified of them by third parties such as their service providers or law enforcement agencies. Here is another sobering fact, the APT on average had residence in the victims network for 256 days.

So clearly things are changing. The ground as it were, is shifting underneath our feet. The traditional methods of security are somehow falling short. Or perhaps they always were and we just didn’t realize it until the rules changed. In any event, the old ‘keep ‘em out’ strategy is no longer sufficient. We need to realize that our networks will at some point be compromised. We will talk a little later as to some of the methods. Because of this, we need to shift our focus to detection. We need to identify the foreign entity and hopefully remove it before it does to much damage or gains to much knowledge. So IT security as we know it will not go away. We still require firewalls and DMZ’s, we will still require encryption and strong identity policy management as well as intrusion detection technologies. We will just need to learn to use them differently then we have in the past. We also have to utilize new technologies and concepts to create discrete visibility into the secure data environments. New architectures and practices will evolve over time to address these imminent demands. This article is intended to provide baseline insight into these issues and how they can be addressed.

 

It’s all about the user (and I’m not talking about IT quality of experience!)

Whenever you see a movie about hacking you always see someone standing in front of several consoles, cracking into various servers and doing their mischief. It’s fast moving and very intense. I always laugh because this is most definitely not the case. Slow and steady is always best and the server is most definitely not the place to start. It’s you. You are the starting point.

Think about it, you move around. You have multiple devices. You probably have less stringent security practices than the IT staff that maintains the server. You are also human. You are the weakest link in the security chain. Now I’ve spoken about this before but it has always been from the perspective of IT professionals who are not as diligent as they should be in the security practice of their roles. Here we are talking about the normal user, who may not be very technically savvy at all. Also, let’s consider that as humans we are all different. Some are more impulsive. Some who are more trusting. Some who simply don’t care. This is the major avenue or rather set of avenues that an attacker could use to gain compromised access into the network. Let’s look at a couple.

Deep Sea Phishing –

Many folks are aware of the typical ‘phishing’ email that says ‘Hey, you’ve won a prize! Click on this URL below!’ Hopefully, most folks now know not to click on the URL. But the problem is that this has moved into new dimensions with whole orders of magnitude in the increase of intelligence behind these types of attacks. As I indicated earlier, much of the reconnaissance that an APT does is totally off of your network. They use publicly posted information. News updates, social media, blog post (yikes – I’m writing one now!). They will not stop there either. There is a lot of financial data and profiling as well as the tagging of individuals to certain organizational chains and projects. Once the right chain is identified the phishing attack is launched. The target user receives a rather normal looking email from his or her boss. The email is about a project that they are currently working on and that they need to hear back on some new numbers that are being crunched. Could they take a look and get back to them by the end of the day. Time is of the essence as we are coming to the end of the quarter. They need to hear back by end of day. Many would open the spreadsheet and understandably so. HTML enabled email makes it even worse in that the SMTP service chain is obscured making it difficult to see the odd chain. And even then, many users wouldn’t even notice that. Many data breaches have occurred in just such a scenario. Once the url is clicked or the document is opened, the malicious code goes to work and establishes two things. The first is command and control back to the attacker, the second is evasion and resilience. From that point of presence the attacker will usually privilege escalate the local machine and then utilize it as a launching point to gain access to other systems.

The Poisoned Watering Hole or Hot Spot –

We all go out on the web and we all probably have sties that we hit regularly. We all go out to lunch and most probably go to our favorite places regularly. This is another thing that attackers can leverage the concept that we are creatures of habit. So let’s change the scenario. Let’s say that the attacker gets a good profile of the targets web behavior. They also learn where the target goes for lunch. But they don’t even need to know that. Typically they will select a place that is popular with multiple individuals in the target organization. That way the probability of will provide greater hits. Then they will emulate the local hot spot with aggressive parameters to force the targets to associate with it. Once that occurs the targets would gain internet access as always but now the attacker is in the middle. As the targets go about using the web they can be re-directed to poisoned sites. Once the hit occurs the attacker shuts down the rogue hot spot and then waits for the malicious code that is now resident on the targets to dial back. From the target users perspective the WLAN dropped and they simply re-associate to the real hot spot. Once the users go back to work, they log on and as a part of it they establish an outbound encrypted TCP connection to the APT. These will not be full standing sessions however, but intermittent. This makes the behavior seem more innocuous. The last thing that the APT wants is to stand out. From there the scenario proceeds much like before.

In both of the scenarios the user is the target. There are dozens of other examples that could be given but I think the two suffice. The human behavior dimension is just too wide to expect technology to fulfill the role, at least at this point. Until then we need firm clear policies that are well understood by everyone in the organization. There also needs to be firm enforcement of the policies in order for them to be effective. This is all in the human domain, not in the technology domain. But technology can help.

 

It’s all about having a goal as well

When an advanced persistent threat organization first starts to put you in their sites, they usually have a pretty good idea of what they are looking for or what they want to do. Only amateurs gain compromised access and then rummage or blunder about. It’s not that an APT wouldn’t take information that it comes across if it found it useful, but they usually have a solid goal and corresponding target data set. What that is depends on what the target does. Credit Card data is often a candidate, but it could be patient record data, confidential financial or research information, the list can be endless. We discussed this in my previous article on data discovery and micro-segmentation practices. It is critical that the critical data gets identified and accounted for. Because you can bet that the APT has.

This means that there is deliberate action on behalf of the APT. Again, only amateurs are going to bungle about. The other thing is that time is, unlike in the movies, not of essence! The average residency number that I quoted earlier illustrates this. In short, they are highly intelligent to their targets, they are very persistent and will wait many months until the right opportunity to move and they are very quiet.

This means that you need to get your house in order on the critical data that you need to protect. You need to know how it moves through your organization and you need to establish a solid idea of what normal is within those data flows. Then you need to move to fight to protect it.

The Internet – The ultimate steel cage

When you think about it, you are in the ultimate steel cage. You have to have a network. You have to have an Internet presence of some sort. You need to use it. You cannot go away. If you do you will go out of business. You are always there and so is the APT. The APT also will not go away. It will try and wait and wait and try and go on and on until it succeeds in compromising access. This paradigm means that you cannot win. No matter what you as a security professional does in your practice, the war can never be won. But the APT can win. It can win big. It can win to the point on putting you out of business. This creates a very interesting set of gaming rules if you are interested in that sort of thing. In a normal zero sum game, there is a set of pieces or tokens that can be won. Two players can sit down and with some sort of rules and maybe some random devices such as dice play the game. The winner is established by the first player to win all of the tokens. But if we remove the dice we have a game more like chess where the players’ lose or win pieces based on skill. This is much more akin to the type of ‘game’ that like to think that we play in information security. Most security architects I know do not use dice in their practice. Now in a normal game of chess, each player is more or less equal with the only real delta being skill. But remember you are sitting at the board with the APT. So here are the new rules. You cannot win all of his or her pieces. You may win some but even if you come down to the last one, you need to give it back. What’s more, there will not be just one. There will be ‘some number’ of pieces that you cannot win. Let’s say that it’s a quarter or maybe even half of the pieces are ‘unwinnable’. Well, it is pretty clear that you are in a losing proposition. You cannot win. The best you can do is stay at the board for as long as you can. Then also consider that the APT’s skill and resources may be just as great if not greater than yours. Does that help put things in perspective?

So the scenario is stark, but it is not hopeless. The game can actually go on for quite some time if you are smart in the way you play. Remember, I said ‘some number’ of pieces that you cannot win I did not say which types. If you look at a chess board you will note that the power pieces and the pawns are exactly half the count. This means that you could win all or most of the power pieces and leave the opponent with a far minimized ability to do damage to you as long as you aren’t stupid. So mathematically the scenario is not hopeless, but it is not bright either. While you can never win you can establish a position of strength that allows you to stand indefinitely.

Realize that the perimeter is now everywhere

Again, an old notion is that we can somehow draw a line our network and systems is becoming antiquated. The trends in BYOD, mobility, virtualization and cloud have forever changed what a security perimeter is. We have to realize that we are in a world of extreme mobility. Users crop up everywhere demanding access from almost anywhere, with almost any consumer device. These devices are also of consumer grade with little or no thought to systems security. As a result these devices, if not handled correctly with the appropriate security practices become a very attractive vector for malicious behavior.

This means that the traditional idea of a network perimeter that can be protected is no longer sufficient. We need to realize that there are many perimeters and these can be dynamic due to the demands of wireless mobility. This doesn’t mean that firewalls and security demarcations are no longer of any use; it just means that we need to relook at the way we use them and compare them with new technologies that can vastly empower them.

It is becoming more and more accepted is that micro-segmentation is one of the best strategies for a comprehensive security practice and to make it as difficult as possible for the APT. But this can’t be a simple set of segments off of a single firewall but multiple tiered segments with traffic inspection points that can view the isolated data sets within. The segmentation provides for two things. First, it creates a series of hurdles for the attacker, both on the way in and on the way out as they seek the exfiltration of data. Second and perhaps less obvious, segmentation provides for isolated traffic patterns with very narrow application profiles as well as interacting systems. In short, these isolated segments are much easier to ‘normalize’ from a security perspective. Why is this important? It is important because in the current environment 100% prevention is not a realistic proposition. If an APT has targeted you, they will get in. You are dealing with a very different beast here. The new motto you need to learn is that “Prevention is an ideal, but detection is a MUST!”

In order to detect you need to know what is normal. In order to make this clear let’s use a mundane example of a shoplifter in a store. The shoplifter wants to look like any other normal shopper, they will browse and try on various items like anyone else. In other words they strongly desire to blend into the normal behavior of the rest of the shoppers in the store. An APT is no different. They want to blend into the user community and appear like any other user in the network. As a matter of fact they will often commandeer normal users machines by the methods discussed earlier. They will learn the normal patterns of behavior and try as much as possible to match them. But at some point, in order to shoplift the shopper needs to diverge from the normal behavior. They need to use some sort of method to take items out of the store undetected. In order to do this, they need to avoid video surveillance direct views and allow for a time where they can ‘lift’ the items. But regardless of the technique, there needs to be delta. Point A, product… point B, no product. The question is will it be noticed. This is what detection is all about. In a retail environment it is also accepted that a certain amount of loss needs to be ‘accepted’ as the normal business risk for operations. The reasons being for this is that there is a cost point where further expense in the areas of prevention and detection do not make any fiscal sense.

It is very much the same thing with APT’s. You simply cannot seal off your borders. They will get in. The question is how far they penetrate and how much they are able to discover about you and what information they are able to pull out. There is common joke in the security industry, it goes like this. “If you want a totally secure computer, unplug all network connections. Seal it off physically with thick walls, including all and any RF with no entrance. Then take several armed guards and an equivalent number of very large attack dogs and place around the perimeter 24 x 7. Also you need to be sure that you have total independence of power, which means you need a totally separate micro grid that in turn cannot be compromised by using the above methods.” Like all tech sector jokes, the humor is dry at best and serves to show the irony of a thought process. Such a perfectly secure computer would be perfectly useless! We like the shop owner need to assume and accept a certain amount of risk and exposure to be on line. It is simply the reality of the situation, hence the steel cage analogy I used earlier. So detection is of absolute key importance to the overall security model.

How to catch a thief

So the next question is how do you detect an APT is in your network? Additionally how do you do it as early as possible taking into consideration that time is on the attackers’ side – not yours. Once again, it serves to revisit the analogy of the shoplifter. Retail outfits usually have store detectives. These individuals are specialists in retail security. They know the patterns of behavior and inflections of movement that will cause a highlight around a certain individual. Many of these individuals have a background in psychology and have been specifically trained to watch for telltale signs. Note that such indicators cannot cause arrest or even ejection from the store. They can only serve to highlight that additional attention is needed on a certain individual. Going further, there are often methods to get into dressing rooms and the counting of items before entry and upon exit. This could be viewed both as a preventative as well as a detective measure. There are also usually RF tags that will flag an alarm if the item is removed from the premises. Often these tags are ink loaded so that they will despoil the product if removal is attempted without the correct tool. All of this can be more or less replicated in the cyber environment. The key is what to look for and how to spot it.

A compromised system

This is the obvious thing to look for as it generally all starts here. But the problem is that APT’s are pretty good at hiding and staying under cover until the right time. So the key is to look for patterns of behavior that are unusual from a historical standpoint. This gets back to the concept of normalization. In order to know that a user’s behavior is abnormal, it is important to have a good idea on what the normal behavior profile is. Some things to look for are unusual patterns of session activity. Lots of peer to peer activity where in the past there was little or none. Port scanning and the use of discovery methods should be monitored as well. Look for unusual TCP connections, particularly peer to peer or outbound encrypted connections.

Remember that there is a theory to all types of intrusion. First, an attacker needs to compromise the perimeter to gain access to the network. Unless the attacker is very lucky, they will not be where they need or want to be. This means that a series of lateral and northbound moves will be required in order to establish a foothold and command and control. This is why it is not always a good idea to take a suspicious or malicious node off of the network. You can gain quite a bit by watching it. As an example, if a newly compromised system begins to implement a series of scans and no other behavior then it is probably an isolated or early compromise. If the same behavior is accompanied by a series of encrypted TCP sessions then there is a good probability that the attacker has an established footprint and is working to expand their presence.

Malicious or suspicious activities

Once again normalization is required in order to flag unusual activities on the network. If you can set up a lab to provide an idealized ‘clean’ runtime environment, a known good pattern and corresponding signature can be developed. This idealized implementation provides a clean reference that is normalized by its very nature. After all, you don’t want to normalize an environment with an APT in it now do you? Once this clean template is created, it is easy to spot deltas and unusual patterns of behavior. These should be investigated immediately. Systems should be located and identified with the corresponding user if appropriate. There may or may not be the confiscation of equipment. As pointed out earlier, sometimes it is desirable to monitor their activities in a controlled fashion with the option of quarantine at any point.

 

Exfiltration & C2  There must be some kind of way out of here                                  (Said the joker to the thief)

In order for any information to leave your organization there has to be an outbound exfiltration channel that is set up prior. Obviously, this is something that the APT has been working to accomplish in the initial phases of compromise. Again, going back to the analogy of the shoplifter, this is another area where the APT has to diverge from the normal behavior of a user. The APT needs to establish a series of outbound channels to move the data out of the organization. In the earlier days, a single outbound TCP encrypted channel would be established to move data as quickly as possible. But now that most threat protection systems are privy to this, they tend to establish networks that can utilize a series of shorter lived outbound sessions, moving only smaller portions of the data so as to blend in to the normal activities of the network. But even with this improvement in technique, they still have to diverge from the normal user pattern. If you are watching close enough you will catch it. But you have to watch close and you have to watch 24 by 7.

Here is a list of things that you want to look for,

1). Logon activity

Logon’s to new or unusual systems can be a flag of malicious behavior. New or unusual session types are also an important flag to watch for, particularly new or unusual out bound encrypted session. Other flags are unusual time of day or location. Watch also for jumps in activity or velocity as well as shared account usage or privileged accounts.

2). Program execution

Look for new or unusual program executions or the execution of the programs at unusual times of the day or from new or unusual locations. Or the executing of the program from privileged account status rather than a normal user account.

3). File access

You want to catch data acquisition attempts before they succeed with access, but if you can’t, you at least want to catch the data as it attempts to leave the network. Look for unusual high volume access to files servers or unusual file access patterns. Also be sure to monitor cloud based sharing uploads as these are a very good way to hide in the flurry of other activity.

4). Network activity

New IP addresses or secondary addresses can be a flag. Unusual DNS queries should be looked into, particularly those with a bad or no reputation. Look for the correlation between the above points and new or unusual network connection activity. Also look for unusual or suspicious application behaviors. These could be dark outbound connections that may use lateral movement internally. Many C2 channels are established in this fashion.

5). Database access

Most users do not have to access the database directly. This is an obvious flag, but also look for manipulated applications calls that doing sensitive table access, modifications or deletions. Also be sure to lock down the database environment by disabling many of the added options that most modern databases provide. Be aware that many of them are enable by default. Be sure to be aware of what services are exposed out of the database environment. An application proxy service should be implemented to prevent direct access in a general fashion.

6). Data Loss Prevention methods

Always monitor sensitive data movement. As pointed out in the last blog, if you have performed your segmentation design correctly according to the confidential data footprint then you should already have isolated communities of interest that you can monitor very tightly, particularly at the ingress and egress to the microsegments. Always monitor FTP usage as well as mentioned earlier cloud services.

Analysis, but avoid the paralysis

The goal is to arrive at a risk score based on the aggregate of the above. This involves the session serialization of hosts as they access resources. As an example a new secondary IP address is created and an outbound encrypted session is established to a cloud service, but earlier in the day or perhaps during the wee hours that same system accessed several sensitive file servers with the administrator profile. Now this is a very obvious set of flags, these can and will be increasingly more subtle and difficult to tease out. This is where security analytics enters the picture. There are many vendors out there who can provide products and solutions in this space. There are several firms and consortiums that can provide ratings for these various vendors so we will not even attempt to replicate here. The goal of this section is on how to use it.

The problem with us humans is that if we are barraged with tons of data and forced to do the picking out of significant data, we are woefully inefficient. First of all, we have a very large capacity for missing certain data sets. How often have you heard the saying, “Another set of eyes”? It’s true, though we don’t like to admit it, when faced with large data sets we can miss certain patterns that others will see and visa-versa. This brings two lessons two lessons. First never manually analyze data alone, always have another set of eyes go over it. Second, perhaps we are not the best choice for this type of activity. There is another reason to look at though. It’s called bias. We are emotional beings. While we like to think we are always intellectual in our decisions this has been proven not to be the case. As a matter of fact, many neurologist researchers are saying that without emotions, we really can’t make a decision. At its root decision making for us is an emotional endeavor.

So enter computers and the science of data analytics. Computers and algorithms do not exhibit the same shortcomings as us humans. But they exhibit others. They are extremely good at sifting through large sets of data and identifying patterns then analyzing them against certain rules such as those noted above. They are also extremely fast in these tasks when compared to us. What they offer will be unadulterated and pure without bias, IF and only if the algorithms are written correctly and do not induce any bias in their design. This whole subject warrants another blog article sometime, but for now let be safe to say that algorithms and theories of operation as well as application design are all done by us. So the real fact of the matter is that there will be biases that are embedded into any solution. But there is one thing that computers do not do well yet. They can’t look at patterns and emotionally ‘suspect’ an activity ‘knowing’ the normal behavior of a user. As an example, to say to itself, “Fred just wouldn’t do this type of thing. Perhaps his machine has been compromised. I think I should give him a call before I escalate this. We can confiscate the machine if this is true, get him a replacement and get the compromised unit into forensics.” Note that I say for now. Artificial intelligence is moving forwards at rapid pace, but what is to say that AI will eventually roadblock on bias just like we have! Many cognitive researches are now coming to this conclusion. So it is clear that we and computers will be co-dependent for the foreseeable future, each side keeping the other from invoking bias. The real fact is that there will always be false negatives and false positives. The cyber-security universe simply moves too fast to assume otherwise. So the concept of setting and forgetting is not valid here. These systems will need assistance from humans, particularly once a system has been identified as ‘suspect’.

Automation and Security

At Avaya we have developed a shortest path bridging networking fabric we refer to as SDN Fx that is based on three basic self-complimentary security principles.

Hyper-segmentation

This is a new term that we have coined to indicate the primary deltas of this new approach to traditional network micro-segmentation. First, hyper-segments are extremely dynamic and lend themselves well to automation and dynamic service chaining as is often required with software defined networks. Second, they are not based on IP routing and therefore do not require traditional route policies or access control lists to constrict access to the micro-segment. These two traits create a service that is well suited to security automation.

 

Stealth

We have spoken to this many times in the past. Due to the fact that SDN Fx is not based on IP, it is dark from an IP discovery perspective. Many of the topological aspects to the network, which are of key importance to an APT simply cannot be discovered by traditional port scanning and discovery techniques. So the hyper-segment holds the user or intruder into a narrow and dark community which has little or no communications capability with the outside world except through well-defined security analytic inspection points.

Elasticity

This refers to the dynamic component. Due to the fact that we are not dependent on IP routing to establish service paths, we can extend or retract certain secure hyper-segments based on authentication and proper authorization. Just as easily however, SDN FX can retract a hyper-segment, perhaps based on an alert from security analytics that something is amiss with the suspect system. But as we recall, we may not want to simply cut the intruder off but place them into a forensic environment where we can watch their behavior and perhaps gain insight into methods used. There may even be the desire to redirect them into Honey pot environments where whole network can be replicated in SDN Fx for little or no cost from a networking perspective.

Welcome to my web (It’s coated with honey! Yum!)

If we take the concept of the honey pot and extend it with SDN Fx, we can create a situation where the APT no longer has complete confidence of where they at and whether they are looking at real systems. Recall that the APT relies on shifting techniques that evolve over time, even during a single attack scenario. There is no reason why you could not so the same. Modern virtualization of servers and storage along with the dynamic attributes of SDN Fx create an environment where we can keep the APT guessing and ALWAYS without a total scope of knowledge about the network. Using SDN Fx we can automate paths within the fabric to redirect suspect or known malicious systems to whatever type of forensic or honey pot service required.

Avaya has been very active in building out the security ecosystem in an open system approach with a networking fabric based on IEEE standards. The concept of closed loop security now becomes a reality. But we need to take it further. Humans still need to communicate and interact about these threats on a real time basis. The ability to alert staff for threats and even set up automated conferencing where staf can compare data and decide on the next best course of action are now possible as such services can be rendered in only a couple of minutes in an automated fashion.

figure-5
Figure 6. Hyper-segmentation, Stealth and Elasticity to create the ‘Everywhere Perimeter’

All of this places the APT in a much more difficult position. As the illustration above shows, hyper-segmentation creates a series of hurdles that need to be compromised before access to a given resource is possible. Then it becomes necessary to create out bound channels for the exfiltration of data across the various hyper-segment boundaries and associated security inspection points. Also note that as the figure above illustrates, you can create hyper-segments where there simply is no connectivity to the outside world. For all intents and purposes they are totally and completely orthogonal. The only way to gain access is to actually log into the segment. This creates even more difficultly for the APT as exfiltration becomes more difficult and if you are watching, easier to catch.

In summary

One could say and most probably should say that this was occurrence that was bound and destined. While I don’t like the term ‘destined’, I must admit that it is particularly true here. As our ability to communicate and compute has increased it has created a new avenue for illegal and illegitimate usage. The lesson here is that the Internet does not make us better people. It only makes us better at being what we already are. It can provide immense transformative power to convert folks to perform unspeakable acts and it can in a few hours’ notice take a global enterprise to its knees.

But it can also be a force for a very powerful good. As an example, I am proud to be involved in the effort on behalf of colleagues such as Mark Fletcher and Avaya in the wider sense to support Kari’s law for the consistent behavior of 9-1-1 emergency services. Mark is also actively engaged abroad in the subject of emergency response as I am for security. The two go hand in hand in many respects because the next thing the APT will attempt is to take out our ability to respond. The battle is not over. Far from it.

 

 

 

 

 

 

 

 

Establishing a confidential Service Boundary with Avaya’s SDN Fx

June 10, 2016

Cover

 

Security is a global requirement. It is also global in the fashion in which it needs to be addressed. But the truth is, regardless of the vertical, the basic components of a security infrastructure do not change. There are firewalls, intrusion detection systems, encryption, networking policies and session border controllers for real time communications. These components also plug together in rather standard fashions or service chains that look largely the same regardless of the vertical or vendor in question. Yes, there are some differences but by and large these modifications are minor.

So the questions begs, why is security so difficult? As it turns out, it is not really the complexities of the technology components themselves, although they certainly have that. It turns out that the real challenge is deciding exactly what to protect and here each vertical will be drastically different. Fortunately, the methods for identifying confidential data or critical control systems are also rather consistent even though the data and applications being protected may vary greatly.

In order for micro-segmentation as a security strategy to succeed, you have to know where the data you need to protect resides. You also need to know how it flows through your organization. What systems are involved and which ones aren’t. If this is information is not readily available it needs to be created by data discovery techniques and then validated as factual.

This article is intended to provide a series of guideposts on how to go about establishing a confidential footprint for such networks of systems. As we move forward into the new era of the Internet of Things and the advent of networked critical infrastructure it is more important than ever before to have at least a basic understanding of the methods involved.

Data Discovery

Obviously the first step in establishing a confidential footprint is in establishing the systems and the data that gets exchanged that needs to be protected. Sometimes this can be a rather obvious thing. A good example is credit card data and PCI. The data and the systems involved in the interchange are fairly well understood and the pattern of movement or flow of data is rather consistent. Other examples might be more difficult to determine. A good example of this is the protection of intellectual property. Who is to say what classifies as intellectual property? Who is to establish a risk value to a given piece of IPR? In many instances this type of information may be in disparate locations and stored with various methods and probably various levels of security. If you do not have a quantified idea on the volume and location of such data, you will probably not have a proper handle on the issue.

Data Discovery is a set of techniques to establish a confidential data footprint. This is the first established phase of identifying exactly what you are trying to protect. There are many products on the market that can perform this function. There are also consulting firms that can be hired to perform a data inventory. Fortunately, this is something that can be handled internally if you have the right individuals with proper domain expertise. As an example, if you are performing data discovery on oil and gas geologic data, it is best to have a geologist involved with the proper background in the oil and gas vertical. Why? Because they would have the best understanding of what data is critical, confidential or superfluous and inconsequential.

Data Discovery is also critical in establishing a secure IoT deployment. Sensors may be generating data that is critical to the feedback actuation of programmable logic controllers. The PLC’s themselves might also generate information on its performance. It is important to understand the fact that much of process automation has to do with closed loop feedback mechanisms. The feedback loops are critical for the proper functioning of the automated IoT framework. An individual that could intercept or modify the information within this closed loop environment could adversely affect the performance of the system; even to the point of making it do exactly the opposite of what was intended.

As pointed out earlier though, fortunately there are some well understood methods in establishing a confidential service boundary. It all starts with a simple checklist.

Establishing a Confidential Data Footprint – IoT Security Checklist for Data

1). What is creating the data?

2). What is the method for transmission?

3). What is receiving the data?

4). How/where is it stored?

5). What systems are using the data?

6). What are they using it for?

7). Do the systems generate ‘emergent’ data?

8). If yes, then is that data sent, stored, or used?

9). If yes, then is that data confidential or critical?

10). If so, then go to step 1.

No, step 10 is not a sick joke. When dealing with creating secure footprints for IoT frameworks it is important to realize that your data discovery will often loop back on itself. With closed loop system feedback this is the nature of the beast. Also be prepared to do this several times as these feedback loops can be relatively complex in fully automated systems environments. So it gets down to some basic detective work. Let’s grab our magnifier and get going. But before we begin we need to take a moment and take a closer look at each step in the discovery process a little closer.

What is sending the Data?

This is the start in the confidential data chain. Usually it will be a sensor of some type or a controller that has a sensing function embedded it. It could also be something as simple as a point of sale location for credit card data. Another possible case would be medical equipment relaying both critical and confidential data. This is where the domain expertise is a key attribute that you need on your team. These individuals will understand what starts the information service chain from an application services perspective. This information will be crucial in establishing a start to the ‘cookie crumb’ trail.

What is the method of transmission?

Obviously if something is creating data there are three choices. First, the device will store the data. Second, the device may use the data to actuate an action or control. Third, the device will transmit the data. Sometimes a device will do all three. Using video as an example, a wildlife camera off in the woods will usually store the data that it generates until some wildlife manager or hunter comes to access the content whereas a video surveillance camera will usually transmit the data to a server, a digital video recorder or a human viewer in a real time fashion. Some video surveillance cameras may also store recent clips or even feedback into the physical security system to lock down an entry or exit zone. When something goes to transmit the information it is important to establish the methods used. Is it IP or another protocol? Is it unicast or multicast? Is it UDP (connectionless) or is it TCP (connection oriented)? Is the data encrypted during transit? If so how? If it is encrypted is there proper chain of trust established and validated? In short if the information moves out the device and you have deemed that data to be confidential or critical then it is important to quantify the nature of the transmission paths and nature of or lack of security for it.

What is receiving the data?

Obviously if the first system element is transmitting data then there has to be a system or set of systems that are receiving it. Again, this may be fairly simple and linear such as the movement of credit card data from a point of sale system to an application server in the data center. In other instances, particularly in IoT frameworks the information flow will be convoluted and loop back on itself to facilitate the closed loop communication required for systems automation. In other words, as you begin to extend your discovery you will begin to discern characteristics or a ‘signature’ to the data footprint. Establishing transmitting and receiving systems are a key critical part of this process. A bit later in the paper we will take a look at a simple linear data flow and compare it to a simple closed loop data flow in order to clarify this precept.

Is the data stored? How is it stored?

When folks think about storage, they typically think about hard drives, solid state storage or storage area networks. So there are considerations that need to be made here. Is the storage a structured database or is it a simple NAS. Perhaps it might be something based on Google File System (GFS) or Hadoop for data analytics. But the reality is that data storage is much broader than that. Any device that holds data in memory is in actuality storing it. Sometimes the data may be transient. In other words, it might be a numerical data point that represents an intermediate mathematical step for an end calculation. Once the calculation is completed the data is no longer needed and the memory space is flushed. But is it really flushed? As an example some earlier vendor applications for credit card information did not properly flush the system of PIN’s or CVC values from prior transactions. It is important that if transient data is being created it needs to be determined if that data is critical or confidential and should be deleted up on termination of the session or if stored, stored with the appropriate security considerations. In comparison, the transient numerical value for a mathematical function may not be confidential because outside of the context that data value would be meaningless. But also keep in mind that this might not be the case as well. Only someone with domain expertise will know. Are you starting to see some common threads?

What systems are using the data and what are they using it for?

Again, this may sound like an obvious question but there are subtle issues and most probably assumptions that need to be validated and vetted. A good example might be data science and analytics. As devices generate data, that data needs to analyzed for traits and trends. In the case of credit card data it might be analysis for fraudulent transactions. In the case of IoT for automated production it might be the use of sensor data to tune and actuate controllers with an analytic process in the middle to tease out pertinent metrics for systems optimization. In the former example, it is an extension of a linear data flow, in the latter the analytics process is embedded into the closed loopback data flow. Knowing these relationships allows one to establish the proposed ‘limits’ to the data footprint. Systems beyond this footprint simply have no need to access the data and consequently no access to it should be provided.

Do those systems generate ‘emergent’ data?

I get occasional strange looks when I use this term. Emergent data is data that did not exist prior to the start of the compute/data flow. Examples of emergent data are transient numerical values that are used for internal computation for a particular algorithmic process. Others are intermediate data metrics that provide actual input into a closed loop behavior pattern. In the areas of data analysis this is referred to as ‘shuffle’. Shuffle is the movement of data across the top of rack environment in an east/west fashion to facilitate the mathematical computation that often accompanies data science analytics. Any of the resultant data from the analysis process is ‘new’ or ‘emergent’ data. In other words, emergent data is data that simply did not exist prior to the start of the compute/data flow.

If yes, is that data sent, stored or used?

Unless you have a very poorly designed solution set, any system that generates emergent data will do something with it (one of the three previously mentioned above). If you find that this is not the case then the data is superfluous and the process could possibly be eliminated out of the end to end data flow. So let’s assume that the system in question will do at least one of the three. In the case of a programmable logic controller it may use the data to more finely tune its integral and atomic process. The same system (or its manager) may store at least a certain span of data for historical context and systems logs. In the case of tuning, the data may be generated by an intermediate analytics process that would arrive at more optimal settings for the controllers’ actuation and control. So remember these data metrics could come from anywhere in the looped feedback system.

If yes, then is that data confidential or critical?

If your answer to this question is yes, then the whole process of investigation needs to begin again until all possible avenues of inter-system communications are exhausted and validated. So in reality we are stepping into another closed loop of systems interaction and information flow within the confidential footprint. Logic dictates that if all of the data up until this point is confidential or critical then it is highly likely that this loop will be as well. It is highly unlikely that one would go through a complex loop process with confidential data and say that they have no security concerns on the emergent data or actions that result out of the system. Typically, if things start as confidential and critical, they usually – but not always – will end up as such within an end to end data flow. Particularly if it is something as critical as the meaning of the universe which we all know is ‘42’.

 

Linear versus closed loop data flows

First, let’s remove the argument of semantics. All data flows that are acknowledged are closed loops. A very good example is TCP. There are acknowledgements to transmissions. This is a closed loop in its proper definition. But what we mean here in this discussion is a bit broader. Here we are talking about the general aspects of the confidential data flow, not the protocol mechanics used to move the data. That was addressed already in step two. Again, a very good example of a linear confidential data flow is PCI. Whereas automation frameworks provide for a good example of looped confidential data flows.

Linear Data Flows

Let’s take a moment and look at a standard data flow for PCI. First you have the start of the confidential data chain which is obviously the point of sale system. From the point of sale system the data is either encrypted or more recently tokenized into a transaction identifier by the credit card firm in question. This tokenization provides yet another degree of abstraction to avoid the need to transmit actual credit card data. From there the data flows up to the data center demarcation where the flow is inspected and validated by firewalls and intrusion detection systems and then handed to the data center environment where a server running an appropriately designed PCA DSS application to handle the card and transaction data. In most instances this is where it stops. From there the data is uploaded to the bank by a dedicated and encrypted services channel. Most credit card merchants to do not store card holder data. As a matter of fact PCI V3.0 advises against it unless there are strong warrants for such practice because there are extended practices to protect stored card holder data which further complicates compliance. Again, examples might be to analyze for fraudulent practice. When this is the case the data analytics sandbox needs to be considered as an extension of the actual PCI card holder data domain. But even then, it is a linear extension to the data flow. Any feedback is likely to end up in a report meant for human consumption and follow up. In the case of an actual credit card vendor however this may be different. There may be the ability and need to automatically disable a card based on the recognition of fraudulent behavior. In that instance the data analytics is actually a closed loop data flow at the end of the linear data flow. The close in the loop is the analytics system flagging to the card management system that the card in question be disabled.

Looped Data Flows

In the case of a true closed loop IoT framework a good simplified example is a simple three loop public water distribution system. The first loop is created by a flow sensor that measures the gallons per second flow coming into the tank. The second loop is created by a flow sensor that measures the gallons per second flow out of the tank. Obviously the two loops feedback on one another and actuate pumps and drain flow valves to maintain a match to the overall flow of the system with a slight favor to the tank filling loop. After all, it’s not just a water distribution system but a water storage system as well. But in ideal working situations as the tank reaches the full point the ingress sensor feeds back to reduce the speed and even shut down the pump. There is also a third loop involved. This is a failsafe that will actuate a ‘pop off’ valve in the case that a mismatch develops due to systems failure (the failure of one the drain valves for instance). Once the fill level of the tank or the tanks pressure gets to a certain level that is established prior, the pop off valve is actuated and thereby relieves the system of additional pressure that could cause further damage and even complete system failure. It is obviously critical for the three loops to have continuous and stable communications. These data paths have to also be secure as anyone who could gain access into the network could mount a denial of service attack on one of the feedback loops. Additionally, if actual systems access is obtained then the rules and policies could be modified to horrific results. A good example is that of a public employee a few years ago who was laid off and consequently gained access and modified certain rules in the metro sewer management system. The attack resulted in sewage backups that went on for months until the attack and malicious modifications were recognized and addressed. So this brings us now to the aspect of systems access and control.

 

But you’re not done yet…

You might have noticed that certain confidential data may be required to leave your administrative boundary. This could be anything from uploading credit card transactions to a bank or sharing confidential or classified information between agencies for law enforcement or homeland defense. In either case this classifies as an extension to the confidential data boundary and needs to be properly scrutinized as a part of it. But the question is how?

This tends to be one of the biggest challenges in establishing control of your data. When you give it to someone else, how do you know that is being treated with due diligence and is not being stored or transferred in a non-secure fashion; or worse yet being sold for revenue. Well, fortunately there are things that you can do to assure that ‘partners’ are using proper security enforcement practices.

1). A contract

The first obvious thing is to get some sort of assurance contract put in place that holds the partner to certain practices in the handling of your data. Ask your partner to provide you with documentation as to how those practices are enforced and what technologies are in place for assurance and it might be a good idea to request to visit the partners’ facilities to meet directly with staff and tour the site in question.

2). Post Contract

Once the contract is assigned and you begin doing business it is always wise to do a regular check on your partner to ensure that there has been no ‘float’ between what is assumed in the contract and what is reality. Coming short of the onerous requirement of a full scale security audit, (and note that there may be some instances where that may very well be required) there are some things that you can do to ensure the integrity and security of your data. It is probably a good idea to establish regular or semi-regular meetings with your partner to review the service that they provide (i.e. transfer, storage, or compute) and its adherence to the initial contract agreement. In some instances it might even warrant setting up direct site visits in an ad hoc fashion so that there is little or no notification. This will provide a better insurance on the proper observance of ‘day to day’ practice. Finally, be sure to have a procedure in place to address any infractions to the agreement as well as contingency plans on alternative tactical methods to provide assurance

 

Systems and Control – Access logic flow

So now that we have established a proper scope for the confidential or critical data footprint, what about the systems? The relationship between data and systems is very strongly analogous to musculature and skeletal structure in animals. In animals there is a very strong synergy between muscle structure and skeletal processes. Simply, muscles only attach to skeletal processes and skeletal processes do not develop in areas where muscles do not attach. You can think of the data as the muscles and the systems that use or generate the data as the processes.

This also should have become evident in the data discovery section above. Identifying the participating systems is a key point to the discovery process. This gives you a pre-defined list of systems elements involved in the confidential footprint. But it is not always just a simple one to one assumption. The confidential footprint may be encompassed by a single L3 VSN, but it may not. As matter of fact, in IoT closed loop frameworks this most probably will not be the case. These frameworks will often require tiered L2 VSN’s to keep certain data loops from ‘seeing’ other data loops. A very good example of this is production automation frameworks where there may be a higher level Flow Management VSN and then tiered ‘below’ it would be several automation managers within smaller dedicated VSN’s to communicate to the higher level Management environment. At the lowest level you would have very small VSN’s or in some instances dedicated ports to the robotics drive. Obviously it’s of key importance to make sure that the systems are authenticated and authorized to be placed into the proper L2 VSN within the overall automation hierarchy. Again, someone with systems and domain experience will be required to provide this type of information.

Below is a higher level logic flow diagram of systems and access control within SDN Fx. Take a quick look at the diagram and we will touch on each point in the logic flow in further detail.

Picture1

Figure 1. SDN Fx Systems & Access Control

There are a few things to note in the diagram above. First in the earlier stages of classifying a device or system there are a wide variety of potential methods that are available that are by the process winnowed out to a single method on which validation and access occurs. It is also important to point out that all of these methods could be used concurrently within a given Fabric Connect network. It is best however to be consistent in the methods that you use to access the confidential data footprint and the corresponding Stealth environment that will eventually encompass it. Let’s take a moment and look a little closer at the overall logic flow.

Device Classification

When a device first comes on line in a network it is a link state on the port and a MAC address. There is generally no quantified idea of what the system is unless the environment is manually provisioned and record keeping scrupulously maintained. This is not a real world proposition so there is the need to classify the device, its nature and its capabilities. We see that there are two main initial paths. Is it a user device, like a PC or a tablet? Or is it just a device? Keep in mind that this could still be a fairly wide array of potential types. It could be a server, or it could be a switch or WLAN access point. It could also be a sensor or controller such as a video surveillance camera.

User Device Access

This is a fairly well understood paradigm. For details, please reference the many TCG’s and documents that exists on Avaya’s Identity Engines and its operation. There is no need to recreate it here. At a higher level IDE can provide for varying degrees of authentication and type. As an example, normal user access might be based on a simple password or token, but other more sensitive types of access might require stronger authentication such as RSA. In extension to that there may be guest users that are allowed for temporary access to guest portal type services.

Auto Attach Device Access

Auto-attach (IEEE 802.1Qcj) known in Avaya as Fabric Attach supports a secure LLDP signaling dialog between the edge device running the Fabric Attach or auto attach client and the Fabric Attach proxy or server depending upon topology and configuration. IDE may or may not be involved in the Fabric Attach process. In the case of a device that supports auto attach there are two main modes of operation. First is the pre-provisioning of VLAN/I-SID relationships on the edge device in question. IDE can be used to validate that the particular device warrants access to the requested service. There is also a NULL mode in which the device does not present a VLAN/I-SID combination request but instead lets IDE handle all or part of the decision (i.e. Null/Null or VLAN/Null). This might be the mode that a video surveillance camera or sensor system that supports auto attach would use. There is also some enhanced security methods used within the FA signaling that significantly mitigate the possibility of MAC spoofing and provide for security of the signaling data flows.

802.1X

Obviously 802.1X is used in many instances of user device access. It can also be used for just devices as well. A very good example again is video surveillance cameras that support it. 802.1X is based on a series of three major elements, supplicants – those wishing to gain access, authenticators – those providing the access such as an edge switch and an authentication server, which for our purposes would be IDE. From the supplicant to the authenticator the Extensible Authentication Protocol or EAP (or its variants) is used. The authenticator and the authentication server support a radius request/challenge dialog on the back end. Once the device is authenticated it is then authorized and provisioned into whatever network service is dictated by IDE whether stealth and confidential or otherwise.

MAC Authentication

If we arrive to this point in the logic flow, we know that it is a non-user device and that it does not support auto attach or 802.1X. At this point the only method left is simple MAC authentication. Note that this box is highlighted in red due to the concerns for valid access security, particularly to the confidential network. MAC authentication can be spoofed by fairly simple methods. Consequently, it is generally not recommended as a network access into secure networks.

Null Access

This is actually the starting point in the logic flow as well as a termination. Every device that attaches to the edge when using IDE gets access for authentication alone. If the loop fails (whether FA or 802.1X), the network state reverts back to this mode. There is no network access provided but there is the ability to address possible configuration issues. Once those are addressed, the authentication loop would again proceed with access granted as a result. On the other hand, if this chain in the logic flow is arrived at due to the fact that nothing else is supported or provisioned then manual configuration is the last viable option.

Manual Provisioning

While this certainly a valid method for providing access, it is generally not recommended. Even if the environment is accurately documented and the record keeping was scrupulously maintained there is still the risk of exposure. This is because VLAN’s are statically provisioned at the service edge. There is no inspection & no device authentication. Anyone could plug into the edge port and if DHCP is configured on the VLAN they are on the network and no one is the wiser. Compare this with the use of IDE in tandem with Fabric Connect, where someone could unplug a system and then plug their own system in to try to gain access. This will obviously fail. As a result this box is shown in red as well as it is not a recommended method in stealth network access.

 

How do I design the Virtual Service Networks required?

Up until now we have been focusing on the abstract notions of data flow and footprint. At some point someone has to sit down and design how the VSN’s will be implemented and what if any relationships exist between them. Well at this point, if you have done due diligence in the data discovery process that was outlined earlier, you should have.

1). A list of transmitting and receiving systems

2). How those systems are related and their respective roles

a). Edge Systems (sensors, controllers, users)

b). Application Server environments (App., DB, Web)

c). Data Storage

3). A resulting flow diagram that illustrates how data moves through the network

a). Linear data flows

b). Closed loop (feedback) data flows

4). Identification of preferred or required communication domains.

a). Which elements need to ‘see’ and communicate with one another?

b). Which elements need to be isolated and should not communicate directly?

As an example to linear data flows, see the diagram below. It illustrates a typical PCI data footprint. Notice how the data flow is primarily from the point of sale systems to the bank. While there are some minor flows of other data in the footprint, it is by and large dominated by the credit card transaction data as it moves to data center and then to the bank or even directly to the bank.

Picture2

Figure 2. Linear PCI Data Footprint

Given the fact the linear footprint is monolithic, the point of sale network can be handled by one L3 IP VPN Virtual Service Network. This VSN would terminate at a standard security demarcation point with a mapping of a single dedicated port. In the data center a single L2 Virtual Service Network could provide the required environment for the PCI server application and the uplink to the financial institution. Alternatively, some customers have utilized Stealth L2 VSN’s out to provide connectivity to the point of sale systems that are in turn collapsed to the security demarcation.

Picture3

Figure 3. Stealth L2 Virtual Service Network

Picture4

Figure 4. L3 Virtual Service Network

A Stealth L2 VSN is nothing more than a normal L2 VSN that has no IP addresses assigned at the VLAN service termination points. As a result the systems within it are much more difficult to discover and hence exploit. L3 VSN’s, which are I-SID’s associated with VRF’s are stealth by nature. The I-SID replaces traditional VRF peering methods creating a much simpler service construct.

To look at looped data flows, let’s use a simple two layer automation framework. As shown in the figure below.

Picture5

Figure 5. Looped Data Footprint for Automation

We can see that we have three main elements in the system, two sensors (S1 & S2), a controller or actuator and a sensor/controller manager, which we will refer to as SCM. We can also see that the sensor feeds information on the actual or effective state of the control system to the SCM. For the sake of clarity let’s say that it is a flood gate. So the sensor (S2) can measure whether the gate is open or closed or in any intermediate position. The SCM can in turn control the state of the gate by actuating the controller. You might even be more sophisticated in that you not only can manage the local gate, but also manage the local gate according to upstream water level conditions. As such there would also be dedicated sensor elements that allow the system to monitor the water level as well, this is sensor S1. So we see a closed loop framework but we also see some consistent patterns in that the sensors never talk directly to the controllers. Even S2 does not talk to the controller; it measures the effective state of it. Only the SCM talks to the controller and the sensors only talk to the SCM. As a result we begin to see a framework of data flow and which elements within the end to end system need to see and communicate with one another. This in turn will provide us with insight as to how to design the supporting Virtual Service Network environment as shown below.

Picture6

Figure 6. Looped Virtual Service Network design

Note that the design is self-similar in the effect that it is replicated at various points of the watercourse that it is meant to monitor and control. Each site location provides three L2 VSN environments for S1, S2 and A/C. Each of these is fed up to the SCM which coordinates the local sensor/control feedback. Note that S1, S2 and A/C have no way to communicate directly, only through the coordination of the SCM. There may be several of these loopback cells at each site location, all feeding back into the site SCM, but also note that there is a higher level communication channel provided by the SCM L3 VSN which allows for SCM sites to communicate upstream states information to downstream flood control infrastructure.

The whole system becomes a series of interrelated atomic networks that have no way to communicate directly and yet have the ability to convey a state of awareness on the overall end to and system that can be monitored and controlled in a very predictable fashion, as long as it is within the engineered limits of the system. But also note that each critical element is effectively isolated from any inbound or outbound communication other than that which is required for the system to operate. Now it becomes easy to implement intrusion detection and firewalls with a very narrow profile on what is acceptable within the given data footprint. Anything outside it is dropped, pure and simple.

 

Know who is who (and when they were there (and what they did))!

The prior statement applies not only to looped automation flows but also to any confidential data footprint. It is important not only to consider the validation of the systems but also the users who will access it. But it goes much further than network and systems access control. It touches into proper auditing of that access and associated change control. This becomes a much stickier wicket and there is still no easy answer. It really comes down to a coordination of resources, both cyber and human. Be sure to think out your access control policies in respect to the confidential footprint. Be prepared to buck standard access policies or demands from users that all services need to be available everywhere. As an example, it is not acceptable to mix UC and PCI point of sale communications in one logical network. This does not mean that a sales clerk cannot have a phone and of course we assume that a contact center worker has a phone. It means that UC communications will traverse a different logical footprint than the PCI point of sale data. The two systems might be co-resident at various locations, but they are ships in the night from a network connectivity perspective. As a customer recently commented to me, “Well, with everything that has been going on, users will just need to accept that it’s a new world.”     He was right. In order to properly lock down information domains there needs to be stricter management of user access to those domains and exactly what they can and cannot do within them. It may even make sense to have whole alternate user ID’s with alternate, stronger methods of authentication. This provides an added hurdle to a would-be attacker that might have gained a general users access account. Alternate user accounts also provide for easier and clearer auditing of those users activities within the confidential data domain. Providing for a common policy and directory resource for both network and systems access controls can allow for consolidation of audits and logs. By syncing all systems to a common clock and using tools such as the E.L.K stack (Elastic Search, Logstash and Kibana), entries can be easily searchable against those alternate user ID’s and systems that are touched or modified. There is still some extra work to generate the appropriate reports but having the data in an easily searchable utility is a great help.

Putting you ‘under the microscope’

Even in the best of circumstances there are times when a user or a device will begin to exhibit suspicious or abnormal behaviors. As previously established, having an isolated information domain allows for anomaly based detection to function with a very high degree of accuracy. When exceptions are found they can be flagged and highlighted. A very powerful capability of Avaya’s SDN Fx is its unique ability to leverage stealth networking services to move the offending system into a ‘forensics environment’ where it is still allowed to perform its normal functions but it is monitored to assure proper behavior or determine the cause of the anomaly. In the case of malicious activity, the offending device can be placed into quarantine with the right forensics trail. Today we have many customers who use this feature on a daily basis in a manual fashion. A security architect can take a system and place it into a forensics environment and then monitor the system for suspect activity. But the human needs to be at the console and see the alert. Recently, Avaya has been working with SDN Fx and the Breeze development workspace to create an automated framework. Working with various security systems partners, Avaya is creating an automated systems framework to protect the micro-segmented domains of interest. Micro-segmentation not only provides for the isolated environment for anomaly detection, but also for the ability to lock down and isolate suspected offenders.

Micro-segmentation ‘on the fly’ – No man is an island… but a network can be!

Sometimes there is the need to move confidential data quickly and in a totally secret and isolated manner. As a result to this, there arose a series of secure web services known as Tor or Onion sites. These sites were initially introduced and intended for research and development groups but over time they have been absconded by drug cartels and terrorist organizations. It has as a result become known as the ‘dark web’. The use of strong encryption in these services is now a concern among the likes of the NSA and FBI as well as many corporations and enterprises. These sites are now often blocked at security demarcations due to concerns about masked malicious activity and content. Additionally, many organizations now forbid strong encryption on laptops or other devices as concerns for their misuse has grown significantly. But clearly, there is a strong benefit to closed networks that are able to move information and provide communications with total security. There has to be some compromise that could allow for this type of service but provide it in a manner that is well mandated and governed by an organizations IT department. Avaya has been doing research into this area as well. Dynamic team formation can be facilitated once again with SDN Fx and the Breeze development workspace. Due to the programmatic nature of SDN Fx, completely isolated Stealth network environments can be established in a very quick and dynamic fashion. The Breeze development platform is used to create a self-provisioning portal where users can securely create a dynamic stealth network with required network services. These services would include required utilities such as DHCP, but also optional services such as secure file services, Scopia video conferencing, and internal security resources to insure proper behavior within the dynamic segment. A secure invitation is sent out to the invitees with URL attachment to join the dynamic portal with authenticated access. During the course of the session, the members are able to work in a totally secure and isolated environment where confidential information and data can be exchanged, discussed and modified with total assurance. From the outside, the network does not exist. It cannot be discovered and cannot be intruded into. Once users are completed with the resource they would simply log out of the portal and they would be automatically placed back into their original networks. Additionally, the dynamic Virtual Service Network can be encrypted by the network edge either on a device like Avaya’s new Open Network Adapter or by a partner such as Senetas, who is able to provide for secure encryption at the I-SID level. With this type of solution, the security of Tor and Onion sites can be provided but in a well-managed fashion that does not require strong encryption on the laptops. Below is an illustration of the demonstration that was publicly held at the recent Avaya Technology Forums across the globe.

Picture7

Figure 7. I-SID level encryption demonstrated by Senetas

In summary

Many security analysts, including those out of the likes of the NSA are saying that micro-segmentation is a key element in a proper cyber-security practice. It is not a hard point to understand. Micro-segmentation can limit east-west movement of malicious individuals and content. It can also provide for isolated environments that can provide an inherently strong compliment to traditional security technologies. The issue that most folks have with micro-segmentation is not the technology itself but deciding what to protect and how to design the network to do so. Avaya’s SDN Fx Fabric Connect can drastically ease the deployment of a micro-segmented network design. Virtual Service Networks are inherently simple service constructs that lend themselves well to software defined functions. It cannot assist in deciding what needs to be protected however. Hopefully, this article has provided insight into methods that any organization can adopt to do the proper data discovery and arrive at the scope of the confidential data footprint. From there the design of the Virtual Service Networks to support it is extremely straightforward.

As we move forward into the new world of the Internet of Things and Smart infrastructures micro-segmentation will be the name of the game. Without it, your systems are simply sitting ducks once the security demarcation has been compromised or worse yet the malice comes from within.

 

 

 

 

 

 

‘Dark Horse’ Networking – Private Networks for the control of Data

September 14, 2013

Dark HorseNext Generation Virtualization Demands for Critical Infrastructure and Public Services

 

Introduction

In recent decades communication technologies have realized significant advancement. These technologies now touch almost every part of our lives, sometimes in ways that we do not even realize. As this evolution has and continues to occur, many systems that have previously been treated as discrete are now networked. Examples of these systems are power grids, metro transit systems, water authorities and many other public services.

While this evolution has brought on a very large benefit to both those managing and using the services, there is the rising spectre of security concerns and the precedent of documented attacks on these systems. This has brought about strong concerns about this convergence and what it portends for the future. This paper will begin by discussing these infrastructure environments that while varied have surprisingly common theories of operation and actually use the same set or class of protocols. Next we will take a look at the security issues and some of the reasons of why they exist. We will provide some insight to some of the attacks that have occurred and what impacts they have had. Then we will discuss the traditional methods for mitigation.

Another class of public services is more so focused on the consumer space but also can be used to provide services to ‘critical’ devices. This mix and mash of ‘cloud’ within these areas are causing a rise in concern among security and risk analysts. The problem is that the trend is well under way. It is probably best to start by examining the challenges of a typical metro transit service. Obviously the primary need is to control the trains and subways. These systems need to be isolated or at the very least very secure. The transit authority also needs to provide for fare services, employee communications and of course public internet access for passengers. We will discuss these different needs and the protocols involved in providing for these services. Interestingly we will see some paradigms of reasoning as we do this review and these will in turn reveal many of the underlying causes for vulnerability. We will also see that as these different requirements converge onto common infrastructures conflicts arise that are often resolved by completely separate network infrastructures. This leads to increasing cost and complexity as well as increasing risk of the two systems being linked at some point in way that would be difficult to determine. It is here where the backdoor of vulnerability can occur. Finally, we will look at new and innovative ways to address these challenges and how they can take our infrastructure security to a new level without abandoning the advancement that remote communications has offered. The fact is, sometimes you do NOT want certain systems and/or protocols to ‘see’ one another. Or at the very least there is the need to have very firm control over where and how they can see one another and inter-communicate. So, this is a big subject and it straddles many different facets. Strap yourself in it will be an interesting ride!

Control and Data Acquisition (SCADA)

Most process automation systems are based on a closed loop control theory. A simple example of a closed loop control theory is a gadget I rigged up as a youth. It consisted of a relay that would open when someone opened the door to my room. The drop in voltage would trigger another relay to close causing a mechanical lever to push a button on a camera. As a result I would get a snapshot of anyone coming into my room. It worked fairly well once I worked out the kinks (they were all on the mechanical side by the way). With multiple siblings it came in handy. This is a very simple example of a closed loop control system. The system is actuated by the action of the door (data acquisition) and the end result is the taking of a photograph (control). While this system is arguably very primitive it still demonstrates the concept well and we will see that the paradigm does not really change much as we move from 1970’s adolescent bedroom security to modern metro transit systems.

In the automation and control arena there are a series of defined protocols that are of both standards based and proprietary nature. These protocols are referred to as SCADA, which is short for Supervisory Control and Data Acquisition. Examples of these protocols on the proprietary side are Modbus, BACnet and LonWorks. Industry standard examples are IEC 61131 and 60870-5-101[IEC101]. Using the established simple example of a closed loop control we will take the concept further by looking at a water storage and distribution system. The figure below shows a simple schematic of such a system. It demonstrates the concepts of SCADA effectively. We will then use that basis to extend it further to other uses.

Figure 1

Figure 1. A simple SCADA system for water storage and distribution

The figure above illustrates a closed loop system. Actually, it is comprised of two closed loops that exchange state information between. The central element of the system is the water tank (T). Its level is measured by sensor L1 (which could be as simple as a mechanical float attached to a potentiometer). As long as the level of the tank is at a certain range it will keep the LEVEL trace as ON. This trace is provided to a device called a Programmable Logic Controller (PLC) or Remote Terminal Unit (RTU). In the case of the diagram it is provided to PLC2. As a result PLC2 sends a signal to a valve servo (V1) to keep it in the OPEN state. If the level were to fall below a defined value in the tank then the PLC would turn the valve off. There may be additional ‘blow off’ valves that the PLC might invoke if the level of the tank grew too high. But this would be a precautionary emergency action. In normal working conditions this would be handled by the other closed loop. In that loop there is a flow meter (F1) that provides feedback to PLC1. As long as PLC1 is receiving a positive flow signal from the sensor it will keep the pump (P1) running and hence feeding water into the system. If the rate on F1 falls below a certain value then it is determined that the tank is nearing full and PLC1 will tell the pump to shut down. As an additional precaution there may be an alternate feed from sensor L1 that will only cause a flag to shut down the pump if the tank level reaches full. This is known as a second loop failsafe. As a result, we have a closed loop self monitoring system that in theory should run on its own without any human intervention. Such systems do. But they are usually monitored by Human Management Interfaces (HMI). In many instances these will literally be the schematic of the system with a series of colors (as an example yellow for off, orange & red for warning & alarm, green for running). In this way, an operator has visibility into the ‘state’ of the working system. HMI’s can also offer human control of the system. As an example, an operator might shut off the pump and override the valve close to drain the system for maintenance. So in that example the closed loop would be extended to include a human who could provide an ‘ad hoc’ input to the system.

The utility of these protocols are obvious. They control everything from water supplies to electrical power grid components. They are networked and need to be due to the very large geographic area that they often are required to cover. This is as opposed to my bedroom security system (it was never really intended on security – it was just a kick to get photos of folks who were unaware) which was a ‘discrete’ system. In such a system, the elements are hardwired and physically isolated. It is hard to get into such a room to circumvent the system. One would literally have to climb in the through the window. This offers a good analogy of what SCADA like systems are experiencing. But also one has to realize that discrete systems are very limited. As an example, it would be a big stretch to take a discrete system to manage a municipal water supply. One would argue that it would be so costly as to make no sense. So SCADA systems are a part of our lives. They can bring great benefit but there is still the spectre of security vulnerability.

Security issues with SCADA

Given that SCADA systems are used to control facilities such as oil, power and public transportation, it is important to ensure that they are robust and have the connectivity to the right control systems and staff. In other words they must be networked. Many implementations of SCADA are L2 using only Ethernet as an example for transport. Recently, there are TCP/IP extensions to SCADA that allow for true Internet connectivity. One would think that this is where the initial concerns for security would lie but actually they are just a further addition the systems vulnerabilities. There are a number of reasons for this.

First, there was a general lack in concern for security as many of these environments were at one time fairly discrete. As an example, a PLC is usually used in local control type scenarios. A Remote Terminal Unit does just what it says. It creates a remote PLC that can be controlled over the network. While this extension of geography has obvious benefits, along with it creep the window of unauthorized access.

Second, there was and still is the general belief that SCADA systems are obscure and not well known. Its protocol constructs are not widely published particularly in the proprietary versions. But as is well known, ‘security by obscurity’ is a partial security concept at best and many true security specialists would say it is a flawed premise.

Third, initially these systems had no connectivity to the Internet. But this is changing. Worse yet, it does not have to be the system itself that is exposed. All an attacker needs is access to a system that can access the system. This brings about a much larger problem.

Finally, as these networks are physically secure it was assumed that some form of cyber-security was realized, but as the above reason points out this is a flawed and dangerous assumption.

Given that SCADA systems control some of our most sensitive and critical systems it should be no surprise that there have been several attacks. One example is a SCADA control for sewer flow where a disgruntled ex-employee gained access to the system and reversed certain control rules. The end result was a series of sewage flooding events into local residential and park areas. Initially, it was thought to be a system malfunction, but eventually the hacker’s access was found out and the culprit was eventually nabbed.  This can even get into International scales. As critical systems such as power grids become networked the security concern can grow to the level of national security interests.

While these issues are not new, they are now well known. Security by Obscurity is no longer a viable option. Systems isolation is the only real answer to the problem.

 

The Bonjour Protocol

On the other side of the spectrum we have a service that is often required at public locations that is the antithesis of the prior discussion. This is a protocol that WANTS services visibility. This protocol is known as Bonjour. Created by Apple™, it is an open system protocol that allows for services resolution. Again it is best to give a case point example. Let’s say that you are a student that is at a University and you want to print a document from your iPAD. You can simply hit the print icon and the Bonjour service will send a SRV query for @PRINTER to the Bonjour multicast address of 224.0.0.251. The receiver of the multicast group address is the Bonjour DNS resolution service which will reply to the request with a series of local printer resources for the student to use. To go further, if the student were to look for an off site resource such as a software upgrade or application, the Bonjour service would respond and provide a URL to an Apple download site. The diagram shows a simple Bonjour service exchange.

Figure 2

Figure 2. A Bonjour Service Exchange

Bonjour also has a way for services to ‘register’ to Bonjour as well. A good example as shown above is in the case of iMusic. As can see the player system can register to local Bonjour Service as @Musicforme. Now when a user wishes to listen they simply query the Bonjour service for @Musicforme and the system will respond with the URL of the player system. This paradigm has obvious merits in the consumer space. But we need to realize that consumer space is rapidly spilling over into the IT environment. This is the trend that we typically hear of as ‘Bring Your Own Device’ or BYOD. The University example is easy to see but many corporations and public service agencies are dealing with the same pressures. Additionally, some true IT level systems are now implementing the Bonjour protocol as an effective way to advertise services and/or locate and use them. As an example, some video surveillance cameras will use Bonjour service to perform software upgrades or for discovery. Take note that Bonjour really has no conventions for security other than the published SRV. All of this has the security world in a maelstrom. In essence, we have disparate protocols evolving out of completely different environments for totally different purposes coming to nest in a shared environment that can be of a very critical nature. This has the makings for a Dan Brown novel!

 

 

Meanwhile, back at the train station…

Let’s now return to our Transit Authority who runs as a part of its services high speed commuter rail service. As a part of this service they offer Business Services such as Internet Access and local business office services such as printing and scanning. They also have a SCADA system to monitor and control the railways. In addition they obviously have a video surveillance system and you guessed it, those cameras use the Bonjour service for software upgrade & discovery. They also have the requirement to run Bonjour for the Business Services as well.

In legacy approaches the organization would need to either implement totally separate networks or a multi-services architecture via the use of Multi-Protocol Label Switching or MPLS. This is an incredibly complex suite of protocols that have very well known CAP/EX and OP/EX requirements and they are high. Running an MPLS network is most probably the most challenging IT financial endeavor that an organization can take on. The figure below illustrates the complexity of the MPLS suite. Note that it also shows a comparison to Shortest Path Bridging IEEE 802.1aq and RFC 6329 as well as the IETF drafts to extend L3 services across the Shortest Path Bridging Fabric.

Figure 3

Figure 3. A comparison between MPLS and SPB

There are two major points to note. First, there is a dramatic consolidation of dependency overlay control planes into a single integrated one provided by IS-IS. Second, as a result to consolidation there results a breaking of the mutual dependence of the service layers into mutually independent service constructs. An underlying benefit is that services are also extremely simple to construct and provision. Another benefit is that these services constructs are correspondingly simpler from an elemental perspective. Rather than requiring a complex and coordinated set of service overlays, SPB/IS-IS provides a single integrated service construct element known as the I-Component Service ID or I-SID.

In previous articles we have discussed how an I-SID is used to emulate end to end L2 service domains as well as true L3 IP VPN environments. Additionally, we covered how I-SID’s can be used dynamically to provide solicited demand services for IP multicast. In this article, we will be focusing on their inherent traits of services separation and control as well as how these traits can be used to enhance a given security practice.

For this particular project we developed the concept of three different network types. Each network type is used to provide for certain protocol instances that require services separation and control. They are listed as follows:

1). Layer 3 Virtual Service Networks

These IP VPN services are used to create a general services network access for general offices and internet access.

2). Local User Subnets (within the L3 VSN)

These are local L2 broadcast domains that provide for normal internet ‘guest’ access for railway passengers. These networks can also support ‘localized’ Bonjour services for the passengers but the service is limited to the station scope and not allowed to be advertised or resolved outside of that local subnet boundary.

3). Layer 2 Virtual Service Networks

These L2 domains are used at a more global level. Due to SPB’s capability to extend L2 service domains across large geographies without the need to support end to end flooding, L2 VSN’s become very useful to support extended L2 protocol environments. Here we are using dedicated L2 VSN’s to support both SCADA and Bonjour protocols. Each protocol will enjoy a private non-IP routed L2 environment that can be placed anywhere within the end to end SPB domain. As such, they can provide global L2 service separated domains simply by not assigning IP addresses to the VLAN’s. IP can still run over the environment as Bonjour requires it, but that IP network will not be visible or reachable within the IS-IS link state database (LSDB) via VRF0.

Figure 4

Figure 4. Different Virtual Service Networks to provide for separation and control.

The figure above illustrates the use of these networks in a symbolic fashion. As can be seen, there are two different L3 VSN’s. The blue L3 VSN is used for internal transit authority employees and services. The red L3 VSN is used for railway passenger internet access. Note that there are two things of signifigance here. First, this is a one way network for these users. They are given a default network gateway to the Internet and that is it. There is no connectivity from this L3 VSN to any other network or system in the environment. Second, each local subnet also allows for local Bonjour services so that users can use their different personal device services without concern that they will go beyond the local station or interfere with any other service at that station.

There are then two L2 VSN’s that are used to provide for inter-station connectivity for the transit authorities use. The green L2 VSN is used to provide for the SCADA protocol environment while the yellow L2 VSN provides for the Bonjour protocol. Note that unlike the other Bonjour L2 service domains for the passengers, this L2 domain can not only be distributed within the stations but between the stations as well. As a result, we have five different types of service domains each one is separated, scoped and controlled over a single network infrastructure. Note that in the case of a passenger at a station who is bringing up their Bonjour client, they will only see other local resources, not any of the video surveillance cameras that also use Bonjour but do so in the totally separate L2 service domain that has absolutely no connectivity to any other network or service. Note also that the station clerk has a totally separate network service environment that gives them confidential access to email, UC and other internal applications that tie back into the central data center resources. In contrast, the passengers at the station are provided Internet access only for general browsing or VPN usage. There is no viable vector for any would be attacker in this network.

Now the transit authority enjoys the ability to deploy these service environments at will any where they are required. Additionally, if requirements for new service domains come up (entry and exit systems for example), they can be easily created and distributed without a major upheaval of the existing networks that have been provisioned.

 

Seeing and Controlling are two different things…

Sometimes one service can step on another. High bandwidth resource intense services such as multicast based video surveillance can tend to break latency sensitive services such as SCADA. In a different example project, these two applications were in direct conflict. The IP multicast environment was unstable causing loss of camera feeds and recordings in the video surveillance application. The SCADA based traffic light control systems experienced daily outages. In a traditional PIM protocol overlay we require multiple state machines that run in the CPU. Additionally, these state machines are full time meaning that they need to consider each IP packet separately and forward accordingly. For multicast packets there is an additional state machine requirement where there may be various modes of behavior based on whether it is a source or a receiver and whether or not the tree is currently established or extended. These state machines are complex and they must occur for every multicast group being serviced.

Figure 5

Figure 5. Legacy PIM overlay

Each PIM router needs to perform this hop by hop computation, and this needs to be done by the various state machines in a coordinated fashion. In most applications this is acceptable. As an example, for IP television delivery there is a relatively high probability that someone is watching the channels being multicast (if not, they are usually promptly identified and removed. Ratings will determine the most viewed groups). In this model, if there is a change to the group membership, it is minor and at the edge. Minor being the fact that one single IP set top box has changed the channel. The point here is that this is a minor topological change to the PIM tree and might not even impact it at all. Also, the number of sources is relatively small to the community of viewers. (200-500 channels to thousands if not tens of thousands of subscribers)

The problem with video surveillance is that this model reverses many of these assumptions and this causes havoc with PIM. First, the ratio of sources to receivers is reversed.  Also, the degree of the ratio changes as well.  As an example, in a typical surveillance project of 600 cameras there could be instances as high as 1,200 sources with transient spikes that will go higher during state transitions. Additionally, video surveillance applications typically have the phenomenon of ‘sweeps’, where a given receiver that is currently viewing a large group of cameras (16 to 64) will suddenly change and request another set of groups.

At these points the amount of required state change in PIM can be significant. Further, there may be multiple instances of this occurring at the same time in the PIM domain. These instances could be humans at viewing consoles or they could be DVR type resources that automatically sweep through sets of cameras feeds on a cyclic basis. So as we can see, this can be a very heavy lift application for PIM and tests have validated this. SPB offers a far superior method for delivering IP multicast.

Now let us consider the second application, the use of SCADA to control traffic lights. Often referred to as Intelligent Traffic Systems or ITS. Like all closed loop applications, there is a fail safe instance which is the familiar red and yellow flashing lights that we see occasionally during instances of storms and other impediments to the system. This is to assure that the traffic light will never fail in a state of permanent green or permanent red. As soon as communication times out, the failsafe loop is engaged and maintained until communications is restored.

During normal working hours the traffic light is obviously controlled by some sort of algorithm. In certain high volume intersections this algorithm may be very complex and based on the hour of the day. In most other instances the algorithm is rather dynamic and based on demand. This is accomplished by placing a sensing loop at the intersection. (Older systems were weight based while newer systems are optical.) As a vehicle pulls up to the intersection its presence is registered and a ‘wait set’ period is engaged. This presumably allows enough time for passing traffic to move through the intersection. In instances or rural intersection this wait set period will be ‘fair’. Each direction will have equal wait sets. In urban situations where there are minor roads intersecting with major routes the wait set period will be in strong favor of the major route. With a relatively large wait set period for the minor road. The point in all of this is that these loops are expected to be fairly low latency and there is not expected to be a lot of loss in the transmission channel. Consequently, SCADA tends towards very small packets that expect very fast round trip with minimal or no loss. You can see where I am going here. The two applications do not play well together. They require separation and control.

Figure 6

Figure 6. Separation of IP multicast and Scada traffic by the use of I-SIDs

As was covered in a previous article (circa June 2012) and also shown in the illustration above. SPB uses dynamic build I-SIDs with a value greater than 16M to establish IP multicast distribution trees. Each multicast group uses a discrete and individual I-SID to create a deterministic reverse path forwarding environment. Note also that the SCADA is delivered via a discrete L2 VSN that is not enabled for IP multicast or any IP configuration for that matter. As a result, the SCADA elements are totally separated from any IP multicast or unicast activity. There is no way for any traffic from the global IP route or ip vpn environment to get forwarded into the SCADA L2 VSN. There is simply no IP forwarding path available. The figure above illustrates a logical view of the two services.

 The end result of the conversion changed the environment drastically. Since then they have not lost a single camera or had any issues with SCADA control. This is a direct testament to the forwarding plane separation that occurs with SPB. As such both applications can be supported with no issues or concerns that one will ‘step on’ the other. It also enhances security for the SCADA control system. As there is no IP configuration on the L2 VSN (note that IP could still ‘run’ within the L2VSN – as for example as is possible with the SCADA HMI control consoles), there is no viable path for spoofing or launching a DOS attack.

What about IP extensions for SCADA?

As was mentioned earlier in the article there are methods to provide for TCP/IP extension for SCADA. Due to the criticality of the nature of the system however, this is seldom used due to the costs of securing the IP network from threat and risk. As with any normal IP network, protecting them to the required degree is difficult and costly. Particularly since the intention of the protocol overlay is provide for things like mobile and remote access to the system. Doing this with the traditional legacy IP networking with would be a big task.

With SPB, L3 VSN’s could be used to establish a separated IP forwarding environment that can then be then directed to appropriate secure ‘touch points’ within a predefined point in the topology of the network. Typically, this will be a Data Center or a Secured DMZ adjunct from it. There all remote access is facilitated through a well defined security series of Firewalls, IPS/IDS’s and VPN Service points. As it is the only valid ingress into the L3 Virtual Service environment, it is hence much easier and less costly to monitor and mitigate any threats to the system with clear forensics in the aftermath. The illustration below shows this concept. The message is that while SPB is not a security technology in and of itself, it is clearly a very strong compliment to those technologies.  If used properly it can provide the first three of the ‘series of gates’ in the layered defense approach. The diagram below shows how this operates.

Figure 7

Figure 7. SPB and the ‘series of gates’ security concept

In a very early article on this blog post I talked to the issues and paradigms of trust and assurance. (See Aspects and characteristics of Trust and its impact on Human Dynamics and E-Commerce – June 2009)
 There I introduced the concept of composite identities and the fact that all identities in cyber-space are as such. This basic concept is rather obvious when it speaks to elemental constructs of device/user combinations, but it gets smeared when the concept extends to applications or services. Or it can extend further to elements such as location or systems that a user is logged into. These are all elements of a composite instance of a user and they are contained within a space/time context. As an example, I may allow user ‘A’ for access application ‘A’ from location ‘A’ with device ‘A’. But any other location, device or even time combination may provide a totally different authentication and consequent access approach. This composite approach is very powerful. Particularly when combined with the rather strong path control capabilities of SPB. This combination yields an ability to determine network placement based on user behavior patterns. Those expected and within profile, but more importantly for those that are unusual and out of the normal users profile. These instances require additional challenges and consequent authentications.

As noted in the figure above, the series of gates concept merges well within this construct. The first gate provides identification of a particular user/device combination. From this elemental composite, network access is provided according to a policy. From there the user is limited to the particular paths that provide access to a normal profile. As a user goes to invoke a certain secure application, the network responds with an additional challenge. This may be an additional password or perhaps a certain secure token and biometric signature to reassure identity for the added degree of trust. This is all normal. But in the normal environment the access is provided at the systems level thereby increasing the ‘smear’ of the user’s identity. A critical difference in the approach I am referring to is that the whole network placement profile of the user changes. In other words, in the previous network profile the system that provides the said application is not even available by any viable network path. It is by the renewal of challenge and additional tiers of authentication that such connectivity is granted. Note how I do not say access but connectivity. Certainly systems access controls would remain but by and large they would be the last and final gate. At the user edge, whole logical topology changes occur that place the user into a dark horse IP VPN environment where secure access to the application can be obtained.

Wow! The noise is gone

In this whole model something significant occurs. Users are now in communities of interest where only certain traffic pattern profiles are expected. As a result, zero day alerts of anomaly based IPS/IDS systems become something other than white noise. They become very discrete resources with an expected monitoring profile and any anolamies outside of that profile will flag as a true alert that should be investigated. This enables zero day threat systems to work far more optimally as their theory of operation is to look for patterns outside of the expected behaviors that are normally seen in the network. SPB compliments this by keeping communities strictly separate when required. With a smaller isolated community it is far easier to use such systems accurately. The diagram below illustrates the value of this virtualized Security Perimeter. Note how any end point is logically on the ‘outer’ network connectivity side. Even though I-SID’s traverse a common network footprint they are ‘ships in the night’ in that they never see one another or have the opportunity to inter-communicate except by formal monitored means.

Figure 8

Figure 8. An established ‘virtual’ Security Perimeter

Firewalls are also notoriously complex when they are used for community separation or multi-tenant applications. The reason for this is that all of the separation is dependent on the security policy database (SPD) and how well it covers all given applications and port calls. If a new application is introduced and it needs to be isolated the SPD must be modified to reflect it. If this gets missed or the settings are not correct, the application is not isolated and no longer secure. Again SPB and dark horse networking help in controlling user’s paths and keeping communities separate. Now the firewall is white listed with a black list deny all policy after that. Now as new applications get installed unless they are added to the white list, they will be isolated by default within the community that they reside in. There is far less manipulation of the individual SPD’s and far less risk of an attack surface developing in the security perimeter due to a misssed policy statement.

 

Time to move…

There is another set of traits that are very attractive about SPB and particularly what we have done with it at Avaya in our Fabric Connect. It is something termed as mutability. In the last article on E-911 evolution we talked to this a little bit. Here I would like to go into it in a little more detail. IP VPN services are nothing new. MPLS has been providing such services for years. Unlike MPLS however, SPB is very dynamic in the way it handles new services or changes to existing services. Where the typical MPLS infrastructure might require hours or even days for the provisioning process, SPB can accomplish the same service in a matter of minutes or even seconds.  This is not taking into account that MPLS must also require the manual provisioning of alternate paths. With SPB not only are the service instances intelligently extended across the network by the shortest path, they are also provided all redundancy and resilience by virtue of the SPB fabric. If alternate routes are available they will be used automatically during times of failure. They do not have to be manually provisioned ahead of time. The fabric has the intelligence to reroute by the shortest path automatically. At Avaya, we have tested our fabric to a reliable convergence of 100ms or under with the majority of instances falling into the 50ms level. As such mutability becomes a trait that Avaya alone can truly claim. But in order to establish what that is let’s realize that there are two forms.

1). Services mutability

This was covered to some degree in the previous article but to review the salient points. It really boils down to the fact that a given L3 VSN can be extended anywhere in the SPB network in minutes. The principles pointed out from the previous article illustrate that membership to a given dark horse network can be rather dynamic and can not only be extended but retracted as required. This is something that comes as part and parcel with Avaya’s Fabric Connect. While MPLS based solutions may provide equivalent type services, none are as nimble, quick or accurate in prompt services deployment as Avaya’s Fabric Connect based on IEEE 802.1aq Shortest Path Bridging.

2). Nodal mutability

This is something very interesting and if you ever have the chance with hands on experience, please try it. It is very, very profound. Recall from previous articles, that each node holds a resident ‘link state database’ generated by IS-IS that reflects its knowledge of the fabric from its own relative perspective. This knowledge not only scopes topology but resident provisioned services as well as those of other nodes. This creates a situation of nodal mutability. Nodal mutability is the fact that a technician out at the far edge of the network can accidentally swap the two (or more) uplink ports and the node will still join the network successfully. Alternatively, if a node were already up and running and for some reason port adjacencies needed to change. It could be accommadated very easily with only a small configuration change. (Try it in a lab. It is very cool!) Going further on this logic the illustration below shows that a given provisioned node could unplug from the network and then drive over 100’s of kilometers to another location.

Figure 9

Figure 9. Nodal and Services Mutability

At that location, they could plug the node back into the SPB network and the node will automatically register the node and all provisioned services. If all of these services are dark horse then there will authentication challenges into the various networks that the node provides as users access services. This means in essence that dark horse networks can be extremely dynamic. They can be mobile as well. This is useful in many applications where mobility is desired but the need to re-provision is frowned upon or simply impossible. Use cases such as emergency response, military operations or mobile broadcasting are just a few areas where this technology would be useful. But there are many others and the number will increase as time moves forward. There is no corresponding MPLS service that can provide for both nodal and services mutability. SPB is the only technology that allows for it via IS-IS, and Avaya’s Fabric Connect is the only solution that can provide this for not only L2 but L3 services as well as for IP VPN and multicast.

Some other use cases…

Other areas where dark horse networks are useful are in networks that require full privacy for PCI or HIPPA compliance. L3 Virtual Service Networks are perfect for these types of applications or solution requirements. Figure 8 could easily be an illustration for a PCI compliant environment in which all subsystems are within a totally closed L3 VSN IP VPN environment. The only ingress and egress are through well defined virtual security perimeters that allow for the full monitoring of all allowed traffic. This combination yields an environment that, when properly designed, will easily pass PCI compliancy scanning and analysis. In addition, these networks not only are private – they are invisible to external would be attackers. The attack surface is mitigated to the virtual security parameter only. As such, it is practically non-existent.

In summary

While private IP VPN environments have been around for years they are typically clumsy and difficult to provision. This is particularly true for environments where quick dynamic changes are required. As an example, the typical MPLS IP VPN provisioning instances will require approximately 200 to 250 command lines depending on the vendor and the topology. Interstingly much of this CLI activity is not in provisioning MPLS but in provisioning other supporting protocols such as IGP’s and BGP. Also, consider that all of this is for just the initial service path. Any redundant service paths must then be manually configured. Compare with Avaya’s Fabric Connect which can provide the same service type with as little as a dozen commands. Additionally, there is no requirement to engineer and provision redundant service paths as they are already provided by SPB’s intelligent fabric.

As a result, IP VPN’s can be provisioned in minutes and be very dynamically moved or extended according to requirements. Again, the last article on the evolution of E-911 speaks to how an IP VPN morphs over the duration of a given emergency with different agencies and individuals coming into and out of the IP VPN environment on a fairly dynamic basis based on their identity, role and group associations.

Furthermore, SPB nodes are themselves mutable. Once again, IS-IS provides for this feature. An SPB node can unplug from the network and move to the opposite end of the topology which can be 100’s or even 1000’s of kilometers away. There they can plug back in and IS-IS will communicate the nodal topology information as well as all provisioned services on the node. The SPB network will in turn extend those services out to the node thereby giving complete portability to that node as well as its resident services.

In addition, SPB can provide separation for non IP data environments as well. Protocols such as SCADA can enjoy an isolated non IP environment by the use of L2 VSN’s and further they can be isolated so that there is simply no viable path into the environment for would be hackers.

This combination of privacy, fast mutability of both services and topology lend to what I term as a Dark Horse Network. They are dark, so that they can not be seen or attacked due to the lack of surface for such an endeavor. They are swift in the way they can morph by services extensions and they are extremely mobile, providing for the ability for nodes to make whole scale changes to the topology and still be able to connect to relevant provisioned services without any need to re-configure. Any other IP VPN technology would be very hard pressed to make such claims, if indeed they can make them at all! Avaya’s Fabric Connect based on IEEE 802.1aq sets the foundation for the true private cloud.

 Feel free to visit my new You Tube Channel! Learn how to set up and enable Avaya’s Fabric Path Technology in a few short step by step videos.

http://www.youtube.com/channel/UCn8AhOZU3ZFQI-YWwUUWSJQ

How would you like to do IP Multicast without PIM or RP’s? Seriously, let’s use Shortest Path Bridging and make it easy!

June 8, 2012

 

Why do we need to do this? What’s wrong with today’s network?

Anyone who has deployed or managed a large PIM multicast environment will relate to the response to this question. PIM works on the assumption of an overlay protocol model. PIM stands for Protocol Independent Multicast, which means that it can utilize any IP routing table to establish a reverse path forwarding tree. These routes can be created with any independent unicast routing protocol such as RIP or OSPF, or even be static routes or combinations thereof. In essence, there is an overlay of the different protocols to establish a pseudo-state within the network for the forwarding of multicast data. As any network engineer who has worked with large PIM deployments will attest, they are sensitive beasts that do not lend themselves well to topology changes or expansions of the network delivery system. The key word in all of this is the term ‘state’. If it is lost, then the tree truncates and the distribution service for that length of the tree is effectively lost. Consequently, changes need to be done carefully and be well tested and planned. And this is all due to the fact that the state of IP multicast services is effectively built upon a foundation of sand.

The first major point to realize is that most of today’s Ethernet switching technology still operates with the same basic theory of operation as the original IEEE 802.1d bridges. Sure, there have been enhancements of VLAN’s and tagged trunking that allow us to slice a multi-port bridge (which is what an Ethernet switch really is) up into virtual broadcast domains and extend those domains outside of the switch and between other switches. But, by and large the original operational process is based on ‘learning’. The concept of a learning bridge is shown in the simple illustration below. As a port on a bridge receives an Ethernet frame it remembers the source MAC address as well as the port it came in on. If the destination MAC address is known it will forward out to the port that it is last known to be on. As shown in the example below source MAC “A” is received on port 1. As the destination MAC “B” is known to be on port 2, the bridge will forward accordingly.

 

Figure 1. Known Forwarding

But MAC “A” also sends out a frame to destination MAC “C”. Since MAC “C” is unknown to the bridge, it will flood the frame to all ports. As a result of the flooding, MAC “C” responds and is found to be on port 3. The bridge records the information into its forwarding information base and forwards the frame accordingly from that point on. Hence, this method of bridging is known as ‘flood based learning’. As one can readily see, it is a critical function for normal local area network behavior. No one argues the value or even the neccesity of learning in the bridged or switched environment. The problem is that the example above was circa 1990.

 Figure 2. Unknown Flooding

As the figure below shows, adding in Virtual LAN’s and multi-port high speed switches makes things much more complex. The reality of it is that as the networking core grows larger, the switches in the middle get busier and busier. The forwarding tables need to be larger and larger, where end to end VLAN’s are no longer tractable so layer 3 boundaries via IP routing are introduced to segment the network domains. In the end, little MAC “A” is just one of the tens of thousands of addresses that traverse the core. In essence, there is no ‘state’ for MAC “A” (or any other MAC address for that matter).

 Figure 3. Unknown Flooding in a Routed VLAN topology

Additionally, recall that multicast is a destination address paradigm. IP multicast groups translate to destination MAC addresses at the Ethernet forwarding level. Due to the fact that it is a destination address, there needs to be a resolution to a unicast source address. This is not a straight forward process. It involves the overlay of services on top of the Ethernet forwarding environment. These services provide for the resolution of the source as well as the build of a reverse path forwarding environment and the joining of that path to any pre-existing distribution tree. In essence these overlay services embed a sort of ‘state’ to the multicast forwarding service. These overlays are also very dependent on timers for the operating protocols and the fine tuning of these timers according to established best practice to maintain the state of the service. When this state is lost or becomes ambiguous however, nasty things happen to the multicast service. This is the primary reason why multicast is so problematic in todays typical enterprise environment.

The protocols most often used to establish unicast routing service are, OSPFv2 or v3 (Open Shortest Path First – v2 being for IPv4 and v3 being for IPv6) for establishing the unicast routing tables for IP. OSPF runs over Ethernet and establishes end to end forwarding paths on top of the stateless frame based flood and learn environment below. On top of this, PIM (Protocol Independent Multicast) is run to establish the actual multicast forwarding service. Source resolution is provided by a function known as a ‘RP’ or Rendevous Point. This is an established service that registers sources for multicast and provides the ‘well known’ point within the PIM domain to establish source resolution. As a result, in PIM sparse mode all first joins to a multicast group from a given edge router is always via the RP. Once the edge router begins to receive packets it is able to discern the actual unicast IP address of the sending source. With this information the edge PIM router or the designated router (DR) will then build a reverse path forward back to the source or the closest topological leg of an existing distribution tree. At the L2 edge, end stations signal their interest in a given service by a protocol known as Internet Group Management Protocol or simply IGMP. In addition, most L2 switches can be aware of this protocol and actually allow for discretionate forwarding to interested receivers without flooding to all ports in a given VLAN. This process is known an IGMP snooping. In PIM sparse mode, the version of IGMP typically used is IGMPv2 which is non-source specific (This is *,G mode, where * means that the source address is not known.) Once the source is resolved by the RP the state changes to S,G – where the source is now known. All of this is shown in the diagram below.

 

Figure 4. Protocol Independent Multicast Overlay Model

As can be readily seen, this is a complex mix of technologies to establish the single service offering. As a result large multicast environments tend to be touchy and require a comparitively large operational budget and staff to keep running. Large changes to network topology can wreak havoc with IP multicast environments. As a result such changes need to be thought through and carefully planned out. Not all changes are planned however. Network outages force topological changes that can often adversely affect the stability of the IP multicast service. The reason for this is the degree of protocol overlay and the need for correlation of the exact state of the network. As an example, a flapping unicast route could adversely affect an end to end multicast service. Additionally, this problem could be caused at the switch element level by a faulty link, port or module. Mutual dependencies in these types of solutions lend themselves to difficult troubleshooting and diagnostics. This translates to longer mean time to repair and overall higher operational expense.

 

 There must be a better way…

As we noted previously, IP multicast is all about state. Yet at the lowest forwarding element level the operational aspects are stateless. It seems that a valid path forward is to evolve this lowest level to become more stateful and deterministic in the manner in which traffic is handled. In essence, the control plane of Ethernet Switching needs to evolve.

Control Plane Evolution

IEEE has established a set of standards that allows for the evolution of the Ethernet switching control plane into a much more stateful and deterministic model. There are three main innovations that enable this evolution.

Link State Topology Awareness – IS-IS

Universal Forwarding Label –The B-MAC

Provisioned Service Paths – Individual Service Identifiers

This is all achieved by introducing link state protocol (IS-IS) to Ethernet switching as well as the concept of provisioned service paths. These innovations, when combined with a MAC encapsulation method known as MAC in MAC (IEEE 802.1ah) allow for a radical change to the Ethernet switching control plane without abandoning its native dichotomy of control and data forwarding within the network element itself. This means that the switch remains an autonomous forwarding element, able to make its own decisions as to how to forward data most effectively. Yet, at the same time the new stateful nature of the control plane allows for very deterministic control of the data forwarding environment. The end result is a vast simplification of the Ethernet control plane that yields a very stateful and deterministic environment. This environment can then optionally be equipped with a provisioning server infrastructure that provides an API environment between the switching network and any applications that require resources from it. As applications communicate their requirements through the API, the server instructs the network on how to provision paths and resources. Yet importantly, if the network experiences failures, the switch elements know how to behave and have no need to communicate back to the provisioning server. They will automatically find the best path to facilitate any existing sessions and will use this modified topology for any new considerations.  In this model the best of both worlds is found. There is deterministic control of network services, but the network elements remain in control of how to forward data and react to changes in network topology.

 Figure 5. Stateful topology with the use of IS-IS

This technology is known as Shortest Path Bridging, the IEEE standard 802.1aq. As its name implies, it is an Ethernet switching technology that switches by the shortest available path between two end points. The anology here are the IP link state routing protocols OSPFv2 for IPv4 and OSPFv3 for IPv6. In link state protocols each node advertises its state as well as any extended reachability. By these updates, each node gains a complete perspective of the network topology. Each element then runs the Dyjkstra shortest path algorithm to identify the shortest loop free path to every point within the network.

When one looks at the stateless methods of Ethernet forwarding and the need for such antiquated protocols such as Spanning Tree one can not help but see it as a path of promise. The problem is that OSPF v2 and OSPFv3 are ‘monolithic’ routing protocols, meaning that they were designed exclusively to route IP. IEEE knew this of course and found a very good link state protocol that was open and extensible. That protocol is IS-IS (Intermediate System – Intermediate System) from the OSI suite.  One of the first areas of interest is that IS-IS establishes adjacencies with L2 Hello’s, NOT L3 LSA’s like OSPF. The second is that it uses extensible type, length, values (TLV) to move information between switch elements like topology, provisioned paths or even L3 network reachability.  In other words, the switches are ‘topology aware’. Once we have this stateful topology of Ethernet switches, we now can determine what network path data are to take for different application services.

The next step IEEE had to deal with was implementing a universal labelling scheme for the network that provides all of the information that a switch element needs to forward the data. Fortunately, there was a pre-existing standard, IEEE 802.1ah (MAC-in-MAC) that provides just this type of functionality. The standard was initially established as a provider/customer demarcation for metro Ethernet managed service offerings. The standard works on the concept of encapsulation of the outer edge (customer) Ethernet frame (C-MAC) into an inner core (provider) frame (B-MAC) that is transported and then stripped off on the other end of the inner core to yield a totally transparent end to end service. This process is shown in the illustration below.

 

Figure 6. The use of 802.1ah B-MAC as a universal forwarding label in conjunction with IS-IS

The benefits to this model are the immense amount of scalability and optimization that happens in the network core. Once a data frame is encapsulated, it can be transported anywhere within the SPB domain without the need to learn. How this is accomplished is by combining 802.1ah and IS-IS together with another modification and extension of virtualization. We will cover this next.

Recall that IS-IS allows for the establishment of adjacencies at the L2 Hello level and that information moves through these updates by the use of Type, length values or TLV’s. As we pointed out earlier, some of these TLV’s are used for network reachability of those adjacencies. Well, these adjacencies are all based on the B-MAC’s of the SPB switches within the domain. Only those addresses are populated into the forwarding information databases at the establishment of adjacency and the running of the dyjkstra algorithm to establish loop-free shortest paths to every point on the network. As a result, the core Link State Database (LSDB) is very small and is only updated at new adjacencies such as new interfaces or switches. The important point is that it is NOT updated with end system MAC addresses. As a result, a core can support 10’s of thousands of outer C-MAC’s while only requiring a 100 or so B-MAC’s in the network core. The end result is that any switch in the SPB network can look at the B-MAC frame and know exactly what to do with it without the need to flood and learn or reference some higher level fabric controller.

There is one last thing required however. Remember that we still need to learn MAC’s. At the edge of the SPB network we need to assume that there are normal IEEE 802.3 switches and end systems that need to be supported. So how does one end system establish connectivity across the SPB domain without flooding? This is where the concept of constrained multicast comes in. The simplest way to discuss constrained multicast is based on the concept of provisioned service paths. These provisioned paths or I-SID’s (Individual Service Identifiers) are similar to VLAN’s in that they contained a broadcast domain, but they operate differently as they are based on subsets of the dykstra forwarding trees mentioned previously. As the example below shows, now when a station wishes to communicate with another end system, it simply sends out an ARP request. That ARP request is then forwarded out to all required points for the associated I-SID.

 

Figure 7. The ‘Constrained Multicast’ Model using 802.1ah and IS-IS

The end system on the other side receives the request and then responds establishing a unicast session over the same shortest path. As a result, the normal Ethernet ‘flood and learn’ process can still be facilited on the outside of the SPB domain without the need to flood and learn in the core. This vastly simplifies the network core, allows for determistic forwarding behavior as well as provides for the ability for separated virtual network services. The reason for this is shown in the diagram below with a little better detail on the B-MAC for SPB and the legacy standards that it builds upon. As can be seen, the concept of the I-SID is a pseudo evolution of the parent Q tag in the 802.1Q-in-Q standard. The I-SID value is contained within the actual B-MAC and consequently tells a core switch everything it needs to know, including whether or not it needs to replicate it for constrained multicast functionality. Note that the two most difficult problems of multicast distribution are solved. The first being source resolution and the second being the RPF build.

 

Figure 8. IEEE 802.1ah and its relation to other ‘Q’ standards

Once these technologies were merged together into a cohesive standard framework known as IEEE 802.1aq Shortest Path Bridging (MACinMAC) or SPBm, we have as a result a very stateful and scalable switching infrastructure that lends itself very well to the building and distribution of multicast services. In addition, SPB can offer many other different types of services ranging from full IP routing to private IP VPN services. All provisioned at the edge as a series of managed services across the network core. With these layer three services comes the need for the distribution of multicast services across the L3 boundaries. This is true L3 IP multicast routing. Interestingly, SPBm provides some very unique approaches to solving the problem. Again, let us take note that the two most important problems have already been solved.

The figure below shows a SBPm network that is providing multicast distribution between two IP subnets. One of the subnets is a L2 VSN (an I-SID that is associated with VLAN’s). The other subnet is a peripheral network that is reachable by IP shortcuts via IS-IS. Note that as a stream becomes active in the network, the BEB that has the source dynamically allocates an I-SID to multicast stream and that information becomes known via the distribution of IS-IS TLV’s. At the edge of the network the Backbone Edge Bridges (BEB’s) are running IGMP snooping out to the L2 Ethernet edge. The edge SPB BEB in effect becomes the querier for the L2 edge. As receivers signal their interest in a given IP multicast group they are handled by the BEB to which they are connected. which looks for ISIS LSDB (Link State Database) which advertize the multicast stream within the context of the VSN to which the receiver belongs. Once the BEB advertizing the stream and the I-SID are found in the LSDB – the BEB connected to the receiver uses standard ISIS-SPB TLVs to receive traffic for the stream. The dynamically assigned I-SID values start at 16000001 and works up. Provisioned services use values less than 16,000,000. In the case of the L3 traversal, the I-SID is dynamically extended to provide for the build of the L3 multicast distribution tree. 802.1aq supports up to 16,777,215 I-SID’s.  

Figure 9. IP Multicast with SPB/IS-IS using IP Shortcuts and L2 VSN

As the diagram above shows, for an end station to receive multicast from the source, it merely uses this dynamic I-SID to extend the service to end stations 10.10.10.11 and 10.10.10.12 which are members same subnet over the L2 VSN. Conversely, receiver 10.10.11.10 will use the same dynamic I-SID built using the information provided by IS-IS to establish the end to end reverse forwarding path. In this model, IP multicast becomes much more stateful and integrated into the switch forwarding element. This results in a far greater build out capacity for the multicast service. It also provides for a much more agile multicast environment when dealing with topology changes and network outages. Switch element failures are handled with ease because the layered mutual dependence model has been removed. If a failure occurs within the core or edge of the network, the service is able to heal seamlessly due to the fact that the information required to preserve service is already known by the all of the elements involved. Due to the fact that the complete SPBm domain is topology aware, each switch member knows what it has to do in order to maintain established service. As long as a path exists between the two end points, Shortest Path Bridging will use it to maintain service. This is the result of true integration of link state routing into the Ethernet forwarding control plane.

What goes on behind closed doors…

In addition to providing constrained and L3 multicast, SPB also provides for the ability to deliver ‘ship in the night’ IP VPN environments. With SPBm’s native capabilities it becomes very easy to extend multicast distribution into these environments as well. Normally, multicast distribution within an IP VPN environment is notoriously complex dealing with yet more overlays of technology. Within SPBm networks however the task is comparitively simple. As the diagram below illustrates, a L3 VSN (IP VPN) is nothing more than a set of VRF’s that are associated with a common I-SID. Here we run IGMP on the routed interfaces that connect to the edge VLAN’s. Note that IGMP snooping is not used here as the local BEB interface will be a router. IGMP, SPB and IS-IS perform as before and the dynamic I-SID simply uses the established Dyjkstra path to provide the multicast service between the VRF’s. Important to note though is that this service is invisible to rest of the IP forwarding environment. It is a dark network that has no routes in and no routes out. Such networks are useful for video surveillance networks that require absolute separation from the rest of the networking environment. Note though that some services may be required from the outside world. This can be accomodated by policy based routing.

 

Figure 10. IP Multicast with SPB/IS-IS using L3 VPN

As the figure illustrates, the users within the L3VSN have access to subnets 10.10.120.0/24, 10.10.130.0/24, 10.10.140.0/24 and 10.10.150.0/24 within the network which is useful for services that require complete secure isolation such as IP multicast based video surveillance. The end result is a very secure closed system multicast environment that would be very difficult to build with legacy technology approaches.

I can see clearly now…

Going back to figure 4 that illustrates the legacy PIM overlay approach, we see that there are several demarcations of technology that tend to obscure the end to end service path. This creates complexities in troubleshooting and overall operations and maintenance. Note that at the edge we are dealing with L2 Ethernet switching and IGMP snooping, then we hop across the DR to the world of OSPF unicast routing. Over this and at the same demarcation we have the PIM protocol. Each demarcation and layer introduces another level of obscurity where the service has to be ‘traced and mapped’ into each technology domain. As a result, intermittent multicast problems can go on for quite some time until the right forensics are gathered to resolve the root cause of the problem.

With SPB, many if not all of these demarcations and overlays are eliminated. As a result, something that is somewhat of a Holy Grail in networking occurs. This is called ‘services transparency’. The end to end network path for a given service can be readiy established and diagnosed without referring to protocol demarcations and ‘stitch points’. As previously shown, IP multicast services are a primary beneficiary to this network evolution. The elimination of protocol overlays provides for a stateful data forwarding model at the level where it makes the most sense; at the data forwarding element itself.

Network diagnostics becomes vastly simplified as a result. End to end latency and connectivity becomes a very straight forward endeavor. Additionally, diagnosing the multicast service path, some thing that is notoriously nasty with PIM, becomes very straight forward and even predictable. Tools such as IEEE 802.1ag and ITU Y.1731 provide diagnostics on network paths, end to end and nodal latencies and all of this can be established end to end along the serivce path without any technology demarcations.

In Summary

IEEE 802.1aq Shortest Path Bridging is proving itself to be much more than a next generation data center mesh protocol. As previous articles have shown, the extensive reach of the technology lends itself well to metro and regional distribution as well as true wide area. Additional capabilities added to SPB such as the ability to deliver true L3 IP multicast without the use of a multicast routing overlay such as PIM clearly demonstrates the extensbility of the protocol as well as its exteremely practical implementation uses. The convergence of the routing intelligence directly into the switch forwarding logic result is an environment which can provide for extremely fast (sub-second) stateful convergence which is of definite benefit to the IP multicast service model. As such, IP multicast evironments can benefit fomr enhanced state which in turn results in increased performance and scale.

End to end services transparency provides for a clear diagnostic environment that eliminates the complexities of protocol overlay models. This drastic simplification of the protocol architecture results in the ability for direct end to end visability of IP multicast services for the first time.

So when someone asks “IP Multicast without PIM? No more RP’s?” You can respond with “With Shortest Path Bridging, of course!”

I would also urge you follow the blog site of esteemed colleague, Paul Unbehagen. Chair and Author of the IEEE 802.1aq “Shortest Path Bridging” Standard. you can find it at:

http://paul.unbehagen.net/

 

For more information please feel free to visit http://www.avaya.com/networking

Also please visit our VENA video on YouTube that provides further detail and insight. you can find this at: http://www.youtube.com/watch?v=ZSbycaOvy5I

 

IPv6 Deployment Practices and Recommendations

June 7, 2010

Communications technologies are evolving rapidly. This pace of evolution, while slowed somewhat by economic circumstances, still moves forward at a dramatic pace. This is indicative to the fact that while the ‘bubble’ of the 1990’s is past, society and business as a whole has arrived to the point where communications technologies and their evolution are a requirement for proper and timely interaction with the human environment.

This has profound impact on a number of foundations upon which the premise of these technologies rest. One of the key issues is that of the Internet Protocol, commonly referred to simply as ‘IP’. The current widely accepted version of IP is version 4. The protocol, referred to as IPv4 has served as the foundation to the current Internet since its practical inception in the public arena. As the success of the Internet attests, IPv4 has performed its job well and has provided the evolutionary scope to adapt over the twenty years that has transpired. Like all technologies though IPv4 is reaching the point where further evolution will become difficult and cumbersome if not impossible. As a result, IPv6 was created as a next generation evolution to the IP protocol to address these issues.

Many critics cite the length of time that IPv6 has been in development. It is after all, a project that has over a ten year history in the standards process. However, when one considers the breadth and complexity of the standards involved a certain maturity is conveyed that the industry can now leverage upon. The protocol has evolved significantly since the first proposals for its predecessor, IPng. Many or most of the initial shortcomings and pitfalls have been addressed to the point where actual deployment is a very tractable proposition. Along this evolution several benefits have been added to the suite that directly benefits the network staff and end user populous. Some these benefits are listed below. Note that this is not an inclusive list.

  • Increased Addressing Space
  • Superior mobility
  • Enhanced end to end security
  • Better transparency for next generation multimedia applications & services

Recently, there has been quite a bit of renewed activity and excitement around IP version 6. The recent announcements by the United States Federal Government for IPv6 deployment by 2008 and the White House Civilian Agency mandate by 2012 has helped greatly to fuel this. Also many, if not most of the latest projects being implemented by providers in the Asia Pacific regions are calling for mandatory IPv6 support. Clearly the protocols’ time is coming. We are seeing the two vectors of maturity and demand meeting to result in market and industry readiness.

There is a cloud on this next generation horizon however. It is known as IPv4. From a practical context all existing networks are either based on or in some way leverage IPv4 communications. Clearly, if IPv6 is to succeed, it must do so in a phased approach that allows hybrid co-existence with it. Fortunately, many in the standards community have put forth transition techniques and methodologies that allow for this co-existence.  A key issue to consider in all of this is that the benefits of IPv6 are somewhat (sometimes severely) compromised by their usage. However, like all technologies, if usage requirements and deployment considerations are considered prior to implementation the proposition is realistic and valid.

Setting the Foundation

IPv6 has several issues and dependencies that are common with IPv4. However, the differences in address format and methods of acquisition require modifications that need to be considered to them. Much of the hype in the industry is on the aspects of support within the networking equipment. While this is of obvious importance, it is critical to realize that there are other aspects that need to be addressed to assure a successful deployment.

The first Block – DNS & DHCP Services

While IPv6 supports auto-configuration of addresses, it also allows for managed address services. DNS does not require, or from a technical standpoint require DHCP, but the two are often offered in same product suite.

When considering the new address format (128 byte colon delimited hexadecimal), it is clear that it is not human friendly. A Domain Name System (DNS) infrastructure is needed for successful coexistence because of the prevalent use of names (rather than addresses) to refer to network resources.  Upgrading the DNS infrastructure consists of populating the DNS servers with records to support IPv6 name-to-address and address-to-name resolutions. After the addresses are obtained using a DNS name query, the sending node must select which addresses are used for communication. This is important to consider both from the perspective of the service (which address is offered as primary) and the application (which address is used). It is obviously important to consider how a dual addressing architecture will work with naming services. Again, the appropriate due diligence needs to be done by investigating product plans but also in limited and isolated test bed environments to assure predictable and stable behavior with the operating systems as well as the applications that are being looked at.

As mentioned earlier, DHCP services are often offered in tandem with DNS services in many products. In instances where IPv6 DHCP services are not supported, but DNS services are, it is important to verify that it will work with standard auto-configuration options.

The second Block – Operating Systems

Any of the operating systems that are being considered to use in the IPv6 deployment should be investigated for compliance and tested so that the operation staff are familiar with any new processes or procedures that IPv6 will require. Tests should also occur between the operating systems and the DNS/DHCP services using simple network utilities such as ping and FTP to assure that all of the operating elements, including the operating systems interoperate at the lowest common denominator of the common IP applications.

It is important to test behaviors of dual stack hosts (hosts that support both IPv4 and IPv6). Much of the industry supports a dual stack approach as being the most stable and tractable approach to IPv6 deployments. Later points in this article will illustrate why this is the case.

The third Block – Applications

Applications should be considered first off to establish the scope of operating systems and the extent to which IPv6 connectivity needs to be offered. Detailed analysis and testing however should occur last after the validation of network services and operating systems. The reason for this is that the applications are the most specific testing instances and strongly depend on the stable and consistent operation of the other two foundation blocks. It is also important to replicate the exact intended mode of usage for the application so that the networking support staff are aware of any particular issues around configuration and or particular feature support. On that note, it is important to consider if there are any features that do not work in IPv6 and what impact that they will have on the intended mode of usage for the application. Finally, considerations need to be made for dual stack configurations and how precedence is set for which IP address to use.

The forth Block – Networking Equipment

Up to this point all of the validation activity referred to can be performed on a ‘link local’ basis. As a result a typical layer two Ethernet switch would suffice. A real world deployment requires quite a bit more however. It is at this point where the networking hardware needs to be considered. It is important to note that many pieces of equipment, particularly layer two type devices will forward IPv6 data. If expressed management via IPv6 is not a requirement then these devices could be used in the transition plans provided they are used appropriately in the network design.

Other devices such as routers, layer three switches, firewalls and layer 4 through 7 devices will require significant upgrades and modification to meet requirements and perform effectively. Due diligence should be done with the network equipment provider to assure that requirements are met and timelines align with project deployment timelines.

As noted previously in the other foundation blocks, dual stack support is highly recommended and will greatly ease transition difficulties as will be shown later. With networking equipment things are a little more complex in that in addition to meeting host system requirements for IPv6 communications of the managed element, the requirements of data forwarding, route computation and rules bases need to be considered. Again, it is important to consider any features that will not be supported in IPv6 and the impact that this will have on the deployment. The figure below illustrates an IPv6 functional stack for networking equipment.

Figure 1. IPv6 network element functional blocks

As shown above, there are many modifications that need to occur at various layers within a given device. The number of layers as well as the specific functions implemented within each layer is largely determined by the type of networking element in question. Simpler layer two devices are only required to provide dual host stack support primarily for management purposes, products like routers and firewalls will be much more complex. When looking at IPv6 support in equipment it makes sense to establish the role that the device performs in the network. This role based approach will best enable an accurate assessment of the real requirements and features that need to be supported rather than industry or vendor hype.

The burden of legacy – Dual stack or translation?

The successful deployment of IPv6 will strongly depend on a solid plan for co-existence and interoperability with existing IPv4 environments. As covered earlier, the use of dual stack configurations whenever possible will greatly ease transition. Today this is an issue for any device supporting IPv6 to speak to IPv4 devices. As time moves on however, the burden will shift to the IPv4 devices to speak to IPv6 devices. As we shall see there are only a certain set of applications that require dual stack down to the end point. Most client server applications will work fine in a server only dual stack environment supporting both IPv4 and IPv6 only clients as shown in the figure below.

Figure 2. A dual stack client server implementation

As shown above both IPv4 and IPv6 client communities have access to the same application server each served by their own native protocol. In the next figure however we see that there are some additional complexities that occur with certain applications and protocols such as multimedia and SIP. In the illustration below we see that there are not only client/server dialogs but client to client dialogs as well. In this instance, at least one of the clients needs to support a dual stack configuration in order to establish the actual media exchange.

Figure 3. A peer to peer dual stack implementation

As shown above, with one end point supporting a dual stack configuration and the appropriate logic to determine protocol selection, end to end multimedia communications can occur. Note that this scenario will typically be lieu of IPv6 only devices as these will become more prevalent over time.

There are many benefits to the dual stack approach. By analyzing applications and mandating dual stack usage, a very workable transition deployment can be attained.

There are arguments that address space, one of the primary benefits of IPv6 is drastically compromised by this approach. After all, by using dual stack you do not remove any IPv4 addresses. In fact you are forced to add IPv4 addresses to accommodate an IPv6 deployment. The truth to this is directly related to the logic of the approach in deployment. By understanding the nature of the applications and giving preference to the innovative (Ipv6 only) population these arguments can be mitigated. The reason for this is that you are only adding IPv6 addresses to existing IPv4 hosts that require communication with IPv6. If this happens to be the whole IPv4 population, so be it. There are plenty of IPv6 addresses to go around! As new hosts and devices are deployed they should be IPv6 only preferentially, or dual stack if required but NOT IPv4 only.

An alternative to the dual stack approach is the use of intermediate gateway technologies to translate between IPv6 and IPv4 environments. This approach is known as NAT-PT. The diagram below illustrates a particular architecture for NAT-PT usage that will provide for the multimedia scenario used previously.

Figure 4. Translation Application Layer Gateway

In this approach the server is supporting a dual stack configuration and is using native protocols to support the client/server dialogs to each end point. Each end point is single stack, one is IPv4 the other is IPv6. In order to establish end to end multimedia communications, there is an intermediate NAT-PT gateway function that provides for the translation between IPv4 and IPv6. There are many issues and caveats with this approach. These can be researched in IETF records.  As a result to this, there is work towards deprecating NAT-PT to an experimental status.  It should be noted that a recent draft revision has been submitted so it is worth keeping on the radar map.

Tunnel Vision

There has been quite a bit of activity around another set of transition methods known as tunneling. In a typical configuration, there are two IPv6 sites that require connectivity across an IPv4 network. The use of tunneling would involve the encapsulation of the IPv6 data frames into IPv4 transport. All IPv6 traffic between the two sites would traverse this IPv4 tunnel. It is a simple and elegant, but correspondingly limited approach that provides co-existence not necessarily interoperability between IPv4 and IPv6. In order to achieve this we need to invoke one of the approaches (dual stack vs. NAT-PT) discussed earlier.  Tunneling by itself only provides the ability to link IPv6 sites and networks over IPv4.

This is a very important point. A point that, if taken to its logical conclusion, indicates that if the network deployment is appropriately engineered, the use of transition tunneling methods can be greatly reduced and controlled, if not eliminated. Before we take this course in logic however it is important to consider the technical aspects of tunneling and why it is something that needs to be thought out prior to using.

The high level use of tunneling is reviewed in RFC 2893 for those interested in further details. Basically there are two types of tunnels; the first is called configured tunnels. Configured tunnels are IPv6 into IPv4 tunnels that are set up manually on a point to point basis. Because of this, configured tunnels are typically used in router to router scenarios. The second type of tunnels is automatic. Automatic tunnels use various methods to derive IPv4/IPv6 address mappings on a dynamic basis in order to support an automatic tunnel setup and operation. As a result, automatic tunnels can be used not only for router to router scenarios but for host to router or even host to host tunneling as well. As a result we are able to build a high level summary table of the major accepted tunneling methods.

Method                Usage                               Risk

Configured          Router to router                 Low

Tunnels

Automatic           Router to router/             Medium

6 to 4                  Host to router

Automatic           Host to host                      High

ISATAP

With out going into deep technical detail on each automatic tunneling methods behavior, we can assume that there is some sort of promiscuous behavior that will activate the tunneling process on recognition of a particular pattern (IP packet type 41 (IPv6 in IPv4)). This promiscuous behavior is what warrants the increased security risk associated with the automatic methods. RFC 3975 goes into detail on the security related issues around automatic tunneling methods. At a high level there is the ability for Denial of Service attacks on the tunnel routers as well as the ability to spoof addresses into the tunnel for integrity breach. The document goes into recommendations on risk reduction practices but they are difficult to implement and maintain properly.

An effective work around to these issues is to use IPSEC VPN branch routing over IPv4 to establish secure encrypted site to site connectivity and then running the automatic tunneling method inside the IPv4 IPSEC tunnel.

The figure below shows a scenario where two 6 to 4 routers have a tunnel set up to establish site to site connectivity inside an IPv4 IPSEC VPN tunnel. With this approach any IP traffic will have site to site connectivity via the VPN branch office tunnels. The IPv6 hosts would have access to one another via the 6 to 4 tunnels. Any promiscuous activity required by 6 to 4 can now be used with relative assurances of integrity and security. The drawback to this approach is that additional features or devices are required to complete the solution.

Figure 5. Using Automatic Tunneling inside IPv4 IPSec VPN

The primary reason for using transition tunnel methods is to transport IPv6 data over IPv4 networks. In essence, the approach ties together islands of IPv6 across IPv4 and allows for connectivity to the IPv6 network.  If we follow this logic, then the use of transition tunneling can be reduced or even eliminated by getting direct connectivity to the IPv6 Internet by at least one IPv6 enabled router in a given organizations network. The figures below illustrate the difference between the two approaches. In the top example, the organization does not have direct access to the IPv6 Internet. As a result transition tunneling must be used to attain connectivity. In the lower example, the organization has a router that is directly attached to the IPv6 Internet. As a result there is no need to invoke transition tunneling. By using layer two technologies such as virtual LAN’s IPv6 hosts can acquire connectivity to the IPv6 dual stack native router.


Figure 6. Using transition tunneling to extend IPv6 connectivity

Figure 7. Using L2 VLAN’s to extend IPv6 connectivity


Within the organization – Use what you already have

As we established by providing direct connectivity to the IPv6 Internet the use of transition tunneling can be eliminated on the public side. Within the organization prior to implementing transition tunneling it makes sense to review the existing methods that may already exist in the network to attain connectivity.

All of the issues in dealing with IPv6 transition revolve around the use of layer 3 approaches. By using layer 2 networking technologies, transparent transport can be provided. There are multiple technologies that can be used for this approach. Some of these are listed below:

  • Optical Ethernet
  • Ethernet Virtual LAN’s
  • ATM
  • Frame Relay

As listed above there are many layer two technologies that can be used to extend IPv6 connectivity within an organizations network.

Virtual LAN’s can be used to extend link local connectivity to IPv6 enabled routers in a campus environment. The data will traverse the IPv4 network with out the complexities of layer 3 transition methods. For the regional and wide area, optical technologies can extend the L2 virtual LAN’s across significant distances and geographies again with the goal of reaching an IPv6 enabled router. Similarly, traditional L2 WAN technologies such as ATM and frame relay can extend IPv6 local links across circuit switched topologies. As the diagram above illustrates, by placing the IPv6 dual stack routers strategically within the network and interconnecting them with L2 networking topologies, an IPv6 deployment can be implemented that co-exists with IPv4 without any transition tunnel or NAT-PT methods.

The catch is of course that these layer two paths can not traverse any IPv4 only routers or layer 3 switches. As long as this topology rule is adhered to this simplified approach is totally feasible. By incorporating dual stack routers, both IPv4 and IPv6 Virtual LAN boundaries can effectively be terminated and in turn propagated further with virtual LAN’s or other layer two technologies on the other side of the routed element. A further evolution on this is to use policy based virtual LAN’s that determine membership according to IP version type of the data received on a given edge port. As the figure below illustrates, dual stack hosts will have access to all required resources in both protocol environments.

Figure 8. Using Policy Based VLAN’s to support dual stack hosts

In essence, where dual stack capability is provided end to end, layer three transition methods can be avoided all together. While it is unlikely that this can be made to occur in most networks, such logic can greatly reduce any layer three transition tunnel usage. By taking additional considerations regarding application network behaviors and characteristics as noted in the beginning of this article the use of intermediate protocol and address translation methods like NAT-PT can also be mitigated.

In conclusion

This article was written to clarify deployment issues for IPv6 with a particular focus on interoperability and co-existence with IPv4. A step by step summary of the deployment considerations can be now summarized as follows:

1). Build the foundation

There are four basic foundation blocks that need to be established prior to deployment consideration. Details on each particular foundation block are provided. In summary they are:

1). DNS/DHCP services

2). Network Operating Systems

3). Applications

4). Network Equipment

As pointed out several times, plan for dual stack support wherever possible in all of the foundation blocks. Such an approach will greatly ease the transition issues around deployment. Ongoing work in multiple routing and forwarding planes such as OSPF-MT (http://www.ietf.org/internet-drafts/draft-ietf-ospf-mt-04.txt)  and Multi-protocol BGP (MBGP) may have beneficial and simplifying merits to interconnect dual stack routing elements and exclusively identify them and build forwarding overlays or route policies based on the traffic type (IPv4 vs. IPv6). While the OSPF-MT work is in preliminary draft phases it has very strong merits in that it can in combination with MBGP effectively displace MPLS type approaches to accomplish the same goal. Again, no transition methods would be required within the OSPF-MT boundary as long as overlay routes exist between the dual stack routing elements.

2). Establish connectivity

Once the foundations have been provided for the next step is to establish how connectivity will be made between different sites. Assuming that dual stack routers are available, it makes sense to closely analyze campus topologies and establish methods that connectivity can be established in concert with layer two networking technologies. Once all available methods have been exhausted and it is clear that one is dealing with an IPv6 ‘island’. It is at this point where one should look at using one of the IPv6 transition tunneling methods with configured tunneling being the most secure and conservative approach and is appropriate for this type of site to site usage.. Host to router tunneling may have valid usage in remote access VPN applications, particularly where local Internet providers do not offer IPv6 networking services. Host to host tunneling applications should be used only in initial test bed or pilot environments and because of manageability and scaling issues is not recommended for general practice usage.

To connect sites across a wide area network, layer two circuit switched technologies such as frame relay and ATM can extend connectivity between the dual stack enabled sites. In some next generation wide area deployments, layer two virtual LAN’s can be extended across RPR optical cores to accomplish the end to end connectivity requirements. Again, only after all other options have been exhausted should the use of IPv6 transition tunneling methods be entertained.

At this point, a dual stack native mode deployment has been achieved with only the minimal use of tunneling methods. It is only at this point that the use of any NAT-PT functions should be entertained to accommodate any applications that do not comply to the deployment. It is strongly urged that such an approach be used in a very limited form and be relatively temporary in the overall deployment. Timelines should be established to move away from the temporary usage by incorporating a dual stack native approach as soon as feasible.

3). Test, test, test

As noted at several points throughout this article testing is critical to deployment success. The reason for this is that requirements are layered and they are interdependent. Consequently, it is important to validate all embodiments of an implementation. Considerations need to be made according to node type, operating system, application as well as any variations that need to be considered for legacy components. It is like the great law of Murphy, it is the implementation that you do not test that will be the one to have the problems.