Establishing a confidential Service Boundary with Avaya’s SDN Fx

June 10, 2016



Security is a global requirement. It is also global in the fashion in which it needs to be addressed. But the truth is, regardless of the vertical, the basic components of a security infrastructure do not change. There are firewalls, intrusion detection systems, encryption, networking policies and session border controllers for real time communications. These components also plug together in rather standard fashions or service chains that look largely the same regardless of the vertical or vendor in question. Yes, there are some differences but by and large these modifications are minor.

So the questions begs, why is security so difficult? As it turns out, it is not really the complexities of the technology components themselves, although they certainly have that. It turns out that the real challenge is deciding exactly what to protect and here each vertical will be drastically different. Fortunately, the methods for identifying confidential data or critical control systems are also rather consistent even though the data and applications being protected may vary greatly.

In order for micro-segmentation as a security strategy to succeed, you have to know where the data you need to protect resides. You also need to know how it flows through your organization. What systems are involved and which ones aren’t. If this is information is not readily available it needs to be created by data discovery techniques and then validated as factual.

This article is intended to provide a series of guideposts on how to go about establishing a confidential footprint for such networks of systems. As we move forward into the new era of the Internet of Things and the advent of networked critical infrastructure it is more important than ever before to have at least a basic understanding of the methods involved.

Data Discovery

Obviously the first step in establishing a confidential footprint is in establishing the systems and the data that gets exchanged that needs to be protected. Sometimes this can be a rather obvious thing. A good example is credit card data and PCI. The data and the systems involved in the interchange are fairly well understood and the pattern of movement or flow of data is rather consistent. Other examples might be more difficult to determine. A good example of this is the protection of intellectual property. Who is to say what classifies as intellectual property? Who is to establish a risk value to a given piece of IPR? In many instances this type of information may be in disparate locations and stored with various methods and probably various levels of security. If you do not have a quantified idea on the volume and location of such data, you will probably not have a proper handle on the issue.

Data Discovery is a set of techniques to establish a confidential data footprint. This is the first established phase of identifying exactly what you are trying to protect. There are many products on the market that can perform this function. There are also consulting firms that can be hired to perform a data inventory. Fortunately, this is something that can be handled internally if you have the right individuals with proper domain expertise. As an example, if you are performing data discovery on oil and gas geologic data, it is best to have a geologist involved with the proper background in the oil and gas vertical. Why? Because they would have the best understanding of what data is critical, confidential or superfluous and inconsequential.

Data Discovery is also critical in establishing a secure IoT deployment. Sensors may be generating data that is critical to the feedback actuation of programmable logic controllers. The PLC’s themselves might also generate information on its performance. It is important to understand the fact that much of process automation has to do with closed loop feedback mechanisms. The feedback loops are critical for the proper functioning of the automated IoT framework. An individual that could intercept or modify the information within this closed loop environment could adversely affect the performance of the system; even to the point of making it do exactly the opposite of what was intended.

As pointed out earlier though, fortunately there are some well understood methods in establishing a confidential service boundary. It all starts with a simple checklist.

Establishing a Confidential Data Footprint – IoT Security Checklist for Data

1). What is creating the data?

2). What is the method for transmission?

3). What is receiving the data?

4). How/where is it stored?

5). What systems are using the data?

6). What are they using it for?

7). Do the systems generate ‘emergent’ data?

8). If yes, then is that data sent, stored, or used?

9). If yes, then is that data confidential or critical?

10). If so, then go to step 1.

No, step 10 is not a sick joke. When dealing with creating secure footprints for IoT frameworks it is important to realize that your data discovery will often loop back on itself. With closed loop system feedback this is the nature of the beast. Also be prepared to do this several times as these feedback loops can be relatively complex in fully automated systems environments. So it gets down to some basic detective work. Let’s grab our magnifier and get going. But before we begin we need to take a moment and take a closer look at each step in the discovery process a little closer.

What is sending the Data?

This is the start in the confidential data chain. Usually it will be a sensor of some type or a controller that has a sensing function embedded it. It could also be something as simple as a point of sale location for credit card data. Another possible case would be medical equipment relaying both critical and confidential data. This is where the domain expertise is a key attribute that you need on your team. These individuals will understand what starts the information service chain from an application services perspective. This information will be crucial in establishing a start to the ‘cookie crumb’ trail.

What is the method of transmission?

Obviously if something is creating data there are three choices. First, the device will store the data. Second, the device may use the data to actuate an action or control. Third, the device will transmit the data. Sometimes a device will do all three. Using video as an example, a wildlife camera off in the woods will usually store the data that it generates until some wildlife manager or hunter comes to access the content whereas a video surveillance camera will usually transmit the data to a server, a digital video recorder or a human viewer in a real time fashion. Some video surveillance cameras may also store recent clips or even feedback into the physical security system to lock down an entry or exit zone. When something goes to transmit the information it is important to establish the methods used. Is it IP or another protocol? Is it unicast or multicast? Is it UDP (connectionless) or is it TCP (connection oriented)? Is the data encrypted during transit? If so how? If it is encrypted is there proper chain of trust established and validated? In short if the information moves out the device and you have deemed that data to be confidential or critical then it is important to quantify the nature of the transmission paths and nature of or lack of security for it.

What is receiving the data?

Obviously if the first system element is transmitting data then there has to be a system or set of systems that are receiving it. Again, this may be fairly simple and linear such as the movement of credit card data from a point of sale system to an application server in the data center. In other instances, particularly in IoT frameworks the information flow will be convoluted and loop back on itself to facilitate the closed loop communication required for systems automation. In other words, as you begin to extend your discovery you will begin to discern characteristics or a ‘signature’ to the data footprint. Establishing transmitting and receiving systems are a key critical part of this process. A bit later in the paper we will take a look at a simple linear data flow and compare it to a simple closed loop data flow in order to clarify this precept.

Is the data stored? How is it stored?

When folks think about storage, they typically think about hard drives, solid state storage or storage area networks. So there are considerations that need to be made here. Is the storage a structured database or is it a simple NAS. Perhaps it might be something based on Google File System (GFS) or Hadoop for data analytics. But the reality is that data storage is much broader than that. Any device that holds data in memory is in actuality storing it. Sometimes the data may be transient. In other words, it might be a numerical data point that represents an intermediate mathematical step for an end calculation. Once the calculation is completed the data is no longer needed and the memory space is flushed. But is it really flushed? As an example some earlier vendor applications for credit card information did not properly flush the system of PIN’s or CVC values from prior transactions. It is important that if transient data is being created it needs to be determined if that data is critical or confidential and should be deleted up on termination of the session or if stored, stored with the appropriate security considerations. In comparison, the transient numerical value for a mathematical function may not be confidential because outside of the context that data value would be meaningless. But also keep in mind that this might not be the case as well. Only someone with domain expertise will know. Are you starting to see some common threads?

What systems are using the data and what are they using it for?

Again, this may sound like an obvious question but there are subtle issues and most probably assumptions that need to be validated and vetted. A good example might be data science and analytics. As devices generate data, that data needs to analyzed for traits and trends. In the case of credit card data it might be analysis for fraudulent transactions. In the case of IoT for automated production it might be the use of sensor data to tune and actuate controllers with an analytic process in the middle to tease out pertinent metrics for systems optimization. In the former example, it is an extension of a linear data flow, in the latter the analytics process is embedded into the closed loopback data flow. Knowing these relationships allows one to establish the proposed ‘limits’ to the data footprint. Systems beyond this footprint simply have no need to access the data and consequently no access to it should be provided.

Do those systems generate ‘emergent’ data?

I get occasional strange looks when I use this term. Emergent data is data that did not exist prior to the start of the compute/data flow. Examples of emergent data are transient numerical values that are used for internal computation for a particular algorithmic process. Others are intermediate data metrics that provide actual input into a closed loop behavior pattern. In the areas of data analysis this is referred to as ‘shuffle’. Shuffle is the movement of data across the top of rack environment in an east/west fashion to facilitate the mathematical computation that often accompanies data science analytics. Any of the resultant data from the analysis process is ‘new’ or ‘emergent’ data. In other words, emergent data is data that simply did not exist prior to the start of the compute/data flow.

If yes, is that data sent, stored or used?

Unless you have a very poorly designed solution set, any system that generates emergent data will do something with it (one of the three previously mentioned above). If you find that this is not the case then the data is superfluous and the process could possibly be eliminated out of the end to end data flow. So let’s assume that the system in question will do at least one of the three. In the case of a programmable logic controller it may use the data to more finely tune its integral and atomic process. The same system (or its manager) may store at least a certain span of data for historical context and systems logs. In the case of tuning, the data may be generated by an intermediate analytics process that would arrive at more optimal settings for the controllers’ actuation and control. So remember these data metrics could come from anywhere in the looped feedback system.

If yes, then is that data confidential or critical?

If your answer to this question is yes, then the whole process of investigation needs to begin again until all possible avenues of inter-system communications are exhausted and validated. So in reality we are stepping into another closed loop of systems interaction and information flow within the confidential footprint. Logic dictates that if all of the data up until this point is confidential or critical then it is highly likely that this loop will be as well. It is highly unlikely that one would go through a complex loop process with confidential data and say that they have no security concerns on the emergent data or actions that result out of the system. Typically, if things start as confidential and critical, they usually – but not always – will end up as such within an end to end data flow. Particularly if it is something as critical as the meaning of the universe which we all know is ‘42’.


Linear versus closed loop data flows

First, let’s remove the argument of semantics. All data flows that are acknowledged are closed loops. A very good example is TCP. There are acknowledgements to transmissions. This is a closed loop in its proper definition. But what we mean here in this discussion is a bit broader. Here we are talking about the general aspects of the confidential data flow, not the protocol mechanics used to move the data. That was addressed already in step two. Again, a very good example of a linear confidential data flow is PCI. Whereas automation frameworks provide for a good example of looped confidential data flows.

Linear Data Flows

Let’s take a moment and look at a standard data flow for PCI. First you have the start of the confidential data chain which is obviously the point of sale system. From the point of sale system the data is either encrypted or more recently tokenized into a transaction identifier by the credit card firm in question. This tokenization provides yet another degree of abstraction to avoid the need to transmit actual credit card data. From there the data flows up to the data center demarcation where the flow is inspected and validated by firewalls and intrusion detection systems and then handed to the data center environment where a server running an appropriately designed PCA DSS application to handle the card and transaction data. In most instances this is where it stops. From there the data is uploaded to the bank by a dedicated and encrypted services channel. Most credit card merchants to do not store card holder data. As a matter of fact PCI V3.0 advises against it unless there are strong warrants for such practice because there are extended practices to protect stored card holder data which further complicates compliance. Again, examples might be to analyze for fraudulent practice. When this is the case the data analytics sandbox needs to be considered as an extension of the actual PCI card holder data domain. But even then, it is a linear extension to the data flow. Any feedback is likely to end up in a report meant for human consumption and follow up. In the case of an actual credit card vendor however this may be different. There may be the ability and need to automatically disable a card based on the recognition of fraudulent behavior. In that instance the data analytics is actually a closed loop data flow at the end of the linear data flow. The close in the loop is the analytics system flagging to the card management system that the card in question be disabled.

Looped Data Flows

In the case of a true closed loop IoT framework a good simplified example is a simple three loop public water distribution system. The first loop is created by a flow sensor that measures the gallons per second flow coming into the tank. The second loop is created by a flow sensor that measures the gallons per second flow out of the tank. Obviously the two loops feedback on one another and actuate pumps and drain flow valves to maintain a match to the overall flow of the system with a slight favor to the tank filling loop. After all, it’s not just a water distribution system but a water storage system as well. But in ideal working situations as the tank reaches the full point the ingress sensor feeds back to reduce the speed and even shut down the pump. There is also a third loop involved. This is a failsafe that will actuate a ‘pop off’ valve in the case that a mismatch develops due to systems failure (the failure of one the drain valves for instance). Once the fill level of the tank or the tanks pressure gets to a certain level that is established prior, the pop off valve is actuated and thereby relieves the system of additional pressure that could cause further damage and even complete system failure. It is obviously critical for the three loops to have continuous and stable communications. These data paths have to also be secure as anyone who could gain access into the network could mount a denial of service attack on one of the feedback loops. Additionally, if actual systems access is obtained then the rules and policies could be modified to horrific results. A good example is that of a public employee a few years ago who was laid off and consequently gained access and modified certain rules in the metro sewer management system. The attack resulted in sewage backups that went on for months until the attack and malicious modifications were recognized and addressed. So this brings us now to the aspect of systems access and control.


But you’re not done yet…

You might have noticed that certain confidential data may be required to leave your administrative boundary. This could be anything from uploading credit card transactions to a bank or sharing confidential or classified information between agencies for law enforcement or homeland defense. In either case this classifies as an extension to the confidential data boundary and needs to be properly scrutinized as a part of it. But the question is how?

This tends to be one of the biggest challenges in establishing control of your data. When you give it to someone else, how do you know that is being treated with due diligence and is not being stored or transferred in a non-secure fashion; or worse yet being sold for revenue. Well, fortunately there are things that you can do to assure that ‘partners’ are using proper security enforcement practices.

1). A contract

The first obvious thing is to get some sort of assurance contract put in place that holds the partner to certain practices in the handling of your data. Ask your partner to provide you with documentation as to how those practices are enforced and what technologies are in place for assurance and it might be a good idea to request to visit the partners’ facilities to meet directly with staff and tour the site in question.

2). Post Contract

Once the contract is assigned and you begin doing business it is always wise to do a regular check on your partner to ensure that there has been no ‘float’ between what is assumed in the contract and what is reality. Coming short of the onerous requirement of a full scale security audit, (and note that there may be some instances where that may very well be required) there are some things that you can do to ensure the integrity and security of your data. It is probably a good idea to establish regular or semi-regular meetings with your partner to review the service that they provide (i.e. transfer, storage, or compute) and its adherence to the initial contract agreement. In some instances it might even warrant setting up direct site visits in an ad hoc fashion so that there is little or no notification. This will provide a better insurance on the proper observance of ‘day to day’ practice. Finally, be sure to have a procedure in place to address any infractions to the agreement as well as contingency plans on alternative tactical methods to provide assurance


Systems and Control – Access logic flow

So now that we have established a proper scope for the confidential or critical data footprint, what about the systems? The relationship between data and systems is very strongly analogous to musculature and skeletal structure in animals. In animals there is a very strong synergy between muscle structure and skeletal processes. Simply, muscles only attach to skeletal processes and skeletal processes do not develop in areas where muscles do not attach. You can think of the data as the muscles and the systems that use or generate the data as the processes.

This also should have become evident in the data discovery section above. Identifying the participating systems is a key point to the discovery process. This gives you a pre-defined list of systems elements involved in the confidential footprint. But it is not always just a simple one to one assumption. The confidential footprint may be encompassed by a single L3 VSN, but it may not. As matter of fact, in IoT closed loop frameworks this most probably will not be the case. These frameworks will often require tiered L2 VSN’s to keep certain data loops from ‘seeing’ other data loops. A very good example of this is production automation frameworks where there may be a higher level Flow Management VSN and then tiered ‘below’ it would be several automation managers within smaller dedicated VSN’s to communicate to the higher level Management environment. At the lowest level you would have very small VSN’s or in some instances dedicated ports to the robotics drive. Obviously it’s of key importance to make sure that the systems are authenticated and authorized to be placed into the proper L2 VSN within the overall automation hierarchy. Again, someone with systems and domain experience will be required to provide this type of information.

Below is a higher level logic flow diagram of systems and access control within SDN Fx. Take a quick look at the diagram and we will touch on each point in the logic flow in further detail.


Figure 1. SDN Fx Systems & Access Control

There are a few things to note in the diagram above. First in the earlier stages of classifying a device or system there are a wide variety of potential methods that are available that are by the process winnowed out to a single method on which validation and access occurs. It is also important to point out that all of these methods could be used concurrently within a given Fabric Connect network. It is best however to be consistent in the methods that you use to access the confidential data footprint and the corresponding Stealth environment that will eventually encompass it. Let’s take a moment and look a little closer at the overall logic flow.

Device Classification

When a device first comes on line in a network it is a link state on the port and a MAC address. There is generally no quantified idea of what the system is unless the environment is manually provisioned and record keeping scrupulously maintained. This is not a real world proposition so there is the need to classify the device, its nature and its capabilities. We see that there are two main initial paths. Is it a user device, like a PC or a tablet? Or is it just a device? Keep in mind that this could still be a fairly wide array of potential types. It could be a server, or it could be a switch or WLAN access point. It could also be a sensor or controller such as a video surveillance camera.

User Device Access

This is a fairly well understood paradigm. For details, please reference the many TCG’s and documents that exists on Avaya’s Identity Engines and its operation. There is no need to recreate it here. At a higher level IDE can provide for varying degrees of authentication and type. As an example, normal user access might be based on a simple password or token, but other more sensitive types of access might require stronger authentication such as RSA. In extension to that there may be guest users that are allowed for temporary access to guest portal type services.

Auto Attach Device Access

Auto-attach (IEEE 802.1Qcj) known in Avaya as Fabric Attach supports a secure LLDP signaling dialog between the edge device running the Fabric Attach or auto attach client and the Fabric Attach proxy or server depending upon topology and configuration. IDE may or may not be involved in the Fabric Attach process. In the case of a device that supports auto attach there are two main modes of operation. First is the pre-provisioning of VLAN/I-SID relationships on the edge device in question. IDE can be used to validate that the particular device warrants access to the requested service. There is also a NULL mode in which the device does not present a VLAN/I-SID combination request but instead lets IDE handle all or part of the decision (i.e. Null/Null or VLAN/Null). This might be the mode that a video surveillance camera or sensor system that supports auto attach would use. There is also some enhanced security methods used within the FA signaling that significantly mitigate the possibility of MAC spoofing and provide for security of the signaling data flows.


Obviously 802.1X is used in many instances of user device access. It can also be used for just devices as well. A very good example again is video surveillance cameras that support it. 802.1X is based on a series of three major elements, supplicants – those wishing to gain access, authenticators – those providing the access such as an edge switch and an authentication server, which for our purposes would be IDE. From the supplicant to the authenticator the Extensible Authentication Protocol or EAP (or its variants) is used. The authenticator and the authentication server support a radius request/challenge dialog on the back end. Once the device is authenticated it is then authorized and provisioned into whatever network service is dictated by IDE whether stealth and confidential or otherwise.

MAC Authentication

If we arrive to this point in the logic flow, we know that it is a non-user device and that it does not support auto attach or 802.1X. At this point the only method left is simple MAC authentication. Note that this box is highlighted in red due to the concerns for valid access security, particularly to the confidential network. MAC authentication can be spoofed by fairly simple methods. Consequently, it is generally not recommended as a network access into secure networks.

Null Access

This is actually the starting point in the logic flow as well as a termination. Every device that attaches to the edge when using IDE gets access for authentication alone. If the loop fails (whether FA or 802.1X), the network state reverts back to this mode. There is no network access provided but there is the ability to address possible configuration issues. Once those are addressed, the authentication loop would again proceed with access granted as a result. On the other hand, if this chain in the logic flow is arrived at due to the fact that nothing else is supported or provisioned then manual configuration is the last viable option.

Manual Provisioning

While this certainly a valid method for providing access, it is generally not recommended. Even if the environment is accurately documented and the record keeping was scrupulously maintained there is still the risk of exposure. This is because VLAN’s are statically provisioned at the service edge. There is no inspection & no device authentication. Anyone could plug into the edge port and if DHCP is configured on the VLAN they are on the network and no one is the wiser. Compare this with the use of IDE in tandem with Fabric Connect, where someone could unplug a system and then plug their own system in to try to gain access. This will obviously fail. As a result this box is shown in red as well as it is not a recommended method in stealth network access.


How do I design the Virtual Service Networks required?

Up until now we have been focusing on the abstract notions of data flow and footprint. At some point someone has to sit down and design how the VSN’s will be implemented and what if any relationships exist between them. Well at this point, if you have done due diligence in the data discovery process that was outlined earlier, you should have.

1). A list of transmitting and receiving systems

2). How those systems are related and their respective roles

a). Edge Systems (sensors, controllers, users)

b). Application Server environments (App., DB, Web)

c). Data Storage

3). A resulting flow diagram that illustrates how data moves through the network

a). Linear data flows

b). Closed loop (feedback) data flows

4). Identification of preferred or required communication domains.

a). Which elements need to ‘see’ and communicate with one another?

b). Which elements need to be isolated and should not communicate directly?

As an example to linear data flows, see the diagram below. It illustrates a typical PCI data footprint. Notice how the data flow is primarily from the point of sale systems to the bank. While there are some minor flows of other data in the footprint, it is by and large dominated by the credit card transaction data as it moves to data center and then to the bank or even directly to the bank.


Figure 2. Linear PCI Data Footprint

Given the fact the linear footprint is monolithic, the point of sale network can be handled by one L3 IP VPN Virtual Service Network. This VSN would terminate at a standard security demarcation point with a mapping of a single dedicated port. In the data center a single L2 Virtual Service Network could provide the required environment for the PCI server application and the uplink to the financial institution. Alternatively, some customers have utilized Stealth L2 VSN’s out to provide connectivity to the point of sale systems that are in turn collapsed to the security demarcation.


Figure 3. Stealth L2 Virtual Service Network


Figure 4. L3 Virtual Service Network

A Stealth L2 VSN is nothing more than a normal L2 VSN that has no IP addresses assigned at the VLAN service termination points. As a result the systems within it are much more difficult to discover and hence exploit. L3 VSN’s, which are I-SID’s associated with VRF’s are stealth by nature. The I-SID replaces traditional VRF peering methods creating a much simpler service construct.

To look at looped data flows, let’s use a simple two layer automation framework. As shown in the figure below.


Figure 5. Looped Data Footprint for Automation

We can see that we have three main elements in the system, two sensors (S1 & S2), a controller or actuator and a sensor/controller manager, which we will refer to as SCM. We can also see that the sensor feeds information on the actual or effective state of the control system to the SCM. For the sake of clarity let’s say that it is a flood gate. So the sensor (S2) can measure whether the gate is open or closed or in any intermediate position. The SCM can in turn control the state of the gate by actuating the controller. You might even be more sophisticated in that you not only can manage the local gate, but also manage the local gate according to upstream water level conditions. As such there would also be dedicated sensor elements that allow the system to monitor the water level as well, this is sensor S1. So we see a closed loop framework but we also see some consistent patterns in that the sensors never talk directly to the controllers. Even S2 does not talk to the controller; it measures the effective state of it. Only the SCM talks to the controller and the sensors only talk to the SCM. As a result we begin to see a framework of data flow and which elements within the end to end system need to see and communicate with one another. This in turn will provide us with insight as to how to design the supporting Virtual Service Network environment as shown below.


Figure 6. Looped Virtual Service Network design

Note that the design is self-similar in the effect that it is replicated at various points of the watercourse that it is meant to monitor and control. Each site location provides three L2 VSN environments for S1, S2 and A/C. Each of these is fed up to the SCM which coordinates the local sensor/control feedback. Note that S1, S2 and A/C have no way to communicate directly, only through the coordination of the SCM. There may be several of these loopback cells at each site location, all feeding back into the site SCM, but also note that there is a higher level communication channel provided by the SCM L3 VSN which allows for SCM sites to communicate upstream states information to downstream flood control infrastructure.

The whole system becomes a series of interrelated atomic networks that have no way to communicate directly and yet have the ability to convey a state of awareness on the overall end to and system that can be monitored and controlled in a very predictable fashion, as long as it is within the engineered limits of the system. But also note that each critical element is effectively isolated from any inbound or outbound communication other than that which is required for the system to operate. Now it becomes easy to implement intrusion detection and firewalls with a very narrow profile on what is acceptable within the given data footprint. Anything outside it is dropped, pure and simple.


Know who is who (and when they were there (and what they did))!

The prior statement applies not only to looped automation flows but also to any confidential data footprint. It is important not only to consider the validation of the systems but also the users who will access it. But it goes much further than network and systems access control. It touches into proper auditing of that access and associated change control. This becomes a much stickier wicket and there is still no easy answer. It really comes down to a coordination of resources, both cyber and human. Be sure to think out your access control policies in respect to the confidential footprint. Be prepared to buck standard access policies or demands from users that all services need to be available everywhere. As an example, it is not acceptable to mix UC and PCI point of sale communications in one logical network. This does not mean that a sales clerk cannot have a phone and of course we assume that a contact center worker has a phone. It means that UC communications will traverse a different logical footprint than the PCI point of sale data. The two systems might be co-resident at various locations, but they are ships in the night from a network connectivity perspective. As a customer recently commented to me, “Well, with everything that has been going on, users will just need to accept that it’s a new world.”     He was right. In order to properly lock down information domains there needs to be stricter management of user access to those domains and exactly what they can and cannot do within them. It may even make sense to have whole alternate user ID’s with alternate, stronger methods of authentication. This provides an added hurdle to a would-be attacker that might have gained a general users access account. Alternate user accounts also provide for easier and clearer auditing of those users activities within the confidential data domain. Providing for a common policy and directory resource for both network and systems access controls can allow for consolidation of audits and logs. By syncing all systems to a common clock and using tools such as the E.L.K stack (Elastic Search, Logstash and Kibana), entries can be easily searchable against those alternate user ID’s and systems that are touched or modified. There is still some extra work to generate the appropriate reports but having the data in an easily searchable utility is a great help.

Putting you ‘under the microscope’

Even in the best of circumstances there are times when a user or a device will begin to exhibit suspicious or abnormal behaviors. As previously established, having an isolated information domain allows for anomaly based detection to function with a very high degree of accuracy. When exceptions are found they can be flagged and highlighted. A very powerful capability of Avaya’s SDN Fx is its unique ability to leverage stealth networking services to move the offending system into a ‘forensics environment’ where it is still allowed to perform its normal functions but it is monitored to assure proper behavior or determine the cause of the anomaly. In the case of malicious activity, the offending device can be placed into quarantine with the right forensics trail. Today we have many customers who use this feature on a daily basis in a manual fashion. A security architect can take a system and place it into a forensics environment and then monitor the system for suspect activity. But the human needs to be at the console and see the alert. Recently, Avaya has been working with SDN Fx and the Breeze development workspace to create an automated framework. Working with various security systems partners, Avaya is creating an automated systems framework to protect the micro-segmented domains of interest. Micro-segmentation not only provides for the isolated environment for anomaly detection, but also for the ability to lock down and isolate suspected offenders.

Micro-segmentation ‘on the fly’ – No man is an island… but a network can be!

Sometimes there is the need to move confidential data quickly and in a totally secret and isolated manner. As a result to this, there arose a series of secure web services known as Tor or Onion sites. These sites were initially introduced and intended for research and development groups but over time they have been absconded by drug cartels and terrorist organizations. It has as a result become known as the ‘dark web’. The use of strong encryption in these services is now a concern among the likes of the NSA and FBI as well as many corporations and enterprises. These sites are now often blocked at security demarcations due to concerns about masked malicious activity and content. Additionally, many organizations now forbid strong encryption on laptops or other devices as concerns for their misuse has grown significantly. But clearly, there is a strong benefit to closed networks that are able to move information and provide communications with total security. There has to be some compromise that could allow for this type of service but provide it in a manner that is well mandated and governed by an organizations IT department. Avaya has been doing research into this area as well. Dynamic team formation can be facilitated once again with SDN Fx and the Breeze development workspace. Due to the programmatic nature of SDN Fx, completely isolated Stealth network environments can be established in a very quick and dynamic fashion. The Breeze development platform is used to create a self-provisioning portal where users can securely create a dynamic stealth network with required network services. These services would include required utilities such as DHCP, but also optional services such as secure file services, Scopia video conferencing, and internal security resources to insure proper behavior within the dynamic segment. A secure invitation is sent out to the invitees with URL attachment to join the dynamic portal with authenticated access. During the course of the session, the members are able to work in a totally secure and isolated environment where confidential information and data can be exchanged, discussed and modified with total assurance. From the outside, the network does not exist. It cannot be discovered and cannot be intruded into. Once users are completed with the resource they would simply log out of the portal and they would be automatically placed back into their original networks. Additionally, the dynamic Virtual Service Network can be encrypted by the network edge either on a device like Avaya’s new Open Network Adapter or by a partner such as Senetas, who is able to provide for secure encryption at the I-SID level. With this type of solution, the security of Tor and Onion sites can be provided but in a well-managed fashion that does not require strong encryption on the laptops. Below is an illustration of the demonstration that was publicly held at the recent Avaya Technology Forums across the globe.


Figure 7. I-SID level encryption demonstrated by Senetas

In summary

Many security analysts, including those out of the likes of the NSA are saying that micro-segmentation is a key element in a proper cyber-security practice. It is not a hard point to understand. Micro-segmentation can limit east-west movement of malicious individuals and content. It can also provide for isolated environments that can provide an inherently strong compliment to traditional security technologies. The issue that most folks have with micro-segmentation is not the technology itself but deciding what to protect and how to design the network to do so. Avaya’s SDN Fx Fabric Connect can drastically ease the deployment of a micro-segmented network design. Virtual Service Networks are inherently simple service constructs that lend themselves well to software defined functions. It cannot assist in deciding what needs to be protected however. Hopefully, this article has provided insight into methods that any organization can adopt to do the proper data discovery and arrive at the scope of the confidential data footprint. From there the design of the Virtual Service Networks to support it is extremely straightforward.

As we move forward into the new world of the Internet of Things and Smart infrastructures micro-segmentation will be the name of the game. Without it, your systems are simply sitting ducks once the security demarcation has been compromised or worse yet the malice comes from within.







What’s the Big Deal about Big Data?

July 28, 2014


It goes without saying that knowledge is power. It gives one the power to make informed decisions and avoid miscalculation and mistakes. In recent years the definition of knowledge has changed slightly. This change is the result of increases in the ease and speed in computation as well as the shear volume of data that these computations can be exercised against. Hence, it is no secret that the rise of computers and the Internet has contributed significantly to enhance this capability.
The term that is often bantered about is “Big Data”. This term has gained a certain mystique that is comparable to cloud computing. Everyone knows that it is important. Unless you have been living in a cave, you most certainly have at least read about it. After all, if such big names as IBM, EMC and Oracle are making a focus of it then it must have some sort of importance to the industry and market as a whole. When pressed for a definition of what it is however, many folks will often struggle. Note that the issue is not that it deals with the computation of large amounts of data as its name implies, but more so that many folks struggle to understand what it would be used it for.
This article is intended to clarify the definition of Big Data and Data Analytics/Data Science and what they mean. It will also talk about why they are important and will become more important (almost paramount) in the very near future. Also discussed will be the impact that Big Data will have on the typical IT departments and what it means to traditional data center design and implementation. In order to do this we will start first with the aspect of knowledge itself and the different characterizations of it that have evolved over time.

I. The two main types of ‘scientific’ knowledge

To avoid getting into an in depth discussion of epistemology, we will limit this section of the article to just the areas of ‘scientific’ knowledge or even more specifically, ‘knowledge of the calculable’. This is not to discount other forms of knowledge. There is much to be offered by spiritual and aesthetic knowledge as well as many other classifications including some that would be deemed as scientific, such as biology*. But here we are concerned with knowledge that is computable or knowledge that can be gained by computation.

* This is rapidly changing however. Many recent findings show that many biological phenomena have mathematical foundations. Bodily systems and living populations have been shown to exhibit strong correlations to non-linear power law relationships. In a practical use example, mathematical calculations are often used to estimate the impact of an epidemic on a given population.

Evolving for centuries but coming to fruition with Galileo in the 16th century, it was discovered that nature could be described and even predicted in mathematical terms. The familiar dropping of balls of different sizes and masses from the tower of Pisa is a familiar myth to anyone with even a slight background in the history of science. I say myth, because it is very doubtful that this had ever literally taken place. Instead, Galileo used inclined planes and ‘perfect’ spheres of various densities to calculate the fact that the gravitational pull is a constant regardless of size or mass. Lacking an accurate timekeeping device, he would sing a song to keep track of the experiments. Being an accomplished musician, he had a keen sense of timing. The inclined planes provided him the extended time for such a method. He correctly realized that it was resistance or friction that caused the deltas that we see in the everyday world. While everyone knows that when someone drops a cannon ball and a feather off of a roof, the cannon ball will strike the earth first. It is not common sense that in a perfect vacuum both the feather and the cannonball will fall at the exact same rate. It actually takes a video to prove it to the mind and this can be found readily if one looks on the Internet. The really important thing about this is that Galileo calculated this from his work with spheres and inclined planes and that the actual experiment was not carried out until many years after his death as the ability to generate a perfect vacuum did not exist at the time he lived. I find this very interesting as it says two things about calculable knowledge. First, it allows one to explain why things occur as they do. Second, and perhaps more importantly, it allows one to predict the results once one knows the mathematical pattern of behavior. Galileo realized this. Even though he was not able to create a perfect vacuum, by the meticulous calculation of the various values involved (with rather archaic mathematics – the equal sign had not even been invented yet, nor most of the symbols that we know as familiar) he was able to arrive at this fact. Needless to say, this goes against all common sense and experience. So much so, that this, as well as his workings with the fledgling science of astronomy, almost landed him on the hot seat (or stake) with the Church. As history attests however, he stuck to his guns and even after the Inquisitional Council had him recant his theories on the heliocentric nature of the solar system, he whispered of the earth… “Yet it still moves”.
If we fast forward to the time of Sir Issac Newton, this insight was made crystalline by Newton’s laws of motion which described the movement of ‘everything’ from the falling of an apple (no myth – this actually did spark his insight but it not hit him on the head) to the movement of the planets with a few simple mathematical formula. Published as the ‘Philosophiae Naturalis Principia Mathmatica’ or simply ‘Principia’ in 1687, this was the foundation of modern physics as we know it. The concept that the world was mathematical or at least could be described by mathematical terms was now something that was not only validated but demonstrable. This set of events led to the eventual ‘positivist’ concept of the world that reached its epitome with the following statement made by Pierre Laplace in 1814.
“Consider an intelligence which, at any instant, could have knowledge of all forces controlling nature together with the momentary conditions of all the entities of which nature consists. If this intelligence were powerful enough to submit all of this data to analysis, it would be able to embrace in a single formula the movements of the largest bodies in the universe and those of lighter atoms; for it, nothing would be uncertain; the future and the past would be equally present to its eyes.”

Wow. Now THAT’s big data! Sound’s great! What the heck happened?

Enter Randomness, Entropy & Chaos

In the roughly same time frame as Laplace, many engineers were using these ‘laws’ to attempt in the optimization of new inventions like the steam engine. One such researcher was a French scientist by the name of Nicholas-Leonard-Sadi Carnot. The research that he focused on was the movement of heat within the engine and to conserve as much of the energy as possible for work. In the process he came to realize that there was a feedback cycle within the engine that could be described mathematically and even monitored and controlled. He also realized the fact that some heat is always lost. It just gets radiated out and away from the system and is unusable for the work of the engine. As anyone that has stood next to a working engine of any type will attest, they tend to get hot. This cycle bears his name as the Carnot cycle. This innovative view led to the foundation of a new branch in physics (with the follow on help of Ludwig Boltzman) known as thermodynamics; the realization that all change in the world (and the universe as a whole) is the movement of heat, more specifically, hot to cold. Without going into detail on the three major laws of thermodynamics, the main point to this discussion is that as change occurs it is irreversible. Interestingly, recently developed information theory validates this as it shows that order can actually be interpreted as ‘information’ and that over time this information is lost to entropy in that there is a loss of order. Entropy is as such a measurement of disorder within a system. This brings us to the major inflection point on our subject. As change occurs, it cannot be run in reverse like a tape and arrive at the same inherent values. This is problematic, as the laws of Newton are not reversible in practice, though they may be on a piece of paper. As a matter of fact, many such representations up to modern times, such as the Feynman Diagrams to illustrate the details of quantum reactions are in fact reversible. What gives?
The real crux of this quick discussion is the realization that reversibility is largely a mathematical expression that starts to fall apart as the number of components in the overall system gets larger. A very simple example is one with two billiard balls on a pool table. It is fairly straightforward to use the Newtonian laws to reverse the equation. We can also do so in practice. But now let us take a single queue ball and strike a large number of other balls. Reversing the calculation is not nearly so straightforward. The number of variables to be considered begins to go beyond our ability to calculate much less control. They most certainly are not reversible in the everyday sense. In the same sense, I can flip a deck of playing cards in the air and bet you with ultimate confidence that the cards will not come down in the same order (or even the same area!) as in which it was thrown. Splattered eggs do not fall upwards to reassemble on the kitchen counter. And much to our chagrin, our cars do not repair themselves after we have had a fender bender. This is the term of entropy, the 2nd law of thermodynamics which states that some energy within a system is always lost to friction and heat. This dissipation could be minimized but never eliminated. As a result the less entropy an engine generates the more efficient it is in its function. Hmmmm, what told us that? A lot of data, that’s what, and back then things were done with paper & pencil! A great and timely discovery for its time as it helped move us into the industrial age. The point of all of this however is that in some (actually most) instances, information on history is important in understanding the behavior of a system.

The strange attraction of Chaos

We need to fast forward again. Now we are in the early 1960’s with a meteorologist by the name of Edward Lorenze. He was interested in the enhanced computational abilities that new computing technology could offer in the goal of predicting the weather. Never mind that it took five days worth of calculation to arrive at the forcast for the following day. At least the check was self evident as it already occurred four days ago!
As the story goes he was crunching some data one evening and one of the machines ran out of paper tape. He quickly refilled the machine and started it from where the calculations left off… manually by typing them in. He then went off and grabbed a cup of coffee to let the machine churn away. When he returned he noticed that the computations where way off the values that the sister machines were running. In alarm he looked over his work to find that the only real major difference was the decimal offset of the initial values (the interface only allowed a three place offset while the actual calculation was running with a six place offset). As it turns out the rounded values he typed in manually created a different result to the same calculation. This brought about the realization that many if not most systems are sensitive and at times extremely so to something now termed as ‘initial conditions’.
There is something more however. Lorenze discovered that if some systems are looked at long enough and with the proper focus of granularity, a quasi-regular or quasi-periodic pattern becomes discernible that allows for the general qualitative description of a system and its behavior without the ability to quantitatively say what the state of any particular part of the system may be at a given point in time. These are termed as mathematical ‘attractors’ within a system. A certain set of power law based formula that a system is, if left unperturbed, drawn to and will be maintained. These attractors are quite common. They are somewhat required for all dissipative systems. In essence, it is a behavior that can be described mathematically that by its nature keeps a system as a system, with just enough energy coming in to offset the entropy that must inevitably go out. The whole thing is fueled by the flow of energy (heat) through it. By the way, both you and I are examples of dissipative systems and yes we are based on a lot of information. But here is something to consider, stock markets are dissipative systems too. The only difference is that energy is replaced by money.

The problem with Infinity

The question is how sensitive do we have to be and to what level of focus will reveal a pattern? How many decimal places can you leave off and still have faith in the calculations that result? This may sound like mere semantics, but the calculable offset in Lorenzes’ work created results that were wildly different. (Otherwise he might very well have dismissed it as noise*)

* Actually in the electronics and communications area this is exactly what the phenomenon was termed as for decades. Additionally, it was termed as ‘undesirable’ and engineers sought to remove or reduce it so it was never researched further as to its nature. Recently efforts to leverage these characteristics are being investigated.

Clearly the accuracy in a given answer is dependent on how accurately the starting conditions are measured. Again, one might say that, OK perhaps this is the case for a minority of cases but that in most cases any difference will be minor. Again, this is alas not true. Most systems are like this. The term is ‘non-linear’. Small degrees of inaccuracy in the initial values of the calculations in non-linear systems can result in vastly different end results. One of the reasons for this is that with the seemingly unassociated concept of infinity, we touch on a very sticky subject. What is an infinite or infinitely accurate initial condition? As an example, I can take a meter and divide it by 100 to arrive at centimeters and then take a centimeter and divide it further to arrive at millimeters and so forth… This process can go on forever! Actually, this is not the case but the answer is not appeasing to our cause. We can continue to divide until we arrive at Planck’s constant which is the smallest recognizable unit of difference before the very existence of space and time become meaningless! In essence a foam of quantum probability from which emerges existence as we know it.
The practical question must be, when I make a measurement how accurate do I need to be? Well, if I am cutting a two by four for the construction of some macro level structure such as a house or shed, I only need to be accurate to the 2nd maybe 3rd decimal place. On the other hand, if I am talking about cutting a piece of scaffolding fabric to fit surgically into a certain locale within an organ to facilitate a substrate for regenerative growth, the orders of magnitude are very much increased. Possibly out to 6 or 8 decimal places. So the question to ask is how do we know how accurate we have to be? Here comes the pattern part! We know this by the history of the system we are dealing with! In the case of a house, we have plenty of history (a strong pattern – we have built a lot of houses) to deduce that we need only be accurate to a certain degree and the house will successfully stand. In the case of micro-surgery we may have less history (a weaker pattern – we have not done so many of these new medical procedures), but enough to know that a couple of decimal places will just not cut it. Going further we even have things like the weather where we have lots and lots of historic data but the exactitude and density of the information still limits us to only a few days of relatively accurate predictive power. In other words, quite a bit of our knowledge is dependent on the granularity and focus in which it’s analyzed. Are you starting to see a thread? Wink, wink.

Historical and Ahistorical knowledge

It all comes down to the fact that calculable knowledge is dependent on us having some idea of the history & conditions of a given system. Without these we can not calculate. But how do we arrive at these initial values? Well, by experiment of course. We all recall the days back in school with the tedious hours of experimentation in exercises where we knew full well the result. But think of the first time that this was realized by the likes of say Galileo. What a great moment it would have been! But an experiment by definition cannot be a ‘onetime thing’. One would have to run an experiment multiple times with ‘exactly’ the same conditions or varying the conditions slightly in a controlled fashion depending on what one was trying to prove. This brings about a strong concept of history. The experimental operations have been run, and we know that such a system behaves in such a way due to historical and replicable examples. Now we plug those variables into the mathematics and let it run. We predict from those calculations and then validate with further experiments. Basic science works on these principals, so as such we should say that all calculable knowledge is historic in nature. But it could also be said in argument that for certain immutable ‘mathematical truths’ that some knowledge is ahistorical. In other words, like Newton’s laws* and like the Feynman diagrams some knowledge just doesn’t care about the nature or direction of times arrow. Be that as it may it would further be argued that any of these would require historical knowledge in order to interpret their meaning or even find that they exist!

* Newton’s laws are actually approximations of what is reality. In normal everyday circumstances the linear laws work quite well. When speed or acceleration is brought to extremes however the laws fail to yield a correct representation. Einstein’s General Theory of Relativity provides for a more accurate way to represent the non-linear reality under these extreme conditions (actually they exist all the time, but in normal environments the delta to the linear is so small as to be negligible). The main difference – In Newton’s laws space and time are absolute. The clock ticks the same regardless of motion or location, hence linear. In Einstein’s theory space and time are mutable and dynamic. The clock ticks differently for different motions or even locations. Specifically, time slows with speed as the local space contracts, hence non-linear.

As an example, you can toss me a ball from about ten feet away. Depending on the angle and the force of the throw I can properly calculate where the ball will be at a certain point in time. I have the whole history of the system from start to finish. I may use an ahistorical piece of knowledge (i.e. the ball is in the air and moving towards me), but without knowledge of the starting conditions for this particular throw I am left with little data and will likely not catch the ball. In retrospect though, it’s amazing that our brains can make this ‘calculation’ all at once. Not explicitly of course but implicitly. We know that we have to back up or run forward to catch the ball. We are not doing the actual calculations in our heads (at least I’m not). But if I were to run out onto the field and see the ball that you threw in mid air with no knowledge of the starting conditions, I am essentially dealing with point zero in knowledge of a system that is pre-existing. Sounds precarious and it is. Because this is the world we live in. But wait! Remember I have a history in my head on how balls in air behave! I can reference this library and get a chunk of history in very small sample periods (the slow motion effect we often recall) and yes perhaps I just might catch that ball – provided that the skill of the thrower was consummate with the skill of those I have knowledge of. Ironically, the more variability there is in my experience with throwers of different skill levels; the higher the probability of my catching the ball in such an instance. And it’s all about catching the ball! But it also says something important about calculable knowledge.

Why does this balloon stay round? The law of large numbers

Thankfully, we live in a world full of history. But ironically, too much history can be a bad thing. More properly put, too specific of a history about a component within a system can be a bad thing. This was made apparent by Ludwig Boltzman in his studies of gasses and their inherent properties. While it is not only impractical but impossible to measure the exact mass and velocity of each and every constituent particle at each and every instant, it is still possible to determine their overall behavior. (He was making the proposition based on the assumption of the existence of as of yet unproven molecules and atoms.) As an example, if we have a box filled with air on one side and no air (a vacuum) on the other, we can be certain that if we lift the divider between the two halves, the particles of air will spread or ‘dissipate’ into the other side of box. Eventually, the gas in the now expanded box will have diffused to every corner. At this point any changes will be random. There is no ‘direction’ in which the particles will have to go. This is the realization of equilibrium. As we pointed out earlier this is simply entropy, reaching its ultimate goal within the limits of the system. Now let us take this box and make it a balloon. If we blow into it, the balloon will inflate and there will be equal distribution of whatever is used to fill it. Note that now the balloon is a ‘system’. After it cools to uniform state the system will reach equilibrium. But the balloon still stays inflated. Regardless of the fact that there is no notable heat movement within the balloon, it still remains inflated by the heat contained within the equilibrium. After all we did not say that there was no heat. We just said that there was no heat movement or more so that it has been slowed drastically. In actuality, it was realized that it was the movement of the molecules and this residual energy (i.e. the balloon at room temperature) that caused the pressure to keep the balloon inflated.*

* Interesting experiment… blow up a balloon and then place it in the freezer for a short while.

Boltzman, as a result of this realization was able to manipulate the temperature of a gas to control its pressure in a fixed container and visa-versa. This showed that the increase in heat actually caused more movement within the constituent particles of gas. He found that while it was futile to try and calculate what is occurs to a single particle; it was possible to represent the behavior of the whole mass of particles in the system by the use of what we now call statistical analysis. An example is shown in figure 1. What it illustrates is that as the gas heats up the familiar bell curve flattens and hence widens the probability that a given particle will be at a certain speed and heat level.

Figure 1

Figure 1. Flattening Bell curves to temperature coefficients

This was a grand insight, and it has enabled a whole new branch of knowledge which for better or worse; has helped shape our modern world. Note I am not gushing over the virtues of statistics, but it does when properly used have strong merits and it has enabled us to see things to which we would otherwise be blind. And after all, this is what knowledge is all about right? But wait, I have more to say about statistics. It’s not all good. As it turns out even if used properly, it can have blind spots.

Those pesky Black Swans…

There is a neat book written on the subject by a gentleman by the name of Nicholas Teleb*. In it he artfully speaks to the improbable but possible. Those events that occur every once in a while to which statistical analysis is often blind. These events are termed as ‘Black Swans’. He goes on to show these events are somewhat invisible to normal statistical analysis in that they are improbable events on the ‘outside’ of the Bell Curve. (Termed as ‘outliers’) He also goes on to indicate what he thinks is the cause. We tend to get myopic on the trends and almost convince ourselves of their dependability. We also do not like to think of ourselves as wrong or somehow flawed in our assumptions. He points out that in today’s world of information, there is almost too much of it and that you can find stats or facts just about anywhere to fit and justify your belief in that dependability. He is totally correct. Statistics is vulnerable to this. Yet, I need to correct that just a bit. It’s not statistics that is at fault. The fault lies with those using it as a tool.

* The Black Swan – Random House

Further, Taleb provides some insight to things that might serve as flags or ‘tell tales’ to Black Swans. As an example, he notes that prior to all drastic market declines they behaved in a spiky, intermittent behavior that, while still in norm with the Gaussian, had an associated ‘noise’ factor. Note that parallel phenomenon exists within electronics, communications and yes you guessed it, the weather! This ‘noise’ tends to indicate ‘instability’ where the system is about to change in a major topological fashion to another phase. These are handy things to know. Note how they deal with the overall ‘pattern’ of behavior. Not the statistical mean or even median.

Why is this at all important?

At this point you might be asking yourself. Where am I going with all of this? Well, it’s all about Big Data! As we pointed out, all knowledge is historical even if gained by ahistorical (law) insight. Properly understanding a given system means that one needs to understand not only those statistical trends, but higher level patterns of behavior that might betel outliers and black swans. All of this requires huge amounts of data of potentially wide varieties as well. Think of a simple example of modeling for a highway expansion. You go through the standard calculation and then consider that you want to add into consideration the local seasonal weather patterns. Things have exponentially increased in computation and data store requirements. This is what the challenge of Big Data is all about. It is in the realization, that it is not intended on handling the ‘simple’ questions. It is intent on pushing out the bounds of what is deemed tractable or calculable in the sense of knowledge. It’s not that the mathematics did not exist in the past. It’s just now that capability is within ‘everyday’ computational reach. Next let’s consider the use cases for Big Data and perhaps touch on a few actual implementations that you could actually run in your data center.


II. Big Data – What’s it good for? Absolutely everything! Well, almost…

If you will recall we spoke about dissipative systems. As it turns out, almost everything is dissipative in nature. The weather, the economy, the stock market, international political dynamics, our bodies, one could even say our own minds. Clearly, there is something to consider in all of that. The way humans behave is a particularly quirky thing. They (we) are also as a result the primary drive and input into the many of the other systems such as economics, politics, the stock market and yes even the weather. Further understanding in these areas could and actually have proven to be profound.
These are important things to know and we will talk a little later as to these lofty goals. But in reality Big Data can have far more modest goals and interests. A good real world example is for retail sales. It gets back to the age old adage… “Know your customer.” But in today’s cyber-commerce environment that’s often easier said than done. Fortunately, there are companies that are working in this area. One of the real founders to this is Google. Google is an information company at its heart. When one thinks about the sheer mass of information that it possesses it is simply boggling. Yet, Google strongly needs to leverage and somehow make sense of that data. At the same time however it had practical limits on computational power and associated costs for it. Out of these competing and contradictory requirements came the realization of a parallel compute infrastructure that leverages off the shelf commodity systems. Initially it was introduced to the public in a series of white papers as the Google File System or GFS and other ‘sister’ papers such as MapReduce, which provides for key/value mappings and Big Table, which can represent structured data into the environment. This technology has since been embraced by the open source community and is now known as Apache Hadoop Distributed File System or HDFS. The figure below shows the evolution of these efforts into the open source community.

Figure 2

Figure 2. Hadoop outgrowth and evolution into the open source space

The benefits of these developments are important as they provide for the springboard for the use of big data and data analytics in the typical Enterprise IT environment. Since this inception a literal market sector has sprung up with major vendors such as EMC and IBM but also startups such as Cloudera and MapR. This article will not go into the details of these different vendor architectures but be it safe to say that each has its spin and secret sauce that differentiates their approach. You can feel free to look into these different vendors and research others. For the purposes of this article we are concerned more so with the architectural principles of Hadoop and what it means to a Data Center environment. In data analytics a lot of data has to be read very fast. The longer it takes for the read time the longer the overall analytics process. HDFS leverages parallel processing at a very low level to provide for a highly optimized read time environment.

Figure 3

Figure 3. A comparison of sequential and parallel reads

In the above we show the same 1 terabyte data file being read by a conventional serial read process versus a Hadoop HDFS cluster which optimizes the read time by an order of ten. Note that the same system type is being used in both instances, but in the HDFS scenario there is just a lot more of them. Importantly, the actual analytic programming runs in parallel as well. Note also that this is just an example. The typical HDFS block size is 64 or 128MB. This means that relatively large amounts of data can be processed extremely fast with a somewhat modest infrastructure investment. As an additional note, HDFS also provides for redundancy and resiliency of data by the use of replication of the distributed data blocks within the cluster.
The main point is that HDFS leverages on a distributed data footprint rather than a singular SAN environment. Very often HDFS farms are comprised completely of Direct Attach Storage systems that are tightly coupled via the data center network.

How the cute little yellow elephant operates…

Hadoop is a strange name, and a cute little yellow elephant as its icon is even more puzzling. As it turns out one the key developers’ young son had a yellow stuffed elephant that he had named Hadoop. The father decided it would be a neat internal project name. The name stuck and the rest is history. True story, strange as it may seem.
Hadoop is not a peer to peer distribution framework. It is hierarchical, with certain master and slave roles within its architecture. The components of HDFS are fairly straight forward and shown in simplified form in the diagram below.

Figure 4

Figure 4. Hadoop HDFS System Components

The overall HDFS cluster is managed by an entity known as the Namenode. You can think of it as the library card index for the file system. More properly, it generates and manages the meta-data for the HDFS cluster. As a file gets broken into blocks and placed into HDFS, it’s the namenode that indicates where, and the namenode that tracks and replicates if required. The meta-data always provides a consistent map of the distributed file system as to where specific data resides. This is used not only for writing into or extracting out of the cluster, but also for data analytics which requires a reading of the data for its execution. It is important to note that in first generation Hadoop, it was a single point of failure. The secondary namenode in generation 1 Hadoop is actually a housekeeper process that extracts the nodename run-time metadata and copies it to disk in what is known as a namenode ‘checkpoint’. Recent versions of Hadoop now offer redundancy for the namenode. Cloudera for instance provides high availability for the namenode service.
There is a second node known as the Jobtracker. This service tracks the various jobs required to maintain and run over the HDFS environment. Both of these nodes are master role nodes. As such, Hadoop is not a peer to peer clustering technology, it is more so hierarchical.
In the slave role are the datanodes. These are the nodes that actually hold the data that resides within the HDFS cluster. In other words the blocks of data that are mapped by the namenode reside within these systems disks. Most often datanodes are direct attached storage and only leverage SAN to a very limited extent. The tasktracker is a process that runs on the datanodes and are managed and report back to the jobtracker for the various executions that occur within the Hadoop HDFS cluster.
And lastly, one of these nodes, referred to as the ‘edge node’ will have an ‘external’ interface that allows the HDFS environment to be exposed so that PC’s running the Hadoop HDFS client can be provided access.

Figure 5

Figure 5. HDFS Data Distribution & Replication

HDFS is actually fairly efficient in that it incorporates replication into the write process. As shown above, when a file is ingested into the cluster it is broken up into a series of blocks. The namenode utilizes a distribution algorithm to accomplish the mapping of where the actual data blocks will reside within the cluster. A HDFS cluster will have a default replication factor of three. This means that each individual block will be replicated three times and then placed algorithmically. The namenode in turn develops a meta-data map of all resident blocks with the distributed file system. This meta-data is in turn a key requirement for the read function, which is a requirement for analytics.
If a datanode were to fail within the cluster, HDFS will ‘respawn’ the lost data to meet the distribution and replication requirements. All of this means east/west data but it also means consistent distribution and replication which is critical for parallel processing.
HDFS is also rack aware. By this we mean that the namenode can be programmed to recognize that certain datanodes are common to racks and consequently should be taken into consideration during the block distribution or replication process. This awareness is not automatic. It must be programmed by batch or python script. However once it is done it allows the span algorithm to place the first data block on a certain rack and then placing the two replicated blocks into a separate common rack. As shown in the figure below, data blocks A and B are distributed evenly across the cluster racks.

Figure 6

Figure 6. HDFS ‘Rack Awareness’

Note that while the default replication factor is three for HDFS it can be increased or decreased at the directory or even file level. As adjustment to the R factor is done for a certain data set, the namenode assures that data is replicated, spawned or deleted according to that adjusted value.
HDFS uses pipelined writes to move data blocks into the cluster. In figure 7, a HDFS client executes a write for file.txt. As an example, the user might use the copyFromLocal command. The request is sent to the namenode. The namenode responds with a series of metadata telling the client where to write the data blocks. Datanode 1 is the first in the pipeline so it receives the request and sends a ready request to nodes 7 and 9. Nodes 7 and 9 respond and then the write process begins by placing the data block on datanode one where it is then pipelined to datanodes 7 and 9. The write process is not complete until all datanodes respond with a write success. Note that most data center topologies utilize a spine & leaf type topology meaning that most of the rack to rack data distribution must flow up and through the data center core nodes. In Avaya’s view, this is highly inefficient and can lead to significant bottlenecks that will limit the parallelization capabilities of Hadoop.

Figure 7

Figure 7. HDFS pipelined writes

Additionally, recent recommendations are to move to 40 GB interfaces for this purpose. These interfaces most certainly are NOT cheap. With the leaf and spline approach this means rack to rack growth requires large cap/ex outlay at each expansion. Suddenly, the aspect of Big Data and Data Science for the common man is becoming a myth! The network costs start to become the big key investment as the cluster grows and with big data, they always grow. We at Avaya have been focusing on this east/west capacity issue within the data center top of rack environment.
Reads within the HDFS environment happen in a similar fashion. When the Hadoop client requests to reads a given file the name node will respond with the appropriate meta-data so that the client can in turn request the separate data blocks from the HDFS cluster. It is important to note that the meta-data for a given block is in an ordered list. In the diagram below the name node responds with meta-data for data block A as being on datanodes 1, 7 & 9. The client will request the block from the first datanode in the list. Only after a failed response will it attempt to read from the other data nodes.

Figure 8

Figure 8. HDFS ordered reads

Another important note is that the read requests for data blocks B & C occur in parallel. It is only after all data blocks have been confirmed and acknowledged that a read request is deemed complete. Finally, similar to the write process, any rack to rack east/west flows need to flow over the core switch in a typical spine and leaf architecture. But it is important to note that most analytic processes will not utilize this type of methodology for ‘reading’. Instead, ‘jobs’ are sent in and partitioned into the environment where the read and compute processes occur on the local data nodes and then reduced into an output from the system as a whole. This provides for the true ‘magic’ of Hadoop, but it requires a relatively large east/ west (rack to rack) capacity and that capacity only grows as the cluster grows.
We at Avaya have anticipated this change of data center traffic patterns. As such we have taken a much more straightforward approach. We call it Distributed Top of Rack or “D-ToR”. ToR switches are directly interconnected using very high bandwidth backplane connections. These 80G+ connections provide ultra-low latency, direct connections to other ToRs to address the expected growth. The ToRs are also connected to the upstream core which can allow for the use of L3 and IP VPN services to ensure security and privacy.

Figure 9

Figure 9. Distributed Top of Rack benefits for HDFS

Note that the D-TOR approach is much better suited for high capacity east/west data flows rack to rack within the data center. Growth of the cluster no longer depends on continual investment in the leaf spline topology, now new racks are simply extended into the existing fabric mesh. Going further, by using front port capacity, direct east/west inter-connects between remote data centers can be created. We refer to this as Remote Rack to Rack. One of the unseen advantages of D-ToR is the reduction of north-south traffic. Where many architects were looking at upgrading to 40G or even 100G uplinks, Avaya’s approach negates this requirement by allowing native L2 east-west server traffic to stay at the rack level. The ports required for this are already in the TOR switches. This provides relief to these strained connections. It also allows for seamless expansion of the cluster without the need to continual capital investment in high speed interfaces.
Another key advantage of D-ToR is the flexibility it provides:
• Server to server connections, in rack, across rows or building to building or even site to site!
The architecture is far superior to other approaches in supporting advanced clustering technologies such as Hadoop HDFS.
• Traffic stays where it needs to be, reserving the North/South links for end user traffic or for advanced L3 Services. Only traffic that classifies as such need traverse the north/south paths.
• The end result is a vast reduction in the traffic on those pipes as well as a significant performance increase for east/west data flows. At far lesser cost.

Figure 10

Figure 10. Distributed Top of Rack modes of operation

Avaya’s Distributed Top of Rack can operate in two different ways-
• Stack-Mode can dual connect up to eight D-ToR switches. The interconnect is 640Gb without losing any front ports! Additionally dual D-ToR switches can be used to scale up to 16 giving a maximum east/west profile of 10 Tb/s
• Fabric-Mode creates a “one hop” mesh which can scale up to hundreds D-ToR switches! The port count tops out at 10 thousand plus 10Gig ports and a maximum east/west capacity of Hundreds of Terabits.

Figure 11

Figure 11. A Geo-distributed Top of Rack environment

Avaya’s D-ToR solution can scale in either mode. Whether the needs are small, large or unknown, D-ToR & Fabric Connect provides unmatched scale, flexibility and perhaps most importantly, the capability to solve the challenges, even the unknown ones that most of us face. As the HDFS farm grows, the seamless expansion capability of Avaya’s D-TOR environment can accommodate it without major architectural design changes.
Another key benefit is that Avaya has solved the complex PCI or HIPAA compliance issues without having to physically segment networks or by adding layers & layers of Firewalls. The same can be said for any sensitive data environments that might be using Hadoop, such as patient medical records, banking and financial information, smart power grid or private personal data. Avaya’s Stealth networking technology (referred to in the previous “Dark Horse” article) can keep such networks invisible and self-enclosed. As a result any attack or scanning surfaces to the data analytics network are removed. The reason for this is that Fabric Connect as a technology is not dependent upon IP as a protocol to establish and end to end service path. This removes on of the primary scaffolding for all espionage and attack methods. As a result the Fabric Connect environment is ‘dark’ to the IP protocol. IP scanning and other topological scanning techniques will yield little or no information.

Using MapReduce to extract meaningful data

Now that we have the data effectively stored and retrievable we will obviously want to exercise certain queries against the data and hopefully receive meaningful answers. MapReduce is the original methodology documented in the Google white papers. Note that it is also a utility within HDFS and is used to chunk and create meta-data for the stored information within the HDFS environment. Data can also be analyzed with MapReduce to extract meaningful secondary data such as hit counts & trends which can serve as the historical foundation for predictive analytics.

Figure 12

Figure 12. A Map Reduce job

Figure 12 shows a MapReduce project being sent into the HDFS environment. The HDFS cluster runs the MapReduce program against the data set and provides a response back to the client. Recall that HDFS leverages parallel read/write paths. MapReduce builds on this foundation. As a result, east/west capacity and latency are of important consideration in the overall solution.
• Avaya’s D-TOR solution provides easy and consistent scaling of the rack to rack environment as the Hadoop farm grows.

The components of MapReduce are relatively simple.

First there is the Map function, which provides the meta-data context within the cluster. So there is an independent record transformation that is a representation of the actual data. This includes deletions, replications to the system. For analytics, the function is performed against key value (K,V) pairs. The best way to describe it is to give an example. Let’s say a word, and we want to see how often it appears in a document or a given set of documents. Let’s say that we are looking for the word ‘cow’. This becomes the ‘key’. Every time the MapReduce function ‘reads’ the word cow it ticks a ‘value’ of 1. As the function proceeds through the read job various ticks are appended into a list of key/value pairs such as cow,31 or there are ‘31’ instances of the word ‘cow’ in the document or set of documents. For this type of job the reduce function is a method to aggregate the results from the Map phase and provide a list of key value pairs that are to be construed as the answer to the query.
Finally, there is the framework function which is responsible for scheduling and re-running of tasks. It also provides all utility functions such as providing a split to the input, which becomes more apparent on the figure below. But it actually refers to the chunking functionality that we spoke of earlier as data is written into HDFS. Typically, these queries are constructed into a larger framework. The figure shows a simple example of a query framework.

Figure 13

Figure 13. A simple Map Reduce word count histogram

Above we see a simple word count histogram, which is the exact process we talked about previously. The upper arrow shows data flow across the MapReduce process chain. As data is ingested into the HDFS cluster it is chunked into blocks as previously covered. The map function makes this read against the individual blocks of data. For purposes of optimization there are copy, sort and merge functions that provide for the ability to aggregate the resulting lists of key value pairs. This is referred to as the shuffle phase and it is accomplished by leveraging on east/west capacity within the HDFS cluster. From this the reduce function reduces the received key value outputs as a single statement (i.e. cow,31)
In the example above we show a construct to count for three words; Cow, Barn and Field. The details for two of the key value queries are shown. The third is simply an extension of that which is shown. From this we can infer that among these records cow appears with field more often than barn. This is obviously a very simple example with no real practical purpose unless you are analyzing dairy farmer diaries. But it illustrates the potential of the clustering approach in facilitating data farms that are well suited to the process of analytics which leverage very heavily on read performance.
In another more practical example, let’s say that we want to implement an analytics function for customer relationship management. We would want to know things like how often key words such as ‘refund’ or ‘dissatisfied’ or even terms like ‘crap’ and ‘garbage’ come up in queries of emails, letters or even transcripts of voice calls. Such information is obviously valuable and can gain an insight to customer satisfaction levels.
As one might guess, things could very quickly get unwieldy dealing with large numbers of atomic key/value queries. YARN, which stands for ‘Yet Another Resource Nanny’, allows for the building of complex tasks that are represented and managed by application masters. The application master starts and recycles tasks and also requests resources from the YARN resource manager. As a result a cycling self-managing job could be run. Weave is an additional developing overlay that provides for more extensive job management functions.

Figure 14

Figure 14. Using Hadoop and Mahout to analyze for credit fraud

The figure above illustrates a practical functional use of the technology. Here we are monitoring incoming credit card transactions for flagging to analysts. Transaction data will be flagged key value pairs. Indeed there may be dozens of key value pairs that are part of this initial operation. This provides for the consistent input into the rest of the flow. LDA scoring based on Latent Dirichlet Allocation allows for a comparative function against the normative set. It can also provide a predictive role. This step provides a scoring function on the generated key value pairs. At this point LDA provides a percentile of anomaly to a transaction. From there further logic can then impact a given merchant score.
All of this is based on yet another higher level construct known as Mahout. Mahout provides for an orchestration and API library set that can execute a wide variety of operations, such as LDA.
Examples are, Matrix Factorization, K Means & Fuzzy K Means, Logic Regression, Naïve Bayes and Random Forest. All of which in essence are packaged algorithmic functions that can be performed against the resident data for analytical and/or predictive purposes. Further these can be cycled such as the example above which would operate on each fresh batch presented to it.
Below is a quick definition of each set of functions for reference:

Matrix Factorization –
As its name implies this function involves factorizing matrixes. Which is to say to find two or more matrixes that when multiplied will yield the original matrixes (i.e. the other matrixes as a result must be subsets of the original). This can be used to discover latent features between entities. Factoring more than two matrixes requires the use of tensor mathematics which would be more complicated. A good example of use is in movie popularity and ratings matches such as done by NetFlix. Film recommendations can be made fairly accurately based on identifying these latent features. A subscriber rating, their interests in venues and the rating of those with similar interests can yield an accurate set of recommended films that the subscriber is likely to enjoy.

K-Means –
K-Means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into something termed as Voronoi cells. These cells are based on common attributes or features that have been identified. Uses for this are learning common aspects or attributes to a given population so that it can be subdivided or partitioned into various sub populations. From there things like logic regression can be run on the sub-populations.

Fuzzy K-Means –
K-Means clustering is what is termed ‘hard clustering’. In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster and only one. In fuzzy clustering, also referred to as soft clustering, data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the association between that data element and a particular cluster. Fuzzy clustering is a process of assigning these membership levels, and then using them to assign data elements to one or more clusters. A particular data element can then be rated as to its strongest memberships within the partitions that the algorithm develops.

Logic Regression –
In statistics, logistic regression, or logic regression, is a type of probabilistic statistical classification model. Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. Logic regression is hence used to analyze probabilistic relationships between different variables within a particular set of data.

Naïve Bayes –
In machine learning environments, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) assumptions of independence between the features. In other words, it knows nothing to start. Naive Bayes is a popular (baseline) method for categorizing text, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, classified terms, etc.) with word frequency as a large part of the features considered. This is very similar to the usage and context information provided by Latent Dirichlet Allocation

Random Forest –
Random Forests is another method for learning & classification of large sets of data from which further regression techniques can be used. Random Forests are in essence constructs of decision trees that are induced in a process known as training. Data is then run through the forest and various decisions are made to learn and classify the data. When building out large forests the concept comes into effect of allowing to decision tree subsets. Weights can then be given to each set and from that further decisions can be made.

The end result of all of these methods is a very powerful environment that is capable of machine learning type phenomena. The best part of it is that it is accomplished with off the shelf technologies. No super computer required. Just a solid distributed storage/compute framework and superior east/west traffic capacity in the top of rack environment. Big Data and Analytics can open our eyes to relationships between phenomena that we would otherwise be blind to. It can even provide us insight into causal relationships. But here we need to tread a careful course. Just because two features are related in some way does not necessarily mean that one causes the other.

A word of caution –

While all of this is extremely powerful, the last comments above should raise a flag to you. Just because you have lots of data and you have all of these fancy mathematical tools at your disposal you can still make some very bad decisions if your assumptions about the meaning of the data is somehow flawed. In other words, good data plus good math with bad assumptions will still yield bad decisions. We also need to remember Mr. Taleb and his black swans. Just because a system has behaved in the past within a certain pattern or range does not mean that it will continue to do so ad infinitum. Examples of these types of systems range from stock exchanges to planetary orbits to our very own bodies! In essence, most systems exhibit this behavior. Does that mean that all of the powerful tools referred to above are rendered invalid and impotent? Absolutely not. But we must remember that knowledge without context is somewhat useless, and knowledge with incorrect context is worse than ignorance. Why? Because we are confident about what it tells us. We like sophisticated mathematical tools that tell us in an oracle like fashion what the secrets of knowledge are within a given system. We have confidence in their findings because of their accuracy. But no amount of accuracy will make an incorrect assumption correct. This is where trying to prove ourselves wrong about our assumptions is very important. One might wonder why there are so many methods that sometimes appear to do the same thing but from a different mathematical perspective. The reason is that these various methods are often run in parallel to yield comparative data sets with multiple replicated studies. By generating large populations of comparative sets another level or hierarchy of trends and relationships becomes visible. Consistency of the sets will generally (but not always) indicate sound assumptions about the original data. Wild variations between sets in turn will usually indicate that something is flawed and needs to be revisited. Note that we are now talking about analyzing the analytical results. But this is not always done. Why? Because many times we don’t want to prove our own assumptions wrong. We want them to be right… no let’s go further – we need them to be right.
A good example is the recent market crash of 2006-2009. Many folks don’t know it but there is a little equation that actually holds a portion of the blame. Well, not really. As it turns out equations are a lot like guns. They are only dangerous when someone dangerous is using it. The equation in question is the Black-Scholes equation. Some have called it one of the most beautiful equations in mathematics. It is a very eloquent piece. Others would call it that because it had another name, the Midas equation. It made folks a ton of money! That is until…
The Black Scholes equation was an attempt to bring rationality to the futures market. This sounds good, but it is based on the concept of creating a systematic method of establishing a value for options before they mature. This also might not be a bad thing if your assumptions about the market are correct. But if there are things that you don’t know (and there always is), then those blind spots could in reality affect your assumptions in an adverse way. As an example, if you are trading on the futures of a given commodity and something happens in the market to affect demand that you did not consider or perhaps weighed its impact incorrectly then guess what… That’s right, you lose money!
In the last market crash that commodity was real estate. As one looks into the detailed history of the crash we can see multiple flawed assumptions that built upon one another. Then to compile the problem the market began to create obscurity by the use of blocks or bundles of mortgages that had absolutely no window into the risk factors associated with those assets. While the banks were buying blind, the banks were of the thought that foreclosures would be a minority and that the foreclosed home can always be sold for the loan value or perhaps greater. To the banks it seems that they couldn’t loose. We all know what happened. Even though the mathematics was elegant and accurate, the conclusions and the advice that was given as a result was drastically flawed and cost the market billions. The lesson, Big Data can lead us astray. It reminds us of the flawed premise of Laplace’s rather arrogant comment back in 1814. There is always something we don’t know about a given system such as a scope of history that we do not know or levels of detail that are unknown to us or perhaps even beyond our measurement. This does not disable data analytics but it puts a limit to its tractability in dealing with real world systems. In the end Big Data does not replace good judgment, but it can complement it.

So how do I build it and how do I use it?

Hadoop is actually fairly easy to install and set up. The major vendors in this space have gone much further in making it easy and manageable as a system. But there are a few general principles that should be followed. First, be sure to size your Hadoop cluster and maintain that sizing ratio as the cluster grows. The basic formula is 4 x D, where D is the data footprint to be analyzed. Now one might say ‘what’? I have to multiply my actual storage requirements by a factor of four!? But do not forget about the Map Reduce flow. The shuffle phase requires datanodes that will act as transitory nodes for the job flow. This extra space needs to be available. So while it might be tempting to float this number, it’s best not to. Below are a few other design recommendations to consider

Figure 15

Figure 15. Hadoop design recommendations

Another issue to consider is the sizing of the individual datanodes within the HDFS cluster. This is actually a soft set of recommendations that greatly depends on the type of analytics. In other words, are you looking to gauge customer satisfaction or model climate change or the stock market? These are obviously many degrees of complexity from one another. So it is wise to think about your end goals with the technology. Below is a rough sizing chart that provides some higher level guidance.

Figure 16

Figure 16. Hadoop HDFS Sizing Guidelines

Beyond this, it is wise to refer to the specific vendors design guidelines and requirements, particularly in the areas of high availability for master node services.
Another question that might be asked is “How do I begin?” In other words, you have installed the cluster and are ready for business but, what to do next? Actually this is very specific to usage and expectations. But we can at least boil it down to a general cycle of data ingestion, analytics and corresponding actions. This is really very similar to well-known systems management theory. A diagram of such a cycle is shown below.

Figure 17

Figure 17. The typical data analytics cycle

Aside from the work flow detail, it cannot be stressed enough, “Know your data”. If you do not know it then make sure that you are working very closely with someone who does. The reason for this is simple. If you do not understand the overall meaning of the data sets that you are analyzing then you are unlikely to be able to identify the initial key values that you need or should be focusing on. So often data analytics is done on a team basis with individuals from various backgrounds within the organization and the data analytics staff will work in concert with this disparate group to identify the key questions that need to be asked as well as the key data values that will help lead towards the construct of an answer to the query. Remember that comparative sets will allow for the validation of both the assumptions that are made on the data model but also on the techniques that are being used to extract and analyze the data sets in question. While it is tempting to jump to conclusions on initial findings, it is always wise to do further studies to validate those findings, particularly if it is a key strategic decision that will result from the analysis.

In summary

We have looked at the history of analytics from its founding fathers to its current state. Throughout, many things have remained consistent. This is comforting. Math is math. Four plus four back in Galileo’s time was the same answer as is today. But we must remember that math is not the real world. It is merely our symbolic representation of it. This was shown by the various discoveries on the aspects of randomness, chaos and infinitudes. We have gone on further in the article to show that the proper manipulation of large sets of data placed against a historical context can yield insights into it that might not be otherwise apparent. Recent trends are to establish methods to visualize the data and the resulting analytics by the use of graphic displays. Companies such as Tableau provide for the ability to generate detailed charts and graphs that can provide a visual view of the results of the various analytic functions noted above. Now a long table or spreadsheet of numbers becomes a visible object that can be manipulated and conjectured against. Patterns and trends can much more easily be picked out and isolated for further analysis. These and other trends are accelerating in the industry and become more and more available to common user or enterprise.
We also talked about the high east/west traffic profiles that are required to support Hadoop distributed data farms and the work that Avaya is doing to facilitate this in the Data Center top of rack environment. We talked about the relatively high costs of leaf spline architectures and Avaya’s approach to the top of rack environment as the data farm expands. Lastly, we spoke to the need for security in data analytics, particularly in the analysis of credit card or patient record data. Avaya’s Stealth Networking Services can effectively provide a cloak of invisibility over the analytics environment. This creates a Stealth Analytics environment from which the analysis of sensitive data can occur with minimal risk.
We also looked at some of the nuts and bolts of analytics and how, once data is teased out, it may be analyzed. We spoke to various methods and procedures, many times which are often worked in concert to yield comparative data sets. These comparative data sets can then be used to check assumptions made about the data and hence the analytic results. Comparative sets can help us measure the validity of the analytics that have been run, or more importantly the assumptions we have made. In this vein we wrapped up with a word of warning as to the use of big data and data analytics. It is not a panacea, nor is it a crystal ball but it can provide us with vast insights into the meaning of the data that we have at our fingertips. With these insights, if the foundational assumptions are sound we can make decisions that are better informed. It can also enable us to process and leverage the ever growing data that we have at our disposal at the pace required for it to be of any value at all! Yet, in all of this we are only at the beginning of the trail. As computing power increases and our algorithmic knowledge of systems increases the technology of data science will reap larger and larger rewards. But it is likely to never provide the foundation for Laplace’s dream.


‘Dark Horse’ Networking – Private Networks for the control of Data

September 14, 2013

Dark HorseNext Generation Virtualization Demands for Critical Infrastructure and Public Services



In recent decades communication technologies have realized significant advancement. These technologies now touch almost every part of our lives, sometimes in ways that we do not even realize. As this evolution has and continues to occur, many systems that have previously been treated as discrete are now networked. Examples of these systems are power grids, metro transit systems, water authorities and many other public services.

While this evolution has brought on a very large benefit to both those managing and using the services, there is the rising spectre of security concerns and the precedent of documented attacks on these systems. This has brought about strong concerns about this convergence and what it portends for the future. This paper will begin by discussing these infrastructure environments that while varied have surprisingly common theories of operation and actually use the same set or class of protocols. Next we will take a look at the security issues and some of the reasons of why they exist. We will provide some insight to some of the attacks that have occurred and what impacts they have had. Then we will discuss the traditional methods for mitigation.

Another class of public services is more so focused on the consumer space but also can be used to provide services to ‘critical’ devices. This mix and mash of ‘cloud’ within these areas are causing a rise in concern among security and risk analysts. The problem is that the trend is well under way. It is probably best to start by examining the challenges of a typical metro transit service. Obviously the primary need is to control the trains and subways. These systems need to be isolated or at the very least very secure. The transit authority also needs to provide for fare services, employee communications and of course public internet access for passengers. We will discuss these different needs and the protocols involved in providing for these services. Interestingly we will see some paradigms of reasoning as we do this review and these will in turn reveal many of the underlying causes for vulnerability. We will also see that as these different requirements converge onto common infrastructures conflicts arise that are often resolved by completely separate network infrastructures. This leads to increasing cost and complexity as well as increasing risk of the two systems being linked at some point in way that would be difficult to determine. It is here where the backdoor of vulnerability can occur. Finally, we will look at new and innovative ways to address these challenges and how they can take our infrastructure security to a new level without abandoning the advancement that remote communications has offered. The fact is, sometimes you do NOT want certain systems and/or protocols to ‘see’ one another. Or at the very least there is the need to have very firm control over where and how they can see one another and inter-communicate. So, this is a big subject and it straddles many different facets. Strap yourself in it will be an interesting ride!

Control and Data Acquisition (SCADA)

Most process automation systems are based on a closed loop control theory. A simple example of a closed loop control theory is a gadget I rigged up as a youth. It consisted of a relay that would open when someone opened the door to my room. The drop in voltage would trigger another relay to close causing a mechanical lever to push a button on a camera. As a result I would get a snapshot of anyone coming into my room. It worked fairly well once I worked out the kinks (they were all on the mechanical side by the way). With multiple siblings it came in handy. This is a very simple example of a closed loop control system. The system is actuated by the action of the door (data acquisition) and the end result is the taking of a photograph (control). While this system is arguably very primitive it still demonstrates the concept well and we will see that the paradigm does not really change much as we move from 1970’s adolescent bedroom security to modern metro transit systems.

In the automation and control arena there are a series of defined protocols that are of both standards based and proprietary nature. These protocols are referred to as SCADA, which is short for Supervisory Control and Data Acquisition. Examples of these protocols on the proprietary side are Modbus, BACnet and LonWorks. Industry standard examples are IEC 61131 and 60870-5-101[IEC101]. Using the established simple example of a closed loop control we will take the concept further by looking at a water storage and distribution system. The figure below shows a simple schematic of such a system. It demonstrates the concepts of SCADA effectively. We will then use that basis to extend it further to other uses.

Figure 1

Figure 1. A simple SCADA system for water storage and distribution

The figure above illustrates a closed loop system. Actually, it is comprised of two closed loops that exchange state information between. The central element of the system is the water tank (T). Its level is measured by sensor L1 (which could be as simple as a mechanical float attached to a potentiometer). As long as the level of the tank is at a certain range it will keep the LEVEL trace as ON. This trace is provided to a device called a Programmable Logic Controller (PLC) or Remote Terminal Unit (RTU). In the case of the diagram it is provided to PLC2. As a result PLC2 sends a signal to a valve servo (V1) to keep it in the OPEN state. If the level were to fall below a defined value in the tank then the PLC would turn the valve off. There may be additional ‘blow off’ valves that the PLC might invoke if the level of the tank grew too high. But this would be a precautionary emergency action. In normal working conditions this would be handled by the other closed loop. In that loop there is a flow meter (F1) that provides feedback to PLC1. As long as PLC1 is receiving a positive flow signal from the sensor it will keep the pump (P1) running and hence feeding water into the system. If the rate on F1 falls below a certain value then it is determined that the tank is nearing full and PLC1 will tell the pump to shut down. As an additional precaution there may be an alternate feed from sensor L1 that will only cause a flag to shut down the pump if the tank level reaches full. This is known as a second loop failsafe. As a result, we have a closed loop self monitoring system that in theory should run on its own without any human intervention. Such systems do. But they are usually monitored by Human Management Interfaces (HMI). In many instances these will literally be the schematic of the system with a series of colors (as an example yellow for off, orange & red for warning & alarm, green for running). In this way, an operator has visibility into the ‘state’ of the working system. HMI’s can also offer human control of the system. As an example, an operator might shut off the pump and override the valve close to drain the system for maintenance. So in that example the closed loop would be extended to include a human who could provide an ‘ad hoc’ input to the system.

The utility of these protocols are obvious. They control everything from water supplies to electrical power grid components. They are networked and need to be due to the very large geographic area that they often are required to cover. This is as opposed to my bedroom security system (it was never really intended on security – it was just a kick to get photos of folks who were unaware) which was a ‘discrete’ system. In such a system, the elements are hardwired and physically isolated. It is hard to get into such a room to circumvent the system. One would literally have to climb in the through the window. This offers a good analogy of what SCADA like systems are experiencing. But also one has to realize that discrete systems are very limited. As an example, it would be a big stretch to take a discrete system to manage a municipal water supply. One would argue that it would be so costly as to make no sense. So SCADA systems are a part of our lives. They can bring great benefit but there is still the spectre of security vulnerability.

Security issues with SCADA

Given that SCADA systems are used to control facilities such as oil, power and public transportation, it is important to ensure that they are robust and have the connectivity to the right control systems and staff. In other words they must be networked. Many implementations of SCADA are L2 using only Ethernet as an example for transport. Recently, there are TCP/IP extensions to SCADA that allow for true Internet connectivity. One would think that this is where the initial concerns for security would lie but actually they are just a further addition the systems vulnerabilities. There are a number of reasons for this.

First, there was a general lack in concern for security as many of these environments were at one time fairly discrete. As an example, a PLC is usually used in local control type scenarios. A Remote Terminal Unit does just what it says. It creates a remote PLC that can be controlled over the network. While this extension of geography has obvious benefits, along with it creep the window of unauthorized access.

Second, there was and still is the general belief that SCADA systems are obscure and not well known. Its protocol constructs are not widely published particularly in the proprietary versions. But as is well known, ‘security by obscurity’ is a partial security concept at best and many true security specialists would say it is a flawed premise.

Third, initially these systems had no connectivity to the Internet. But this is changing. Worse yet, it does not have to be the system itself that is exposed. All an attacker needs is access to a system that can access the system. This brings about a much larger problem.

Finally, as these networks are physically secure it was assumed that some form of cyber-security was realized, but as the above reason points out this is a flawed and dangerous assumption.

Given that SCADA systems control some of our most sensitive and critical systems it should be no surprise that there have been several attacks. One example is a SCADA control for sewer flow where a disgruntled ex-employee gained access to the system and reversed certain control rules. The end result was a series of sewage flooding events into local residential and park areas. Initially, it was thought to be a system malfunction, but eventually the hacker’s access was found out and the culprit was eventually nabbed.  This can even get into International scales. As critical systems such as power grids become networked the security concern can grow to the level of national security interests.

While these issues are not new, they are now well known. Security by Obscurity is no longer a viable option. Systems isolation is the only real answer to the problem.


The Bonjour Protocol

On the other side of the spectrum we have a service that is often required at public locations that is the antithesis of the prior discussion. This is a protocol that WANTS services visibility. This protocol is known as Bonjour. Created by Apple™, it is an open system protocol that allows for services resolution. Again it is best to give a case point example. Let’s say that you are a student that is at a University and you want to print a document from your iPAD. You can simply hit the print icon and the Bonjour service will send a SRV query for @PRINTER to the Bonjour multicast address of The receiver of the multicast group address is the Bonjour DNS resolution service which will reply to the request with a series of local printer resources for the student to use. To go further, if the student were to look for an off site resource such as a software upgrade or application, the Bonjour service would respond and provide a URL to an Apple download site. The diagram shows a simple Bonjour service exchange.

Figure 2

Figure 2. A Bonjour Service Exchange

Bonjour also has a way for services to ‘register’ to Bonjour as well. A good example as shown above is in the case of iMusic. As can see the player system can register to local Bonjour Service as @Musicforme. Now when a user wishes to listen they simply query the Bonjour service for @Musicforme and the system will respond with the URL of the player system. This paradigm has obvious merits in the consumer space. But we need to realize that consumer space is rapidly spilling over into the IT environment. This is the trend that we typically hear of as ‘Bring Your Own Device’ or BYOD. The University example is easy to see but many corporations and public service agencies are dealing with the same pressures. Additionally, some true IT level systems are now implementing the Bonjour protocol as an effective way to advertise services and/or locate and use them. As an example, some video surveillance cameras will use Bonjour service to perform software upgrades or for discovery. Take note that Bonjour really has no conventions for security other than the published SRV. All of this has the security world in a maelstrom. In essence, we have disparate protocols evolving out of completely different environments for totally different purposes coming to nest in a shared environment that can be of a very critical nature. This has the makings for a Dan Brown novel!



Meanwhile, back at the train station…

Let’s now return to our Transit Authority who runs as a part of its services high speed commuter rail service. As a part of this service they offer Business Services such as Internet Access and local business office services such as printing and scanning. They also have a SCADA system to monitor and control the railways. In addition they obviously have a video surveillance system and you guessed it, those cameras use the Bonjour service for software upgrade & discovery. They also have the requirement to run Bonjour for the Business Services as well.

In legacy approaches the organization would need to either implement totally separate networks or a multi-services architecture via the use of Multi-Protocol Label Switching or MPLS. This is an incredibly complex suite of protocols that have very well known CAP/EX and OP/EX requirements and they are high. Running an MPLS network is most probably the most challenging IT financial endeavor that an organization can take on. The figure below illustrates the complexity of the MPLS suite. Note that it also shows a comparison to Shortest Path Bridging IEEE 802.1aq and RFC 6329 as well as the IETF drafts to extend L3 services across the Shortest Path Bridging Fabric.

Figure 3

Figure 3. A comparison between MPLS and SPB

There are two major points to note. First, there is a dramatic consolidation of dependency overlay control planes into a single integrated one provided by IS-IS. Second, as a result to consolidation there results a breaking of the mutual dependence of the service layers into mutually independent service constructs. An underlying benefit is that services are also extremely simple to construct and provision. Another benefit is that these services constructs are correspondingly simpler from an elemental perspective. Rather than requiring a complex and coordinated set of service overlays, SPB/IS-IS provides a single integrated service construct element known as the I-Component Service ID or I-SID.

In previous articles we have discussed how an I-SID is used to emulate end to end L2 service domains as well as true L3 IP VPN environments. Additionally, we covered how I-SID’s can be used dynamically to provide solicited demand services for IP multicast. In this article, we will be focusing on their inherent traits of services separation and control as well as how these traits can be used to enhance a given security practice.

For this particular project we developed the concept of three different network types. Each network type is used to provide for certain protocol instances that require services separation and control. They are listed as follows:

1). Layer 3 Virtual Service Networks

These IP VPN services are used to create a general services network access for general offices and internet access.

2). Local User Subnets (within the L3 VSN)

These are local L2 broadcast domains that provide for normal internet ‘guest’ access for railway passengers. These networks can also support ‘localized’ Bonjour services for the passengers but the service is limited to the station scope and not allowed to be advertised or resolved outside of that local subnet boundary.

3). Layer 2 Virtual Service Networks

These L2 domains are used at a more global level. Due to SPB’s capability to extend L2 service domains across large geographies without the need to support end to end flooding, L2 VSN’s become very useful to support extended L2 protocol environments. Here we are using dedicated L2 VSN’s to support both SCADA and Bonjour protocols. Each protocol will enjoy a private non-IP routed L2 environment that can be placed anywhere within the end to end SPB domain. As such, they can provide global L2 service separated domains simply by not assigning IP addresses to the VLAN’s. IP can still run over the environment as Bonjour requires it, but that IP network will not be visible or reachable within the IS-IS link state database (LSDB) via VRF0.

Figure 4

Figure 4. Different Virtual Service Networks to provide for separation and control.

The figure above illustrates the use of these networks in a symbolic fashion. As can be seen, there are two different L3 VSN’s. The blue L3 VSN is used for internal transit authority employees and services. The red L3 VSN is used for railway passenger internet access. Note that there are two things of signifigance here. First, this is a one way network for these users. They are given a default network gateway to the Internet and that is it. There is no connectivity from this L3 VSN to any other network or system in the environment. Second, each local subnet also allows for local Bonjour services so that users can use their different personal device services without concern that they will go beyond the local station or interfere with any other service at that station.

There are then two L2 VSN’s that are used to provide for inter-station connectivity for the transit authorities use. The green L2 VSN is used to provide for the SCADA protocol environment while the yellow L2 VSN provides for the Bonjour protocol. Note that unlike the other Bonjour L2 service domains for the passengers, this L2 domain can not only be distributed within the stations but between the stations as well. As a result, we have five different types of service domains each one is separated, scoped and controlled over a single network infrastructure. Note that in the case of a passenger at a station who is bringing up their Bonjour client, they will only see other local resources, not any of the video surveillance cameras that also use Bonjour but do so in the totally separate L2 service domain that has absolutely no connectivity to any other network or service. Note also that the station clerk has a totally separate network service environment that gives them confidential access to email, UC and other internal applications that tie back into the central data center resources. In contrast, the passengers at the station are provided Internet access only for general browsing or VPN usage. There is no viable vector for any would be attacker in this network.

Now the transit authority enjoys the ability to deploy these service environments at will any where they are required. Additionally, if requirements for new service domains come up (entry and exit systems for example), they can be easily created and distributed without a major upheaval of the existing networks that have been provisioned.


Seeing and Controlling are two different things…

Sometimes one service can step on another. High bandwidth resource intense services such as multicast based video surveillance can tend to break latency sensitive services such as SCADA. In a different example project, these two applications were in direct conflict. The IP multicast environment was unstable causing loss of camera feeds and recordings in the video surveillance application. The SCADA based traffic light control systems experienced daily outages. In a traditional PIM protocol overlay we require multiple state machines that run in the CPU. Additionally, these state machines are full time meaning that they need to consider each IP packet separately and forward accordingly. For multicast packets there is an additional state machine requirement where there may be various modes of behavior based on whether it is a source or a receiver and whether or not the tree is currently established or extended. These state machines are complex and they must occur for every multicast group being serviced.

Figure 5

Figure 5. Legacy PIM overlay

Each PIM router needs to perform this hop by hop computation, and this needs to be done by the various state machines in a coordinated fashion. In most applications this is acceptable. As an example, for IP television delivery there is a relatively high probability that someone is watching the channels being multicast (if not, they are usually promptly identified and removed. Ratings will determine the most viewed groups). In this model, if there is a change to the group membership, it is minor and at the edge. Minor being the fact that one single IP set top box has changed the channel. The point here is that this is a minor topological change to the PIM tree and might not even impact it at all. Also, the number of sources is relatively small to the community of viewers. (200-500 channels to thousands if not tens of thousands of subscribers)

The problem with video surveillance is that this model reverses many of these assumptions and this causes havoc with PIM. First, the ratio of sources to receivers is reversed.  Also, the degree of the ratio changes as well.  As an example, in a typical surveillance project of 600 cameras there could be instances as high as 1,200 sources with transient spikes that will go higher during state transitions. Additionally, video surveillance applications typically have the phenomenon of ‘sweeps’, where a given receiver that is currently viewing a large group of cameras (16 to 64) will suddenly change and request another set of groups.

At these points the amount of required state change in PIM can be significant. Further, there may be multiple instances of this occurring at the same time in the PIM domain. These instances could be humans at viewing consoles or they could be DVR type resources that automatically sweep through sets of cameras feeds on a cyclic basis. So as we can see, this can be a very heavy lift application for PIM and tests have validated this. SPB offers a far superior method for delivering IP multicast.

Now let us consider the second application, the use of SCADA to control traffic lights. Often referred to as Intelligent Traffic Systems or ITS. Like all closed loop applications, there is a fail safe instance which is the familiar red and yellow flashing lights that we see occasionally during instances of storms and other impediments to the system. This is to assure that the traffic light will never fail in a state of permanent green or permanent red. As soon as communication times out, the failsafe loop is engaged and maintained until communications is restored.

During normal working hours the traffic light is obviously controlled by some sort of algorithm. In certain high volume intersections this algorithm may be very complex and based on the hour of the day. In most other instances the algorithm is rather dynamic and based on demand. This is accomplished by placing a sensing loop at the intersection. (Older systems were weight based while newer systems are optical.) As a vehicle pulls up to the intersection its presence is registered and a ‘wait set’ period is engaged. This presumably allows enough time for passing traffic to move through the intersection. In instances or rural intersection this wait set period will be ‘fair’. Each direction will have equal wait sets. In urban situations where there are minor roads intersecting with major routes the wait set period will be in strong favor of the major route. With a relatively large wait set period for the minor road. The point in all of this is that these loops are expected to be fairly low latency and there is not expected to be a lot of loss in the transmission channel. Consequently, SCADA tends towards very small packets that expect very fast round trip with minimal or no loss. You can see where I am going here. The two applications do not play well together. They require separation and control.

Figure 6

Figure 6. Separation of IP multicast and Scada traffic by the use of I-SIDs

As was covered in a previous article (circa June 2012) and also shown in the illustration above. SPB uses dynamic build I-SIDs with a value greater than 16M to establish IP multicast distribution trees. Each multicast group uses a discrete and individual I-SID to create a deterministic reverse path forwarding environment. Note also that the SCADA is delivered via a discrete L2 VSN that is not enabled for IP multicast or any IP configuration for that matter. As a result, the SCADA elements are totally separated from any IP multicast or unicast activity. There is no way for any traffic from the global IP route or ip vpn environment to get forwarded into the SCADA L2 VSN. There is simply no IP forwarding path available. The figure above illustrates a logical view of the two services.

 The end result of the conversion changed the environment drastically. Since then they have not lost a single camera or had any issues with SCADA control. This is a direct testament to the forwarding plane separation that occurs with SPB. As such both applications can be supported with no issues or concerns that one will ‘step on’ the other. It also enhances security for the SCADA control system. As there is no IP configuration on the L2 VSN (note that IP could still ‘run’ within the L2VSN – as for example as is possible with the SCADA HMI control consoles), there is no viable path for spoofing or launching a DOS attack.

What about IP extensions for SCADA?

As was mentioned earlier in the article there are methods to provide for TCP/IP extension for SCADA. Due to the criticality of the nature of the system however, this is seldom used due to the costs of securing the IP network from threat and risk. As with any normal IP network, protecting them to the required degree is difficult and costly. Particularly since the intention of the protocol overlay is provide for things like mobile and remote access to the system. Doing this with the traditional legacy IP networking with would be a big task.

With SPB, L3 VSN’s could be used to establish a separated IP forwarding environment that can then be then directed to appropriate secure ‘touch points’ within a predefined point in the topology of the network. Typically, this will be a Data Center or a Secured DMZ adjunct from it. There all remote access is facilitated through a well defined security series of Firewalls, IPS/IDS’s and VPN Service points. As it is the only valid ingress into the L3 Virtual Service environment, it is hence much easier and less costly to monitor and mitigate any threats to the system with clear forensics in the aftermath. The illustration below shows this concept. The message is that while SPB is not a security technology in and of itself, it is clearly a very strong compliment to those technologies.  If used properly it can provide the first three of the ‘series of gates’ in the layered defense approach. The diagram below shows how this operates.

Figure 7

Figure 7. SPB and the ‘series of gates’ security concept

In a very early article on this blog post I talked to the issues and paradigms of trust and assurance. (See Aspects and characteristics of Trust and its impact on Human Dynamics and E-Commerce – June 2009)
 There I introduced the concept of composite identities and the fact that all identities in cyber-space are as such. This basic concept is rather obvious when it speaks to elemental constructs of device/user combinations, but it gets smeared when the concept extends to applications or services. Or it can extend further to elements such as location or systems that a user is logged into. These are all elements of a composite instance of a user and they are contained within a space/time context. As an example, I may allow user ‘A’ for access application ‘A’ from location ‘A’ with device ‘A’. But any other location, device or even time combination may provide a totally different authentication and consequent access approach. This composite approach is very powerful. Particularly when combined with the rather strong path control capabilities of SPB. This combination yields an ability to determine network placement based on user behavior patterns. Those expected and within profile, but more importantly for those that are unusual and out of the normal users profile. These instances require additional challenges and consequent authentications.

As noted in the figure above, the series of gates concept merges well within this construct. The first gate provides identification of a particular user/device combination. From this elemental composite, network access is provided according to a policy. From there the user is limited to the particular paths that provide access to a normal profile. As a user goes to invoke a certain secure application, the network responds with an additional challenge. This may be an additional password or perhaps a certain secure token and biometric signature to reassure identity for the added degree of trust. This is all normal. But in the normal environment the access is provided at the systems level thereby increasing the ‘smear’ of the user’s identity. A critical difference in the approach I am referring to is that the whole network placement profile of the user changes. In other words, in the previous network profile the system that provides the said application is not even available by any viable network path. It is by the renewal of challenge and additional tiers of authentication that such connectivity is granted. Note how I do not say access but connectivity. Certainly systems access controls would remain but by and large they would be the last and final gate. At the user edge, whole logical topology changes occur that place the user into a dark horse IP VPN environment where secure access to the application can be obtained.

Wow! The noise is gone

In this whole model something significant occurs. Users are now in communities of interest where only certain traffic pattern profiles are expected. As a result, zero day alerts of anomaly based IPS/IDS systems become something other than white noise. They become very discrete resources with an expected monitoring profile and any anolamies outside of that profile will flag as a true alert that should be investigated. This enables zero day threat systems to work far more optimally as their theory of operation is to look for patterns outside of the expected behaviors that are normally seen in the network. SPB compliments this by keeping communities strictly separate when required. With a smaller isolated community it is far easier to use such systems accurately. The diagram below illustrates the value of this virtualized Security Perimeter. Note how any end point is logically on the ‘outer’ network connectivity side. Even though I-SID’s traverse a common network footprint they are ‘ships in the night’ in that they never see one another or have the opportunity to inter-communicate except by formal monitored means.

Figure 8

Figure 8. An established ‘virtual’ Security Perimeter

Firewalls are also notoriously complex when they are used for community separation or multi-tenant applications. The reason for this is that all of the separation is dependent on the security policy database (SPD) and how well it covers all given applications and port calls. If a new application is introduced and it needs to be isolated the SPD must be modified to reflect it. If this gets missed or the settings are not correct, the application is not isolated and no longer secure. Again SPB and dark horse networking help in controlling user’s paths and keeping communities separate. Now the firewall is white listed with a black list deny all policy after that. Now as new applications get installed unless they are added to the white list, they will be isolated by default within the community that they reside in. There is far less manipulation of the individual SPD’s and far less risk of an attack surface developing in the security perimeter due to a misssed policy statement.


Time to move…

There is another set of traits that are very attractive about SPB and particularly what we have done with it at Avaya in our Fabric Connect. It is something termed as mutability. In the last article on E-911 evolution we talked to this a little bit. Here I would like to go into it in a little more detail. IP VPN services are nothing new. MPLS has been providing such services for years. Unlike MPLS however, SPB is very dynamic in the way it handles new services or changes to existing services. Where the typical MPLS infrastructure might require hours or even days for the provisioning process, SPB can accomplish the same service in a matter of minutes or even seconds.  This is not taking into account that MPLS must also require the manual provisioning of alternate paths. With SPB not only are the service instances intelligently extended across the network by the shortest path, they are also provided all redundancy and resilience by virtue of the SPB fabric. If alternate routes are available they will be used automatically during times of failure. They do not have to be manually provisioned ahead of time. The fabric has the intelligence to reroute by the shortest path automatically. At Avaya, we have tested our fabric to a reliable convergence of 100ms or under with the majority of instances falling into the 50ms level. As such mutability becomes a trait that Avaya alone can truly claim. But in order to establish what that is let’s realize that there are two forms.

1). Services mutability

This was covered to some degree in the previous article but to review the salient points. It really boils down to the fact that a given L3 VSN can be extended anywhere in the SPB network in minutes. The principles pointed out from the previous article illustrate that membership to a given dark horse network can be rather dynamic and can not only be extended but retracted as required. This is something that comes as part and parcel with Avaya’s Fabric Connect. While MPLS based solutions may provide equivalent type services, none are as nimble, quick or accurate in prompt services deployment as Avaya’s Fabric Connect based on IEEE 802.1aq Shortest Path Bridging.

2). Nodal mutability

This is something very interesting and if you ever have the chance with hands on experience, please try it. It is very, very profound. Recall from previous articles, that each node holds a resident ‘link state database’ generated by IS-IS that reflects its knowledge of the fabric from its own relative perspective. This knowledge not only scopes topology but resident provisioned services as well as those of other nodes. This creates a situation of nodal mutability. Nodal mutability is the fact that a technician out at the far edge of the network can accidentally swap the two (or more) uplink ports and the node will still join the network successfully. Alternatively, if a node were already up and running and for some reason port adjacencies needed to change. It could be accommadated very easily with only a small configuration change. (Try it in a lab. It is very cool!) Going further on this logic the illustration below shows that a given provisioned node could unplug from the network and then drive over 100’s of kilometers to another location.

Figure 9

Figure 9. Nodal and Services Mutability

At that location, they could plug the node back into the SPB network and the node will automatically register the node and all provisioned services. If all of these services are dark horse then there will authentication challenges into the various networks that the node provides as users access services. This means in essence that dark horse networks can be extremely dynamic. They can be mobile as well. This is useful in many applications where mobility is desired but the need to re-provision is frowned upon or simply impossible. Use cases such as emergency response, military operations or mobile broadcasting are just a few areas where this technology would be useful. But there are many others and the number will increase as time moves forward. There is no corresponding MPLS service that can provide for both nodal and services mutability. SPB is the only technology that allows for it via IS-IS, and Avaya’s Fabric Connect is the only solution that can provide this for not only L2 but L3 services as well as for IP VPN and multicast.

Some other use cases…

Other areas where dark horse networks are useful are in networks that require full privacy for PCI or HIPPA compliance. L3 Virtual Service Networks are perfect for these types of applications or solution requirements. Figure 8 could easily be an illustration for a PCI compliant environment in which all subsystems are within a totally closed L3 VSN IP VPN environment. The only ingress and egress are through well defined virtual security perimeters that allow for the full monitoring of all allowed traffic. This combination yields an environment that, when properly designed, will easily pass PCI compliancy scanning and analysis. In addition, these networks not only are private – they are invisible to external would be attackers. The attack surface is mitigated to the virtual security parameter only. As such, it is practically non-existent.

In summary

While private IP VPN environments have been around for years they are typically clumsy and difficult to provision. This is particularly true for environments where quick dynamic changes are required. As an example, the typical MPLS IP VPN provisioning instances will require approximately 200 to 250 command lines depending on the vendor and the topology. Interstingly much of this CLI activity is not in provisioning MPLS but in provisioning other supporting protocols such as IGP’s and BGP. Also, consider that all of this is for just the initial service path. Any redundant service paths must then be manually configured. Compare with Avaya’s Fabric Connect which can provide the same service type with as little as a dozen commands. Additionally, there is no requirement to engineer and provision redundant service paths as they are already provided by SPB’s intelligent fabric.

As a result, IP VPN’s can be provisioned in minutes and be very dynamically moved or extended according to requirements. Again, the last article on the evolution of E-911 speaks to how an IP VPN morphs over the duration of a given emergency with different agencies and individuals coming into and out of the IP VPN environment on a fairly dynamic basis based on their identity, role and group associations.

Furthermore, SPB nodes are themselves mutable. Once again, IS-IS provides for this feature. An SPB node can unplug from the network and move to the opposite end of the topology which can be 100’s or even 1000’s of kilometers away. There they can plug back in and IS-IS will communicate the nodal topology information as well as all provisioned services on the node. The SPB network will in turn extend those services out to the node thereby giving complete portability to that node as well as its resident services.

In addition, SPB can provide separation for non IP data environments as well. Protocols such as SCADA can enjoy an isolated non IP environment by the use of L2 VSN’s and further they can be isolated so that there is simply no viable path into the environment for would be hackers.

This combination of privacy, fast mutability of both services and topology lend to what I term as a Dark Horse Network. They are dark, so that they can not be seen or attacked due to the lack of surface for such an endeavor. They are swift in the way they can morph by services extensions and they are extremely mobile, providing for the ability for nodes to make whole scale changes to the topology and still be able to connect to relevant provisioned services without any need to re-configure. Any other IP VPN technology would be very hard pressed to make such claims, if indeed they can make them at all! Avaya’s Fabric Connect based on IEEE 802.1aq sets the foundation for the true private cloud.

 Feel free to visit my new You Tube Channel! Learn how to set up and enable Avaya’s Fabric Path Technology in a few short step by step videos.

The evolution of E-911

November 2, 2012

                                                 NG911 and the evolution of ESInet


If you live within North America and have ever been in a road accident or had a house fire then you are one of the fortunate ones who had the convenience and assurance of 911 services. I am old enough to remember how these types of things were handled prior to 911. Phones (dial phones!) had dozens of stickers for Police, Fire and Ambulance. If there were no stickers then one had to resort to a local phone book that hopefully had an emergency services section. To think of how many lives that has been saved by this simple three digit number is simply boggling. Yet to a large degree we all now take this service for granted and assume it will just work as it always has regardless of the calling point. We also seem to implicitly assume that all of the next generation capabilities and intelligence that is available today can just automatically be utilized within its framework. This article is intended to provide a brief history of 911 services and how they have evolved up to the current era of E911. It will also talk about the upcoming challenges for extending the service into a true multi-tenant, multi-service framework that can leverage the latest technology offerings. In short, we are talking about the advent of Next Generation (NG911) Emergency Services infrastructure.

Conceptually, 911 is very simple. As the figure below illustrates, a person reporting an emergency calls the three digit number. The original intent was to provide the public with a single point of contact for all emergencies. Prior to 911, you would have a number for Police, a number for Fire, and a number for Medical; and to make matters worse, each jurisdiction would have it’s own unique number. That would be a dozen numbers to remember for a your town and 3 of your neighboring towns. It was out of this that “E”911 was born to deliver even more functionality. In addition to providing a single ubiquitous number regardless of where you were located, it provided ‘selective routing’, or automatic routing based on the originating numbers documented location in the telephone company database. It also provided some new intelligence on the wire, called Automatic NumberIdentification or ANI. You probably are more familiar with it’s street name of ‘Caller ID’.

Figure 1. Traditional 911 PSAP

This however was back in the days of land line type phone services. This is a growing minority in this age of mobile communications. Originally embodied by the advent of cellular phones, the industry has evolved to facilitate both local and wide area wireless technologies as well as PDA’s, tablets and yes, still cell phones. The problem with this is that the old original 911 model became increasingly broken and in need of an update to handle this new mobile phenomenon. Think of it, if I am driving down Interstate 90 in Boston and I called 911, how do they know that I am in Boston and not in Ontario, New York where my billing address is? At first, there was no such capability and some folks lost their lives due to longer response times incurred. For a while the first thing the 911 contact needed to do was validate and confirm the location assuming that if it was mobile there was no other way. Fortunately, this led to the evolution of PHASE 1 Cellular E911 services which allowed for the correlation of cellular 911 calls to a particular antennae face on a tower. Each cellular carrier has 3 antennas on each tower that provide service to a 120 degree arc of the compass. When a call is received in a particular sector, it is routed to the PSAP that has the primary coverage of that sector. PSAPs can also transfer calls between themselves, so if a call was misrouted once in a while, it could be easily warm handed off to the proper authority. There are several technologies that allow for this and they are summarized briefly in the illustration below.

Figure 2. Methods for mobile device location

As one can readily imagine, a wireless provider can tell which cell your device is operating in when the call to 911 is made. This can be a fairly vast geography however. The actual number varies depending on the technology but typically can be a radius of 10 to 20 miles. Accuracy is gained by leveraging different radio antenna sources by a method known as triangulation where a closer proximity can be gained by usng multiple signal points of reference. Lately, in the newer Droids and iPhones additional GPS capabilities lend to an accuracy of meters.

Another evolution is Assisted GPS or A-GPS. A-GPS works on the merging of GPS and network related technoloigies to increase the accuracy and decrease the ‘fix time’ to determine location. A-GPS uses network related resources and in turn use satelite services when signal conditions are poor due to signal weakness or interference. The typical A-GPS device will not only have GPS hardware but also Internet access as well. Most modern SmartPhones and PDA’s fit this capability mix. As a result a mobile user’s location can be determined with a great degree of accuracy.


But that was then…

Recent (within the decade) large scale emergencies both man made and otherwise have taught us a few things about events of this proportion. First, infrastructure is damaged and along with it communications elements. At times, communications can be lost all together for extended periods of time. Second, events of this scale require coordinated logistics between multiple organizations and their resources. When we put these two things together we see a real issue in that coordinated logistics requires reliable communications! Events like NYC 9/11, Katrina and even the BP Oil Spill crisis have shown that no single agency can address all of the needs that require response. In short, the ability to communicate effectively is paramount to effective large scale emergency response. Partcularly in large scale events of wide spread geographic proportions.

None the less, these events serve to remind us that they can render useless much of the technology we today take for granted. Additionally, the traditional E911 network is closed in architecture and very regional in the way it is deployed. This makes wide scale geographic coordination of information & resources very difficult. Emergencies that cross PSAP boundaries will often require additional impromtu adhoc communications that often lack context or clarity.

Now let’s add in the new abilities that technology brings to the table. Big Data analytics is my personal and professional favorite. In emergency situations, information is essential, but too much information without context will tend to slow down the emergency response. Contextual prioritized information and the timely delivery of it has been shown to increase both the timeliness and the accuracy of the response. A later example will clarify. The major point here is that E911, which was architected to handle the mobile emergency call, is still effective for that purpose but not effective for these upcoming challenges. NG911 is intended for the ‘other side’ of the equation, the agencies and services (fire, hazmat, medical response) that will require detailed and reliable communications and information to most effectively deal with the situation at hand.

All of this means that the supporting network must be capable of multi-service and multi-tenency. We will cover these two terms in the next few paragraphs. Both of these terms are part of the normal service provider nomenclature. Multi-service is ability of the network to deliver appropriate service level assurance for the proper operation of end to end applications. The categories most often thought of are voice, video and data but can be more granular to include data for certain application types so that some applications can be prioritized over another. Multi-tenancy is the ability to support multiple user, service or even application groups and keep the resources that they use totally separate from one another. At the same time, there may be applications that do have the requirement to cross tenant boundaries, such as IP voice or email but will be constrained to cross over a security demarcation were such rules can be enforced. Rule number one of multi-tenancy is tenant A should never see tenant B’s traffic or visa versa unless otherwise provisioned to do so as per above. Also tenant A should never be able to impinge on the resources allocated to tenant B, again unless otherwise provisioned. These are not easy bars to reach with traditional networking technology and practices. Typically, in order to do this to the scale required, we require a complex mix of technologies such as those shown in the diagram below. MPLS IP VPN services has really been the only technology that has been up to par to meet these requirements. Unfortunately, this means that many state and local governments are either forced to depend on a 3rd party public service provider or directly implement MPLS themselves. Those that do find that the technology is expensive, complex and requires an inordinately high staff count to properly implement and maintain it.

Recently however, there is another technology that has been ratified by the IEEE known as ‘Shortest Path Bridging’ or IEEE 802.1aq. This standard provides for a radical evolution to the Ethernet forwarding control plane that allows for both multi-tenency and multi-service capabilities without the complexities of legacy approaches. Previous articles have discussed both the methods and services that allow for these capabilities. As a result, we will not go into these areas with any depth here. To summarize, this is all achieved by introducing a link state protocol (IS-IS) to Ethernet switching as well as the concept of provisioned service paths. These innovations, when combined with a MAC encapsulation method known as MAC in MAC (IEEE 802.1ah) that serves as a universal forwarding label, allow for a radical change to the Ethernet switching control plane without abandoning its native dichotomy of control and data forwarding within the network element itself. This means that the switch remains an autonomous forwarding element, able to make its own decisions as to how to forward data most efficiently and effectively. Yet, at the same time the new stateful nature of the 802.1aq control plane allows for a very deterministic control of the data forwarding environment. The end result is a vast simplification of the Ethernet control plane that yields a very stateful and deterministic environment.

The figure below shows a comparison between MPLS and SPB. Note that there is a vast simplification in the number of protocol state machines required in order to support a given service. This simplification not only results in ease of use but also drastically increases the reachable scale for Ethernet as well. This is important for ESInets as the number of agencies and entities that will require access will increase as time moves on and NG911 technology evolves.

Figure 3. A comparison of MPLS to SPB


Ships in the Night, but I may want to jump ship if required…

As we look closer at the concepts of multi-tenancy for emergency services we see that the requirements can be fairly dynamic. As an example, during normal working operations entities may be quite separate from one another. Normal day to day operations might not require a lot of cross communications. There may be some common services that might be used such as email or Voice over IP as is often the case with State and Local Government, but by and large each agencies applications as well as traffic are largely separate.

During emergencies however this normal pattern may not apply. Certain entities may need to be in very tight logistical coordination and as a result have to communicate in a very seamless fashion with applications that may straddle agency boundaries. A good example is a hazardous chemical spill. In a typical scenario you will have a large number of agencies or entities involved in the response. For instance, there will obviously be the police to cordon off the area and maintain a ‘do not cross’ line. You will also have the Fire Dept. with particular HazMat teams that are matched according to experience. You may also have several area hospitals that are alerted and set up with triage teams to handle the exposed victims as well as ambulance services to provide transport. Obviously, the teams selected should have previous experience with such events and preferably even with the particular substance involved. The ability to match experience to requirements is a very key element to a successful response. This is where data analytics plays a key role. Another key element is to enable these teams to communicate effectively and with as much context and supporting data as possible, but it has to be filtered so as not to overload response personnel with superfilous information.

The figure below illustrates some of the potential that SPB could bring to the table to address these requirements. As shown, each entity in question has their own isolated L3 IP VPN environment that provides for normal day to day operations. As an emergency occurs however, a new L3 IP VPN environment can be created for the event response teams. Members of these teams will be selected and provided with enhanced credentials to access this new IP VPN environment. Note that these teams will have bi-directional communication capabilities. Both normal day to day services such as email as well as dedicated or special services for the emergency response can be provided to this team. Additionally, as they use these dedicated services they are isolated from the other VPN environments both from a service and resource perspective. This is important, as the applications that are being used during the emergency response might be high bandwidth such as video or insistent such as east/west flows within the data centers to support outbound data for field application use. In either case they most definitely will be critical and require absolute gaurantee of services reliability.

Figure 4. A hazardous material spill emergency

As the figure above shows, this new L3 IP VPN environment will exist for as long as required by the emergency response teams and can even exist for as long after the event as neccesary for forensics and/or audit investigations. Further, if additional entities are discovered to be required during the course of the event or for investigations afterwards, it becomes very easy and straight forward to extend the L3 IP VPN to include these new members without the need to do major rearchitecting of the service. As shown below, investigatory units from both the police and fire are required after the event has transpired. At each agency new memberships to the special L3 IP VPN environment are added and the personnel that are assigned to the investigation units are provided access via centralized or distributed access controls. These virtual Service networks are then added to the L3 IP VPN environment to facilitate their ability to communicate with the wider team. Note also, that certain critical real-time elements such as 911 dispatch, ambulance and emergency triage are no longer required in the post event L3 VPN so they are effectively dropped from the membership but can easily be added again if required. The main point in all of this is that unlike MPLS which has very complicated and somewhat rigid provisioning practices that prohibit such dynamic behavior, SPB due to its vast simplification of the protocol substrate allows for quick re-provisioning of the network environment without the complexity. Indeed the whole solution approach has a profound consequence; it has been largely relegated to the practice and federation of identity management.

Figure 5. Post event forensics L3 VPN


When the world is falling apart…

As we have learned from various wide scale emergencies both man made and otherwise such as NYC 9/11 and Katrina and more recently Sandy, there is often significant infrastructure damage that occurs with a disaster event. Such damage can be a critical impediment to the responding emergency teams. Often complex logistical data is provided by response data centers that correlate and filter information out to the field teams. Failures in the response center or in the network path between the field teams and the response center can cause a major set of logistical complications and possibly cost additional lives.

In one of my previous articles titled “Data Storage: The Foundation and potential Achilles Heel of Cloud Computing”. I illustrated the critical importance of the data footprint and the requirement of mobile virtual machaines to have access to these data stores regardless of location. Also, many applications are composite instances that are the result of several server exchanges on the data center back end. This is further complicated that in order to provide a truly resilient data fabric, multiple data centers are required at geographically dispersed locations. As a result, these data stores need to be replicated and updated on a very consistent basis, sometimes up to a full data journaling or copy on write requirement. Additionally, Virtual Machines need to be migrated or at the very worst whole scale site recovery be initiated. As this occurs, mapping to data stores must be preserved including all required network paths. Also, as the migration of the VM, Cluster or whole Data Center occurs, users will require the adequate communication paths to seamlessly continue in the use of the applications they require to do thier jobs. The figure below illustrates these critical relationships and the communication paths required to facilitate them.

Figure 6. Required Services and Communication Paths

Interestingly SPB provides a very optimal solution in that the networking technology is ‘topology aware’. As such, its convergence time is extremely fast, ranging in the 100’s of milliseconds. This not only includes layer 2 services like VLAN’s but layer 3 services such as IP VPN’s and IP multicast as well. As major outages occur within the fabric each individual SPB node will natively make the forwarding decision based on its shortest path knowledge of the network. If a path exists, SPB will use it. As the diagram below shows, several major outages can occur at multiple points in the end to end topology but if the mesh fabric is engineered correctly, there will always be an alternate route that is available for use. As a result whole regions of the network can fail without an overall failure to the network as a whole. Redundant links can be wireline (optical) or wireless such as microwave. As long as they provide point to point communications links for the SPB nodes and allow for the protocol to establish adjacencies they are candidate technologies for transport linkage.

Figure 7. Shortest Path Resiliency

As is shown above, both data centers and users have valid communication paths available despite the fact that a good portion of the network is down. This is an important trait for reliable communication infrastructures, particulary those that are used during emergencies. Note that through all of this the normal NG911 service is running as normal with no disruption of services or outage of call services.

Give me the Bull Horn please

In emergencies it is often a very strongly desired trait to broadcast alerts to all members of a given team or set of teams. This capability can increase the effectiveness of the field response teams but also may very well save their lives. In the past this feature was leveraged via LMR or Land Mobile Radio. While such technology still has valid use and is often gatewayed at the edge for voice communications, other packet based technologies can deliver richer information such as video and graphics such as weather and radar maps or building blueprints. The major limitation for these newer forms of wireless communication are that they require an IP multicast infrastructure which is difficult to scale and support. Additionally, major network outages tend to adversely affect the multicast service often to the point of rendering it unusable. As mentioned earlier, SPB can provide convergence of multicast services on the order of 100’s of millisenconds. This is accomplished by eliminating the typical protocol overlay model of networking shown in figure 3 and creating a collapsed route switching substrate which is Shortest Path Bridging. As the network is shortest path tree aware, it is also multicast distribution tree aware. My previous article discusses multicast in SPB and the major advancements in scale, performance and convergence time it provides. The diagram below shows a more symbolic representation in a major alert going out from the response center to not only the field response teams to the NG911 PSAP’s as well.

Figure 8. SPB Multicast used to provide all points alerts via multicast

With traditional networking technologies this would be a very difficult proposition, requiring the interaction of multiple virtualized PIM domains within MPLS. With SPB, it’s inherent multi-tenant capabilties lends to easy distribution of multicast trees each in separate VPN environments within one network domain. Additionally, there is the benefit of sub-second convergence of the network in lieu of failure or outage which would be fast enough to be totally transparent to the multicast services running over it whether they are audio, video, graphics or data. These traits are highly desireable and lend themselves well to critical communications infrastructure. Real time services become much more reliable when a resilient scalable networking technology is used as the infrastructure substrate. This also compliments the various layers of resiliency that can be built into other functional parts of the end to end solution such as servers and storage as well as whole data centers. The end result is a very strong resiliency plan that can stand the worst of impact and still survive… as long as valid SPB links exists.

In Summary

Let’s face it. No one ever wants to call 911. But when an emergency occurs we are always thankful for the service that it provides. Like many, I recall a time prior to the service. Many more have known nothing but. As this critical civil service moves into the future and begins to leverage the new technologies that are available it will become more and more important to pay attention to the network infrastructure that will support them. The ‘Cloud’ works by the reach of the network. The services remain up by the resiliency that the network provides. In reality, this is nothing new. Service Providers have been using such pratices for years. The thing that is really new is that IEEE 802.1aq Shortest Path Bridging provides for an infrastructure that is no longer out of reach for most State and Local Governments who are now analyzing the network requirements for true NG 911 and ESInet evolution.

I would like to thank my esteemed colleague Mark Fletcher, a fellow Avaya Engineer for his input and mentoring for this article. Mark has extensive experience in E911 and is an industry recognized expert in his field.

How would you like to do IP Multicast without PIM or RP’s? Seriously, let’s use Shortest Path Bridging and make it easy!

June 8, 2012


Why do we need to do this? What’s wrong with today’s network?

Anyone who has deployed or managed a large PIM multicast environment will relate to the response to this question. PIM works on the assumption of an overlay protocol model. PIM stands for Protocol Independent Multicast, which means that it can utilize any IP routing table to establish a reverse path forwarding tree. These routes can be created with any independent unicast routing protocol such as RIP or OSPF, or even be static routes or combinations thereof. In essence, there is an overlay of the different protocols to establish a pseudo-state within the network for the forwarding of multicast data. As any network engineer who has worked with large PIM deployments will attest, they are sensitive beasts that do not lend themselves well to topology changes or expansions of the network delivery system. The key word in all of this is the term ‘state’. If it is lost, then the tree truncates and the distribution service for that length of the tree is effectively lost. Consequently, changes need to be done carefully and be well tested and planned. And this is all due to the fact that the state of IP multicast services is effectively built upon a foundation of sand.

The first major point to realize is that most of today’s Ethernet switching technology still operates with the same basic theory of operation as the original IEEE 802.1d bridges. Sure, there have been enhancements of VLAN’s and tagged trunking that allow us to slice a multi-port bridge (which is what an Ethernet switch really is) up into virtual broadcast domains and extend those domains outside of the switch and between other switches. But, by and large the original operational process is based on ‘learning’. The concept of a learning bridge is shown in the simple illustration below. As a port on a bridge receives an Ethernet frame it remembers the source MAC address as well as the port it came in on. If the destination MAC address is known it will forward out to the port that it is last known to be on. As shown in the example below source MAC “A” is received on port 1. As the destination MAC “B” is known to be on port 2, the bridge will forward accordingly.


Figure 1. Known Forwarding

But MAC “A” also sends out a frame to destination MAC “C”. Since MAC “C” is unknown to the bridge, it will flood the frame to all ports. As a result of the flooding, MAC “C” responds and is found to be on port 3. The bridge records the information into its forwarding information base and forwards the frame accordingly from that point on. Hence, this method of bridging is known as ‘flood based learning’. As one can readily see, it is a critical function for normal local area network behavior. No one argues the value or even the neccesity of learning in the bridged or switched environment. The problem is that the example above was circa 1990.

 Figure 2. Unknown Flooding

As the figure below shows, adding in Virtual LAN’s and multi-port high speed switches makes things much more complex. The reality of it is that as the networking core grows larger, the switches in the middle get busier and busier. The forwarding tables need to be larger and larger, where end to end VLAN’s are no longer tractable so layer 3 boundaries via IP routing are introduced to segment the network domains. In the end, little MAC “A” is just one of the tens of thousands of addresses that traverse the core. In essence, there is no ‘state’ for MAC “A” (or any other MAC address for that matter).

 Figure 3. Unknown Flooding in a Routed VLAN topology

Additionally, recall that multicast is a destination address paradigm. IP multicast groups translate to destination MAC addresses at the Ethernet forwarding level. Due to the fact that it is a destination address, there needs to be a resolution to a unicast source address. This is not a straight forward process. It involves the overlay of services on top of the Ethernet forwarding environment. These services provide for the resolution of the source as well as the build of a reverse path forwarding environment and the joining of that path to any pre-existing distribution tree. In essence these overlay services embed a sort of ‘state’ to the multicast forwarding service. These overlays are also very dependent on timers for the operating protocols and the fine tuning of these timers according to established best practice to maintain the state of the service. When this state is lost or becomes ambiguous however, nasty things happen to the multicast service. This is the primary reason why multicast is so problematic in todays typical enterprise environment.

The protocols most often used to establish unicast routing service are, OSPFv2 or v3 (Open Shortest Path First – v2 being for IPv4 and v3 being for IPv6) for establishing the unicast routing tables for IP. OSPF runs over Ethernet and establishes end to end forwarding paths on top of the stateless frame based flood and learn environment below. On top of this, PIM (Protocol Independent Multicast) is run to establish the actual multicast forwarding service. Source resolution is provided by a function known as a ‘RP’ or Rendevous Point. This is an established service that registers sources for multicast and provides the ‘well known’ point within the PIM domain to establish source resolution. As a result, in PIM sparse mode all first joins to a multicast group from a given edge router is always via the RP. Once the edge router begins to receive packets it is able to discern the actual unicast IP address of the sending source. With this information the edge PIM router or the designated router (DR) will then build a reverse path forward back to the source or the closest topological leg of an existing distribution tree. At the L2 edge, end stations signal their interest in a given service by a protocol known as Internet Group Management Protocol or simply IGMP. In addition, most L2 switches can be aware of this protocol and actually allow for discretionate forwarding to interested receivers without flooding to all ports in a given VLAN. This process is known an IGMP snooping. In PIM sparse mode, the version of IGMP typically used is IGMPv2 which is non-source specific (This is *,G mode, where * means that the source address is not known.) Once the source is resolved by the RP the state changes to S,G – where the source is now known. All of this is shown in the diagram below.


Figure 4. Protocol Independent Multicast Overlay Model

As can be readily seen, this is a complex mix of technologies to establish the single service offering. As a result large multicast environments tend to be touchy and require a comparitively large operational budget and staff to keep running. Large changes to network topology can wreak havoc with IP multicast environments. As a result such changes need to be thought through and carefully planned out. Not all changes are planned however. Network outages force topological changes that can often adversely affect the stability of the IP multicast service. The reason for this is the degree of protocol overlay and the need for correlation of the exact state of the network. As an example, a flapping unicast route could adversely affect an end to end multicast service. Additionally, this problem could be caused at the switch element level by a faulty link, port or module. Mutual dependencies in these types of solutions lend themselves to difficult troubleshooting and diagnostics. This translates to longer mean time to repair and overall higher operational expense.


 There must be a better way…

As we noted previously, IP multicast is all about state. Yet at the lowest forwarding element level the operational aspects are stateless. It seems that a valid path forward is to evolve this lowest level to become more stateful and deterministic in the manner in which traffic is handled. In essence, the control plane of Ethernet Switching needs to evolve.

Control Plane Evolution

IEEE has established a set of standards that allows for the evolution of the Ethernet switching control plane into a much more stateful and deterministic model. There are three main innovations that enable this evolution.

Link State Topology Awareness – IS-IS

Universal Forwarding Label –The B-MAC

Provisioned Service Paths – Individual Service Identifiers

This is all achieved by introducing link state protocol (IS-IS) to Ethernet switching as well as the concept of provisioned service paths. These innovations, when combined with a MAC encapsulation method known as MAC in MAC (IEEE 802.1ah) allow for a radical change to the Ethernet switching control plane without abandoning its native dichotomy of control and data forwarding within the network element itself. This means that the switch remains an autonomous forwarding element, able to make its own decisions as to how to forward data most effectively. Yet, at the same time the new stateful nature of the control plane allows for very deterministic control of the data forwarding environment. The end result is a vast simplification of the Ethernet control plane that yields a very stateful and deterministic environment. This environment can then optionally be equipped with a provisioning server infrastructure that provides an API environment between the switching network and any applications that require resources from it. As applications communicate their requirements through the API, the server instructs the network on how to provision paths and resources. Yet importantly, if the network experiences failures, the switch elements know how to behave and have no need to communicate back to the provisioning server. They will automatically find the best path to facilitate any existing sessions and will use this modified topology for any new considerations.  In this model the best of both worlds is found. There is deterministic control of network services, but the network elements remain in control of how to forward data and react to changes in network topology.

 Figure 5. Stateful topology with the use of IS-IS

This technology is known as Shortest Path Bridging, the IEEE standard 802.1aq. As its name implies, it is an Ethernet switching technology that switches by the shortest available path between two end points. The anology here are the IP link state routing protocols OSPFv2 for IPv4 and OSPFv3 for IPv6. In link state protocols each node advertises its state as well as any extended reachability. By these updates, each node gains a complete perspective of the network topology. Each element then runs the Dyjkstra shortest path algorithm to identify the shortest loop free path to every point within the network.

When one looks at the stateless methods of Ethernet forwarding and the need for such antiquated protocols such as Spanning Tree one can not help but see it as a path of promise. The problem is that OSPF v2 and OSPFv3 are ‘monolithic’ routing protocols, meaning that they were designed exclusively to route IP. IEEE knew this of course and found a very good link state protocol that was open and extensible. That protocol is IS-IS (Intermediate System – Intermediate System) from the OSI suite.  One of the first areas of interest is that IS-IS establishes adjacencies with L2 Hello’s, NOT L3 LSA’s like OSPF. The second is that it uses extensible type, length, values (TLV) to move information between switch elements like topology, provisioned paths or even L3 network reachability.  In other words, the switches are ‘topology aware’. Once we have this stateful topology of Ethernet switches, we now can determine what network path data are to take for different application services.

The next step IEEE had to deal with was implementing a universal labelling scheme for the network that provides all of the information that a switch element needs to forward the data. Fortunately, there was a pre-existing standard, IEEE 802.1ah (MAC-in-MAC) that provides just this type of functionality. The standard was initially established as a provider/customer demarcation for metro Ethernet managed service offerings. The standard works on the concept of encapsulation of the outer edge (customer) Ethernet frame (C-MAC) into an inner core (provider) frame (B-MAC) that is transported and then stripped off on the other end of the inner core to yield a totally transparent end to end service. This process is shown in the illustration below.


Figure 6. The use of 802.1ah B-MAC as a universal forwarding label in conjunction with IS-IS

The benefits to this model are the immense amount of scalability and optimization that happens in the network core. Once a data frame is encapsulated, it can be transported anywhere within the SPB domain without the need to learn. How this is accomplished is by combining 802.1ah and IS-IS together with another modification and extension of virtualization. We will cover this next.

Recall that IS-IS allows for the establishment of adjacencies at the L2 Hello level and that information moves through these updates by the use of Type, length values or TLV’s. As we pointed out earlier, some of these TLV’s are used for network reachability of those adjacencies. Well, these adjacencies are all based on the B-MAC’s of the SPB switches within the domain. Only those addresses are populated into the forwarding information databases at the establishment of adjacency and the running of the dyjkstra algorithm to establish loop-free shortest paths to every point on the network. As a result, the core Link State Database (LSDB) is very small and is only updated at new adjacencies such as new interfaces or switches. The important point is that it is NOT updated with end system MAC addresses. As a result, a core can support 10’s of thousands of outer C-MAC’s while only requiring a 100 or so B-MAC’s in the network core. The end result is that any switch in the SPB network can look at the B-MAC frame and know exactly what to do with it without the need to flood and learn or reference some higher level fabric controller.

There is one last thing required however. Remember that we still need to learn MAC’s. At the edge of the SPB network we need to assume that there are normal IEEE 802.3 switches and end systems that need to be supported. So how does one end system establish connectivity across the SPB domain without flooding? This is where the concept of constrained multicast comes in. The simplest way to discuss constrained multicast is based on the concept of provisioned service paths. These provisioned paths or I-SID’s (Individual Service Identifiers) are similar to VLAN’s in that they contained a broadcast domain, but they operate differently as they are based on subsets of the dykstra forwarding trees mentioned previously. As the example below shows, now when a station wishes to communicate with another end system, it simply sends out an ARP request. That ARP request is then forwarded out to all required points for the associated I-SID.


Figure 7. The ‘Constrained Multicast’ Model using 802.1ah and IS-IS

The end system on the other side receives the request and then responds establishing a unicast session over the same shortest path. As a result, the normal Ethernet ‘flood and learn’ process can still be facilited on the outside of the SPB domain without the need to flood and learn in the core. This vastly simplifies the network core, allows for determistic forwarding behavior as well as provides for the ability for separated virtual network services. The reason for this is shown in the diagram below with a little better detail on the B-MAC for SPB and the legacy standards that it builds upon. As can be seen, the concept of the I-SID is a pseudo evolution of the parent Q tag in the 802.1Q-in-Q standard. The I-SID value is contained within the actual B-MAC and consequently tells a core switch everything it needs to know, including whether or not it needs to replicate it for constrained multicast functionality. Note that the two most difficult problems of multicast distribution are solved. The first being source resolution and the second being the RPF build.


Figure 8. IEEE 802.1ah and its relation to other ‘Q’ standards

Once these technologies were merged together into a cohesive standard framework known as IEEE 802.1aq Shortest Path Bridging (MACinMAC) or SPBm, we have as a result a very stateful and scalable switching infrastructure that lends itself very well to the building and distribution of multicast services. In addition, SPB can offer many other different types of services ranging from full IP routing to private IP VPN services. All provisioned at the edge as a series of managed services across the network core. With these layer three services comes the need for the distribution of multicast services across the L3 boundaries. This is true L3 IP multicast routing. Interestingly, SPBm provides some very unique approaches to solving the problem. Again, let us take note that the two most important problems have already been solved.

The figure below shows a SBPm network that is providing multicast distribution between two IP subnets. One of the subnets is a L2 VSN (an I-SID that is associated with VLAN’s). The other subnet is a peripheral network that is reachable by IP shortcuts via IS-IS. Note that as a stream becomes active in the network, the BEB that has the source dynamically allocates an I-SID to multicast stream and that information becomes known via the distribution of IS-IS TLV’s. At the edge of the network the Backbone Edge Bridges (BEB’s) are running IGMP snooping out to the L2 Ethernet edge. The edge SPB BEB in effect becomes the querier for the L2 edge. As receivers signal their interest in a given IP multicast group they are handled by the BEB to which they are connected. which looks for ISIS LSDB (Link State Database) which advertize the multicast stream within the context of the VSN to which the receiver belongs. Once the BEB advertizing the stream and the I-SID are found in the LSDB – the BEB connected to the receiver uses standard ISIS-SPB TLVs to receive traffic for the stream. The dynamically assigned I-SID values start at 16000001 and works up. Provisioned services use values less than 16,000,000. In the case of the L3 traversal, the I-SID is dynamically extended to provide for the build of the L3 multicast distribution tree. 802.1aq supports up to 16,777,215 I-SID’s.  

Figure 9. IP Multicast with SPB/IS-IS using IP Shortcuts and L2 VSN

As the diagram above shows, for an end station to receive multicast from the source, it merely uses this dynamic I-SID to extend the service to end stations and which are members same subnet over the L2 VSN. Conversely, receiver will use the same dynamic I-SID built using the information provided by IS-IS to establish the end to end reverse forwarding path. In this model, IP multicast becomes much more stateful and integrated into the switch forwarding element. This results in a far greater build out capacity for the multicast service. It also provides for a much more agile multicast environment when dealing with topology changes and network outages. Switch element failures are handled with ease because the layered mutual dependence model has been removed. If a failure occurs within the core or edge of the network, the service is able to heal seamlessly due to the fact that the information required to preserve service is already known by the all of the elements involved. Due to the fact that the complete SPBm domain is topology aware, each switch member knows what it has to do in order to maintain established service. As long as a path exists between the two end points, Shortest Path Bridging will use it to maintain service. This is the result of true integration of link state routing into the Ethernet forwarding control plane.

What goes on behind closed doors…

In addition to providing constrained and L3 multicast, SPB also provides for the ability to deliver ‘ship in the night’ IP VPN environments. With SPBm’s native capabilities it becomes very easy to extend multicast distribution into these environments as well. Normally, multicast distribution within an IP VPN environment is notoriously complex dealing with yet more overlays of technology. Within SPBm networks however the task is comparitively simple. As the diagram below illustrates, a L3 VSN (IP VPN) is nothing more than a set of VRF’s that are associated with a common I-SID. Here we run IGMP on the routed interfaces that connect to the edge VLAN’s. Note that IGMP snooping is not used here as the local BEB interface will be a router. IGMP, SPB and IS-IS perform as before and the dynamic I-SID simply uses the established Dyjkstra path to provide the multicast service between the VRF’s. Important to note though is that this service is invisible to rest of the IP forwarding environment. It is a dark network that has no routes in and no routes out. Such networks are useful for video surveillance networks that require absolute separation from the rest of the networking environment. Note though that some services may be required from the outside world. This can be accomodated by policy based routing.


Figure 10. IP Multicast with SPB/IS-IS using L3 VPN

As the figure illustrates, the users within the L3VSN have access to subnets,, and within the network which is useful for services that require complete secure isolation such as IP multicast based video surveillance. The end result is a very secure closed system multicast environment that would be very difficult to build with legacy technology approaches.

I can see clearly now…

Going back to figure 4 that illustrates the legacy PIM overlay approach, we see that there are several demarcations of technology that tend to obscure the end to end service path. This creates complexities in troubleshooting and overall operations and maintenance. Note that at the edge we are dealing with L2 Ethernet switching and IGMP snooping, then we hop across the DR to the world of OSPF unicast routing. Over this and at the same demarcation we have the PIM protocol. Each demarcation and layer introduces another level of obscurity where the service has to be ‘traced and mapped’ into each technology domain. As a result, intermittent multicast problems can go on for quite some time until the right forensics are gathered to resolve the root cause of the problem.

With SPB, many if not all of these demarcations and overlays are eliminated. As a result, something that is somewhat of a Holy Grail in networking occurs. This is called ‘services transparency’. The end to end network path for a given service can be readiy established and diagnosed without referring to protocol demarcations and ‘stitch points’. As previously shown, IP multicast services are a primary beneficiary to this network evolution. The elimination of protocol overlays provides for a stateful data forwarding model at the level where it makes the most sense; at the data forwarding element itself.

Network diagnostics becomes vastly simplified as a result. End to end latency and connectivity becomes a very straight forward endeavor. Additionally, diagnosing the multicast service path, some thing that is notoriously nasty with PIM, becomes very straight forward and even predictable. Tools such as IEEE 802.1ag and ITU Y.1731 provide diagnostics on network paths, end to end and nodal latencies and all of this can be established end to end along the serivce path without any technology demarcations.

In Summary

IEEE 802.1aq Shortest Path Bridging is proving itself to be much more than a next generation data center mesh protocol. As previous articles have shown, the extensive reach of the technology lends itself well to metro and regional distribution as well as true wide area. Additional capabilities added to SPB such as the ability to deliver true L3 IP multicast without the use of a multicast routing overlay such as PIM clearly demonstrates the extensbility of the protocol as well as its exteremely practical implementation uses. The convergence of the routing intelligence directly into the switch forwarding logic result is an environment which can provide for extremely fast (sub-second) stateful convergence which is of definite benefit to the IP multicast service model. As such, IP multicast evironments can benefit fomr enhanced state which in turn results in increased performance and scale.

End to end services transparency provides for a clear diagnostic environment that eliminates the complexities of protocol overlay models. This drastic simplification of the protocol architecture results in the ability for direct end to end visability of IP multicast services for the first time.

So when someone asks “IP Multicast without PIM? No more RP’s?” You can respond with “With Shortest Path Bridging, of course!”

I would also urge you follow the blog site of esteemed colleague, Paul Unbehagen. Chair and Author of the IEEE 802.1aq “Shortest Path Bridging” Standard. you can find it at:


For more information please feel free to visit

Also please visit our VENA video on YouTube that provides further detail and insight. you can find this at:


Seamless Data Migration with Avaya’s VENA framework

November 23, 2011

There are very few technologies that come along which actually make things easier for IT staff. This is particularly true with new technology introductions. Very often, the introduction of a new technology is problematic from a systems service up time perspective. With networking technologies in particular, new introductions often involve large amounts of intermittent down time and a huge amount of human resources to properly plan the outages and migration processes to assure minimal down time. More so than any other, network core technologies tend to be the most disruptive due to their very nature and function. Technologies like MPLS are a good example. It requires full redesign of the network infrastructure as well very detailed design within the network core itself to provide connectivity. While some argue that things like MPLS-TP helps to alleviate this, it is not without cost – and the distruption remains.

IEEE 802.1aq or Shortest Path Bridging (SPB for short) is one of those very few technologies that can introduced in a very seamless fashion with minimal disruption or down time. It can also be introduced with minimal redesign of the existing network if so desired. A good case point example is a recent project that we have been working on with a large health care provider up in the northeast US. This was a long time Avaya networking customer who had an installed base of existing ERS 8600 routing switches. There was particular portion of the topology that interconnected the customer’s two data centers which were located in separate geographic loctions. This was the portion of the network topology that they chose to upgrade and introduce shortest path bridging.

The original intention was to upgrade the existing backbone switches to code that could support shortest path bridging (v7.1). They would then build out a parallel routed core in the resulting new ISIS routing plane. The ISIS environment would be kept latent and secondary by the resetting of its global priority to something lower than OSPF. Typically, this value is set at 130 (the default for ISIS is 7). Once the parallel routed core is built out as a mirror to OSPF, the systems are checked for validity and then once assured of stability, the priority of ISIS is then reset back to its default value of 7. ISIS then becomes the primary routed plane and OSPF is relagated to a secondary role. After system checks and validation, the OSPF network can be kept as secondary for as long as required. Then, at a later point in time, it can be decommissioned to leave ISIS as the sole core routing protocol for the enterprise core. This is a very seamless migration that provides for zero downtime to the overall networking core.

After a survey of the equipment however, it became obvious that due to its age  and slot density requirements (circa- 2000-2001) would need to be completely upgraded – including the switch chassis. Rather than view it as an impediment we quickly realized that by implementing a parallel routed core infrastructure the upgrade and migration of the critical path could be accomplished with little or no down time to the network core. This was in comparison to a gradual swap out and upgrade of the existing core which would have meant multiple outage occurances for each chassis swap out.

The theory was based on the diagram below, which shows the existing OSPF routed core running in parallel to a new SPB based ISIS routed core. By using a series of migration techniques which we will cover shortly, both routed cores would work in tandem with networks gradually migrated over to the new ISIS routed core in a controlled and phased approach.

Figure 1. Parallel OSPF and ISIS routed cores


The first step in the project was to account for the various VLAN’s that were provisioned in the existing OSPF routed core. Part of this was to also identify if they were one of two types. The first being VLAN’s that did not traverse the routed core by the use of Q-tagged trunks. These we identified as ‘peripheral VLAN’s’ in that the only Q-Tagged tunks that they ran over were along the edge and over the SMLT ‘Inter-Switch Trunks’.  The second type was a VLAN that existed in multiple places in the routed core and hence traversed the routed core by the use of Q-tagged trunks. These we labeled as ‘traversal VLAN’s’. Figure 2 illustrates the difference between the two VLAN types. This was an important step in the investigation because as one will see it largely determined the migration method for a given VLAN.
As is noted in other white papers, SPB offers various provisioning options. These are listed below for the convenience of the reader.

                L2 Virtual Service Network

This is a provisioned path across SPB, known as an I-SID in IEEE terms that inter-connects VLAN’s at the SPB edge. Taken as such it can be termed as a VLAN extension method somehwat anologous to Q-tagged extensions.

                L3 Virtual Service Network

This is a provisioned path across SPB, known as an I-SID in IEEE terms that inter-connects VRFs at the SPB edge. Taken as such it can be termed as a IP VPN method somewhat anologous to VRF lite.

Inter-VSN routing

This is a method of interconnecting Virtual Service Networks by the use of external routers or other devices. A good usage example is in a data center topology where user or ‘dirty’ VSN’s interconnect to data center or ‘clean VSN’s by the use of security perimeter technolgies such as firewalls and intrusion protection type devices.

IP Shortcuts

This final method does not involve the use of VSN’s at all but instead works on the injection of IP routing directly into ISIS and utilizing ISIS as an actual internal gateway routing protocol or IGP.


For the purposes of the migration we chose to use a combination of IP shortcuts in order to implement ISIS as the replacement core routed topology and L2 VSN’s to facilitate the connectivity to support the ‘traversal VLAN’s’ which would require multiple points of presence across the routed core.

In essence the network core migration involved three major steps:

1). Build out parallel network segments that match in almost every sense topologically. The new segment will run ISIS/SPBm as its core protocol. A migration link will be set up between the two routed domains to provide for a communication path during the migration. This link will be a MLT configuration for both bandwidth capacity and resiliency.

2). Redistribute VLAN’s and IP routes into the SPBm ISIS core on a switch by switch and VLAN by VLAN basis. Both ISIS and OSPF routing domains will be utilized throughout the migration process.

3). After all network migrations are completed the OSPF network core is to be dismantled.

If properly orchestrated and implemented, we strongly felt that this could be accomplished with zero network downtime for the local core network. There would however be short outages for each switch as it is migrated over to the SPBm/ISIS core. There would also be short outages for the individual VLAN’s during the final migration steps over to the new ISIS core. These however would be minimal and could also be scheduled during opportune windows that the IT staff had on a regular basis. The rest of this document will provide a more detailed outline of the three project phases listed above.

The diagram below illustrates the various types of VLAN’s and how they relate to the overall parallel routed cores. Note that with the introduction of SPB there is an additional type of VLAN (subnet) that is introduced which is a traversal VLAN that is in the process of migrating to the new routed core but still uses OSPF as its IGP. This required a number of items to work successfully. First we need to interconnect the VLAN by the use of L2VSN’s (I-SIDs) across the SPB ISIS routed core. This provided for connectivity, but due to the L2 nature that provides extension back into the OSPF environment, NOT the use of ISIS in an L3 sense. Additionally, we added in OSPF to ISIS and ISIS to OSPF redistribution at the migration link interface between the new and existing cores. This provided for the ability for the migrating VLAN (subnet) to have routed connectivity into the new ISIS routed core via redistribution but still use OSPF as its IGP. As the resident switches and systems were migrated over, the VLAN (subnet) would eventually be redistributed direct into ISIS and effectively decommisioned from the OSPF routed core. Again, by the use of the OSPF to ISIS and ISIS to OSPF redistribution the completely migrated network would still have connectivity over to the older OSPF routed core and visa versa. With the exception of the actual movement of switches and decommissioning of the subnet from OSPF and redistribution into ISIS the network downtime would be zero. More importantly, there would never be a time when the network core was not functional in a holistic sense.

Figure 2. Various migration VLAN types

                Taking a closer look at the ISIS side in the illustration below will provide a better feel for the actual topology in action. As noted in the diagram, we show the three VLAN types in the new SPB ISIS environment. First, for the completely migrated dual homed VLAN; it is simply redistributed into ISIS and routed accordingly. Due to the fact that it is provisioned as a Q-Tagged VLAN over the edge SMLT IST there is no use of VSN’s, the peripheral VLAN is simply redistributed direct into ISIS by the use of IP shortcuts.

                In the case of the traversal VLAN’s the illustration shows VLAN A which is a completely migrated traversal VLAN that is set up with VRRP Master Backup at various points for router redundancy. The VLAN (subnet) is then redistributed direct into the ISIS routed core by the use of IP shortcuts. This provides for the multiple points of presence required in the routed core by the use of L2 VSN’s and for the IP connectivity into the ISIS routed core by the use of IP shortcuts. The third VLAN type (VLAN C) is a migrating VLAN (subnet). As pointed out above, this is a VLAN that is extended over from the old routed core by the use of Q-tagging (old side) and L2 VSN’s (new side). As the diagram also shows, the migrating VLAN C (subnet) will continue to use OSPF as its routing protocol until all systems are moved over to the new core. At that point in time, the subnet is decommisioned in OSPF and redistributed direct into ISIS and will mirror VLAN A with the exception that there will be four VRRP instances and not two.

Figure 3. A closer view of the new ISIS core and various VLAN types

                Normally is such a scenario, one would have to deal with prefix lists and route policies to suppress the advertisment of the networks on one side or the other as they are co-resident in both routed cores. We were able to avoid this by simply not assigning IP addresses to the VLAN’s in the new core during the migration. By not doing this, the VLAN’s would simply not be distributed direct into ISIS and all systems connected to the subnet would use OSPF for all IP routing until the final migration step.

                Prior to the actual migration project we thought it prudent to test out the migration scenario as well as use the environment to provide knowledge transfer to the customer. As a result we set up an OSPF environment in the lab prior to actual deployment that looked like the toplogy below in figure 4. Note that both VLAN types (peripheral and traversal) are represented in the diagram. The switch in the lower left hand portion of the illustration shows a switch that provided for the OSPF routed environment in the lab test. Note that OSPF also is supported on the SPB core in the form of an ASBR function. In this example, IP network has connectivity to via OSPF. ( serves as the redistribution subnet). also has connectivity to respectively by the use of OSPF to ISIS and ISIS to OSPF redistribution. In summary, all IP subnets have routed connectivity to one another.

Figure 4. Existing provisioned SPB ISIS core.

                The next step was to introduce a migration VLAN into the lab test. We did this by creating a new VLAN on the OSPF side and gave it the IP address of As the figure below shows, we were able to extend that VLAN out across the migration link by the use of Q-tags on the OSPF side and L2 VSN’s in the new SPB ISIS core. We then emulated system moves over to the new core. Note that during this time utilized the OSPF protocol as its IGP. As a final step to the migration, the VLAN was deleted from the OSPF environment including any Q-tag extensions and then assigned IP addresses, VRRP Master Backup on the ISIS side and redistributed direct into ISIS.

Figure 5. Migration VLAN case point example

                In summary, the migration steps can be summarized as follows:

1). VLAN is extended over migration MLT from OSPF side

2). VLAN is assigned at required points of presence. NO IP addresses configured yet!

3). Add port members as required and create I-SID to connect VLAN’s togther

4). Migration can now proceed (systems are moved over to new core)

5). Upon completion, decommission network from legacy OSPF side (short outage)

6). Assign IP addresses at required VLAN POP’S , set up & enable VRRP Master Backup

7). Remove VLAN from migration MLT (clean up)

The following shows the CLI sequence to perform these steps. Note that ISIS redistribute direct is already set up in the environment. For clarity and reference, the redistribution method for DC3-8800-1 is shown below:





ISIS to OSPF redistribution.

ip ospf redistribute isis create

ip ospf redistribute isis enable

ip ospf redistribute direct create

Direct to OSPF redistribution.The “supress_IST” route policy is used to not advertise the IST subnet

ip ospf redistribute direct route-policy “suppress_IST”

ip ospf redistribute direct enable

OSPF to ISIS redistribution.

ip isis redistribute ospf create

ip isis redistribute ospf metric 1

ip isis redistribute ospf enable

ip isis redistribute direct create

ip isis redistribute direct metric 1

Direct to ISIS redistribution. The “supress_NETS” route policy is used to not advertise the IST subnet as well as others that may require suppression during migration

ip isis redistribute direct route-policy “suppress_NETS”

ip isis redistribute direct enable




Simple accept policy to ignore advertisements coming from its IST peer (DC3-8800-2). This avoids less than optimal IP routes



ip ospf accept adv-rtr create

ip ospf accept adv-rtr enable

ip ospf accept adv-rtr route-policy “reject”

As a result to the above, as soon as the VLAN is assigned IP addressing and VRRP Master Backup it will have routed connectivity into ISIS, no other steps are required. Also note that the network needs to be decommisioned in OSPF BEFORE being provisioned into the ISIS environment. This will involve a short outage (minutes) for the given subnet. will then have connectivity back into the OSPF side by the use of the route redistribution occuring at the migration link point which again has already been configured as per the above.

1). Set up VID 3 on the MLT (new side) both DC3-8800-1 & DC3-8800-2 (assuming this is done on the 5510)

            config vlan 3 create byport 1 name “TEST_MIG1″

            config vlan 3 add-mlt 2

2). Set up port & I-SID configuration. Both DC3-8800-1&2

            config vlan 3 ports add <members> (i.e. 10/9-10/10)

            config vlan 3 i-sid 3

3). Set up port & I-SID configuration on each required DC1-8800-1 & DC1-8800-2.

config vlan 3 create byport 1 name “TEST_MIG1″

config vlan 3 ports add <members> (i.e. 10/9-10/10)             

config vlan 3 i-sid 3

4). Migrate systems as appropriate (note – during migration still uses OSPF due to the fact that no IP addresses are yet assigned on ISIS side)

5). Once migration is complete (VID3) is decommissioned from 5510 (legacy OSPF environment)

6). Assign IP addresses. Enable VRRP Master Backup and VRRP on DC3-8800-1&2, DC1-8800-1&2

            config vlan 3 ip create 10.0.13.*/

            config vlan 3 ip dhcp-relay enable

            config vlan 3 ip vrrp 3 address

            config vlan 3 ip vrrp 3 backup-master enable

            config vlan 3 ip vrrp 3 enable


* Is,3,4 or  5 as required.


MIGRATION IS COMPLETE! should now be visible to the 5510 across the OSPF-ISIS Redistribution on Every subnet will have routed connectivity to the other.







            As can be seen by the example provided here, what can be a very complex migration project is now greatly simplified into a concise set of simple steps by the use of Shortest Path Bridging and VENA. OP/EX improvements when compared to other network virtualization technologies like MPLS are incomparible. Moreover, network downtime is predicatable, controllable and very short in comparison.

            Avaya’s VENA architecture facilitates a flexible yet powerful infrastructure that allows for this type of capability. It is also important to note that only a subset of the network services offered by VENA is used in this case point example. Very few technologies can claim such ease of introduction and actually ease the migration that they themselves require in order to be effectively used.

For more information please feel free to visit

Also please visit our VENA video on YouTube that provides further detail and insight. you can find this at:

 Happy Holidays to all!

With the very best wishes for the New Year!



Next Generation Mesh Networks

June 10, 2011


The proper design of a network infrastructure should allow for a number of key traits that are very desirable in an overall network design. First, the infrastructure needs to provide redundancy and resiliency without a single point of failure. Second, the infrastructure must be scalable in both geographic reach as well as bandwidth and throughput capacity.

Ideally, as one facet of the network is improved, such as resiliency; it should also improve on bandwidth and throughput capacity as well. Certain technologies work on the premise of an active/standby method. In this manner, there is one primary active link – all other links are in a standby state that will only become active upon the primary links failure. Examples of this kind of approach are 802.1d spanning tree and its descendants rapid and multiple spanning trees in the layer 2 domain and non-equal cost distance vector routing technologies such as RIP.

While these technologies do provide resiliency and redundancy they do so at the assumption that half of the network infrastructure is unusable and that a state of failure needs to occur in order to leverage those resources. As a result, it becomes highly desirable to implement active/active resiliency wherever possible to allow for these resources to be used in the day to day operations of the network.


Active/Active Mesh Switch Clustering


The figure below illustrates a very simple active/active mesh fabric. As in all redundancy and resiliency methods, topological separation is a key trait. As shown in the diagram below the two bottom switches are interconnected by a type of trunk known as an ‘inter-switch trunk or IST, that allows for the virtualization of the forwarding database across the core switches. The best and most mature iteration of this technology is something known as Avaya’s Split Multi-Link Trunking or SMLT. First invented in 2001 and movning into its 3rd generation, this effectively creates a virtualized switch that is viewed as single switch by the other edge switches in the diagram. Due to this fact, the edges switches can utilize defacto or industry standard multiple link technologies such as Multi-Line Trunks (MLT) or link aggregation (LAG) respectively. Because of the fact that the virtualized switch cluster appears as a single chassis these links can be dual homed to the two different switches at the top of the diagram to deliver active/active load balanced connectivity out to the edge switches.


Fig.1 A simple Active/Active Mesh Switch Topology

 Due to the fact that all links are utilized there is far better utilization of network resources. Additionally, because of this active/active mesh design, the resiliency and failover times offered are exponentially faster than comparative active/standby methods.

While the diagram above illustrates a very simple triangulated topology, active/active mesh designs can become much more sophisticated, such as box, full mesh and mesh ladder topologies. These additional topologies are shown in the diagram below. The benefit of these is that as the network topology is extended, both resiliency and capacity need not be sacrificed.

                                        box                   full mesh              ladder mesh

Fig. 2 Extended Active/Active Mesh Topologies

 As can be seen by the diagram above, these topologies can be very sophisticated and provide a very high degree of resiliency while enhancing the over all capacity of the network.


Topological Considerations for Active/Active Mesh Designs –


Most network topologies consist of various regions that provide certain functions. Depending on the region, there may be different features required that are specific to that region. As an example, within the network core high capacity load sharing trunks are a requirement where as at the network edge features like Power over Ethernet (PoE) are required in order to supply DC voltage to power VoIP handsets or other such devices.

Typically, these regions are divided into three sections of the topology; the network Core, Distribution and Edge. Below are short descriptions of each region and the role that they play. It should be noted that the distribution region is not required in all instances and should be viewed as an option.


The Network Core –


In a typical topology model, the individual network regions are interconnected using a core layer. The core serves as the backbone for the network, as shown in Figure 3. The core needs to be fast and extremely resilient because every network region depends on it for connectivity. Hence, active/active mesh topologies such as SMLT provide a very valuable role here. Even though the Core and Distribution Layer may be the same hardware, their role is different and should be looked as logically different layers. Also, as note above, the distribution layer is not always required. In the core of the network a “less is more” approach should be taken. A minimal configuration in the core reduces configuration complexity limiting the possibility for operational error. Ideally the core should be implemented and remain in a stable state with minimal adjustments or changes.

Fig 3. Simple Two Tier Switch Core

 The following are some of the other key design issues to keep in mind:

Design the core layer as a high-speed, Layer 3 (L3) or Layer 2 (L2) switching environment utilizing only hardware-accelerated services. Active/active mesh core designs are superior to routed and other alternatives because they provide:

Faster convergence around a link or node failure.

Increased scalability because neighbor relationships and meshing are reduced.

More efficient bandwidth utilization.

Use active/active meshing as well as topological distribution to enhance the overall resiliency of the network design.

Avoid L2 loops and the complexity of L2 redundancy, such as Spanning Tree Protocol (STP) and indirect failure detection for L3 building block peers.

If topology requires, utilize L3 switching in the active/active mesh core to provide for optimal sizing of the MAC forwarding table within the network core.

The Distribution Layer –

Due to the scale and capacity of active/active mesh core designs, the distribution layer is optional. It is far more efficient to dual home the network edge directly to the network core. This approach negates any aggregation or latency considerations that come in to play by the use of a distribution layer. The active/active mesh topology provides better utilization of trunk feeds and capacity can be scaled by multiple links in a dual homed fashion.

While the ideal topology is what is termed as a two tier design, it is some times necessary to introduce a distribution layer to address certain topology or capacity issues. Instances where a distribution layer might be entertained in a design are as follows:

  • ·         Where the required reach is outside of available trunk distances.
  • ·         Where the port count capacity in that portion of the network core can not support all of the edge connections without expansion and expansion in the core is not desired.
  • ·         Where logical topology issues such as Virtual LAN’s or port aggregation require it

It should be noted though that all of the above instances could be addressed by the expansion of the network core. Examples if this are moving from a dual to a quad core design or going further, moving to a mesh ladder topology as shown in figure 2.
In any instance it is more desirable to maintain a two tier rather than a three tier design if possible. The overall design of the network is far more efficient and resiliency convergence times become optimized. The diagram below shows a three tier design that utilizes an intermediate distribution or aggregation layer.

Fig. 4. Simple Three Tier Network

Note that topologies can be hybrid. As an example, most of the network can be designed around a two tier architecture with one or two regions that are interconnected by distribution layers for one or more of the reasons noted above.

The Network Edge

The access layer is the first point of entry into the network for edge devices, end stations, and IP phones (see Figure 5). The switches in the access layer are connected to two separate distribution layer switches for redundancy. If the connection between the distribution layer switches is to an active/active mesh, then there are no loops and all uplinks actively forward traffic.

A robust edge layer provides the following key features:

High availability (HA) supported by many hardware and software attributes.

Inline power for IP telephony and wireless access points, allowing customers to converge voice onto their data network and providing roaming WLAN access for users.

Foundation services.

The hardware and software attributes of the access layer that support high availability include the following:

Default gateway redundancy using dual active/active connections to redundant systems (core or distribution layer switches) that use industry standard or vendor specific Load Balancing or Virtual Gateway protocols such as VRRP or Avaya’s VRRP w/ Backup Master or R/SMLT. This provides fast failover of default gateway and IP paths. Note that with an active/active core or distribution mesh topology link and node resiliency and convergence are handled by the L2 topology which is much faster than any form of L3 IP routing convergence. As a result, any failover within the active/active mesh is well within the L3 routed timeout.

Operating system high-availability features, such as Link Aggregation or Multi-Line Trunks, which provide higher effective bandwidth that leverages on the active/active mesh while reducing complexity.

Prioritization of mission-critical network traffic using QoS. This provides traffic classification and queuing as close to the ingress of the network as possible.

In figure 5 the diagram illustrates a build out of a hybrid two/three tier network showing active/active load sharing interconnections with all network edge components.

Fig 5.  Full Resilient Active/Active Network Topology

Also note, that as shown in figure 5, active/active connections can also be established within the Data Center via top of rack switching to facilitate load sharing highly resilient links down to server nodes. Again, such resiliency is provided at L2 and is totally independent of the overlying IP topology or addressing.



Provisioned Virtual Network Topologies –

An evolution of active/active mesh topologies is provided by the ratification of IEEE 802.1aq “Shortest Path Bridging” or SPBm (the ‘m’ standing for MAC in MAC – IEEE 802.1ah) for short. This technology is an evolution of earlier carrier grade implementations of Ethernet bridging that were designed for metro and regional level reach and scale. The major drawbacks of these earlier methods were that they were based on modified spanning tree architectures that made the network complex to design and scale. IEEE 802.1aq resolves these issues with the implementation of link state adjacencies within the L2 switch domain in a manner that is the same as occurs by L3 link state adjacencies such as IS-IS and OSPF. All nodes within the SPB domain (which use ISIS to establish adjacencies) then run Dykstra to establish the shortest path to all other nodes in the active/active mesh cloud. Reverse Path Forwarding Checks provide for the ability to prevent loops in all data forwarding instances in a manner that is very similar to that provided in L3 routing. IEEE 802.1aq provides a cornerstone technology for Avaya’s Virtual Enterprise Network Architecture or VENA. The VENA framework utilizes SPBm as a foundational technology for many next generation cloud service models that either offerable today or currently under development at Avaya.

This next generation virtualization technology will revolutionize the design, deployment and operations of the Enterprise Campus core networks along with the Enterprise Data Center. The benefits of the technology will be clearly evident in its ability to provide massive scalability while at the same time reducing the complexity of the network. This will make network virtualization a much easier paradigm to deploy within the Enterprise environment.

Shortest Path Bridging eliminates the need for multiple protocols in the core of the network by separating the connectivity services from the protocol infrastructure. By reducing the core to a single protocol, the idea of build it once and don’t have to touch it again becomes a true reality. This simplicity also aides in greatly reducing time to service for new applications and network functionality.

The design of networks has evolved throughout the years with the advent of new technologies and new design concepts. IT requirements drive this evolution and the adoption of any new technology is primarily based on the benefit it provides versus the cost of implementation.

The cost in this sense is not only cost of physical hardware and software, but also in the complexity of implementation and on-going management. New technologies that are too “costly” may never gain traction in the market even though in the end they provide a benefit.

In order to change the way networks are designed, the new technologies and design criteria must be easy to understand and easy to implement. When Ethernet evolved from a simple shared media with huge broadcast domains to a switched media with segregated broadcast domains, there was a shift in design. The ease of creating a VLAN and assigning users to that VLAN made it commonplace and a function that went without much added work or worry. In the same sense, Shortest Path Bridging allows for the implementation of network virtualization in a true core distribution sense.



The key value propositions for IEEE 802.1aq SPBm include:



          IEEE 802.1aq standard

          Unmatched Resiliency

          Single robust protocol with sub-second failover

          Optimal network bandwidth utilization


          One protocol for all network services

          Plug & Play deployment reduces time to service


          Evolved from Carrier with Enterprise-friendly features

          Separates infrastructure from connectivity services


          No constraints on network topology

          Easy to implement virtualization

There are some major features within SPBm that lend themselves well to a scalable and resilient enterprise design. Two major points are as follows:

1). Separation of the Core and the Edge

SPBm implements IEEE 802.1ah ‘MACinMAC’ which provides for a boundary separation between data forwarding methods in the network core versus the edge. It provides for a clear delineation between the normal Ethernet ‘learning bridge’ environment which is required for local area network operations and the SPBm Core network cut-through switching environment where performance and optimal path selection are the key most important criteria. As a result, the use of SPBm creates a core network that creates smaller edge forwarding environments where the MAC tables are effectively isolated. Within the actual SPBm core network itself the only MAC addresses within the forwarding tables are those of the SPBm switches themselves. As a result, the IEEE 802.1aq SPBm Core is very high performance and very scalable. It is also able to utilize multiple forwarding paths and provide for clear delineation between the network core and edge.

2). Virtual Provisioning Fabric

As noted earlier, IEEE 802.1aq evolved from earlier carrier grade implementations for Provider Based Bridging. There are two things that are key to a provider based offering. First, no customer should ever see another customer’s traffic. There needs to be complete and total services separation. Second, there must be a robust and detailed method for Operation and Maintenance (OAM) and Configuration and Fault Management (CFM) which is addressed by IEEE802.1ag and is used by SPBm for those purposes..

The first requirement is addressed by SPBm’s ability to create isolated data forwarding environments in a manner that are similar to VLAN’s in the traditional learning bridge fashion. In the SPBm core there is no learning function required. As such, these forwarding paths provide for total separation and allow for very determinate forwarding to associated resources across the SPBm core. These paths, termed as Instance Service Identifiers or I-SID’s allow for the ability to provision virtual network topologies that can be of a very wide variety of forms.

In addition, due to the established topology of the SPBm domain, the creation of these I-SID’s are provisioned at the edge of the SPBm cloud. There is no need to go into the core to any provisioning to establish the end to end connectivity. This contrasts with normal VLANs which require each and every node to be configured properly.

The figure below shows the dichotomy of these two features and how they relate to the network edge and in this case a distribution layer.

Fig. 6  MAC-in-MAC and I-SID’s within SPBm

As an example, I-SID’s can be used to connect Data Centers together with very high performance cut through dedicated paths for things such as Virtual Machine Migration, Stretch Server Clusters or Data Storage Replication. The figure below illustrates the use of L2 I-SID in this fashion

 Fig. 7. End to end IEEE 802.1aq L2 I-SID providing a path for V-Motion

Additionally, complete Data Center architectures can be built that provide for all of the benefits of traditional security perimeter design but with the benefits full virtualization of the network infrastructure. The figure below shows a typical Data Canter design implemented by inter-connected I-SID’s in a Shortest Path Bridging network. This effectively shows that not only is SPBm an ideal core network technology, it is also an optimal data center bridging fabric.

Fig. 8. Full Data Center Security Zone


Finally, complex L3 topologies can be built on top of SPBm that can utilize traditional routing technologies and protocols or can provide for the networks L3 forwarding requirements by the use of the native L2 link state routing within SPBm provided by IS-IS. The illustration below shows a network topology in which all methods are utilized to provide for a global enterprise design.

Fig. 9  Full end to end Virtualized Network Topology over an IEEE802.1aq cloud

Shortest Path Bridging Services Types

Avaya’s implementation of Shortest Path Bridging provides a tremendous level of flexibility to support multiple service types simultaneously, singly or in tandem.

One of the key advantages of the SPB protocol is the fact that network virtualization provisioning is achieved by just configuring the edge of the network, thus the intrusive core provisioning that other Layer 2 virtualization technologies require is not needed when new connectivity services are added to an SPB network.

Shortest Path Bridging Layer 2 Virtual Services Network (L2 VSN)

Layer 2 Virtual Services Networks are used to transparently extend VLANs through the backbone.  A SPB L2 VSN topology is simply made up of a number of Backbone Edge Bridges (BEB) used to terminate Layer 2 VSNs. The control plane uses IS-IS for forwarding at a Layer 2 level. Only the BEB bridges are aware of any VSN and associated edge MAC addresses while the backbone bridges simply forward traffic at the backbone MAC (B-MAC) level.

Figure 10. L2 Virtual Service Networks

A backbone service Instance Identifier (I-SID) used to identify the Virtual Services Network will be assigned on the BEB to each VLAN. All VLANs in the network sharing the same I-SID will be able to participate in the same VSN.


Shortest Path Bridging Inter-VSN Routing (Inter-ISID Routing)

Inter-VSN Routing allows routing between IP networks on Layer 2 VLANs with different I-SIDs. As illustrated in the diagram below, routing between VLAN 10, VLAN 100 and VLAN 200 occurs on one of the SPB core switches in the middle of the diagram. 

Figure 11. Inter-VSN routing

Although in the middle of the network, this switch provides “edge services” and has I-SIDs and VLANs provisioned on it, and therefore is designated as a BEB switch.  End users from the BEB switches as shown on the right and left of the diagram are able to forward traffic between their respective VLANs via the VRF instance configured on the switch shown.  For additional IP level redundancy, Inter-VSN Routing may also be configured on another switch and both can be configured with VRRP to eliminate single points of failure.


Shortest Path Bridging Layer 3 Virtual Services Network (L3 VSN)

A SPB L3 VSN topology is very similar to a SPB L2 VSN topology with the exception that a backbone service Instance Identifier (I-SID) will be assigned at a Virtual Router (VRF) level instead of at a VLAN level. All VRFs in the network sharing the same I-SID will be able to participate in the same VSN. Routing within a single VRF in the network occurs normally as one would expect.  Routing between VRF’s is possible by using redistribution policies and injecting routes from another protocol, i.e., BGP even if BGP is not used within the target VRF.

Figure 12. L3 Virtual Service Networks

Layer 3 Virtual Service Networks provide a high level of flexibility in network design by allowing IP routing functionality to be distributed among multiple switches without proliferation of multiple router-to-router transit subnets.


SPB Native IP shortcuts

The services described to this point require the establishment of Virtual Service Networks and their associated I-SID identifiers.  IP Shortcuts enables additional flexibility in the SPB network to support IP routing across the SPB backbone without configuration of L2 VSNs or L3 VSNs.


Figure 13. Native IP GRT Shortcuts

IP shortcuts allow routing between VLANs in the global routing table/network routing engine (GRT). No I-SID configuration is used.

Although operating at Layer 2, IS-IS is a dynamic routing protocol.  As such, it supports route redistribution between itself and any IP route types present in the BEB switch’s routing table.  This includes local (direct) IP routes and static routes as well as IP routes learned through any dynamic routing protocol including RIP, OSPF and BGP.

IP routing is enabled on the BEB switches, and route redistribution is enabled to redistribute these routes into IS-IS.  This provides normal IP forwarding between BEB sites over the IS-IS backbone.


 BGP-Based IP VPN and IP VPN Lite over Shortest Path Bridging

Avaya’s implementation of Shortest Path Bridging has the flexibility to support not only the L2 and L3 VSN capabilities and IP routing capabilities as described above, but also supports additional IP VPN types.  BGP-Based IP VPN over SPB and IP VPN Lite over SPB are features supported in the Avaya implementation of Shortest Path Bridging. 

Figure 14. BGP IP VPN over IS-IS

BGP IP VPNs are used in situations where it is necessary to leak routes into IS-IS from a number of different VRF sources.  Additionally, using BGP IP VPNs support over SPB, it is possible to provide hub and spoke configurations by manipulating the import and export Route Target (RT) values. This allows, for example, a server frame in a central site to have connectivity to all spokes, but, no connectivity between the spoke sites. BGP configuration is only required on the BEB sites where the backbone switches have no knowledge of any Layer 3 VPN IP addresses or routes.


Resilient Edge Connectivity with Switch Clustering Support

As earlier described, the boundary between the MAC-in-MAC SPB domain and 802.1Q domain is handled by the Backbone Edge Bridges (BEBs). At the BEBs, VLANs are mapped into I-SIDs based on the local service provisioning.

Figure 15. Resilient edge switch cluster

Redundant connectivity between the VLAN domain and the SPB infrastructure is achieved by operating two SPB switches in Switch Clustering (SMLT) mode. This allows dual homing of any traditional link aggregation capable device into a SPB network. 

Switch Clustering provides the ability to dual home any edge device that supports standards-based 802.1ad LACP link aggregation, Avaya’s MLT link aggregation, EtherChannel or any similar link aggregation method.  With Switch Clustering, the capability is provided to fully load balance all VLANs across the multiple links to the switch cluster pair.  If either link as depicted fails, all traffic will instantly fail over to the remaining link.  Although two links are depicted, Switch Clustering supports LAGs up to 8 ports for additional resiliency and bandwidth flexibility. 


Quality of Service Support and Traffic Policing and Shaping Support

Quality of Service (QoS) is maintained in a SPB network the same way any IEEE based 802.1Q network is operated. Traffic ingressing a SPB domain which is either already 802.1p bit marked (within the C-MAC header), or is being marked by an ingress policy (remarking), is getting its B-MAC header p-bit marked to the appropriate value.

Figure 16. QoS & Policing over SPB

The traffic in the SPB core is scheduled, prioritized and forwarded according to the 802.1p values in the backbone outside packet header. In the case where traffic is being routed at any of the SPB nodes, the IP Differentiated Services DSCP values are taken into account as well.

The number of I-SID’s available in an SPBm domain are virtually limitless (16 million). Additionally, this technology can be effectively extended over many forms of transport such as dark or dim optics, CWDM or DWDM, MPLS L2 pseudo-wires, ATM and others. This means that it can effectively cover vast geographies in its native form and provide all of the virtualization benefits where ever it reaches.

In instances where required however an SPBm domain can effectively interface to a traditional routed WAN by the use of standard interior and border gateway protocols.

Provider Type Services offerings and larger regional topologies

In instances where larger geographic coverage is desired to leverage IEEE 802.1aq and its inherent provisioned core approach the traditional mash topology has difficulty in scaling due to costs in optical infrastructure and point of presence. In these instances ring based topologies make the most sense. IEEE 802.1aq can not only support ring topologies but can also support various interesting iterations such as dual core rings or the more esoteric 3D torus topology which is intended to support very high core port densities.

The next section of this document will discuss the various ring topology options as well as the combination of their use. The diagram below illustrates the basic components for the dual core ring. There are two basic assumptions in the design. First, the core ring topology is populated with only Backbone Core Bridges (BCB’s). This optimizes one of the key traits of Shortest Path Bridging – separation of core and edge. The result is a design of immense scale from a services perspective. Second, all provisioned service paths are applied at the edge in the Backbone Edge Bridges (BEB’s) which provides the interface to the customer edge.

Figure 17. Basic Dual Core components

As we look below at a complete topology we can see that a very efficient design emerges which uses both minimal node and fiber counts as well as effectively leverage on shortest paths across the topology. Each BEB is dual homed back into the ring fabric by SPB trunks. As such there are multiple options for dual homing the BEB node back into the ring topology.

Figure 18.  A Basic Dual Core Ring

An additional level of differentiation can be provided by the use of a dual home active/active mesh service edge. In this type of edge shown below, there are two BEB’s which are trunked together with active/active Inter-Switch Trunks. These two switches then provide a clustered edge that interoperates with any industry standard dual homing trunk method such as MLT or LAG. The end result is a very high level of mesh resiliency directly down to the customer service edge.

Figure 19. Dual Homed Active/Active Mesh Edge

The diagram below shows a dual core ring design that implements various forms of dual homed resiliency. These can range from simple dual homing of the BEB to a very highly resilient inter-area active/active edge design that can provide sub-second failover into the provider cloud. Again, this supports industry standard methods for active/active dual homing of the Ethernet service edge.

 Figure 20. Dual Core Ring with various methods of dual homed resiliency

More complex topologies can be designed when higher densities of backbone core ports are required. The topology below illustrates a 3D torus design that links together triad nodal areas to build a very highly resilient and dense core port capacity ring.

Figure 21. 3D Torus Ring

As the diagram below shows, the basic construct of the 3D torus is fairly simple and is comprised of only six core nodes. The dotted lines show optional SPB trunks to provide enhanced shortest path meshing. With these optional trunks every node is directly connected for shortest path forwarding.

Figure 22. 3D Torus Section

These sections can be linked together to build a complete torus as shown above, or used in a hybrid fashion as shown below to build up or down core port densities as required by subscriber population. The illustration below shows a hybrid ring topology that scales up or down according to population and subscriber density requirements.

Figure 23. Hybrid Ring Topology

As this section illustrates, IEEE 802.1aq is an excellent technology for regional and metropolitan area networks. It allows for scalability and reach as well as a great degree of flexibility in supported topologies. Moreover, these different degrees of scale can be accomplished in the same network without any degree of sacrifice to the overall resiliency of the whole.

Provisioned Virtual Service Networks

As mentioned earlier, IEEE 802.1aq offers several methods of service connectivity across the SPB cloud. In the context of a service offering however, the use of I-SID’s will have a different focus. Rather than a departmental or organizational focus as was used in the above example, here we are concerned with shared service offerings or services separation. As an example, in the area of voice service offerings, a service may be shared in that it is much like the PSTN only over IP. In contrast, a service might be offered for a virtual PBX service for a private company that would expect that service to be dedicated. The figure below shows how IEEE 802.1aq can easily provide the dedicated service paths for both modes of service offering. The PSTN service I-SID offering is shown in green while the private virtual PBX service I-SID is shown in red.

Figure 24.  Shared vs. Dedicated Services


In typical deployment an offering of services might be as follows –

Private Sector – Voice/Shared – Video/Shared – Data/Shared

Business – Voice/Private – Video/Shared – Data/Private

These are of course general and can be customized to any degree. The diagram below shows how the use of IEEE802.1aq I-SID’s allows for the support of both service models with no conflict. Note that the private sector shares a common I-SID for video services with the business sector. Also note that the business sector profile allows for the use of a dedicated virtual PBX service that is private to that business.

Figure 25.  Voice and Video I-SID’s across SPB

Figure 26.  Multiple ‘Service Separated’ data service paths across SPB

The illustration above highlights the data networking services. Note that the private sector is using a shared I-SID (shown in green) much as is done today with DOCSIS type solutions. Note also that the business is using L3 I-SID’s with VRF’s to build out a separate private and dedicated IP topology over the IEEE 802.1aq managed offering. This creates separate and discrete data forwarding environment that are true ‘ships in the night’. They have no ability to support end to end communications unless the routing topology explicitly allows it. As such all of the traditional IT security frameworks such as firewalls and intrusion detection and prevention come into play and are used in a rather traditional fashion to protect key corporate resources. On the private residential space, end point anti-virus & protection as is typical with ISP’s today.


IP Version 6 Support

Introducing new technology is always a move into the unknown. IPv6 is no different. While the technology has been under development so some time (over ten years), there has been no great impetus that has been the motivation for large scale adoption. This is changing now that IANA/ARIN has announced that the last contiguous block of IPv4 addresses has been sold. Now it is down to non-contiguous blocks and recycling of address blocks. These efforts will not provide any significant extension to the availability of IPv4 addresses. With these events, many organizations are now actively investigating how IPv6 can be deployed into their networks.


This section is intended to provide an overview of a tested topology over shortest path bridging (IEEE 802.1aq) environments for the distribution of globally routable IPv6 addressing using L2 VSN’s and inter-VSN routing.

The high level results of the work demonstrate that an enterprise can effectively use SPB to provide for the overlay of a routed IPv6 infrastructure that is incongruent to the existing IPv4 topology. Furthermore, with IPv4 default gateways resident on the L2 VSN’s, dual stack end  stations can have full end to end hybrid connectivity without the use of L3 transition methods such as 6to4, ISATAP, or Teredo. This results in a clean and simple implementation that allows for the use of allocated globally routable IPv6 addresses in a native fashion.


IPv6 in General –


IPv6 is the next generation form of IP addressing. Replacing IPv4 it is intended to provide greatly enhanced address space as well as end to end transparency which was becoming more and more difficult by the increasing use of Network Address Translation (NAT) in IPv4. NAT was created in order to provide for the use of ‘private’ IPv4 addressing within an organization and then allow for a gateway device to interface out to the public Internet. Even this technology however could not forestall the unavoidable event that occurred earlier this year contiguous blocks of IPv4 addresses have run out.

Currently, there are address recycling efforts that will provide some reprieve but in the immanent future even this effort will be exhausted.

These events have caused a recent surge in the interest in IPv6. Many enterprises that had it on the back burner are now taking a new look at this technology and the requirements that need to met for their organizations to deploy it. For the first time investigator this can be a daunting task. Beyond the knowledge of IPv6 itself, one needs to learn all of the methods required in order to co-exist in an IPv4 network environment. This is a strict requirement because no one will completely forklift their complete communications environment and even if they could there are issues with contact to the outside world that need to be addressed. The reason for this is that the IPv6 suite is NOT directly backwards compatible to IPv4. This complication has caused quite a bit of effort within the IETF to resolve. There are a number of RFC’s, drafts as well as deprecated drafts that cover a wide variety of translation or transition methods. Each has its own set of complications and security or resiliency issues that need to be dealt with. At the end of the day, most IT personnel walk away with a headache and wish for the good old days of just IPv4.


During the time since IPv6 was first introduced different schools of thought evolved as to how this co-existence between IPv4 and IPv6 could be addressed. Network and Port Translation (NAT-PT) came into vogue but has since faded off into deprecation as the approach has largely proved to be intractable. Other methods have stayed and even become ‘default’. As an example, all Microsoft OS’s running IPv6 run 6to4, ISATAP and Teredo tunneling methods.

So it has become clear. One school has won out and that school of thought is… dual stack in the end stations and tunneling across the IPv4 network to tie IPv6 islands together. These methods work, but as I pointed out earlier, they all have complications and issues that need to be dealt with.

If one looks at the evolution long enough though something else becomes apparent. If you could provide the paths between IPv6 islands by Layer 2 methods, things like 6to4, ISATAP and Teredo are no longer required. Furthermore, without these methods an enterprise is free to use formally allocated globally routable address space. The only requirement for the dual stack host is that they have clear default routes for both IPv6 and IPv4. With typical VLAN based networks however, this design while feasible does not scale and quickly becomes intractable due to the complications of tagged trunk design within the network core. With the evolution of Shortest Path Bridging (IEEE 802.1aq) this scalable layer two method is now available. The rest of this solution guide will describe the test bed environment and then discuss ramifications that this work has on larger network infrastructures.


The IPv6 over SPB Example Topology –


The figure below shows the minimal requirements for a successful hybrid IPv6 deployment over shortest path bridging. As can be seen the requirements are fairly concise and simple. You require an SPB Virtual Service Network configured which is then associated with edge VLAN’s. These VLAN’s will host dual stack end stations.

Addtionally, this VSN will need to attach to default IPv6 and IPv4 default gateways. Again, this would occur by the use of edge VLAN’s that interface to the relevant devices.


Figure 27. Required elements for a native hybrid IPv6 deployment over SPB


So as one can see the requirements are straightforward and easy to understand. We implemented the following topology in a lab to demonstrate the proposed configuration.

The diagram below illustrates this topology in a simplified form for clarity. 

 Figure 28. Native IPv6 Dual Stack over L2 VSN Test bed


In the test bed we implemented a common VSN that would support the IPv6 deployment. This was for simplicity only. More complicated IPv6 routed topologies can easily be achieved by using inter-VSN routing. Examples later in the brief will be shown where this is illustrated. In the lab we created VLAN ID 500 at the three different key points at the edge of the SPB domain. A Virtual Service Network was created within the SPB domain (also using 500 as its identifier) that ties the different VLAN’s together. At one edge VLAN a Win7 end station running dual stack had the IPv4 address of and the IPv6 address of 3000::2. For IPv4 the end stations default gateway was and for IPv6 the Default Gateway was 3000::1. The IPv6 default gateway is also attached to VLAN 500 and is able to provide directly routable paths in and out of the VSN. Additionally, the IPv4 default gateway is also attached and reachable as well. The dual stack end station enjoys end to end hybrid connectivity to both IPv6 and IPv4 environments without the use of any L3 transition method. In the topology shown in figure 3, we show that from the dual stack end stations perspective, there is complete hybrid connectivity and available routed paths to both IPv4 and IPv6 environments. Due to the fact that formally allocated global addressing is used there is connectivity out into INET2 to native IPv6 resources.

Figure 29. Dual Stack end stations perspective on default routed paths


The ramifications on larger IPv6 deployments


One of the major drawbacks of L3 transition methods for IPv6 is that they bind the IPv6 topology to IPv4. Many find this as undesirable. After all, why implement a new globally routed protocol and then lock it down to an existing legacy topology? As a result, it was realized very early on that if you could run IPv6 as ships in the night with IPv4 it would be a very good solution. The problem with this was that the only method to accomplish this was by the use of VLAN’s and tagged trunks or with routed overlays. As a result, while the previous test bed shown in figure 2 was feasible and provable, the approach quickly suffers from complexity in larger topologies and does lend itself well to scale.

With Shortest Path Bridging these issues are vastly simplified making this approach tractable on an enterprise scale. The reason for this is that the IPv6 deployment becomes an overlay L3 environment that rides on top of SPB. As such, there is no need to make detailed configuration changes to the network core to deploy it. This original ‘ships in the night’ vision can now be realized in real world designs.


The diagram below shows a large network topology that interconnects two data centers. The topology in blue shows the IPv6 native dual stack deployment. The topology in green shows the IPv4 legacy routed environment. Note that while there are common touch points between the two environments for legacy dual stack IPv4 use, the two IP topologies are quite independent of one another.

Figure 30. Totally Independent IP topologies



In Summary –


This document has provided a review of active/active mesh network topologies and the significant benefits that they bring to an overall network design. With networking speeds now at plus 10 Gb/s it is no longer sufficient to have very high speed expensive switch ports sitting in a totally passive state waiting for a network failure. It is also no longer sufficient to tolerate failover times in the range of seconds or even tenths or hundreths of seconds. The amount of data loss and the performance impacts are just too serious. Active/active mesh networking addresses this by providing for multiple load sharing paths across the network topology. Additionally, due to the active nature of the trunking method, SMLT can very easily provide for failovers in the subsecond range. As a note, recent testing of Avaya’s 3rd generation of SMLT reliably shows failovers in the range of 6 ms. This is practically instantaneous from the persapective of the overall network. This failover speed is unrivaled in the industry and is a testament to Avaya’s dedication to this technology space.

Additionally, newer active/active mesh technologies are being introduced such as IEEE 802.1aq Shortest Path Bridging – a key foundational component of Avaya’s VENA framework that promise to take active/active mesh network topologies into a new era of scale and flexibility never before realized with switched Ethernet topologies. The provisioned virtual network capability of VENA allows for one touch provisioing of the network serivce paths with zero touch requirements to the transport core. This new innovation not only vastly simplifies administration and reduces configuration errors. It can provide for dramatic improvements in IT OP/EX costs in that changes that would normally take hours are brought down to minutes with an exponential reduction in the probablity for error.

In addition, this paper has shown that this new addition to active mesh networking is totally complatible and complimentary with older active/active mesh switched Ethernet topologies such as SMLT. The results of the combination are a flexible core meshing technology that allows for almost unlimited permutations of topologies and a very highly resilient dual homed edge with sub-second failover.

Another more mundane but equally important aspect of Avaya’s SPBm offering is that it can be easily migrated to within their existing Ethernet Routing Switch 8600. The result of this upgrade is to make it the equivalent of an Ethernet Routing Switch 8800 which can participate in an SPBm domain as either a Backbone Edge Bridge (BEB) or a Backbone Core Bridge (BCB), including all service modes detailed earlier in this article. This mean that an existing ERS 8600 customer can implement the technology without the needs for a forklift upgrade.

Even when considering networks with alternative vendors, Avaya’s SPBm VENA framework – due to it’s strict compliance to IEEE 802.1aq and other IEEE standards – allows for the seamless introduction of SPBm into the network as a core distribution technology with minimal disruption to the network edge. Additionally, network edges that are Spanning Tree based today because of core networking limitations can then move to implement the active/active dual homing model spoken to earlier by the use of LAG or MLT at the edge, both of which are widely supported throughout the industry.

The end result is a technology that brings immense value.  It is easy to implement in both new and existing networks, and migration can be virtually seamless.

Could it be that the days of spanning tree have finally passed?

I would like to extend both credit and thanks to my esteemed Avaya colleagues, Steve Emert and John Vant Erve for both input and use of facilities for solution validation.

IPv6 Deployment Practices and Recommendations

June 7, 2010

Communications technologies are evolving rapidly. This pace of evolution, while slowed somewhat by economic circumstances, still moves forward at a dramatic pace. This is indicative to the fact that while the ‘bubble’ of the 1990’s is past, society and business as a whole has arrived to the point where communications technologies and their evolution are a requirement for proper and timely interaction with the human environment.

This has profound impact on a number of foundations upon which the premise of these technologies rest. One of the key issues is that of the Internet Protocol, commonly referred to simply as ‘IP’. The current widely accepted version of IP is version 4. The protocol, referred to as IPv4 has served as the foundation to the current Internet since its practical inception in the public arena. As the success of the Internet attests, IPv4 has performed its job well and has provided the evolutionary scope to adapt over the twenty years that has transpired. Like all technologies though IPv4 is reaching the point where further evolution will become difficult and cumbersome if not impossible. As a result, IPv6 was created as a next generation evolution to the IP protocol to address these issues.

Many critics cite the length of time that IPv6 has been in development. It is after all, a project that has over a ten year history in the standards process. However, when one considers the breadth and complexity of the standards involved a certain maturity is conveyed that the industry can now leverage upon. The protocol has evolved significantly since the first proposals for its predecessor, IPng. Many or most of the initial shortcomings and pitfalls have been addressed to the point where actual deployment is a very tractable proposition. Along this evolution several benefits have been added to the suite that directly benefits the network staff and end user populous. Some these benefits are listed below. Note that this is not an inclusive list.

  • Increased Addressing Space
  • Superior mobility
  • Enhanced end to end security
  • Better transparency for next generation multimedia applications & services

Recently, there has been quite a bit of renewed activity and excitement around IP version 6. The recent announcements by the United States Federal Government for IPv6 deployment by 2008 and the White House Civilian Agency mandate by 2012 has helped greatly to fuel this. Also many, if not most of the latest projects being implemented by providers in the Asia Pacific regions are calling for mandatory IPv6 support. Clearly the protocols’ time is coming. We are seeing the two vectors of maturity and demand meeting to result in market and industry readiness.

There is a cloud on this next generation horizon however. It is known as IPv4. From a practical context all existing networks are either based on or in some way leverage IPv4 communications. Clearly, if IPv6 is to succeed, it must do so in a phased approach that allows hybrid co-existence with it. Fortunately, many in the standards community have put forth transition techniques and methodologies that allow for this co-existence.  A key issue to consider in all of this is that the benefits of IPv6 are somewhat (sometimes severely) compromised by their usage. However, like all technologies, if usage requirements and deployment considerations are considered prior to implementation the proposition is realistic and valid.

Setting the Foundation

IPv6 has several issues and dependencies that are common with IPv4. However, the differences in address format and methods of acquisition require modifications that need to be considered to them. Much of the hype in the industry is on the aspects of support within the networking equipment. While this is of obvious importance, it is critical to realize that there are other aspects that need to be addressed to assure a successful deployment.

The first Block – DNS & DHCP Services

While IPv6 supports auto-configuration of addresses, it also allows for managed address services. DNS does not require, or from a technical standpoint require DHCP, but the two are often offered in same product suite.

When considering the new address format (128 byte colon delimited hexadecimal), it is clear that it is not human friendly. A Domain Name System (DNS) infrastructure is needed for successful coexistence because of the prevalent use of names (rather than addresses) to refer to network resources.  Upgrading the DNS infrastructure consists of populating the DNS servers with records to support IPv6 name-to-address and address-to-name resolutions. After the addresses are obtained using a DNS name query, the sending node must select which addresses are used for communication. This is important to consider both from the perspective of the service (which address is offered as primary) and the application (which address is used). It is obviously important to consider how a dual addressing architecture will work with naming services. Again, the appropriate due diligence needs to be done by investigating product plans but also in limited and isolated test bed environments to assure predictable and stable behavior with the operating systems as well as the applications that are being looked at.

As mentioned earlier, DHCP services are often offered in tandem with DNS services in many products. In instances where IPv6 DHCP services are not supported, but DNS services are, it is important to verify that it will work with standard auto-configuration options.

The second Block – Operating Systems

Any of the operating systems that are being considered to use in the IPv6 deployment should be investigated for compliance and tested so that the operation staff are familiar with any new processes or procedures that IPv6 will require. Tests should also occur between the operating systems and the DNS/DHCP services using simple network utilities such as ping and FTP to assure that all of the operating elements, including the operating systems interoperate at the lowest common denominator of the common IP applications.

It is important to test behaviors of dual stack hosts (hosts that support both IPv4 and IPv6). Much of the industry supports a dual stack approach as being the most stable and tractable approach to IPv6 deployments. Later points in this article will illustrate why this is the case.

The third Block – Applications

Applications should be considered first off to establish the scope of operating systems and the extent to which IPv6 connectivity needs to be offered. Detailed analysis and testing however should occur last after the validation of network services and operating systems. The reason for this is that the applications are the most specific testing instances and strongly depend on the stable and consistent operation of the other two foundation blocks. It is also important to replicate the exact intended mode of usage for the application so that the networking support staff are aware of any particular issues around configuration and or particular feature support. On that note, it is important to consider if there are any features that do not work in IPv6 and what impact that they will have on the intended mode of usage for the application. Finally, considerations need to be made for dual stack configurations and how precedence is set for which IP address to use.

The forth Block – Networking Equipment

Up to this point all of the validation activity referred to can be performed on a ‘link local’ basis. As a result a typical layer two Ethernet switch would suffice. A real world deployment requires quite a bit more however. It is at this point where the networking hardware needs to be considered. It is important to note that many pieces of equipment, particularly layer two type devices will forward IPv6 data. If expressed management via IPv6 is not a requirement then these devices could be used in the transition plans provided they are used appropriately in the network design.

Other devices such as routers, layer three switches, firewalls and layer 4 through 7 devices will require significant upgrades and modification to meet requirements and perform effectively. Due diligence should be done with the network equipment provider to assure that requirements are met and timelines align with project deployment timelines.

As noted previously in the other foundation blocks, dual stack support is highly recommended and will greatly ease transition difficulties as will be shown later. With networking equipment things are a little more complex in that in addition to meeting host system requirements for IPv6 communications of the managed element, the requirements of data forwarding, route computation and rules bases need to be considered. Again, it is important to consider any features that will not be supported in IPv6 and the impact that this will have on the deployment. The figure below illustrates an IPv6 functional stack for networking equipment.

Figure 1. IPv6 network element functional blocks

As shown above, there are many modifications that need to occur at various layers within a given device. The number of layers as well as the specific functions implemented within each layer is largely determined by the type of networking element in question. Simpler layer two devices are only required to provide dual host stack support primarily for management purposes, products like routers and firewalls will be much more complex. When looking at IPv6 support in equipment it makes sense to establish the role that the device performs in the network. This role based approach will best enable an accurate assessment of the real requirements and features that need to be supported rather than industry or vendor hype.

The burden of legacy – Dual stack or translation?

The successful deployment of IPv6 will strongly depend on a solid plan for co-existence and interoperability with existing IPv4 environments. As covered earlier, the use of dual stack configurations whenever possible will greatly ease transition. Today this is an issue for any device supporting IPv6 to speak to IPv4 devices. As time moves on however, the burden will shift to the IPv4 devices to speak to IPv6 devices. As we shall see there are only a certain set of applications that require dual stack down to the end point. Most client server applications will work fine in a server only dual stack environment supporting both IPv4 and IPv6 only clients as shown in the figure below.

Figure 2. A dual stack client server implementation

As shown above both IPv4 and IPv6 client communities have access to the same application server each served by their own native protocol. In the next figure however we see that there are some additional complexities that occur with certain applications and protocols such as multimedia and SIP. In the illustration below we see that there are not only client/server dialogs but client to client dialogs as well. In this instance, at least one of the clients needs to support a dual stack configuration in order to establish the actual media exchange.

Figure 3. A peer to peer dual stack implementation

As shown above, with one end point supporting a dual stack configuration and the appropriate logic to determine protocol selection, end to end multimedia communications can occur. Note that this scenario will typically be lieu of IPv6 only devices as these will become more prevalent over time.

There are many benefits to the dual stack approach. By analyzing applications and mandating dual stack usage, a very workable transition deployment can be attained.

There are arguments that address space, one of the primary benefits of IPv6 is drastically compromised by this approach. After all, by using dual stack you do not remove any IPv4 addresses. In fact you are forced to add IPv4 addresses to accommodate an IPv6 deployment. The truth to this is directly related to the logic of the approach in deployment. By understanding the nature of the applications and giving preference to the innovative (Ipv6 only) population these arguments can be mitigated. The reason for this is that you are only adding IPv6 addresses to existing IPv4 hosts that require communication with IPv6. If this happens to be the whole IPv4 population, so be it. There are plenty of IPv6 addresses to go around! As new hosts and devices are deployed they should be IPv6 only preferentially, or dual stack if required but NOT IPv4 only.

An alternative to the dual stack approach is the use of intermediate gateway technologies to translate between IPv6 and IPv4 environments. This approach is known as NAT-PT. The diagram below illustrates a particular architecture for NAT-PT usage that will provide for the multimedia scenario used previously.

Figure 4. Translation Application Layer Gateway

In this approach the server is supporting a dual stack configuration and is using native protocols to support the client/server dialogs to each end point. Each end point is single stack, one is IPv4 the other is IPv6. In order to establish end to end multimedia communications, there is an intermediate NAT-PT gateway function that provides for the translation between IPv4 and IPv6. There are many issues and caveats with this approach. These can be researched in IETF records.  As a result to this, there is work towards deprecating NAT-PT to an experimental status.  It should be noted that a recent draft revision has been submitted so it is worth keeping on the radar map.

Tunnel Vision

There has been quite a bit of activity around another set of transition methods known as tunneling. In a typical configuration, there are two IPv6 sites that require connectivity across an IPv4 network. The use of tunneling would involve the encapsulation of the IPv6 data frames into IPv4 transport. All IPv6 traffic between the two sites would traverse this IPv4 tunnel. It is a simple and elegant, but correspondingly limited approach that provides co-existence not necessarily interoperability between IPv4 and IPv6. In order to achieve this we need to invoke one of the approaches (dual stack vs. NAT-PT) discussed earlier.  Tunneling by itself only provides the ability to link IPv6 sites and networks over IPv4.

This is a very important point. A point that, if taken to its logical conclusion, indicates that if the network deployment is appropriately engineered, the use of transition tunneling methods can be greatly reduced and controlled, if not eliminated. Before we take this course in logic however it is important to consider the technical aspects of tunneling and why it is something that needs to be thought out prior to using.

The high level use of tunneling is reviewed in RFC 2893 for those interested in further details. Basically there are two types of tunnels; the first is called configured tunnels. Configured tunnels are IPv6 into IPv4 tunnels that are set up manually on a point to point basis. Because of this, configured tunnels are typically used in router to router scenarios. The second type of tunnels is automatic. Automatic tunnels use various methods to derive IPv4/IPv6 address mappings on a dynamic basis in order to support an automatic tunnel setup and operation. As a result, automatic tunnels can be used not only for router to router scenarios but for host to router or even host to host tunneling as well. As a result we are able to build a high level summary table of the major accepted tunneling methods.

Method                Usage                               Risk

Configured          Router to router                 Low


Automatic           Router to router/             Medium

6 to 4                  Host to router

Automatic           Host to host                      High


With out going into deep technical detail on each automatic tunneling methods behavior, we can assume that there is some sort of promiscuous behavior that will activate the tunneling process on recognition of a particular pattern (IP packet type 41 (IPv6 in IPv4)). This promiscuous behavior is what warrants the increased security risk associated with the automatic methods. RFC 3975 goes into detail on the security related issues around automatic tunneling methods. At a high level there is the ability for Denial of Service attacks on the tunnel routers as well as the ability to spoof addresses into the tunnel for integrity breach. The document goes into recommendations on risk reduction practices but they are difficult to implement and maintain properly.

An effective work around to these issues is to use IPSEC VPN branch routing over IPv4 to establish secure encrypted site to site connectivity and then running the automatic tunneling method inside the IPv4 IPSEC tunnel.

The figure below shows a scenario where two 6 to 4 routers have a tunnel set up to establish site to site connectivity inside an IPv4 IPSEC VPN tunnel. With this approach any IP traffic will have site to site connectivity via the VPN branch office tunnels. The IPv6 hosts would have access to one another via the 6 to 4 tunnels. Any promiscuous activity required by 6 to 4 can now be used with relative assurances of integrity and security. The drawback to this approach is that additional features or devices are required to complete the solution.

Figure 5. Using Automatic Tunneling inside IPv4 IPSec VPN

The primary reason for using transition tunnel methods is to transport IPv6 data over IPv4 networks. In essence, the approach ties together islands of IPv6 across IPv4 and allows for connectivity to the IPv6 network.  If we follow this logic, then the use of transition tunneling can be reduced or even eliminated by getting direct connectivity to the IPv6 Internet by at least one IPv6 enabled router in a given organizations network. The figures below illustrate the difference between the two approaches. In the top example, the organization does not have direct access to the IPv6 Internet. As a result transition tunneling must be used to attain connectivity. In the lower example, the organization has a router that is directly attached to the IPv6 Internet. As a result there is no need to invoke transition tunneling. By using layer two technologies such as virtual LAN’s IPv6 hosts can acquire connectivity to the IPv6 dual stack native router.

Figure 6. Using transition tunneling to extend IPv6 connectivity

Figure 7. Using L2 VLAN’s to extend IPv6 connectivity

Within the organization – Use what you already have

As we established by providing direct connectivity to the IPv6 Internet the use of transition tunneling can be eliminated on the public side. Within the organization prior to implementing transition tunneling it makes sense to review the existing methods that may already exist in the network to attain connectivity.

All of the issues in dealing with IPv6 transition revolve around the use of layer 3 approaches. By using layer 2 networking technologies, transparent transport can be provided. There are multiple technologies that can be used for this approach. Some of these are listed below:

  • Optical Ethernet
  • Ethernet Virtual LAN’s
  • ATM
  • Frame Relay

As listed above there are many layer two technologies that can be used to extend IPv6 connectivity within an organizations network.

Virtual LAN’s can be used to extend link local connectivity to IPv6 enabled routers in a campus environment. The data will traverse the IPv4 network with out the complexities of layer 3 transition methods. For the regional and wide area, optical technologies can extend the L2 virtual LAN’s across significant distances and geographies again with the goal of reaching an IPv6 enabled router. Similarly, traditional L2 WAN technologies such as ATM and frame relay can extend IPv6 local links across circuit switched topologies. As the diagram above illustrates, by placing the IPv6 dual stack routers strategically within the network and interconnecting them with L2 networking topologies, an IPv6 deployment can be implemented that co-exists with IPv4 without any transition tunnel or NAT-PT methods.

The catch is of course that these layer two paths can not traverse any IPv4 only routers or layer 3 switches. As long as this topology rule is adhered to this simplified approach is totally feasible. By incorporating dual stack routers, both IPv4 and IPv6 Virtual LAN boundaries can effectively be terminated and in turn propagated further with virtual LAN’s or other layer two technologies on the other side of the routed element. A further evolution on this is to use policy based virtual LAN’s that determine membership according to IP version type of the data received on a given edge port. As the figure below illustrates, dual stack hosts will have access to all required resources in both protocol environments.

Figure 8. Using Policy Based VLAN’s to support dual stack hosts

In essence, where dual stack capability is provided end to end, layer three transition methods can be avoided all together. While it is unlikely that this can be made to occur in most networks, such logic can greatly reduce any layer three transition tunnel usage. By taking additional considerations regarding application network behaviors and characteristics as noted in the beginning of this article the use of intermediate protocol and address translation methods like NAT-PT can also be mitigated.

In conclusion

This article was written to clarify deployment issues for IPv6 with a particular focus on interoperability and co-existence with IPv4. A step by step summary of the deployment considerations can be now summarized as follows:

1). Build the foundation

There are four basic foundation blocks that need to be established prior to deployment consideration. Details on each particular foundation block are provided. In summary they are:

1). DNS/DHCP services

2). Network Operating Systems

3). Applications

4). Network Equipment

As pointed out several times, plan for dual stack support wherever possible in all of the foundation blocks. Such an approach will greatly ease the transition issues around deployment. Ongoing work in multiple routing and forwarding planes such as OSPF-MT (  and Multi-protocol BGP (MBGP) may have beneficial and simplifying merits to interconnect dual stack routing elements and exclusively identify them and build forwarding overlays or route policies based on the traffic type (IPv4 vs. IPv6). While the OSPF-MT work is in preliminary draft phases it has very strong merits in that it can in combination with MBGP effectively displace MPLS type approaches to accomplish the same goal. Again, no transition methods would be required within the OSPF-MT boundary as long as overlay routes exist between the dual stack routing elements.

2). Establish connectivity

Once the foundations have been provided for the next step is to establish how connectivity will be made between different sites. Assuming that dual stack routers are available, it makes sense to closely analyze campus topologies and establish methods that connectivity can be established in concert with layer two networking technologies. Once all available methods have been exhausted and it is clear that one is dealing with an IPv6 ‘island’. It is at this point where one should look at using one of the IPv6 transition tunneling methods with configured tunneling being the most secure and conservative approach and is appropriate for this type of site to site usage.. Host to router tunneling may have valid usage in remote access VPN applications, particularly where local Internet providers do not offer IPv6 networking services. Host to host tunneling applications should be used only in initial test bed or pilot environments and because of manageability and scaling issues is not recommended for general practice usage.

To connect sites across a wide area network, layer two circuit switched technologies such as frame relay and ATM can extend connectivity between the dual stack enabled sites. In some next generation wide area deployments, layer two virtual LAN’s can be extended across RPR optical cores to accomplish the end to end connectivity requirements. Again, only after all other options have been exhausted should the use of IPv6 transition tunneling methods be entertained.

At this point, a dual stack native mode deployment has been achieved with only the minimal use of tunneling methods. It is only at this point that the use of any NAT-PT functions should be entertained to accommodate any applications that do not comply to the deployment. It is strongly urged that such an approach be used in a very limited form and be relatively temporary in the overall deployment. Timelines should be established to move away from the temporary usage by incorporating a dual stack native approach as soon as feasible.

3). Test, test, test

As noted at several points throughout this article testing is critical to deployment success. The reason for this is that requirements are layered and they are interdependent. Consequently, it is important to validate all embodiments of an implementation. Considerations need to be made according to node type, operating system, application as well as any variations that need to be considered for legacy components. It is like the great law of Murphy, it is the implementation that you do not test that will be the one to have the problems.

Storage as a Service – Clouds of Data

May 26, 2010

Storage as a Service (SaaS) – How in the world do you?

There is a very good reason why cloud storage has so much hype. It simply makes sense. It has an array of attractive use case models. It has a wide range of potential scope and purpose making it as flexible as the meaning of the bits stored. But most importantly, it has a good business model that has attracted some major names into the market sector.

If you read the blog posts and articles, most will say that Cloud Storage will never be accepted due to the lack of security & accountability. The end result is that many CISO’s & CIO’s have decided that it is just too difficult to prove due diligence for compliance. As a result, they have not widely embraced the cloud model. Now while this is correct, it is not totally true. As a matter of fact most folks are actually using Cloud Storage within their environment. They just don’t equate it as such. This article is intended to provide some insight into the use models of SaaS as well as some of the technical and business considerations that need to be made in moving to a SaaS environment.

Types of SaaS Clouds

It is commonly accepted that there are two types of clouds; public and private. It is the position of this architect that there are in reality three major types of clouds and a wide range of manifestations of them. There are reasons for this logic and the following definitions will clarify why.

Public SaaS Clouds

Public clouds are clouds that are provided by open internet service providers. They are truly public in that they are equally available to anyone who is willing to put down a credit card number and post data to the repository. Examples of this are Google, Amazon & Storage Planet. While this is a popular model, as attested by its use, many are saying the honeymoon is fading along with issues of accountability, reports of lost data and lack of assurances for security and integrity of content.

Semi- Private SaaS Clouds

These are clouds that are more closed in that they usually require some sort of membership or prior business subscribership. As a result the service is typically less open to the general public. Also, the definition of semi-private can have a wide range of embodiments. Examples are, network service providers like cable and telco companies, then slightly more closed might be an educational clouds for higher education to store, post and share vast quantities of content; finally the most closed would be government usage where say in the example of a county that provides a SaaS cloud service to the various agencies within its area of coverage.

Private SaaS Clouds

These are the truly private SaaS services that are totally owned and supported by a single organization. The environment is totally closed to the outside world and access is typically controlled with the same level of diligence as corporate resource access. The usual requirements are that the user has secure credentials and his department is accounted for usage by some of type of cost center.

As indicated earlier these can occur in a variety of embodiments and in reality there is no hard categorization between them. Rather a continuum of characteristics that range from truly private to truly public.

While placing data up into a truly public cloud would cause most CISO’s and CIO’s to cringe, many are finding that semi-private and private clouds are totally acceptable in dealing with issues of integrity, security and compliance. Concern about security and integrity of content is one thing. Another more teasing issue is knowing exactly where your data is in the cloud. Is it in New York? California? Canada? Additionally, if the SaaS provider is doing due diligence in protecting your data then they are replicating it to a secondary site. Where is that? India? As you can see in a totally public cloud service there are a big set of issues that prevent large scale serious use. Additionally, often performance is a real issue. This is particularly the case for critical data or for system restores, when the disappointed systems administrator finds that it will be a day and a half before the system is back on line and operational. These are serious issues that are not easily addressable in a true public cloud environment. Semi-private and Private Clouds on the other hand can often answer these requirements and can provide fairly solid reporting about the security and location of posted content.

The important thing to realize is that it is not all or nothing. A single organization may use multiple clouds for various purposes, each with a different range of scope and usage. As an example, the figure below shows a single organization that has two private clouds one of which are used exclusively by a single department and one of which spans the whole organization. Additionally, that same organization may have semi-private clouds that are used for B2B exchange of data for use in partnerships, channel relationships, etc. Then finally, the organization may have an e-Commerce site that provides a fairly open public cloud service for its customer and prospect communities.

Figure 1. Multiple tiered Clouds

If you really boil it down, you come to a series of tiered security environments that control what type of data gets posted, by whom and for what purpose. Other issues include data type and size as well as performance expectations. Again, in a Semi-private to private usage model these issues can effectively be addressed in a fashion that satisfies both user and provider. The less public the service, the more stringent the controls for access and data movement and the tighter the security boundaries with the outside world.

It is for this reason that I think truly public SaaS clouds have too much stacked against them to be taken as a serious tool for large off site data repositories. Rather, I think that organizations and enterprises will more quickly embrace semi-private and private Cloud storage because of the more tractable environment to address the issues mentioned earlier.

There are also different levels of SaaS offerings. These can vary in complexity and offered value. As an example, a network drive service might be handy for storing extra data copies but might not be too handy as a tool for disaster recovery. As a result, most SaaS offerings can be broken into three major categories.

  • Low level – Simple Storage Target

–        Easy to implement

–        Low integration requirements

–        Simple network drive

  • Mid level – Enhanced Storage Target

–        VTL or D2D

–        iSCSI

–        Good secondary ‘off-site’ use model

  • High level – Hosted Disaster Recovery

–        VM failover

–        P2V Consistency Groups

–        Attractive to SMB sector

As one moves from one level to the next the need for more control and security becomes more important. As a result, the higher the level of SaaS offering the more private it needs to be in order to satisfy security and regulatory requirements.

The value of the first Point of Presence in SaaS

As traffic leaves a particular organization or enterprise it enters either a private WAN and at some point there is boundary to the public Internet. Often these networks are depicted as clouds. We of course realize that there is in reality a topology of networking elements that handle the various issues of data movement. These devices are often switches or routers that operate at L2 or L3 and each imposes a certain amount of latency to the traffic as it moves from one point to another. As are result, the latency profiles to access data in a truly public SaaS becomes longer and less predictable due to increasing variables. The figure below illustrates this effect. As data traverses across the Internet it intermixes with other data flows at the various points of presence where these network elements route and forward data.

Figure 2. Various ‘points of presence’ for SaaS

In a semi-private or a private cloud offering, the situation is much more controlled. In the case of a network provider, they are the very first point of presence or ‘hop’ that their customer’s traffic crosses. It only makes sense that hosting a SaaS service at that POP will offer significantly better and more controlled latency and as a result far better throughput than will a public cloud service somewhere on the network. Also consider that the bandwidth of the connection to that first POP will be much higher than the average aggregate bandwidth that would be realized to the public storage provider on the Internet. If we move to a private cloud environment such as that hosted by a University as a billed tuition service for its student population, very high bandwidth can be realized with no WAN technologies involved. Obviously, the end to end latency in this type of scenario will be minimal when compared to pushing the data. This in addition to the security and control issues mentioned above will in the opinion of the author result in the dramatic growth in semi-private and private SaaS.

Usage models for SaaS

Now that we have clarified the issues of how SaaS can be embodied, what would someone use it for? The blatant response of ‘to store data stupid’ is not sufficient. Most certainly that is an answer, but it turns out that the use case models are much more varied and interesting. At this point, I think that it is fruitful to discern between two major user populations – Residential & Business, with business including education and government institutions. The reason for the division is the degree of formality in usage. In most residential use models, there are no legal compliance issues like SOX or HIPPA to deal with. There may be confidentiality and security issues but as indicated earlier these issues are easier to address in a semi-private or private SaaS.

Business and Institution use models

Virtual Tape Library SaaS

The figure below illustrates a simple VTL SaaS topology. The basic premise is to emulate a physical tape drive across the network with connectivity provided as an iSCSI target to the initiator, which is the customer’s backup software. With the right open system VTL, the service can be as easy as a new iSCSI target that is discovered and entered into the backup server. With no modifications to existing practices or installed software, the service matches well with organizations that are tape oriented in practice and are looking for an effective means of secondary off site copies. Tapes can be re-imported back across the network to physical tape if required in the future.

Figure 3. A simple VTL SaaS

D2D SaaS

Disk to disk SaaS offerings basically provide an iSCSI target of a virtual disk volume across the network. In this type of scenario the customers existing backup software simple points to the iSCSI target for D2D backup or replication. Again, the benefit is that because the volume is virtualized and hosted, it effectively addresses off site secondary data store requirements. In some instances that may require CPE, it can even be used in tandem with next generation technologies like continuous data protection and data reduction methods, which moves towards the Hosted Disaster Recovery end of the spectrum. The figure below shows a D2D SaaS service offering with two customers illustrated. One is simply using the service as a virtual disk target. The other has an installed CPE that is running CDP and data reduction resulting in a drastic improvement on the overall required bandwidth.

Figure 4. A D2D SaaS

Collaborative Share SaaS

Another use model that has been around for a long time is collaborative sharing. I say this because I can remember better than ten years ago placing a file up on an FTP server and then pasting the URL into an email that went out to a dozen or so recipients. Rather than plug up the email servers with multiple copies of large attachments. Engineers have a number of things in common regardless of discipline. First is collaboration. A close second though is the amount of data that they typically require in order to collaborate. This type of model is very similar to the FTP example except that it is enhanced with a collaborative portal that might even host real time web conferencing services. The storage aspect, though of primary importance to the collaboration is now a secondary supporting service that is provided in a unified fashion out to the customer via a web portal. The figure below shows an example of this type of service. Note that in reality there is no direct link between the SaaS and the Web Conferencing application. Instead they are unified and merged by a front end web portal that the customer sees when using the service. On the back end a simple shared virtual network drive is provided that receives all content that is posted by the collaborative team. Each may have there own view and sets of folders for instance and each can share them with one individual, or with a group, or with everyone. This type of service makes a lot of sense for this type of community of users. In fact, any user community that regularly exchanges large amounts of data would find value in the type of use model.

Figure 5. A Collaborative Share Service

Disaster Recovery as a Service (DRaaS)

There are times when the user is looking for more than simple storage space. There is a problem that is endemic in small and medium business environments today. There is minimal if any resident IT staff and even less funding to support back end secondary projects like disaster recovery. As a result many companies have BC/DR plans that are woefully inadequate and often would leave them with major or even total data loss in the event of a key critical system failure. For these types of companies using an existing network provider for warm standby virtual data center usage makes a lot of sense. The solution would most probably require CPE to be installed, but after that point the solution could offer a turnkey DR plan that could be tested at regular scheduled intervals for a per event fee.

The big advantage of this approach is that the customer can avoid expanding IT staff and addresses a key issue of primary importance, which is the preservation of data and system up time.

Obviously, this type of service offering requires a provider who is taking SaaS seriously. There is a Data Center required where virtual resources are leased out and hosted to the customer as well as the IT staff required to run the overall operations. As shown by the prevalence of vendors providing this type of service, even with the overhead, it does have an attractive business model that only improves with expanded customer base.

Figure 6. DRaaS implementation

Residential Use Models

PC Backup & Extra Storage

This type of SaaS offering is similar to the virtual disk service (D2D) mentioned above. The important difference is that it is not iSCSI based. Rather it a NAS virtual drive that is offered to the customer through some type of web service portal. Alternatively, it could be offered as a mountable network drive via Windows Explorer™. The user would then simply drag the folders that they want to store into the cloud onto that network drive. If they use backup software they can with a few simple modifications copy data into the cloud by pointing the backup application to the virtual NAS drive. Additionally, this type of service could support small and medium businesses that are NAS oriented from a data storage architecture perspective. In the figure below, a NAS SaaS is illustrated with a residential user who is using the service to store video and music content. Another user is a small business that is using the service for NAS based D2D backup. Both customers see the service as a mapped network drive (i.e. F or H:). For the residential customer it is a drive that content can be saved to, for the business customer it is a NAS target for its backup application.

Figure 7. NAS SaaS

Collaborative Share

More and more, friends and family are not only sharing content, but creating it as well. Additionally, most of it is in pictures, music and video. All files of huge size. This results in a huge amount of data that needs to be stored but also needs to be reference able in order to be shared with others. The widely popular YouTube™ is a good example of such a collaborative service. Another example is FaceBook™, where users can post pictures and video to their walls and share them with others as they see fit. As shown in the figure below, SaaS is an embedded feature of the service. The first user posts content into the service there by using the SaaS feature. Then the second user receives the content in a streaming CDN fashion. The first user would post the content via the web service portal (i.e. their wall).The second user would initiate the real time session via the web service portal by clicking on the posted link and view the content via their local installed media player. Aside from the larger industry players, there is a demand for more localized community based collaborative shares that can exist with art and book communities, student populations, or even local business communities.

Figure 8. Collaborative Share for Residential

Technologies for SaaS

The above use models assumed the use of underlying technologies to move the data, reduce it and store it. These are then merged with supporting technologies such as web services, collaboration and perhaps content delivery to create a unified solution to the customer. Again, this could be as simple as a storage target where data storage is the primary function or it could be as complex as a full collaboration portal where data storage is more ancillary. In each instance, the same basic technologies come into play. It is obvious that from the point of view of the customer, only the best will do. While from the point of view of the provider, it is providing what will meet the level of service required. This results in a dichotomy – as often results in a business model. The end result is an equitable compromise which uses the technologies below to arrive at an equitable solution that satisfies the interest of the user as well as that of the provider. The end result is a tenable set of values and benefits to all parties which is the sign of a good business model.

Disk Arrays

Spinning disks have been around almost as long a modern computing itself. We all have the familiar spinning and clicking (now oh so faint!) on our laptops as the machines chunks through data on its relentless task of providing the right bits at the right time. Disk technology has come a long ways as well. The MTBF rating for even lower end drives are exponentially higher than the original ‘platter’ technologies. Still though, this is the Achilles Heel. This is the place where the mechanics occur. Where mechanics occur, particularly high speed mechanics – failure is just one of the realities that need to be dealt with.

I was surprised to learn just how common it is that just a bunch of disks are set up and used for cloud storage services. The reason is simple, cost. It is far more cost effective to place whole disk arrays out for leasing than it is to take that same array and sequester a portion of it for parity or mirroring. As a result, many cloud services offer best effort service and with smaller services that pretty much works – particulary if the IT staff is diligent with backups. As the data volume grows however, this approach will not work as the MTBF rate of potential failure will out weigh the ability to pump the data back into the primary. That exact number is related to the network speed available and since most organizations do not infinite bandwidth available, that limit is a finite number.

Now one could go through the math to figure the probability of data loss and gamble, or one could invest into RAID and be serious about the offering they are providing. As we shall see later on, there are technologies that assist in the economic feasibility. In my opinion, it would be the first question I asked someone who wanted to provide me a SaaS offering. That is first beyond backup and replication or anything else. Will my data be resident on a RAID array? If so what type? Another question to ask is the data replicated? If so, the next question is how many times and where?

Storage Virtualization

While a SaaS offering could be created with just a bunch of disk space. Allocation of resources would have very rough granularity and the end result would be an environment that would be drastically over provisioned. The reason for this is that as space is leased out the resource is ‘used’ whether it has data or not. Additionally, as new customers are brought on line to the service additional disk space must be acquired and allocated in a discrete fashion. Storage virtualization overcomes this limitation by creating a virtual pool of storage resources that can consist of any number and variety of disks. There are several advantages that are brought about by the introduction of this type of technology. The most notable is that of thin provisioning. Which, from a service provider standpoint is some thing that is as old as service offerings itself. As an example, network service providers do not build their networks to be provisioned to 100% of the potential customer capacity 100% of the time. Instead they analyze and look at traffic patterns and engineer the network to handle the particular occurrences of peak traffic. The same might be said of a thinly provisioned environment. Instead of allocating the whole chunk of disk space at the time of the allocation, a smaller thinly provisioned chunk is setup but the larger chunk is represented back to the application. The system then monitors and audits the usage of the allocation and according to high water thresholds, allocate more space to the user based on some sort of established policy. This has obvious benefits in a SaaS environment as only very seldom will a customer purchase and use 100% of the space at the outset. The gamble is that the provider keeps enough storage resources within the virtual pool to accommodate any increases. Being that most providers are very familiar with type of practice in bandwidth provisioning, it is only a small jump to apply that logic in storage.

Not all approaches to virtualization are the same however. Some implementations are done at the disk array level. While this approach does offer pooling and thin provisioning, it only does so at the array level or within the array cluster. Additionally, the approach is closed in that it only works with that disk vendors’ implementation. Alternatively, virtualization can be performed above the disk array environment. This approach more appropriately matches a SaaS environment in that the open system approach allows any array to be encompassed into the resource pool which better leverages on the SaaS providers’ purchasing power. Rather than getting locked into a particular vendors approach, the provider has the ability to commoditize the disk resources and hence allow better pricing points.

There are also situations called ‘margin calls’. These are scenarios that can occur in thinly provisioned environments where the data growth is beyond the capacity if the resource pool. In those instances, additional storage must physically be added to the system. With array based approaches, this can run into issues such as spanning beyond the capacity of the array or the cluster. In those instances, in order to accommodate for the growth, the provider needs to migrate the data to a new storage system. With the open system approach, the addition of storage is totally seamless and it can occur with any vendors’ hardware. Additionally, implementing storage virtualization at a level above the arrays allows for very easy data migration, which is useful in handling existing data sets.

Data Reduction Methods

This is a key technology for the providers return on investment. Here remember that storage is the commodity. In typical Cloud Storage SaaS offerings the commodity is sold by the Gigabyte. Obviously, if you can retain 100% of the customers data and only store ten or twenty percent of the bits, the delta is revenue back to you for return on investment. If you are then able to take that same technology and not only leverage it across all subscribers but across all content types as well then it becomes something that is of great value to the overall business model of Storage as a Service. The key to the technology is that the data reduction is performed at the disk level. Additionally, the size of the bit sequence is relatively small (512 bytes) rather than the typical block levels. As a result, the comparative is large (the whole SaaS data store) while the sample is small (512 bytes) The end result, is that as more data is added to the system the context of reference is widened correspondingly meaning that the probability that a particular bit sequence will match another in the repository is hence  increased.

But beware, data reduction is not a panacea. Like all technologies it has its limitations and there is the simple fact that some data just does not de-duplicate well. There is also the fact that the data that is stored by the customer is in fact manipulated by an algorithm and abstracted in the repository. This means that some issues of regulatory legal compliance may come into play with some types of content. For the most part however, these issues can be dealt with and data reduction can play a very important role in SaaS architectures, particularly in the back end data store.

Replication of the data

If you are doing due diligence and implementing RAID rather than selling space on ‘just a bunch of disks’, then your most probably the type that will go further to create secondary copies of the primary data footprint. If you do this, you also probably want to do this on the back end so as not to impact the service offering. You also probably want to use as little network resource as possible to keep that replicated copy up to date. Here technologies like Continuous Data Protection and thin replication can assist in getting the data into the back end and performing the replication with minimal impact to network resources.


There are more and more concerns about placing content in the cloud. Typically these concerns are from business users who see it as a major compromise of security policy. Individual end users are also broaching concerns around confidentiality of content. Encryption can not solve the issue by itself but it can go a long way towards it. It should be noted though that with SaaS encryption needs to be considered in two aspects. First is the encryption of data in movement. That is protecting the data as it is posted into and pulled out of the cloud service. Second is the encryption of data at rest, which is protecting the content once it is resident in a repository. The first is addressed by methods such as SSL/TLS or IPSec. The second is addressed by encryption at the disk level or prior to disk placement.

Access Controls

Depending on the type and intention of the service, access controls can be relatively simple (i.e. user name & password) to complex (RSA type). In private cloud environments, normal user credentials for enterprise or organization access would be the minimum requirement. Likely, there will be additional passwords or perhaps even tokenization to access the service. For semi-private clouds the requirements are likely to not be as intense but again, can be if needed. Also, there may be a wide range in the level of access requirements. As an example, for a backup service there only needs to be an iSCSI initiator/target binding and a monthly report on usage that might be accessible over the web. In other services such as collaboration, a higher level portal environment will need to be provided – hence the need for a higher level access control or log on. Needless to say, some consideration will need to be made for access to the service, even if it is for the minimal task of data separation and accounting.

The technologies listed above are not ‘required’, as pointed out above just a bunch of disks on the network could be considered cloud storage. Nor is the list exhaustive.  But if the provider is serious about the service offering and also serious about its prospect community, it will make investments into at least some if not all of them.

Planning for the Service

There are two perspectives to cover here. The first is that of the customer. When IT organizations start thinking about using cloud services they are either attempting to reduce cost or bypass internal project barriers. Most of these will plan on using the service to answer requirements for off site storage. Secondary sites are not cheap, particularly if the site is properly equipped as a data center. If this does not already exist, it can be a prime motivator for moving secondary or even tertiary data copies into a cloud service.

There are a number of questions and concerns that should be asked prior to using such a service though. The IT staff should create a task group to assemble a list of questions, requirements & qualifications as to what they expect out of the service. Individuals from various areas of practice should be engaged in this process. Examples are, Security, Systems Administrators, DB Administrators, IT Audit, Networking, etc… the list can be quite extensive. But it important to be sure to consider all facets of the IT practice in regards to the service in question. In the end a form should be created that can be filled out in dialogs with the various providers that are being entertained. Tests and pilots are also a good thing to arrange if it can be done. It is important to get an idea of how fast data can be pumped into the cloud. It is also very important to know how fast it can be pulled out as well. At the very least the service should be closely monitored by both storage and networking staff to be certain that the service works according to SLA (if there is one) and is not decaying in performance over time or increase in data. In either instance communication with the SaaS provider is then in order and may involve technical support and troubleshooting or service expansion. In any event, it should be realized that a SaaS service package, just like the primary data footprint, is not a static thing; and they usually do not shrink!

Some sample questions that might be asked of a SaaS vendor are the following:

Is the data protected by RAID storage?

Is the data replicated? If so, how many times and where will copies be located?

Is the data encrypted in movement? At rest?

What is the estimated ingestion capacity rate? (i.e. how much data can be moved in an hour into the cloud)

What is the estimated restore time? (i.e. how much data can be moved off of the cloud in an hour)

(The two questions above may require an actual test.)

What security measures are taken at the storage sites (both cyber and physical)?

These are only a few generic level questions that can help in getting the process started. You will quickly find that once you start bringing in other individuals into the process from various disciplines that list can get large and may need to be optimized and pared down. Once this process is complete, it is good to set up a review committee that will meet with the various vendors and move through the investigation process.

From the perspective of the SaaS provider the issues are similar as it is in the best interest to meet the needs of the customer. There is a spin of using the service to providing it however. There are two ways that this can occur. The first instance is where a prospective SaaS provider already has an existing customer base that it is looking to provide a service to. In this case the data points are readily available. A survey needs to be created that will assemble the pertinent data points and that then needs to be filled out by the various customers of the service. Questions that might be asked are, what is your backup environment like, what is the size of the full data repository, what is the size of the daily incremental backup, can you provide an estimated growth rate, what is your network bandwidth capacity? Once the data is assembled, it can be tallied up and sizing can occur in a rather accurate fashion.

The second method is in the case of a prospective provider who does not yet have a known set of data for existing customers. Here some assumptions must be made on a prospective business model. It needs to be determined what the potential target market is for the service launch. Once those numbers are reached a range or average needs to be figured on many of the data points above to create a typical customer profile. It is important that this is well defined and well known. The reason for this is that as you add new customers onto the service you can in the course of the service profile survey identify a relative size for the customer. (i.e. 1 standard profile or 3.5 times the standard profile) With that information predicting service impact and scaling estimations can be much easier. From there the system can then be sized according to those metrics with an eye to the future for growth. Capacity is added as the service deployment grows.

As a storage solution provider, my company will assist prospective SaaS providers in doing this initial sizing exercise. As an example, in the first case point we assisted a prospect in the creation of the service requirements survey as well as helped in actually administering it. Afterwords, we worked interactively with the provider to size out the appropriate system to meet the requirements of the initial offering. Additionally, we offered scaling information as well as regular consultative services so that the offering is scaled properly.

Like all service offerings, SaaS is only as good as its design. Someone can go out and spend the highest dollar on the ‘best’ equipment and then be some what slipshod in the way the system is sized and implemented and end up with a mediocre service offering. On the other hand one can get good cost effective equipment, size and implement them with care and wind up with a superior offering. The message here is that the key to success in SaaS is in the planning, both for the customer as well as the provider.

Infiniband and it’s unique potential for Storage and Business Continuity

February 18, 2010

It’s one of those technologies that many have only had cursory awareness of. It is certainly not a ‘mainstream’ technology in comparison to IP, Ethernet or even Fibre Channel. Those who have awareness of it know Infiniband as a high performance compute clustering technology that is typically used for very short interconnects within the Data Center. While this is true, it’s uses and capacity have been expanded into many areas that were once thought to be out of its realm. In addition, many of the distance limitations that have prevented it’s expanded use are being extended. In some instances to rather amazing distances that rival the more Internet oriented networking technologies. This article will look closely at this networking technology from both historical and evolutionary perspectives. We will also look at some of the unique solutions that are offered by its use.

Not your mother’s Infiniband

The InfiniBand (IB) specification defines the methods & architecture of the interconnect that establishes the interconnection of the I/O subsystems of next generation of servers, otherwise known as compute clustering. The architecture is based on a serial, switched fabric that currently defines link bandwidths between 2.5 and 120 Gbits/sec. It effectively resolves the scalability, expandability, and fault tolerance limitations of the shared bus architecture through the use of switches and routers in the construction of its fabric. In essence, it was created as a bus extension technology to supplant the aging PCI specification.

The protocol is defined as a very thin set of zero copy functions when compared to thicker protocol implementations such as TCP/IP. The figure below illustrates a comparison of the two stacks.

Figure 1. A comparison of TCP/IP and Infiniband Protocols

Note that IB is focused on providing a very specific type of interconnect over a very high reliability line of fairly short distance. In contrast, TCP/IP is intended to support almost any use case over any variety of line quality for undefined distances. In other words, TCP/IP provides robustness for the protocol to work under widely varying conditions. But with this robustness comes overhead. Infiniband instead optimizes the stack to allow for something known as RDMA or Remote Direct Memory Access. RDMA is basically the extension of the direct memory access (DMA) from the memory of one computer into that of another (via READ/WRITE) without involving the server’s operating system. This permits a very high throughput, low latency interconnect which is of particular use to massively parallel compute cluster arrangements. We will return to RDMA and its use a little later.

The figure below shows a typical IB cluster. Note that both the servers and storage are assumed to be relative peers on the network. There are differentiations in the network connections however. HCA’s (Host Channel Adapters) refer to the adapters and drivers to support host server platforms. TCA’s (Target Channel Adapters) refer to the I/O subsystem components such as RAID or MAID disk subsystems.

Figure 2. An example Infiniband Network

At its most basic form the IB specification defines the interconnect as (Point-to-Point) 2.5 GHz differential pairs (signaling rate)- one transmit and one receive (full duplex) – using LVDS and 8B/10B encoding. This single channel interconnect delivers 2.5 Gb/s. This is referred to as a 2X interconnect. The specification also allows for the bonding of these single channels into aggregate interconnects to yield higher bandwidths. 4X defines a interface with 8 differential pairs (4 per direction). The same for Fiber, 4 Transmit, 4 Receive, whereas 12X defines an interface with 24 differential pairs (12 per direction). The same for Fiber, 12 Transmit, 12 Receive. The table below illustrates various characteristics of the various channel classes including usable data rates.

Table 1.

Also note that the technology is not standing still. The graph below illustrates the evolution of the IB interface over time.

Figure 3. Graph illustrating the bandwidth evolution of IB

As the topology above in figure 2 shows however, the effective distance of the technology is limited to single data centers. The table below provides some reference to the distance limitations of the various protocols used in the data center environment including IB.

Table 2.

Note that while none of the other technologies extend much further from a simplex link perspective, they do have well established methods of transport that can extend them beyond the data center and even the campus.

This lack of extensibility is changing for Infiniband however. There are products that can extend its supportable link distance to tens, if not hundreds of Kilometers, distances which rival well established WAN interconnects. New products also allow for the inter-connection of IB to the other well established data center protocols, Fibre Channel and Ethernet. These new developments are expanding its potential topology thereby providing the evolutionary framework for IB to become an effective networking tool for next generation Business Continuity and Site Resiliency solutions. In figure 4 below, if we compare the relative bandwidth capacities of IB with Ethernet and Fibre Channel we find a drastic difference in effective bandwidth both presently and in the future.

Figure 4. A relative bandwidth comparison of various Data Center protocols

Virtual I/O

With a very high bandwidth low latency connection it becomes very desirable to use the interconnect for more than one purpose. Because of the ultra-thin profile of the Infiniband stack, it can easily accommodate various protocols within virtual interfaces (VI) that serve different roles. As the figure below illustrates, a host could connect virtually to its data storage resources over iSCSI (via iSER) or native SCSI (via SRP). In addition it could run its host IP stack as a virtual interface as well. This capacity to provide a low overhead high bandwidth link that can support various virtual interfaces (VI) lends it well to interface consolidation within the data center environment. As we shall see however, in combination with the recent developments in extensibility, IB is becoming increasingly useful for a cloud site resiliency model.

Figure 5. Virtual Interfaces supporting different protocols

Infiniband for Storage Networking

One of the primary uses for Data Center interconnects is to attach server resources to data storage subsystems. Original direct storage systems were connected to server resources via internal busses (i.e. PCI) or over very short SCSI (Small Computer Serial Interface) connections, known as Direct Access Storage (DAS). This interface is at the heart of most storage networking standards and defines the internal behaviors of these protocols for hosts (initiators) to I/O device (targets). An example for our purposes is a host writing data to or reading data from a storage subsystem.

Infiniband has multiple models for supporting SCSI (including iSCSI). The figure below illustrates two of the block storage protocols used, SRP and iSER.

Figure 6. Two IB block storage protocols

SRP (SCSI RDMA Protocol) is a protocol that allows remote command access to a SCSI device. The use of RDMA avoids the overhead and latency of TCP/IP and because it allows for direct RDMA write/read is a zero copy function. SRP never made it into a formal standard. Defined by ANSI T10, the latest draft is rev. 16a (6/3/02).

iSER (iSCSI Extensions for RDMA) is a protocol model defined by the IETF that maps the iSCSI protocol directly over RDMA and is part of the ‘Data Mover’ architecture. As such, iSCSI management infrastructures can be leveraged. While most say that SRP is easier to implement than iSER, iSER provides enhanced end to end management via iSCSI management. Both protocol models, to effectively support RDMA, possess a peculiar function that results in all RDMA being directed towards the initiator. As such, a SCSI read request would translate into an RDMA write command from the target to the initiator; whereas a SCSI write request would translate into an RDMA read from the target to the initiator. As a result some of the functional requirements for the I/O process shift to the target and provides offload to the initiator or host. While this might seem strange, if one thinks about what RDMA is it only makes sense to leverage the direct memory access of the host. This is results in a very efficient leverage of Infiniband for use in data storage.

Another iteration of a storage networking protocol over IB is Fibre Channel (FCoIB). In this instance, the SCSI protocol is embedded into the Fibre Channel interface, which is in turn run as a virtual interface inside of IB. Hence, unlike iSER and SRP, FCoIB does not leverage RDMA but runs the Fibre Channel protocol as an additional functional overhead. FCoIB does however provide the ability to incorporate existing Fibre Channel SAN’s into an Infiniband network. The figure below illustrates a network that is supporting both iSER and FCoIB, with a Fibre Channel SAN attached by a gateway that provides interface between IB and FC environments.

Figure 7. An IB host supporting both FC & native IB interconnects

As can be seen, a legacy FC SAN can be effectively used in the overall systems network. Add to this high availability and you have a solid solution for a hybrid migration path.

If we stop and think about it, data storage is number two only to compute clustering for an ideal usage model for Infiniband. Even with this, the use of IB as a SAN is a much more real world usage model for the standard IT organization. Not many IT groups are doing advanced compute clustering and those that do already know the benefits of IB.

Infiniband & Site Resiliency

Given the standard offered distances of IB, it is little wonder that it has not been often entertained for use in site resiliency. This however, is another area that is changing for Infiniband. There are now technologies available that can extend the distance limitation out to hundreds of kilometers and still provide the native IB protocol end to end. In order to understand the technology we must first understand the inner mechanics of IB.

The figure below shows a comparison between IB and TCP/IP reliable connection. The TCP/IP connection shows a typical saw tooth profile which is the normal result of the working mechanics of the TCP sliding window. The window starts at a nominal size for the connection and gradually increases in size (i.e. Bytes transmitted) until a congestion event is encountered. Depending on the severity of the event the window could slide all the way back to the nominal starting size. The reason for this behavior is that TCP reliable connections were developed in a time when most long distance links were far more unreliable and of less quality.

Figure 8. A comparison of the throughput profiles of Infiniband & TCP/IP

If we take a look at the Infiniband throughput profile we find that the saw tooth pattern is replaced by a square profile that is the result of the transmission instantly going to 100% of the offered capacity and maintains as such until a similar event occurs which results in a halt to the transfer. Then after a period of time, it resumes as 100% of the offered capacity. This similar event is something termed as buffer starvation. Where the sending Channel Adapter has exhausted its available buffer credits which are calculated by the available resources and the bandwidth of the interconnect (i.e. 2X, 4X, etc.). Note that the calculation does not include any significant concept of latency. As we covered earlier, Infiniband was originally intended for very short highly dependable interconnects so the variable of transmission latency is so slight that it can effectively be ignored within the data center. As a result the relationship of buffer credits to available resources and offered channel capacity resulted in a very high throughput interconnect that seldom ran short of transmit buffer credits. Provided things were close.

As distance is extended things become more complex. This is best realized in the familiar bucket analogy. If I sit on one end of a three foot ribbon and you sit on the other end and I have a bucket full of bananas (which are analogous to the data in the transmit queue) where as you have a bucket that is empty (analogous to your receive queue) we can run the analogy. As I pass you the bananas , there is only a short distance which can allow for a direct hand off of the bananas. Remembering that this is RDMA, I pass you the bananas at a very fast predetermined speed (the speed of the offered channel) and you take them just as fast. At the end of passing you the bananas, you pass me a quarter to acknowledge the fact that the bananas have been received (this is analogous to the completion queue element shown in figure 1). Now imagine that there is someone standing next to me who is providing me bananas at a predetermined rate (this is the available processing speed of the system). Also, he will only start to fill my bucket if the following two conditions exist. 1). my bucket is empty and, 2). I give him the quarter for the last bucket. Obviously the time required end to end will impact that rate. If that resulting rate is equal to the offered channel, we will never run out of bananas and you and I will be very tired. If that rate is less than the offered channel speed then at some point I will run out of bananas. At that point I will need to wait until my bucket is full before I begin passing them to you again. This is buffer starvation. Now in a local scenario, we see that the main tuning parameters are a). the size of our buckets (available memory resources for RDMA) and, b). the rate of the individual placing bananas into my bucket (the system speed). If these parameters are tuned correctly, the connection will be of very high performance. (You and I will move a heck of a lot of bananas). The further we are from that optimal set of parameters, the lower the performance profile will be and an improperly tuned system will perform dismally.

Now let’s take that ribbon and extend it to twelve feet. As we watch the following scenario unfold it becomes obvious as to why buffer starvation limits distance. Normally, I would toss you a banana and wait for you to catch it. Then I would toss you another one. If you missed one and had to go pick it up off of the ground (the bruised banana is a transmission or reception error), I would wait until you were ready to catch another one. This in reality is closer to TCP/IP. With RDMA, I toss you the bananas just as if you were sitting next to me. What results is a flurry of bananas in the air all of which you catch successfully because hey – your good. (In reality, it is because we are assuming a high quality interconnect) After I fling the bananas however, I need to wait to receive my quarter and until my bucket is in turn refilled. At twelve feet if nothing else changes – we will be forced to pause far more often as my bucket refills. If we move to twenty feet the situation gets even more skewed. We can tune certain things like the depth of our buckets or the speed of the replenishment but these get to be unrealistic as we stretch the distance farther and farther. This is what in essence has kept Infiniband inside the data center.*

*Note that the analogy is not totally accurate with the technical issues but it is close enough to give you a feel of the issues at hand.

Now what would happen if I were to put some folks in between us who had reserve buckets for bananas I send to you and you were to do the same for bananas you in turn send to me? Also, unlike the individual who fills my bucket who deals with other intensive tasks such as banana origination (the upper system and application), this person is dedicated one hundred percent to the purpose of relaying bananas. Add to this the fact that this individual has enough quarters to give me for twice the size of his bucket, and yours in turn as well. If we give them nice deep buckets we can see a scenario that would unfold as follows.

I would wait until my bucket was full then I would begin to hand off my bananas to the person in front of me. If this individual were three feet from me I could hand them off directly as I did with you originally. Better than that, I could simply place the bananas in their bucket and they would give me quarter each time I emptied mine. The process repeats until their bucket is full. They then can begin throwing the bananas to you. While we are at it, why should they toss directly to you? Let’s put another individual in front of you that is also completely focused. But instead of being focused on tossing bananas, they would be focused on catching them. Now if these person’s buckets are roughly 4 times the size of yours and mine, and the relayed bananas occurred over six feet out to your receiver at the same rate as being handed by me, we in theory should never run out of bananas. There would be an initial period of the channel filling and the use of credit but after that initial period the channel could operate at optimal speed with the initial offset in reserve buffer credits being related to the distance or latency of the interconnect. The reason for the channel fill is that the person has to wait until their bucket is full before they can begin tossing, but importantly, after that initial fill point they will continue to toss bananas as long as there are some in the bucket. In essence, I always have an open channel for placing bananas and I always get paid and can in turn pay the guy who fills my bucket only on the conditions mentioned earlier.

This buffering characteristic has led to a new class of devices that can provide significant extension to the distance offered by Infiniband. Some of the latest systems can provide buffer credits equivalent to one full second, which is A LOT of time at modern networking speeds. If we add these new devices and another switch to the topology shown earlier we can begin to realize some very big distances that become very attractive for real time active-active site resiliency.

Figure 9. An extended Infiniband Network

As a case point, the figure above shows an Infiniband network that is extended out to support data centers that are 20Km in distance. The systems at each end, using RDMA are effectively regarding each other as local and for all intensive purposes in the same data center. This means that there are versions of fault tolerance and active to active high availability that otherwise would be out of the question, that are now quite feasible to design and work in practice. A common virtualized pool of storage resources using iSER allow for seamless treatment of data and bring a reduced degree of fault dependency between the server and storage systems. Either side could experience failure at either the server or storage system level and still be resilient. Adding further systems redundancy for both servers and storage locally on each side provides further resiliency as well as provide for off line background manipulation of the data footprint for replication, testing, etc.

Figure 10. A Hybrid Infiniband network

In order for any interface consolidation effort to work in the data center the virtual interface solution must provide for a method of connectivity to other forms of networking technology. After all, what good is an IP stack that can only communicate within the IB cluster? A new generation of gateway products provide for this option. As shown in the figure above, gateway products exist that can tie IB to both Ethernet and Fibre Channel topologies. This allows for the ability to consolidate data center interfaces and still provide for general internet IP access as well as connectivity to traditional SAN topologies and resources such as Fibre Channel based storage arrays.

While it is clear that Infiniband is unlikely to become a mainstream networking technology, it is also clear that there are many merits to the technology that have kept it alive and provided enough motivation (i.e. market) for its evolution into a more mature architectural component. With the advent of higher speed Ethernet and FCoE as well as the current development of lower latency profiles for DC Ethernet, the longer range future of Infiniband may be similar to that of Token Ring or FDDI. On the other hand, even with these developments, the technology may be more likened to ATM. Which, while being far from mainstream, is still being used extensively in certain areas.  If one has the convenience of waiting for these trends to sort themselves out then moving to Infiniband in the Data Center may be premature. However, if you are one of the many IT architects that are faced with intense low latency performance requirements that need to be addressed today and not some time in the future, IB may be the right technology choice for you. It has been implemented by enough organizations that best practices are fairly well defined. It has matured enough to provide for extended connectivity outside of the glass house and gateway technologies are now in place that can provide connectivity out into other more traditional forms of networking technology. Infiniband may never set the world on fire, but it has the potential to put out fires that are currently burning in certain high performance application and data center environments.