Archive for the ‘Data Science’ Category

What’s the Big Deal about Big Data?

July 28, 2014


It goes without saying that knowledge is power. It gives one the power to make informed decisions and avoid miscalculation and mistakes. In recent years the definition of knowledge has changed slightly. This change is the result of increases in the ease and speed in computation as well as the shear volume of data that these computations can be exercised against. Hence, it is no secret that the rise of computers and the Internet has contributed significantly to enhance this capability.
The term that is often bantered about is “Big Data”. This term has gained a certain mystique that is comparable to cloud computing. Everyone knows that it is important. Unless you have been living in a cave, you most certainly have at least read about it. After all, if such big names as IBM, EMC and Oracle are making a focus of it then it must have some sort of importance to the industry and market as a whole. When pressed for a definition of what it is however, many folks will often struggle. Note that the issue is not that it deals with the computation of large amounts of data as its name implies, but more so that many folks struggle to understand what it would be used it for.
This article is intended to clarify the definition of Big Data and Data Analytics/Data Science and what they mean. It will also talk about why they are important and will become more important (almost paramount) in the very near future. Also discussed will be the impact that Big Data will have on the typical IT departments and what it means to traditional data center design and implementation. In order to do this we will start first with the aspect of knowledge itself and the different characterizations of it that have evolved over time.

I. The two main types of ‘scientific’ knowledge

To avoid getting into an in depth discussion of epistemology, we will limit this section of the article to just the areas of ‘scientific’ knowledge or even more specifically, ‘knowledge of the calculable’. This is not to discount other forms of knowledge. There is much to be offered by spiritual and aesthetic knowledge as well as many other classifications including some that would be deemed as scientific, such as biology*. But here we are concerned with knowledge that is computable or knowledge that can be gained by computation.

* This is rapidly changing however. Many recent findings show that many biological phenomena have mathematical foundations. Bodily systems and living populations have been shown to exhibit strong correlations to non-linear power law relationships. In a practical use example, mathematical calculations are often used to estimate the impact of an epidemic on a given population.

Evolving for centuries but coming to fruition with Galileo in the 16th century, it was discovered that nature could be described and even predicted in mathematical terms. The familiar dropping of balls of different sizes and masses from the tower of Pisa is a familiar myth to anyone with even a slight background in the history of science. I say myth, because it is very doubtful that this had ever literally taken place. Instead, Galileo used inclined planes and ‘perfect’ spheres of various densities to calculate the fact that the gravitational pull is a constant regardless of size or mass. Lacking an accurate timekeeping device, he would sing a song to keep track of the experiments. Being an accomplished musician, he had a keen sense of timing. The inclined planes provided him the extended time for such a method. He correctly realized that it was resistance or friction that caused the deltas that we see in the everyday world. While everyone knows that when someone drops a cannon ball and a feather off of a roof, the cannon ball will strike the earth first. It is not common sense that in a perfect vacuum both the feather and the cannonball will fall at the exact same rate. It actually takes a video to prove it to the mind and this can be found readily if one looks on the Internet. The really important thing about this is that Galileo calculated this from his work with spheres and inclined planes and that the actual experiment was not carried out until many years after his death as the ability to generate a perfect vacuum did not exist at the time he lived. I find this very interesting as it says two things about calculable knowledge. First, it allows one to explain why things occur as they do. Second, and perhaps more importantly, it allows one to predict the results once one knows the mathematical pattern of behavior. Galileo realized this. Even though he was not able to create a perfect vacuum, by the meticulous calculation of the various values involved (with rather archaic mathematics – the equal sign had not even been invented yet, nor most of the symbols that we know as familiar) he was able to arrive at this fact. Needless to say, this goes against all common sense and experience. So much so, that this, as well as his workings with the fledgling science of astronomy, almost landed him on the hot seat (or stake) with the Church. As history attests however, he stuck to his guns and even after the Inquisitional Council had him recant his theories on the heliocentric nature of the solar system, he whispered of the earth… “Yet it still moves”.
If we fast forward to the time of Sir Issac Newton, this insight was made crystalline by Newton’s laws of motion which described the movement of ‘everything’ from the falling of an apple (no myth – this actually did spark his insight but it not hit him on the head) to the movement of the planets with a few simple mathematical formula. Published as the ‘Philosophiae Naturalis Principia Mathmatica’ or simply ‘Principia’ in 1687, this was the foundation of modern physics as we know it. The concept that the world was mathematical or at least could be described by mathematical terms was now something that was not only validated but demonstrable. This set of events led to the eventual ‘positivist’ concept of the world that reached its epitome with the following statement made by Pierre Laplace in 1814.
“Consider an intelligence which, at any instant, could have knowledge of all forces controlling nature together with the momentary conditions of all the entities of which nature consists. If this intelligence were powerful enough to submit all of this data to analysis, it would be able to embrace in a single formula the movements of the largest bodies in the universe and those of lighter atoms; for it, nothing would be uncertain; the future and the past would be equally present to its eyes.”

Wow. Now THAT’s big data! Sound’s great! What the heck happened?

Enter Randomness, Entropy & Chaos

In the roughly same time frame as Laplace, many engineers were using these ‘laws’ to attempt in the optimization of new inventions like the steam engine. One such researcher was a French scientist by the name of Nicholas-Leonard-Sadi Carnot. The research that he focused on was the movement of heat within the engine and to conserve as much of the energy as possible for work. In the process he came to realize that there was a feedback cycle within the engine that could be described mathematically and even monitored and controlled. He also realized the fact that some heat is always lost. It just gets radiated out and away from the system and is unusable for the work of the engine. As anyone that has stood next to a working engine of any type will attest, they tend to get hot. This cycle bears his name as the Carnot cycle. This innovative view led to the foundation of a new branch in physics (with the follow on help of Ludwig Boltzman) known as thermodynamics; the realization that all change in the world (and the universe as a whole) is the movement of heat, more specifically, hot to cold. Without going into detail on the three major laws of thermodynamics, the main point to this discussion is that as change occurs it is irreversible. Interestingly, recently developed information theory validates this as it shows that order can actually be interpreted as ‘information’ and that over time this information is lost to entropy in that there is a loss of order. Entropy is as such a measurement of disorder within a system. This brings us to the major inflection point on our subject. As change occurs, it cannot be run in reverse like a tape and arrive at the same inherent values. This is problematic, as the laws of Newton are not reversible in practice, though they may be on a piece of paper. As a matter of fact, many such representations up to modern times, such as the Feynman Diagrams to illustrate the details of quantum reactions are in fact reversible. What gives?
The real crux of this quick discussion is the realization that reversibility is largely a mathematical expression that starts to fall apart as the number of components in the overall system gets larger. A very simple example is one with two billiard balls on a pool table. It is fairly straightforward to use the Newtonian laws to reverse the equation. We can also do so in practice. But now let us take a single queue ball and strike a large number of other balls. Reversing the calculation is not nearly so straightforward. The number of variables to be considered begins to go beyond our ability to calculate much less control. They most certainly are not reversible in the everyday sense. In the same sense, I can flip a deck of playing cards in the air and bet you with ultimate confidence that the cards will not come down in the same order (or even the same area!) as in which it was thrown. Splattered eggs do not fall upwards to reassemble on the kitchen counter. And much to our chagrin, our cars do not repair themselves after we have had a fender bender. This is the term of entropy, the 2nd law of thermodynamics which states that some energy within a system is always lost to friction and heat. This dissipation could be minimized but never eliminated. As a result the less entropy an engine generates the more efficient it is in its function. Hmmmm, what told us that? A lot of data, that’s what, and back then things were done with paper & pencil! A great and timely discovery for its time as it helped move us into the industrial age. The point of all of this however is that in some (actually most) instances, information on history is important in understanding the behavior of a system.

The strange attraction of Chaos

We need to fast forward again. Now we are in the early 1960’s with a meteorologist by the name of Edward Lorenze. He was interested in the enhanced computational abilities that new computing technology could offer in the goal of predicting the weather. Never mind that it took five days worth of calculation to arrive at the forcast for the following day. At least the check was self evident as it already occurred four days ago!
As the story goes he was crunching some data one evening and one of the machines ran out of paper tape. He quickly refilled the machine and started it from where the calculations left off… manually by typing them in. He then went off and grabbed a cup of coffee to let the machine churn away. When he returned he noticed that the computations where way off the values that the sister machines were running. In alarm he looked over his work to find that the only real major difference was the decimal offset of the initial values (the interface only allowed a three place offset while the actual calculation was running with a six place offset). As it turns out the rounded values he typed in manually created a different result to the same calculation. This brought about the realization that many if not most systems are sensitive and at times extremely so to something now termed as ‘initial conditions’.
There is something more however. Lorenze discovered that if some systems are looked at long enough and with the proper focus of granularity, a quasi-regular or quasi-periodic pattern becomes discernible that allows for the general qualitative description of a system and its behavior without the ability to quantitatively say what the state of any particular part of the system may be at a given point in time. These are termed as mathematical ‘attractors’ within a system. A certain set of power law based formula that a system is, if left unperturbed, drawn to and will be maintained. These attractors are quite common. They are somewhat required for all dissipative systems. In essence, it is a behavior that can be described mathematically that by its nature keeps a system as a system, with just enough energy coming in to offset the entropy that must inevitably go out. The whole thing is fueled by the flow of energy (heat) through it. By the way, both you and I are examples of dissipative systems and yes we are based on a lot of information. But here is something to consider, stock markets are dissipative systems too. The only difference is that energy is replaced by money.

The problem with Infinity

The question is how sensitive do we have to be and to what level of focus will reveal a pattern? How many decimal places can you leave off and still have faith in the calculations that result? This may sound like mere semantics, but the calculable offset in Lorenzes’ work created results that were wildly different. (Otherwise he might very well have dismissed it as noise*)

* Actually in the electronics and communications area this is exactly what the phenomenon was termed as for decades. Additionally, it was termed as ‘undesirable’ and engineers sought to remove or reduce it so it was never researched further as to its nature. Recently efforts to leverage these characteristics are being investigated.

Clearly the accuracy in a given answer is dependent on how accurately the starting conditions are measured. Again, one might say that, OK perhaps this is the case for a minority of cases but that in most cases any difference will be minor. Again, this is alas not true. Most systems are like this. The term is ‘non-linear’. Small degrees of inaccuracy in the initial values of the calculations in non-linear systems can result in vastly different end results. One of the reasons for this is that with the seemingly unassociated concept of infinity, we touch on a very sticky subject. What is an infinite or infinitely accurate initial condition? As an example, I can take a meter and divide it by 100 to arrive at centimeters and then take a centimeter and divide it further to arrive at millimeters and so forth… This process can go on forever! Actually, this is not the case but the answer is not appeasing to our cause. We can continue to divide until we arrive at Planck’s constant which is the smallest recognizable unit of difference before the very existence of space and time become meaningless! In essence a foam of quantum probability from which emerges existence as we know it.
The practical question must be, when I make a measurement how accurate do I need to be? Well, if I am cutting a two by four for the construction of some macro level structure such as a house or shed, I only need to be accurate to the 2nd maybe 3rd decimal place. On the other hand, if I am talking about cutting a piece of scaffolding fabric to fit surgically into a certain locale within an organ to facilitate a substrate for regenerative growth, the orders of magnitude are very much increased. Possibly out to 6 or 8 decimal places. So the question to ask is how do we know how accurate we have to be? Here comes the pattern part! We know this by the history of the system we are dealing with! In the case of a house, we have plenty of history (a strong pattern – we have built a lot of houses) to deduce that we need only be accurate to a certain degree and the house will successfully stand. In the case of micro-surgery we may have less history (a weaker pattern – we have not done so many of these new medical procedures), but enough to know that a couple of decimal places will just not cut it. Going further we even have things like the weather where we have lots and lots of historic data but the exactitude and density of the information still limits us to only a few days of relatively accurate predictive power. In other words, quite a bit of our knowledge is dependent on the granularity and focus in which it’s analyzed. Are you starting to see a thread? Wink, wink.

Historical and Ahistorical knowledge

It all comes down to the fact that calculable knowledge is dependent on us having some idea of the history & conditions of a given system. Without these we can not calculate. But how do we arrive at these initial values? Well, by experiment of course. We all recall the days back in school with the tedious hours of experimentation in exercises where we knew full well the result. But think of the first time that this was realized by the likes of say Galileo. What a great moment it would have been! But an experiment by definition cannot be a ‘onetime thing’. One would have to run an experiment multiple times with ‘exactly’ the same conditions or varying the conditions slightly in a controlled fashion depending on what one was trying to prove. This brings about a strong concept of history. The experimental operations have been run, and we know that such a system behaves in such a way due to historical and replicable examples. Now we plug those variables into the mathematics and let it run. We predict from those calculations and then validate with further experiments. Basic science works on these principals, so as such we should say that all calculable knowledge is historic in nature. But it could also be said in argument that for certain immutable ‘mathematical truths’ that some knowledge is ahistorical. In other words, like Newton’s laws* and like the Feynman diagrams some knowledge just doesn’t care about the nature or direction of times arrow. Be that as it may it would further be argued that any of these would require historical knowledge in order to interpret their meaning or even find that they exist!

* Newton’s laws are actually approximations of what is reality. In normal everyday circumstances the linear laws work quite well. When speed or acceleration is brought to extremes however the laws fail to yield a correct representation. Einstein’s General Theory of Relativity provides for a more accurate way to represent the non-linear reality under these extreme conditions (actually they exist all the time, but in normal environments the delta to the linear is so small as to be negligible). The main difference – In Newton’s laws space and time are absolute. The clock ticks the same regardless of motion or location, hence linear. In Einstein’s theory space and time are mutable and dynamic. The clock ticks differently for different motions or even locations. Specifically, time slows with speed as the local space contracts, hence non-linear.

As an example, you can toss me a ball from about ten feet away. Depending on the angle and the force of the throw I can properly calculate where the ball will be at a certain point in time. I have the whole history of the system from start to finish. I may use an ahistorical piece of knowledge (i.e. the ball is in the air and moving towards me), but without knowledge of the starting conditions for this particular throw I am left with little data and will likely not catch the ball. In retrospect though, it’s amazing that our brains can make this ‘calculation’ all at once. Not explicitly of course but implicitly. We know that we have to back up or run forward to catch the ball. We are not doing the actual calculations in our heads (at least I’m not). But if I were to run out onto the field and see the ball that you threw in mid air with no knowledge of the starting conditions, I am essentially dealing with point zero in knowledge of a system that is pre-existing. Sounds precarious and it is. Because this is the world we live in. But wait! Remember I have a history in my head on how balls in air behave! I can reference this library and get a chunk of history in very small sample periods (the slow motion effect we often recall) and yes perhaps I just might catch that ball – provided that the skill of the thrower was consummate with the skill of those I have knowledge of. Ironically, the more variability there is in my experience with throwers of different skill levels; the higher the probability of my catching the ball in such an instance. And it’s all about catching the ball! But it also says something important about calculable knowledge.

Why does this balloon stay round? The law of large numbers

Thankfully, we live in a world full of history. But ironically, too much history can be a bad thing. More properly put, too specific of a history about a component within a system can be a bad thing. This was made apparent by Ludwig Boltzman in his studies of gasses and their inherent properties. While it is not only impractical but impossible to measure the exact mass and velocity of each and every constituent particle at each and every instant, it is still possible to determine their overall behavior. (He was making the proposition based on the assumption of the existence of as of yet unproven molecules and atoms.) As an example, if we have a box filled with air on one side and no air (a vacuum) on the other, we can be certain that if we lift the divider between the two halves, the particles of air will spread or ‘dissipate’ into the other side of box. Eventually, the gas in the now expanded box will have diffused to every corner. At this point any changes will be random. There is no ‘direction’ in which the particles will have to go. This is the realization of equilibrium. As we pointed out earlier this is simply entropy, reaching its ultimate goal within the limits of the system. Now let us take this box and make it a balloon. If we blow into it, the balloon will inflate and there will be equal distribution of whatever is used to fill it. Note that now the balloon is a ‘system’. After it cools to uniform state the system will reach equilibrium. But the balloon still stays inflated. Regardless of the fact that there is no notable heat movement within the balloon, it still remains inflated by the heat contained within the equilibrium. After all we did not say that there was no heat. We just said that there was no heat movement or more so that it has been slowed drastically. In actuality, it was realized that it was the movement of the molecules and this residual energy (i.e. the balloon at room temperature) that caused the pressure to keep the balloon inflated.*

* Interesting experiment… blow up a balloon and then place it in the freezer for a short while.

Boltzman, as a result of this realization was able to manipulate the temperature of a gas to control its pressure in a fixed container and visa-versa. This showed that the increase in heat actually caused more movement within the constituent particles of gas. He found that while it was futile to try and calculate what is occurs to a single particle; it was possible to represent the behavior of the whole mass of particles in the system by the use of what we now call statistical analysis. An example is shown in figure 1. What it illustrates is that as the gas heats up the familiar bell curve flattens and hence widens the probability that a given particle will be at a certain speed and heat level.

Figure 1

Figure 1. Flattening Bell curves to temperature coefficients

This was a grand insight, and it has enabled a whole new branch of knowledge which for better or worse; has helped shape our modern world. Note I am not gushing over the virtues of statistics, but it does when properly used have strong merits and it has enabled us to see things to which we would otherwise be blind. And after all, this is what knowledge is all about right? But wait, I have more to say about statistics. It’s not all good. As it turns out even if used properly, it can have blind spots.

Those pesky Black Swans…

There is a neat book written on the subject by a gentleman by the name of Nicholas Teleb*. In it he artfully speaks to the improbable but possible. Those events that occur every once in a while to which statistical analysis is often blind. These events are termed as ‘Black Swans’. He goes on to show these events are somewhat invisible to normal statistical analysis in that they are improbable events on the ‘outside’ of the Bell Curve. (Termed as ‘outliers’) He also goes on to indicate what he thinks is the cause. We tend to get myopic on the trends and almost convince ourselves of their dependability. We also do not like to think of ourselves as wrong or somehow flawed in our assumptions. He points out that in today’s world of information, there is almost too much of it and that you can find stats or facts just about anywhere to fit and justify your belief in that dependability. He is totally correct. Statistics is vulnerable to this. Yet, I need to correct that just a bit. It’s not statistics that is at fault. The fault lies with those using it as a tool.

* The Black Swan – Random House

Further, Taleb provides some insight to things that might serve as flags or ‘tell tales’ to Black Swans. As an example, he notes that prior to all drastic market declines they behaved in a spiky, intermittent behavior that, while still in norm with the Gaussian, had an associated ‘noise’ factor. Note that parallel phenomenon exists within electronics, communications and yes you guessed it, the weather! This ‘noise’ tends to indicate ‘instability’ where the system is about to change in a major topological fashion to another phase. These are handy things to know. Note how they deal with the overall ‘pattern’ of behavior. Not the statistical mean or even median.

Why is this at all important?

At this point you might be asking yourself. Where am I going with all of this? Well, it’s all about Big Data! As we pointed out, all knowledge is historical even if gained by ahistorical (law) insight. Properly understanding a given system means that one needs to understand not only those statistical trends, but higher level patterns of behavior that might betel outliers and black swans. All of this requires huge amounts of data of potentially wide varieties as well. Think of a simple example of modeling for a highway expansion. You go through the standard calculation and then consider that you want to add into consideration the local seasonal weather patterns. Things have exponentially increased in computation and data store requirements. This is what the challenge of Big Data is all about. It is in the realization, that it is not intended on handling the ‘simple’ questions. It is intent on pushing out the bounds of what is deemed tractable or calculable in the sense of knowledge. It’s not that the mathematics did not exist in the past. It’s just now that capability is within ‘everyday’ computational reach. Next let’s consider the use cases for Big Data and perhaps touch on a few actual implementations that you could actually run in your data center.


II. Big Data – What’s it good for? Absolutely everything! Well, almost…

If you will recall we spoke about dissipative systems. As it turns out, almost everything is dissipative in nature. The weather, the economy, the stock market, international political dynamics, our bodies, one could even say our own minds. Clearly, there is something to consider in all of that. The way humans behave is a particularly quirky thing. They (we) are also as a result the primary drive and input into the many of the other systems such as economics, politics, the stock market and yes even the weather. Further understanding in these areas could and actually have proven to be profound.
These are important things to know and we will talk a little later as to these lofty goals. But in reality Big Data can have far more modest goals and interests. A good real world example is for retail sales. It gets back to the age old adage… “Know your customer.” But in today’s cyber-commerce environment that’s often easier said than done. Fortunately, there are companies that are working in this area. One of the real founders to this is Google. Google is an information company at its heart. When one thinks about the sheer mass of information that it possesses it is simply boggling. Yet, Google strongly needs to leverage and somehow make sense of that data. At the same time however it had practical limits on computational power and associated costs for it. Out of these competing and contradictory requirements came the realization of a parallel compute infrastructure that leverages off the shelf commodity systems. Initially it was introduced to the public in a series of white papers as the Google File System or GFS and other ‘sister’ papers such as MapReduce, which provides for key/value mappings and Big Table, which can represent structured data into the environment. This technology has since been embraced by the open source community and is now known as Apache Hadoop Distributed File System or HDFS. The figure below shows the evolution of these efforts into the open source community.

Figure 2

Figure 2. Hadoop outgrowth and evolution into the open source space

The benefits of these developments are important as they provide for the springboard for the use of big data and data analytics in the typical Enterprise IT environment. Since this inception a literal market sector has sprung up with major vendors such as EMC and IBM but also startups such as Cloudera and MapR. This article will not go into the details of these different vendor architectures but be it safe to say that each has its spin and secret sauce that differentiates their approach. You can feel free to look into these different vendors and research others. For the purposes of this article we are concerned more so with the architectural principles of Hadoop and what it means to a Data Center environment. In data analytics a lot of data has to be read very fast. The longer it takes for the read time the longer the overall analytics process. HDFS leverages parallel processing at a very low level to provide for a highly optimized read time environment.

Figure 3

Figure 3. A comparison of sequential and parallel reads

In the above we show the same 1 terabyte data file being read by a conventional serial read process versus a Hadoop HDFS cluster which optimizes the read time by an order of ten. Note that the same system type is being used in both instances, but in the HDFS scenario there is just a lot more of them. Importantly, the actual analytic programming runs in parallel as well. Note also that this is just an example. The typical HDFS block size is 64 or 128MB. This means that relatively large amounts of data can be processed extremely fast with a somewhat modest infrastructure investment. As an additional note, HDFS also provides for redundancy and resiliency of data by the use of replication of the distributed data blocks within the cluster.
The main point is that HDFS leverages on a distributed data footprint rather than a singular SAN environment. Very often HDFS farms are comprised completely of Direct Attach Storage systems that are tightly coupled via the data center network.

How the cute little yellow elephant operates…

Hadoop is a strange name, and a cute little yellow elephant as its icon is even more puzzling. As it turns out one the key developers’ young son had a yellow stuffed elephant that he had named Hadoop. The father decided it would be a neat internal project name. The name stuck and the rest is history. True story, strange as it may seem.
Hadoop is not a peer to peer distribution framework. It is hierarchical, with certain master and slave roles within its architecture. The components of HDFS are fairly straight forward and shown in simplified form in the diagram below.

Figure 4

Figure 4. Hadoop HDFS System Components

The overall HDFS cluster is managed by an entity known as the Namenode. You can think of it as the library card index for the file system. More properly, it generates and manages the meta-data for the HDFS cluster. As a file gets broken into blocks and placed into HDFS, it’s the namenode that indicates where, and the namenode that tracks and replicates if required. The meta-data always provides a consistent map of the distributed file system as to where specific data resides. This is used not only for writing into or extracting out of the cluster, but also for data analytics which requires a reading of the data for its execution. It is important to note that in first generation Hadoop, it was a single point of failure. The secondary namenode in generation 1 Hadoop is actually a housekeeper process that extracts the nodename run-time metadata and copies it to disk in what is known as a namenode ‘checkpoint’. Recent versions of Hadoop now offer redundancy for the namenode. Cloudera for instance provides high availability for the namenode service.
There is a second node known as the Jobtracker. This service tracks the various jobs required to maintain and run over the HDFS environment. Both of these nodes are master role nodes. As such, Hadoop is not a peer to peer clustering technology, it is more so hierarchical.
In the slave role are the datanodes. These are the nodes that actually hold the data that resides within the HDFS cluster. In other words the blocks of data that are mapped by the namenode reside within these systems disks. Most often datanodes are direct attached storage and only leverage SAN to a very limited extent. The tasktracker is a process that runs on the datanodes and are managed and report back to the jobtracker for the various executions that occur within the Hadoop HDFS cluster.
And lastly, one of these nodes, referred to as the ‘edge node’ will have an ‘external’ interface that allows the HDFS environment to be exposed so that PC’s running the Hadoop HDFS client can be provided access.

Figure 5

Figure 5. HDFS Data Distribution & Replication

HDFS is actually fairly efficient in that it incorporates replication into the write process. As shown above, when a file is ingested into the cluster it is broken up into a series of blocks. The namenode utilizes a distribution algorithm to accomplish the mapping of where the actual data blocks will reside within the cluster. A HDFS cluster will have a default replication factor of three. This means that each individual block will be replicated three times and then placed algorithmically. The namenode in turn develops a meta-data map of all resident blocks with the distributed file system. This meta-data is in turn a key requirement for the read function, which is a requirement for analytics.
If a datanode were to fail within the cluster, HDFS will ‘respawn’ the lost data to meet the distribution and replication requirements. All of this means east/west data but it also means consistent distribution and replication which is critical for parallel processing.
HDFS is also rack aware. By this we mean that the namenode can be programmed to recognize that certain datanodes are common to racks and consequently should be taken into consideration during the block distribution or replication process. This awareness is not automatic. It must be programmed by batch or python script. However once it is done it allows the span algorithm to place the first data block on a certain rack and then placing the two replicated blocks into a separate common rack. As shown in the figure below, data blocks A and B are distributed evenly across the cluster racks.

Figure 6

Figure 6. HDFS ‘Rack Awareness’

Note that while the default replication factor is three for HDFS it can be increased or decreased at the directory or even file level. As adjustment to the R factor is done for a certain data set, the namenode assures that data is replicated, spawned or deleted according to that adjusted value.
HDFS uses pipelined writes to move data blocks into the cluster. In figure 7, a HDFS client executes a write for file.txt. As an example, the user might use the copyFromLocal command. The request is sent to the namenode. The namenode responds with a series of metadata telling the client where to write the data blocks. Datanode 1 is the first in the pipeline so it receives the request and sends a ready request to nodes 7 and 9. Nodes 7 and 9 respond and then the write process begins by placing the data block on datanode one where it is then pipelined to datanodes 7 and 9. The write process is not complete until all datanodes respond with a write success. Note that most data center topologies utilize a spine & leaf type topology meaning that most of the rack to rack data distribution must flow up and through the data center core nodes. In Avaya’s view, this is highly inefficient and can lead to significant bottlenecks that will limit the parallelization capabilities of Hadoop.

Figure 7

Figure 7. HDFS pipelined writes

Additionally, recent recommendations are to move to 40 GB interfaces for this purpose. These interfaces most certainly are NOT cheap. With the leaf and spline approach this means rack to rack growth requires large cap/ex outlay at each expansion. Suddenly, the aspect of Big Data and Data Science for the common man is becoming a myth! The network costs start to become the big key investment as the cluster grows and with big data, they always grow. We at Avaya have been focusing on this east/west capacity issue within the data center top of rack environment.
Reads within the HDFS environment happen in a similar fashion. When the Hadoop client requests to reads a given file the name node will respond with the appropriate meta-data so that the client can in turn request the separate data blocks from the HDFS cluster. It is important to note that the meta-data for a given block is in an ordered list. In the diagram below the name node responds with meta-data for data block A as being on datanodes 1, 7 & 9. The client will request the block from the first datanode in the list. Only after a failed response will it attempt to read from the other data nodes.

Figure 8

Figure 8. HDFS ordered reads

Another important note is that the read requests for data blocks B & C occur in parallel. It is only after all data blocks have been confirmed and acknowledged that a read request is deemed complete. Finally, similar to the write process, any rack to rack east/west flows need to flow over the core switch in a typical spine and leaf architecture. But it is important to note that most analytic processes will not utilize this type of methodology for ‘reading’. Instead, ‘jobs’ are sent in and partitioned into the environment where the read and compute processes occur on the local data nodes and then reduced into an output from the system as a whole. This provides for the true ‘magic’ of Hadoop, but it requires a relatively large east/ west (rack to rack) capacity and that capacity only grows as the cluster grows.
We at Avaya have anticipated this change of data center traffic patterns. As such we have taken a much more straightforward approach. We call it Distributed Top of Rack or “D-ToR”. ToR switches are directly interconnected using very high bandwidth backplane connections. These 80G+ connections provide ultra-low latency, direct connections to other ToRs to address the expected growth. The ToRs are also connected to the upstream core which can allow for the use of L3 and IP VPN services to ensure security and privacy.

Figure 9

Figure 9. Distributed Top of Rack benefits for HDFS

Note that the D-TOR approach is much better suited for high capacity east/west data flows rack to rack within the data center. Growth of the cluster no longer depends on continual investment in the leaf spline topology, now new racks are simply extended into the existing fabric mesh. Going further, by using front port capacity, direct east/west inter-connects between remote data centers can be created. We refer to this as Remote Rack to Rack. One of the unseen advantages of D-ToR is the reduction of north-south traffic. Where many architects were looking at upgrading to 40G or even 100G uplinks, Avaya’s approach negates this requirement by allowing native L2 east-west server traffic to stay at the rack level. The ports required for this are already in the TOR switches. This provides relief to these strained connections. It also allows for seamless expansion of the cluster without the need to continual capital investment in high speed interfaces.
Another key advantage of D-ToR is the flexibility it provides:
• Server to server connections, in rack, across rows or building to building or even site to site!
The architecture is far superior to other approaches in supporting advanced clustering technologies such as Hadoop HDFS.
• Traffic stays where it needs to be, reserving the North/South links for end user traffic or for advanced L3 Services. Only traffic that classifies as such need traverse the north/south paths.
• The end result is a vast reduction in the traffic on those pipes as well as a significant performance increase for east/west data flows. At far lesser cost.

Figure 10

Figure 10. Distributed Top of Rack modes of operation

Avaya’s Distributed Top of Rack can operate in two different ways-
• Stack-Mode can dual connect up to eight D-ToR switches. The interconnect is 640Gb without losing any front ports! Additionally dual D-ToR switches can be used to scale up to 16 giving a maximum east/west profile of 10 Tb/s
• Fabric-Mode creates a “one hop” mesh which can scale up to hundreds D-ToR switches! The port count tops out at 10 thousand plus 10Gig ports and a maximum east/west capacity of Hundreds of Terabits.

Figure 11

Figure 11. A Geo-distributed Top of Rack environment

Avaya’s D-ToR solution can scale in either mode. Whether the needs are small, large or unknown, D-ToR & Fabric Connect provides unmatched scale, flexibility and perhaps most importantly, the capability to solve the challenges, even the unknown ones that most of us face. As the HDFS farm grows, the seamless expansion capability of Avaya’s D-TOR environment can accommodate it without major architectural design changes.
Another key benefit is that Avaya has solved the complex PCI or HIPAA compliance issues without having to physically segment networks or by adding layers & layers of Firewalls. The same can be said for any sensitive data environments that might be using Hadoop, such as patient medical records, banking and financial information, smart power grid or private personal data. Avaya’s Stealth networking technology (referred to in the previous “Dark Horse” article) can keep such networks invisible and self-enclosed. As a result any attack or scanning surfaces to the data analytics network are removed. The reason for this is that Fabric Connect as a technology is not dependent upon IP as a protocol to establish and end to end service path. This removes on of the primary scaffolding for all espionage and attack methods. As a result the Fabric Connect environment is ‘dark’ to the IP protocol. IP scanning and other topological scanning techniques will yield little or no information.

Using MapReduce to extract meaningful data

Now that we have the data effectively stored and retrievable we will obviously want to exercise certain queries against the data and hopefully receive meaningful answers. MapReduce is the original methodology documented in the Google white papers. Note that it is also a utility within HDFS and is used to chunk and create meta-data for the stored information within the HDFS environment. Data can also be analyzed with MapReduce to extract meaningful secondary data such as hit counts & trends which can serve as the historical foundation for predictive analytics.

Figure 12

Figure 12. A Map Reduce job

Figure 12 shows a MapReduce project being sent into the HDFS environment. The HDFS cluster runs the MapReduce program against the data set and provides a response back to the client. Recall that HDFS leverages parallel read/write paths. MapReduce builds on this foundation. As a result, east/west capacity and latency are of important consideration in the overall solution.
• Avaya’s D-TOR solution provides easy and consistent scaling of the rack to rack environment as the Hadoop farm grows.

The components of MapReduce are relatively simple.

First there is the Map function, which provides the meta-data context within the cluster. So there is an independent record transformation that is a representation of the actual data. This includes deletions, replications to the system. For analytics, the function is performed against key value (K,V) pairs. The best way to describe it is to give an example. Let’s say a word, and we want to see how often it appears in a document or a given set of documents. Let’s say that we are looking for the word ‘cow’. This becomes the ‘key’. Every time the MapReduce function ‘reads’ the word cow it ticks a ‘value’ of 1. As the function proceeds through the read job various ticks are appended into a list of key/value pairs such as cow,31 or there are ‘31’ instances of the word ‘cow’ in the document or set of documents. For this type of job the reduce function is a method to aggregate the results from the Map phase and provide a list of key value pairs that are to be construed as the answer to the query.
Finally, there is the framework function which is responsible for scheduling and re-running of tasks. It also provides all utility functions such as providing a split to the input, which becomes more apparent on the figure below. But it actually refers to the chunking functionality that we spoke of earlier as data is written into HDFS. Typically, these queries are constructed into a larger framework. The figure shows a simple example of a query framework.

Figure 13

Figure 13. A simple Map Reduce word count histogram

Above we see a simple word count histogram, which is the exact process we talked about previously. The upper arrow shows data flow across the MapReduce process chain. As data is ingested into the HDFS cluster it is chunked into blocks as previously covered. The map function makes this read against the individual blocks of data. For purposes of optimization there are copy, sort and merge functions that provide for the ability to aggregate the resulting lists of key value pairs. This is referred to as the shuffle phase and it is accomplished by leveraging on east/west capacity within the HDFS cluster. From this the reduce function reduces the received key value outputs as a single statement (i.e. cow,31)
In the example above we show a construct to count for three words; Cow, Barn and Field. The details for two of the key value queries are shown. The third is simply an extension of that which is shown. From this we can infer that among these records cow appears with field more often than barn. This is obviously a very simple example with no real practical purpose unless you are analyzing dairy farmer diaries. But it illustrates the potential of the clustering approach in facilitating data farms that are well suited to the process of analytics which leverage very heavily on read performance.
In another more practical example, let’s say that we want to implement an analytics function for customer relationship management. We would want to know things like how often key words such as ‘refund’ or ‘dissatisfied’ or even terms like ‘crap’ and ‘garbage’ come up in queries of emails, letters or even transcripts of voice calls. Such information is obviously valuable and can gain an insight to customer satisfaction levels.
As one might guess, things could very quickly get unwieldy dealing with large numbers of atomic key/value queries. YARN, which stands for ‘Yet Another Resource Nanny’, allows for the building of complex tasks that are represented and managed by application masters. The application master starts and recycles tasks and also requests resources from the YARN resource manager. As a result a cycling self-managing job could be run. Weave is an additional developing overlay that provides for more extensive job management functions.

Figure 14

Figure 14. Using Hadoop and Mahout to analyze for credit fraud

The figure above illustrates a practical functional use of the technology. Here we are monitoring incoming credit card transactions for flagging to analysts. Transaction data will be flagged key value pairs. Indeed there may be dozens of key value pairs that are part of this initial operation. This provides for the consistent input into the rest of the flow. LDA scoring based on Latent Dirichlet Allocation allows for a comparative function against the normative set. It can also provide a predictive role. This step provides a scoring function on the generated key value pairs. At this point LDA provides a percentile of anomaly to a transaction. From there further logic can then impact a given merchant score.
All of this is based on yet another higher level construct known as Mahout. Mahout provides for an orchestration and API library set that can execute a wide variety of operations, such as LDA.
Examples are, Matrix Factorization, K Means & Fuzzy K Means, Logic Regression, Naïve Bayes and Random Forest. All of which in essence are packaged algorithmic functions that can be performed against the resident data for analytical and/or predictive purposes. Further these can be cycled such as the example above which would operate on each fresh batch presented to it.
Below is a quick definition of each set of functions for reference:

Matrix Factorization –
As its name implies this function involves factorizing matrixes. Which is to say to find two or more matrixes that when multiplied will yield the original matrixes (i.e. the other matrixes as a result must be subsets of the original). This can be used to discover latent features between entities. Factoring more than two matrixes requires the use of tensor mathematics which would be more complicated. A good example of use is in movie popularity and ratings matches such as done by NetFlix. Film recommendations can be made fairly accurately based on identifying these latent features. A subscriber rating, their interests in venues and the rating of those with similar interests can yield an accurate set of recommended films that the subscriber is likely to enjoy.

K-Means –
K-Means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into something termed as Voronoi cells. These cells are based on common attributes or features that have been identified. Uses for this are learning common aspects or attributes to a given population so that it can be subdivided or partitioned into various sub populations. From there things like logic regression can be run on the sub-populations.

Fuzzy K-Means –
K-Means clustering is what is termed ‘hard clustering’. In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster and only one. In fuzzy clustering, also referred to as soft clustering, data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the association between that data element and a particular cluster. Fuzzy clustering is a process of assigning these membership levels, and then using them to assign data elements to one or more clusters. A particular data element can then be rated as to its strongest memberships within the partitions that the algorithm develops.

Logic Regression –
In statistics, logistic regression, or logic regression, is a type of probabilistic statistical classification model. Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. Logic regression is hence used to analyze probabilistic relationships between different variables within a particular set of data.

Naïve Bayes –
In machine learning environments, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) assumptions of independence between the features. In other words, it knows nothing to start. Naive Bayes is a popular (baseline) method for categorizing text, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, classified terms, etc.) with word frequency as a large part of the features considered. This is very similar to the usage and context information provided by Latent Dirichlet Allocation

Random Forest –
Random Forests is another method for learning & classification of large sets of data from which further regression techniques can be used. Random Forests are in essence constructs of decision trees that are induced in a process known as training. Data is then run through the forest and various decisions are made to learn and classify the data. When building out large forests the concept comes into effect of allowing to decision tree subsets. Weights can then be given to each set and from that further decisions can be made.

The end result of all of these methods is a very powerful environment that is capable of machine learning type phenomena. The best part of it is that it is accomplished with off the shelf technologies. No super computer required. Just a solid distributed storage/compute framework and superior east/west traffic capacity in the top of rack environment. Big Data and Analytics can open our eyes to relationships between phenomena that we would otherwise be blind to. It can even provide us insight into causal relationships. But here we need to tread a careful course. Just because two features are related in some way does not necessarily mean that one causes the other.

A word of caution –

While all of this is extremely powerful, the last comments above should raise a flag to you. Just because you have lots of data and you have all of these fancy mathematical tools at your disposal you can still make some very bad decisions if your assumptions about the meaning of the data is somehow flawed. In other words, good data plus good math with bad assumptions will still yield bad decisions. We also need to remember Mr. Taleb and his black swans. Just because a system has behaved in the past within a certain pattern or range does not mean that it will continue to do so ad infinitum. Examples of these types of systems range from stock exchanges to planetary orbits to our very own bodies! In essence, most systems exhibit this behavior. Does that mean that all of the powerful tools referred to above are rendered invalid and impotent? Absolutely not. But we must remember that knowledge without context is somewhat useless, and knowledge with incorrect context is worse than ignorance. Why? Because we are confident about what it tells us. We like sophisticated mathematical tools that tell us in an oracle like fashion what the secrets of knowledge are within a given system. We have confidence in their findings because of their accuracy. But no amount of accuracy will make an incorrect assumption correct. This is where trying to prove ourselves wrong about our assumptions is very important. One might wonder why there are so many methods that sometimes appear to do the same thing but from a different mathematical perspective. The reason is that these various methods are often run in parallel to yield comparative data sets with multiple replicated studies. By generating large populations of comparative sets another level or hierarchy of trends and relationships becomes visible. Consistency of the sets will generally (but not always) indicate sound assumptions about the original data. Wild variations between sets in turn will usually indicate that something is flawed and needs to be revisited. Note that we are now talking about analyzing the analytical results. But this is not always done. Why? Because many times we don’t want to prove our own assumptions wrong. We want them to be right… no let’s go further – we need them to be right.
A good example is the recent market crash of 2006-2009. Many folks don’t know it but there is a little equation that actually holds a portion of the blame. Well, not really. As it turns out equations are a lot like guns. They are only dangerous when someone dangerous is using it. The equation in question is the Black-Scholes equation. Some have called it one of the most beautiful equations in mathematics. It is a very eloquent piece. Others would call it that because it had another name, the Midas equation. It made folks a ton of money! That is until…
The Black Scholes equation was an attempt to bring rationality to the futures market. This sounds good, but it is based on the concept of creating a systematic method of establishing a value for options before they mature. This also might not be a bad thing if your assumptions about the market are correct. But if there are things that you don’t know (and there always is), then those blind spots could in reality affect your assumptions in an adverse way. As an example, if you are trading on the futures of a given commodity and something happens in the market to affect demand that you did not consider or perhaps weighed its impact incorrectly then guess what… That’s right, you lose money!
In the last market crash that commodity was real estate. As one looks into the detailed history of the crash we can see multiple flawed assumptions that built upon one another. Then to compile the problem the market began to create obscurity by the use of blocks or bundles of mortgages that had absolutely no window into the risk factors associated with those assets. While the banks were buying blind, the banks were of the thought that foreclosures would be a minority and that the foreclosed home can always be sold for the loan value or perhaps greater. To the banks it seems that they couldn’t loose. We all know what happened. Even though the mathematics was elegant and accurate, the conclusions and the advice that was given as a result was drastically flawed and cost the market billions. The lesson, Big Data can lead us astray. It reminds us of the flawed premise of Laplace’s rather arrogant comment back in 1814. There is always something we don’t know about a given system such as a scope of history that we do not know or levels of detail that are unknown to us or perhaps even beyond our measurement. This does not disable data analytics but it puts a limit to its tractability in dealing with real world systems. In the end Big Data does not replace good judgment, but it can complement it.

So how do I build it and how do I use it?

Hadoop is actually fairly easy to install and set up. The major vendors in this space have gone much further in making it easy and manageable as a system. But there are a few general principles that should be followed. First, be sure to size your Hadoop cluster and maintain that sizing ratio as the cluster grows. The basic formula is 4 x D, where D is the data footprint to be analyzed. Now one might say ‘what’? I have to multiply my actual storage requirements by a factor of four!? But do not forget about the Map Reduce flow. The shuffle phase requires datanodes that will act as transitory nodes for the job flow. This extra space needs to be available. So while it might be tempting to float this number, it’s best not to. Below are a few other design recommendations to consider

Figure 15

Figure 15. Hadoop design recommendations

Another issue to consider is the sizing of the individual datanodes within the HDFS cluster. This is actually a soft set of recommendations that greatly depends on the type of analytics. In other words, are you looking to gauge customer satisfaction or model climate change or the stock market? These are obviously many degrees of complexity from one another. So it is wise to think about your end goals with the technology. Below is a rough sizing chart that provides some higher level guidance.

Figure 16

Figure 16. Hadoop HDFS Sizing Guidelines

Beyond this, it is wise to refer to the specific vendors design guidelines and requirements, particularly in the areas of high availability for master node services.
Another question that might be asked is “How do I begin?” In other words, you have installed the cluster and are ready for business but, what to do next? Actually this is very specific to usage and expectations. But we can at least boil it down to a general cycle of data ingestion, analytics and corresponding actions. This is really very similar to well-known systems management theory. A diagram of such a cycle is shown below.

Figure 17

Figure 17. The typical data analytics cycle

Aside from the work flow detail, it cannot be stressed enough, “Know your data”. If you do not know it then make sure that you are working very closely with someone who does. The reason for this is simple. If you do not understand the overall meaning of the data sets that you are analyzing then you are unlikely to be able to identify the initial key values that you need or should be focusing on. So often data analytics is done on a team basis with individuals from various backgrounds within the organization and the data analytics staff will work in concert with this disparate group to identify the key questions that need to be asked as well as the key data values that will help lead towards the construct of an answer to the query. Remember that comparative sets will allow for the validation of both the assumptions that are made on the data model but also on the techniques that are being used to extract and analyze the data sets in question. While it is tempting to jump to conclusions on initial findings, it is always wise to do further studies to validate those findings, particularly if it is a key strategic decision that will result from the analysis.

In summary

We have looked at the history of analytics from its founding fathers to its current state. Throughout, many things have remained consistent. This is comforting. Math is math. Four plus four back in Galileo’s time was the same answer as is today. But we must remember that math is not the real world. It is merely our symbolic representation of it. This was shown by the various discoveries on the aspects of randomness, chaos and infinitudes. We have gone on further in the article to show that the proper manipulation of large sets of data placed against a historical context can yield insights into it that might not be otherwise apparent. Recent trends are to establish methods to visualize the data and the resulting analytics by the use of graphic displays. Companies such as Tableau provide for the ability to generate detailed charts and graphs that can provide a visual view of the results of the various analytic functions noted above. Now a long table or spreadsheet of numbers becomes a visible object that can be manipulated and conjectured against. Patterns and trends can much more easily be picked out and isolated for further analysis. These and other trends are accelerating in the industry and become more and more available to common user or enterprise.
We also talked about the high east/west traffic profiles that are required to support Hadoop distributed data farms and the work that Avaya is doing to facilitate this in the Data Center top of rack environment. We talked about the relatively high costs of leaf spline architectures and Avaya’s approach to the top of rack environment as the data farm expands. Lastly, we spoke to the need for security in data analytics, particularly in the analysis of credit card or patient record data. Avaya’s Stealth Networking Services can effectively provide a cloak of invisibility over the analytics environment. This creates a Stealth Analytics environment from which the analysis of sensitive data can occur with minimal risk.
We also looked at some of the nuts and bolts of analytics and how, once data is teased out, it may be analyzed. We spoke to various methods and procedures, many times which are often worked in concert to yield comparative data sets. These comparative data sets can then be used to check assumptions made about the data and hence the analytic results. Comparative sets can help us measure the validity of the analytics that have been run, or more importantly the assumptions we have made. In this vein we wrapped up with a word of warning as to the use of big data and data analytics. It is not a panacea, nor is it a crystal ball but it can provide us with vast insights into the meaning of the data that we have at our fingertips. With these insights, if the foundational assumptions are sound we can make decisions that are better informed. It can also enable us to process and leverage the ever growing data that we have at our disposal at the pace required for it to be of any value at all! Yet, in all of this we are only at the beginning of the trail. As computing power increases and our algorithmic knowledge of systems increases the technology of data science will reap larger and larger rewards. But it is likely to never provide the foundation for Laplace’s dream.