By Adam Tyra, Contributing Editor
The Defense Advanced Research Projects Agency (DARPA) has led the way on defense technology development for decades. The original home of the Internet has brought us stealth technology and GPS, just to name a few.  Browsing DARPA’s current research portfolio allows cybersecurity professionals to peer into the near future to see what’s next.  Currently, DARPA is pursuing no less than 17 different projects that are tagged with the topic “analytics.” These include two different projects to detect anomalies in massive sets of network data, a project to improve attribution for malicious activity, a project to create new methods to defend against DDoS attacks, a project to use analytics to automate forensic analysis of removable media and devices, and a project to improve the speed of detection of attacks against critical infrastructure. In today’s military where talent and resource challenges exceed those of the private sector, any of these projects could revolutionize the cybersecurity professional’s toolkit, and they all rely on analytics.
For some cybersecurity professionals, tomorrow is already here. This year’s DEF CON saw the execution of the first automated hacking competition sponsored by DARPA. Dubbed the “Cyber Grand Challenge,” the purpose of the contest was to field a system capable of scanning a piece of software, identifying bugs, and patching them automatically.  The winning solution, a system dubbed Mayhem, successfully completed all stages of the competition without any human interaction whatsoever- a triumph of machine learning and artificial intelligence in the extremely complex field of binary reverse engineering and analysis. 
If the list of solutions cooking in the DARPA labs is any indication, the same expertise and methodology that fueled Mayhem will be finding its way into the military cybersecurity arsenal soon. Tool vendors have recognized the potential of machine learning and are hard at work building platforms to automate security monitoring in order to solve manpower and resource shortages. Unfortunately, viable solutions that can replace even low-level analysts might be further than we think. In this article we’ll discuss why adversary detection is a fundamentally larger and more difficult problem than other applications of analytics (including malware analysis), and why security analytics tools probably won’t replace human analysts anytime soon.
Data science is a complex topic, requiring considerable study to achieve mastery. Readers who are familiar with the field may wish to skip to the section entitled “Difficulties” to move directly into a discussion of the issues with analytics in adversary detection. Those less familiar should carry on from here.
The word “analytics” is a source of considerable confusion in many organizations. Analytics is commonly (and incorrectly) used interchangeably with other buzzwords such as “big data,” “machine learning,” “Hadoop,” “business intelligence,” “artificial intelligence,” “data visualization,” and others. A brief (and massively simplified) discussion of several of these terms will eliminate misunderstanding later in this article.
Analytics is the “discovery and communication of meaningful patterns in data.”  On its own, it does not imply any context such as the source of data, the methods used for analysis, the methods used for communication of discovered patterns, or anything else. It is a general term usually used to describe analysis using computerized statistical modeling. Like the word from which it was derived (analysis), analytics is properly used only as a noun. That is, analytics is not something that you do. Instead, analytics are simultaneously tools that you use on a dataset and also the output of the analysis conducted by those tools. The fact that analytics is not a verb is the source of the confusion. Serious discussions about analytics must include a technical (usually mathematical) component describing what one is actually doing to a dataset when “analytics” are performed.
Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.  Machine learning comprises a number of techniques, and all of them rely on statistical analysis to a greater or lesser extent. The two fields (machine learning and statistics) are so closely related that the term “data science” is usually used to label the resulting overlap in concepts. While statistical analysis can be performed on any mathematical dataset to uncover interesting insights, analytics that purport to predict some future outcome from a complex dataset generally make use of machine learning concepts. To be clear, machine learning tools merely automate the generation of statistical models that could, with enough effort, be created on paper by a skilled mathematician. There’s no magic here. The benefits of machine learning tools are that they make possible the analysis and comprehension of far more data in far less time than any human could possibly achieve. Descriptions of various machine learning techniques can be found in the later section entitled “Methods of Analysis.”
HadoopApache Hadoop is a free and open-source software development framework to enable storage and processing of large datasets. At its core, it has two components: the Hadoop Distributed File System (HDFS) and MapReduce. These key features enable distributed storage and processing. That is, with HDFS, large datasets can be split across multiple physical disks and machines while still appearing logically to be a single set. MapReduce, in turn, provides the ability for large jobs to be split across multiple processors in parallel.
While Hadoop technically describes only the two components described above, the term has been stretched in recent years to refer to an entire set of open-source tools that are routinely used in conjunction with Hadoop. Hadoop on its own doesn't really do much of anything. It provides a means of storing massive amounts of data (HDFS) and a means of performing work on massive amounts of data in parallel (MapReduce). It doesn't provide organization or structure to stored data without the assistance of additional components. It also doesn't provide any special facilities for data processing (searching, sorting, analyzing) without additional components. A fully functional deployment of Hadoop and family might look like this:
Once we have potentially interesting data gathered and stored, we have to perform some kind of analysis to derive insights (value) from it. The following is a basic discussion of several advanced methods of analysis available to cybersecurity teams.
Statistical analysis relies on the use of mathematical computation to determine whether an event is interesting. In the context of cybersecurity, events are usually interesting when they are outliers. A basic example of this is frequency analysis: An event can be interesting because its absolute number of occurrences is too low or too high. A more sophisticated example might be deviation of an event from a norm or average; many current intrusion detection systems have this capability.
Consider the case of bandwidth usage on a network. Assume that most hosts send an average of 25 GB of network traffic in a given month. Recording this information over time gives us the opportunity to observe the variance and standard deviation of the data, which are essentially measures of how spread apart the data points are. Standard deviation is useful for identifying outliers due to due to the fact that nearly all values in a distributed dataset must be close to the mean. 
Assuming that our dataset has a standard deviation of 5 GB, a host that is observed to send 75 GB of traffic in a month would land in the 99th percentile of bandwidth users. This host would definitely be an outlier and therefore a good candidate for closer inspection for malicious activity.
Unsupervised learning describes machine learning approaches that attempt to identify hidden structure or relationships within a dataset. Various unsupervised learning techniques may be lumped under the heading of “data mining,” because they provide insight from a set of existing data about events that occurred in the past. Unsupervised learning approaches do not predict outcomes from a set of data. That is, data mining can uncover the insight that retail customers who buy toothbrushes tend to also buy toothpaste in the same transaction, but it does not yield a formula that could be used to predict the probability that a customer will complete a purchase based on the items in her cart. For this reason, unsupervised learning techniques cannot be used as the basis for predictive analytics. However, they might help guide security researchers’ attention to particular subsets of available data that could be used for predictive analytics using supervised learning techniques (described later).
The most well-known unsupervised learning task is cluster discovery and analysis (clustering). Clusters are defined as data points that are unusually “close” to one another (close in time, close in value, etc.) when compared to the dataset as a whole. An example of an interesting cluster would be the discovery in netflow data that numerous hosts are attempting to communicate with the same external server at the same time every day. This could represent beaconing activity and might signal malware infections on each of the involved hosts.
Security analysts wielding unsupervised learning techniques against the right data can find the answers to questions they didn’t even know that they should ask. These techniques are the basis for techniques used by intelligence services to uncover terrorist networks via data analysis , and there is significant potential in cybersecurity for unsupervised learning to effectively automate functions such as incident root cause analyses. However, besides possessing the right data, analysts also need it in sufficient quantity to successfully identify patterns, draw conclusions, and test hypotheses. Companies offering automated malware analysis solutions are in the best position to fulfill these requirements, since they can leverage databases filled with decades of malware samples from the likes of VirusTotal, McAfee, and Symantec. Unfortunately, there is no publicly available source of data of sufficient size to develop adversary or malicious behavior identification use cases.
Supervised learning describes machine learning approaches that attempt to create a formula that can predict an outcome based on a set of “training” data that includes the outcome (e.g., the set of conditions present when a purchase occurred). Given a set of data similar to that used for training, the formula derived from the training data can be used to make predictions about unknown future outcomes.
Consider a set of data that includes information about a user’s behavior when interacting with an online retailer. Each record might represent a particular session when the user visited the site and must include a target variable. The target variable is usually a binary variable (possible values of 0 or 1) that identifies whether the outcome in which we’re interested actually occurred for that observation or not. In this example, we probably want to know whether or not the user made a purchase for each session, so this is our target. The remaining data in the record should describe how the user interacts with the site (average time connected, number of mouse clicks per session) and other information that might be relevant.
For example, imagine that we want to be able to predict whether any particular user will make a purchase when visiting the site. Some of the data that we have about users may be highly correlated with purchasing behavior, while other data will not show any correlation at all. For instance, the gender of a user may show little correlation with purchasing, while the length of time that a user stays connected to the site may be highly correlated with purchasing (i.e. users who stay connected longer are more likely to buy). The purpose of supervised learning techniques is two-fold in this example. First, they should help us identify which of our data points (called “dimensions”) are most useful in predicting the outcome of our target variable. This is called principal component analysis. Second, they should help us derive an equation that will weigh each dimension according to its importance relative to the outcome, add them together, and produce a “prediction”- which is nothing more than a projected value of the target variable for a given session.
Once we have an equation, we can feed new data into the equation in real-time, and it will “predict” whether the behavior we care about will happen. If our analysis told us that length of connection was important for predicting a sale and we had an instance of a user whose connection length was in the 90th percentile for all connections, our equation would most likely predict that this user would buy. We could then decide that offering this user free shipping isn’t worthwhile, since it won’t positively affect our predicted outcome. We can use this same type of predictive analysis when watching a user’s behavior to decide if that user is likely to be an adversary.
There are multiple methods available to perform this analysis, and they are known as machine learning “classifiers.” Each uses a different mathematical approach, but they all fulfill the two purposes listed above. In practice, data scientists determine which one to use based on its apparent effectiveness (tested by trial and error) at predicting target variables for which they already know the actual value. “Training” a classifier is the process of applying the mathematical technique represented by the classifier to a dataset to derive a predictive function.
In our example above, we would likely take a small sample (10%) of our data to train the classifier. We would then use the other 90% of our available data to test the effectiveness of our classifier. The effectiveness of each classifier can be judged based on the percentage of target variables successfully predicted in the test data. For a variable with two possible outcomes (purchase or no purchase), 50% effectiveness can be achieved simply by guessing randomly. Thus, only classifiers that achieve accuracy significantly higher than 50% can be considered effective. Different classifiers may be more suitable for different data sets, so testing all of the methods available yields optimum results.
Remember that a classifier is nothing more than a long arithmetic equation that weighs each dimension according to its importance and adds them together to infer the value of the target variable. This is all analytics are- formulas and algorithms.
Photo credit: www.rmmagazine.com
On the first day of my very first computer science course, the instructor told us that we couldn’t solve any problem with a computer unless we already knew how to solve the problem using a pencil and a piece of paper. Notwithstanding the speed, memory, and data storage advantages provided by automation, this remains true today. The implication of this statement for cybersecurity is that we can’t expect a machine to effectively identify adversaries unless we tell it how; and in general, cybersecurity professionals don’t know how. Attackers are inherently unpredictable and free to innovate new tactics at will. This fact effectively removes the bounds on the complexity of adversary detection as an analytics problem.
Compare our field to fields where analytics have been successful. Retailing examples were used repeatedly in the previous sections of this article for two reasons. First, they’re straightforward. Everyone understands the mental model of a purchase transaction. Second, they’re simple. Every purchase pretty much follows an identical pattern, so there are a limited number of data points that are both collectable and reasonably related to the final purchase decision.
In cybersecurity, we recognize a very limited set of identifiable behaviors that are always malicious. Observing the hash of a well-known piece of malware, an outbound connection to a known malicious IP address, or an excessive number of failed login attempts always tell us that something nefarious is underway. We also recognize a larger set of behaviors that might be malicious. These are things that must be investigated. Examples include unsolicited emails with binary attachments or the detection of a previously unknown host connected to the network. Use cases in these two sets have largely been automated successfully by existing security products.
Analytics have the ability to identify the contents of the third set- malicious behaviors and users not previously known to be malicious- that belong in sets one and two. Unfortunately, the third set is bounded only by the imaginations of current and future attackers. So, we will never really solve analytics for cybersecurity. Instead, this field will be forced into the ever-tightening cycle of innovation already experienced by attackers and defenders racing to out-think each other. When it reaches that point, cybersecurity analytics will have realized its potential. However, cyber defenders will need to overcome a few challenging obstacles in conjunction with their friendly neighborhood data scientists before they can wield analytics to maximum effect. Discussions of a few of these issues follow.
You must already have a dataset that contains the malicious activity that you intend to identify.
This is the single largest problem that must be overcome in order to realize predictive analytics in cybersecurity. The desired end-state is to create a platform that is capable of analyzing data from hosts and devices across a network to determine whether some arbitrary type of malicious activity is occurring. Recall that supervised learning requires the use of a set of training data containing known values for the target variable. Thus, we need a set of data that contains many known instances of the activity that we’re trying to identify. Again, you have to know that your dataset contains the behavior that you’re looking for. You can’t automate the identification of malicious activity unless you have already figured out some manual technique to know for sure when malicious activity is occurring. If we were good at this, we wouldn’t need analytics! We could try to get around this problem by emulating malicious activity in an instrumented environment in order to collect the necessary data, but this would likely result in analytics that are good at identifying penetration testers or red teams and not necessarily real attackers.
Analytics are distinctive to the data used to create them.
Analytics that are portable between networks rely on the notion that a particular type of malicious activity looks essentially the same everywhere. This notion holds for basic correlation, but it breaks down as our analytical methods increase in complexity and become more probabilistic. The more data that we have to analyze to decide if something is happening, the less likely we are to decide properly, since individual outlier data points can derail the whole analysis.
Let’s say we were able to identify the best data to use to identify adversary activity on our network. This data is generated by the activities of the unique combination of applications, devices, and users present on the network where the data was gathered. Another network will have different network traffic patterns, different tools generating data, and a different definition of normal for all of it. Thus, we might be able to create analytics that successfully identify malicious activity on one network, but their effectiveness may drop significantly if tried on a different network.
This problem isn’t insurmountable, since the universe of machine data sources is finite, and it may be the case that common sources like Windows event logs contain data points that are highly correlated with some types of malicious activity. However, the low- hanging fruit here has already been claimed by the Security Information and Event Management market, in conjunction with endpoint monitoring tools.
Predictive analytics (usually) cannot identify previously unknown malicious activity.
Many vendors claim that their solution can identify zero-day exploits and/or zero-day malware. This might be true for exploits and malware that look like something we’ve seen before, but it simply won’t work for something that’s all new. Recall that supervised machine learning uses training data to build predictive analytics. By definition, we don’t have training data for zero-day techniques or malware that we haven’t seen before. Thus, it follows that we can’t use predictive analytics to reliably identify arbitrary new types of malicious activity.
The limited exception to this is malware detection. Of all potential cybersecurity use cases for analytics, this one is perhaps the best analogy to fields like retailing where analytics are highly effective. Malware authors have a small toolset available for malware development. They must rely on the functionality provided by a finite set of operating systems, and there are a finite number of tools available to author and compile malware. This simplifies the detection of malicious logic, since it will likely be represented by similar sets of programming idioms that are compiled into similar sets of machine code instructions. If new malware samples resemble samples that we’ve seen previously in meaningful ways, they will be identified as malicious using machine learning techniques regardless of whether they’re officially “zero-day” or not.
We still have the problem of identifying the attributes of binaries that are most useful in determining maliciousness, but the set of available attributes is much smaller. Data points that might be useful for analysis include the operating system libraries imported by a binary, the system calls used, particular sequences of assembly language instructions, etc. Detonating a binary allows us to record the behavior as well in order to identify attempts to write to the file system, strange usages of memory, attempts to access privileged resources, etc.
We also have the problem of needing a large library of malicious and benign binary samples to use to develop our analytics. Luckily, this problem has already been solved. Databases maintained by services like VirusTotal and various anti-virus vendors contain all the malware we could ever want, and a single Windows 7 instance contains tens of thousands of benign binary files to use for comparison.
Photo credit: insight.venturebeat.com
The promise of analytics for cybersecurity is undeniable, and the marketplace is awash in new products and services from both startup companies and established industry leaders that purport to do all or some of the analyses described above. These tools will find their way into the military’s “walled gardens” soon. While a healthy skepticism is warranted for the marketing hype around many security analytics products, progress in this field is real and rapid.
The most effective solutions currently available are those that use analytics to identify particular behaviors related to specific stages of the cyber kill-chain. For example, some solutions examine the connection history between each discrete pair of hosts on the network in order to identify attacker lateral movement. Others attempt to identify data exfiltration by examining the amount of data that hosts typically transmit and looking for outliers. One startup, comparing attacker activity to a mutating virus, even claims to use unsupervised learning to detect and eradicate malicious activity similar to the way the human immune system eliminates pathogens.
For better or worse, many organizations are buying these and other products, and every analytics platform deployment increases the probability of a major breakthrough in the field. The longer that analytics platforms continue to operate, the more data they’ll gather, and the more effective they’ll become. Ultimately, the security analytics problem is one of data acquisition, storage, and processing. Soon, a vendor will reach critical mass in all three of these factors, and we’ll begin to see true predictive analytics for cybersecurity.
Adam Tyra is a cybersecurity professional with expertise in security operations, security software development, and mobile device security. He is currently employed as a cybersecurity consultant. Adam served in the U.S. Army and continues to serve part-time as an Army reservist. He is an active member of the Military Cyber Professionals Association and is a former president of the San Antonio, Texas chapter.
 Graham-Rowe, Duncan. "Fifty Years of DARPA: Hits, Misses and Ones to Watch." New Scientist. May 15, 2008. https://www.newscientist.com/article/dn13907-fifty-years-of-darpa-hits-misses-and-ones-to-watch/.
 "Our Research." DARPA. http://www.darpa.mil/our-research.
 "DARPA | Cyber Grand Challenge." DARPA | Cyber Grand Challenge. https://www.cybergrandchallenge.com/.
 Barrie, Allison. "'Mayhem' Rules as DARPA's Battle of the Machines Hits Las Vegas | Fox News." Fox News. 2016. http://www.foxnews.com/tech/2016/08/08/mayhem-rules-as-darpas-battle-machines-hits-vegas.html.
 Chebyshev's Inequality." Wikipedia. https://en.wikipedia.org/wiki/Chebyshev's_inequality.
 Pappalardo, Joe. "NSA Data Mining: How It Works." Popular Mechanics. September 11, 2013. http://www.popularmechanics.com/military/a9465/nsa-data-mining-how-it-works-15910146/.