October 17, 2016 By Russell Couturier 4 min read

Trying to discern drug smugglers passing through customs presents exactly the same problem as trying to discern security threats passing through our networks. Machine learning has been applied to both with varying degrees of success, but ultimately the technology reaches the same limitations. Machine learning has two basic elements: feature vectors and classification exemplars — the data that is gathered and the corresponding classification examples.

In the case of drug smugglers, we might observe number of travelers, point of origin, point of destination, number of bags, length of stay and weight of the bags. We might also flag any traveler or pair of travelers with two or more bags whose combined weight is greater than 150 pounds, whose stay is less than a week and who originated from a climate conducive to poppies.

In the case of threat analytics, we observe the addresses of the endpoints, the amount of data that flows to and from each endpoint, the geolocated countries of the endpoints and the frequency of communications. Given this context, should it not then be easier to discern between malicious and benign behavior?

Reducing False Positives

The following graph is an expected results curve for machine learning:

The more feature vectors (data) we can collect, combined with continual and up-to-date classification examples, the more easily machine learning can discern between malicious and benign behavior, with the false positive rate trending to near zero. False positives, or incorrectly classified events, are the bane of all machine learning, whether it’s used to detect drug smugglers or malicious network traffic.

A traveling aid worker might carry unusual tools in his luggage. Similarly, an unusual level of file transfer traffic from an untrustworthy geolocation might simply be a large download of vacation photos. False positives consume human resources to investigate events that can be commonplace, thus diminishing the effectiveness of machine learning. These are still early days.

Infinite Exemplars and Finite Data

So how do we decrease security false positives? It’s important to keep in mind that classification exemplars are infinite and data is finite. If our traveler is denoted as malicious and subsequently proves to be legitimate, feed this as just another training exemplar of valid travelers. The same holds true for network traffic: Simply classify the photo downloads as valid traffic. The more exemplars, the better the classification, right? This is true to the extent that the feature vectors are granular enough to allow discrete exemplars, so the data points don’t apply to multiple classifications.

Feature vectors are limited, however, and in many instances exhausted. In the case of a traveler, we might add the color of his or her bags, style of clothes or frequency of travel. In the case of threat analytics, we might add file content type and file entropy. Still, at some point the data is finite. Some vacationers simply look like drug smugglers, and vice versa.

Experts have conducted an abundance of research on network flow information. Unfortunately, flow information is easily exhausted because it only provides a few dozen features and suffers greatly from false positives, per the following diagram:

Here, exemplars continue to increase, but feature vectors are limited. The key to improving machine learning is to increase feature vectors so that exemplars correctly classify events as malicious or benign. The traveler might be wearing additional attire, like sunglasses or a hat. Similarly, network traffic might include tunneling or proxy knowledge content. An unfortunate reality for network analytics is that the desired data is typically not available or simply unattainable. So where does this leave us in terms of machine learning for security?

The Future of Machine Learning

Machine learning is very effective in eliminating white noise and classifying benign traffic with a high degree of accuracy — that is, what it believes to be benign is absolutely benign. Still, the false positive rates for predicting malicious events have been pretty disappointing.

For machine learning to be effective, IT professionals must switch from an analytics-based approach to a data-centric one, as more analytics on the same data produces the same results. We need to derive better data if we are going to rely on machine learning alone for security assessments.

IBM has done extensive research using machine learning for security analytics in the areas of Domain Name System (DNS) and the insider threat. IBM is currently beta testing machine learning-based applications for DNS that detect tunneling, beaconing, fluxing, squatting and exfiltration.

Here, false positives are reduced by increasing feature vectors with supplemental WHOIS and URL/website analytics. The insider threat technology is based on user behavior analytics that classify typical and anomalous behavior. For example, logging into an application from the same endpoint with a different identifier would be an anomaly upon first occurrence.

Machine learning can reduce the white noise, but without an injection of security-relevant data, it has a long way to go before it can be considered a generational leap in analytics. The next generation of machine learning-based security analytics will undoubtedly have a heavy focus on data acquisition.

More from Intelligence & Analytics

Hive0051’s large scale malicious operations enabled by synchronized multi-channel DNS fluxing

12 min read - For the last year and a half, IBM X-Force has actively monitored the evolution of Hive0051’s malware capabilities. This Russian threat actor has accelerated its development efforts to support expanding operations since the onset of the Ukraine conflict. Recent analysis identified three key changes to capabilities: an improved multi-channel approach to DNS fluxing, obfuscated multi-stage scripts, and the use of fileless PowerShell variants of the Gamma malware. As of October 2023, IBM X-Force has also observed a significant increase in…

Email campaigns leverage updated DBatLoader to deliver RATs, stealers

11 min read - IBM X-Force has identified new capabilities in DBatLoader malware samples delivered in recent email campaigns, signaling a heightened risk of infection from commodity malware families associated with DBatLoader activity. X-Force has observed nearly two dozen email campaigns since late June leveraging the updated DBatLoader loader to deliver payloads such as Remcos, Warzone, Formbook, and AgentTesla. DBatLoader malware has been used since 2020 by cybercriminals to install commodity malware remote access Trojans (RATs) and infostealers, primarily via malicious spam (malspam). DBatLoader…

New Hive0117 phishing campaign imitates conscription summons to deliver DarkWatchman malware

8 min read - IBM X-Force uncovered a new phishing campaign likely conducted by Hive0117 delivering the fileless malware DarkWatchman, directed at individuals associated with major energy, finance, transport, and software security industries based in Russia, Kazakhstan, Latvia, and Estonia. DarkWatchman malware is capable of keylogging, collecting system information, and deploying secondary payloads. Imitating official correspondence from the Russian government in phishing emails aligns with previous Hive0117 campaigns delivering DarkWatchman malware, and shows a possible significant effort to induce a sense of urgency as…

Topic updates

Get email updates and stay ahead of the latest threats to the security landscape, thought leadership and research.
Subscribe today