The Human Error of Ignoring Machine Learning Cybersecurity

The Human Error of Ignoring Machine Learning Cybersecurity

By Wenke Lee, Ph.D, Georgia Tech Institute for Information Security & Privacy

Recent surveys by universities, tech giants, and consulting firms alike point to an emerging paradox: a rabid hunger for machine learning (ML) technology that gives organizations the next competitive edge, coupled against an unclear plan about what to do with it – especially when it goes wrong. Because much of today’s ML technology is explorational but progressing quickly, it’s fair to infer that how to approach, manage, and triage it inside an organization represents unknown territory for business leadership. Nonetheless, this is the ideal time for important conversations around the cybersecurity of ML to take place.

A survey in late 2016 by MIT Technology Review and Google Cloud found that more than half of 375 respondents already have adopted ML into their strategic plan. Among the top four application areas are:

  • Natural language processing
  • Text classification and mining
  • Emotion and behavior analysis
  • Image recognition, classification, or tagging

A larger, global survey by Tata Consultancy Services found that 84 percent of executives believe ML and artificial intelligence are essential to business competitiveness, and half of them believe it will be transformative. Yet, none of these surveys clearly indicate how businesses plan to troubleshoot what could go wrong. So let me tell you.

What Can Go Wrong with Machine Learning

Already, attackers can launch a causative (or data poisoning) attack, which injects intentionally misleading or false training data so that an ML model becomes ineffective. Intuitively, if the ML algorithm uses the wrong examples, it is going to learn the wrong model. Attackers can also launch an exploratory (or evasion) attack to find the blind spots of a ML model and evade detection. For example, if an attacker discovers that a detection model looks for unusually high traffic, he can send malicious traffic at a lower volume and just take more time to complete the attack.

Suppose a competitor hacks into a mid-market construction company with “data poisoning” that unfavorably skews the ML algorithms for inventory or customer-relationship management in order to complicate merger talks with a mutual rival? Or, what if a classifier was trained to look for specific paint colors and materials used by Picasso? A forger could use the same colors and materials to produce a painting that would be authenticated as a Picasso. Imagined another way, ML gone wrong could mean your search engine results about a political candidate were hijacked or your antivirus program was trained to ignore a Trojan horse instead of quarantine it.

These issues in an adversarial setting pose many interesting and new ML challenges. For example, it is important for the CISO defender to understand the trade-offs between how long to keep a machine learning model fixed (which can give rise to exploratory attacks) versus how frequently to update it (which opens the window for causative attacks).

At the Intel Science & Technology Center for Adversary-Resilient Security Analytics (ISTC-ARSA) housed at Georgia Tech’s Institute for Information Security & Privacy (IISP), researchers are studying vulnerabilities to develop new security approaches that will improve the resilience of ML applications. Our focus includes how ML cybersecurity can be hardened for important use cases such as security analytics, search engines, facial and voice recognition, fraud detection, and more.

Since August 2016 when the ISTC-ARSA was established, Intel and Georgia Tech researchers have been meeting monthly to tackle projects that span theory to tools:

1. Adversarial machine learning to study the vulnerabilities of ML algorithms and attacks on ML-based systems.

2. Adversarial-resilience to create design improvements to ML algorithms.

3. MLSPLOIT, a custom toolkit by Georgia Tech to help computer scientists build a ML-based malware analysis framework for experiments and countermeasures.

4. Next-generation security analytics to explore alternative data sources such as the Intel-PT traces for program monitoring and malware analysis.

5. Robust security analytics to explore using Intel-SGX (software guard extension) to protect a security analysis system with ML-based modules.

What Can Be Done to Improve Machine Learning

This month, 25 researchers from Intel and Georgia Tech gathered in Atlanta for a two-day retreat to share their latest findings. Highlights included:

1. Theoretical foundations of adversarial machine learning: ISTC-ARSA researchers have developed a new theoretical framework to reason how an attacker can generate a sequence of examples to feed to a ML algorithm and force it to quickly produce a model desired by the attacker. This framework provides different strategies that an attacker can use depending on the assumption of how much the attacker knows about the ML algorithm and features.

2. Techniques for adversarial-resilience machine learning: For example, recent research shows that deep neural networks (DNNs) can be highly vulnerable to adversarial samples that are crafted by adding small perturbations (imperceptible to the human eye) to normal, benign images. ISTC-ARSA researchers explored and demonstrated how systematic JPEG compression can work as an effective pre-processing step in the classification pipeline to counter adversarial attacks and dramatically reduce their effects. The researchers also proposed an ensemble-based technique that can be constructed quickly from a given well-performing DNN and empirically showed how such an ensemble that leverages JPEG compression can protect a model from multiple types of attacks without requiring knowledge about the model.

3. The MLSPLOIT project: ISTC-ARSA researchers have laid groundwork for MLSPLOIT by developing a malware analysis framework that supports safe and live exploration of malware behavior. This allows meaningful data to be collected and fed to ML algorithms to learn security analytics models.

We’ll also continue to study some of the features of Intel SGX, which could help harden what is intended to be a secure enclave for program execution when an operating system has been compromised.

Looking ahead in the coming year, ISTC-ARSA researchers will work toward applying the new theoretical framework and hardened techniques from the image domain to the security analytics domain. We also expect to release MLSPLOIT to the research community as a corpus of malware samples, ML-based malware analysis algorithms, tools for manipulating malware samples to evade detection, and procedures for improving the adversarial-resilience of ML algorithms.

Recently surveyed businesses say the most-desired outcome of ML is to increase data insights and prevent human error. Therefore, we must consider the error involved in overlooking cybersecurity. We believe cybersecurity must be considered before – and at the very least in tandem with – boardroom discussions to evaluate the strength, weaknesses, opportunities, and threats of use-case scenarios and benefits.

Because ML is an emerging technology and the hunger from industry is palpable, this is the right time for cross-sector collaboration among academia, industry, and policymakers. Unlike the rapid onset of Internet of Things devices (many of which lack information security and privacy protections), I believe ML can be the scientific discipline that proceeds with caution and security at the forefront. It’s important that more voices from all sectors increase collaboration now rather than later.

To keep up with advances by the Intel Science & Technology Center for Adversary-Resilient Security Analytics at Georgia Tech, follow us here: http://istc-arsa.iisp.gatech.edu/

Wenke Lee, Ph.D., is co-director of the Institute for Information Security & Privacy (IISP) and the John P. Imlay, Jr. Chair of Computer Science in the College of Computing at Georgia Tech, where he has taught since 2001.