From the Desk
🖋 Here we make space for our Data Science Community Newsletter writers to dive deep into a topic and bring you more details, nuances, and insights than will fit in our standard DSCN issues. From their desks to yours, enjoy!
29 March 2021
From the desk of Laura Norén
The seduction of AI in security
The seduction of AI in the context of security
The field of cybersecurity - which has given up its training wheels and is simply called "security" - has seen an increase in the number of vendors claiming to use AI to detect and/or prevent malicious intrusions and attacks. This is incredibly seductive - wouldn't it be great if an algorithm cranking through real-time and historical audit logs, user activity, and user status information could effectively head off bad situations before they turn into big problems? The overtaxed security center operators could sleep through more nights, have real weekends, eat dinner with their loved ones, all while knowing their organizations are protected and their colleague's productivity is unfettered by locked accounts and other productivity-sapping security strategies.
Employing AI in the context of typical security problems is difficult for five reasons:
1. AI isn’t great in radically novel contexts: Some security threats are radically novel
The most serious security threats are new, sophisticated, multi-part intrusion tactics that often unfold over long periods of time. Algorithms are bad at new, long term, sophisticated vectors that span multiple systems, and are designed to evade detection. Algorithms that work best - which are those with the lowest false positive and false negative rates and highest true positive/negative rates that do not generate negative unintended consequences - are generally developed in the context of simpler, at least somewhat repetitive patterns. For the 'street crime' of the internet - phishing, spam, malware - algorithms do a pretty good job. Phishing, spam, and malware are repetitive and generally involve email and/or links.
But for intrusions like the recent SolarWinds supply chain attack, AI was not a big help and we shouldn’t expect AI to be all that helpful against future sophisticated attacks. The attack strategy was new; it unfurled in stages that didn’t strongly resemble one another; and it spanned software updates from one company to access products from another company, within the domains of many companies. As far as I know, there is still no AI model capable of definitively declaring whether or not a given company was impacted by SolarWinds, and if so, in which ways.
AI simply isn't great at flat-out novelty, let alone novelty designed to go undetected.
But if the only criticism about using AI in the context of security is that it cannot detect sophisticated nation-state attacks, we would be in good shape. About 80-90% of security attacks are unsophisticated street crimes. If AI could tackle all of those, security professionals could focus on complex espionage and sophisticated hacks.
But it isn’t quite that simple.
2. Human burnout is the cost of shoddy AI
The second stumbling block for using AI in security has to do with human burnout. Working in security involves a great deal of adrenaline and accountability paradoxa. Over-reporting on anomalies that turn out to be nothing is annoying and slows organizations down. But choosing not to raise a red flag until there is 100% certainty that an anomaly is malicious would mean many intrusions would persist unmitigated for months or years.
In other words, false positives and false negatives are both dealbreakers in security. AI tends to work on big data so even if there is a small false positive rate, that rate translates into a large absolute number of alerts which may not sufficiently reduce the signal-to-noise ratio to humanscale. The human worker tends to get crushed in the crumple zone between low error percentages, which are nonetheless high absolute incidents deserving investigation. It is fairly common for companies who sign up with new vendors to get access to all the new-fangled AI to end up disabling their AI after a while.
The applications of AI that tend to work best are those in which the input signals are of a stable type. It’s all MRIs from the same brand of MRI scanner, all of the same type of organ. Or it’s all legalese from cases filed in the same district court, with responses prepared by a finite number of judges. With more variability in inputs, there will tend to be a higher error rate. The error rate may still be small as a percentage, but when the prediction has no ‘fail safer’ option - when both false positives and false negatives quickly become unmanageable - predictive modeling can lead to personnel burnout.
There is already a shortage of talent in security. It’s unclear if AI is going to eventually alleviate this problem or exacerbate it by burning people out even faster.
3. The security context is full of scheming: Scheming is a real challenge for AI
This is closely related to the next conundrum: security systems are more reliant on scheming human inputs than would be ideal for AI. The pace of change with respect to security threat vectors is rapid and the scope of change is vast. AI does not require perfectly defined terrains - that's the beauty of it! - but it may require more predictability than is available in typical organizations, under threat from everyone from teenagers with a grudge (e.g. Twitter) to nefarious hackers to highly motivated one-person clever thieves to sophisticated nation-state spies. Top this off with the perpetual state of internal transformation, digital and otherwise, that organizations impose upon themselves, and it’s easy to see why training predictive models to the level of required accuracy is simply harder in security, especially when both false positives and false negatives are crippling. For self-driving cars, the classes of threats are narrower. They generally come from somewhere within 300 feet of the exterior plane of the vehicle or from the driver’s seat.
Let's take a closer look at the comparison between predictive models for self-driving cars and trucks versus those for security threat detection. While there is some uncertainty associated with driving - there could be pedestrians, bikers, or novel road conditions like snow covering all lane markers - there is a large amount of sameness in driving. Maps don't change all that often. There is generally a road/non-road distinction, whether or not lane markings are available. (Blizzard conditions are a hard problem for self-driving cars...but blizzards are generally a hard problem for human drivers, too. We should all just stay off the roads during blizzards.) We don't suddenly decide to switch the side of the road that we drive on because it's Tuesday and we're up for a new experience. Bikers and pedestrians mostly do not cross roads wherever they wish, unless you're in NYC. Drivers themselves can cause some problems, but with the right kind of persuasion, they mostly fall asleep or otherwise fail to pay attention to the road, which is a problem the cars can be designed to handle. In other words, the task of driving is predictable enough that with copious amounts of model training and other physical safety features, self-driving cars are likely to be safer than human drivers within the next five years (though probably not in NYC, given the higher pedestrian::car ratio).
One nice optimization feature in self-driving cars is that false positives may not be crippling, though false negatives certainly are. Having a car stop for a plastic bag may be annoying to the rider, but it likely won't kill anyone. (There is the problem of the human driver behind the self-driving car rear-ending the self-driving car that has come to a screeching halt for no apparent reason, but that can be mitigated.) Having a car fail to stop for a toddler squatting in the street to pick up a penny would be a serious problem, so self-driving car makers can carefully optimize to err towards ‘stop’ rather than ‘go’ when uncertainty is high.
In security, there's far less predictability - the type of applications used, tasks performed, locations from which work is conducted, times at which work is conducted, sizes and types of files, strategies for intruding on an organization, assets targeted within the org (doing it for the lulz, doing it for the bitcoin, doing it for the USDs, doing it to eavesdrop for years, doing it to steal IP once) are constantly changing. The baseline is transformation. Further complicating this, security systems are fully human-machine hybrids. False positives and false negatives cripple the humans in the loop, in a variety of very human ways. Self-driving cars can scale into fleets or self-rearranging squads as they begin to primarily interact with each other, with a smaller and smaller percentage of (less predictable) human input. Security workers and tools don’t and can’t work this way, so the returns to AI investment also won’t be advantaged by the low-order fleet-like exponential benefits, either.
[Now, it could certainly be true that other systems to which AI has been applied are also more intertwined with humans than is ideal for AI, but here we're only talking about AI in security.]
4. An intractable problem: Security has weak feedback loops
In security, the people who make the AI are different from the people who apply the AI. This problem is not unique to security! It’s almost always true. I have driven several cars, but designed none. The crux of this difficulty is that security vendors - the people who design and develop many of the tools used in security - don't get to use them everyday, at scale, and the people who DO use them everyday at scale, have little incentive to tell the vendors when they run into truly bad things. When they run into plain vanilla bad things, sure, those aren’t so embarrassing that they’re going to damage corporate value. Those regular bad things that vendors help companies spotlight in the massive data flows coursing through their corporate veins can be shared back to the vendors and may help tune certain algorithms to be more accurate in the future. Particularly when it comes to AI - which tends to become more accurate with training - the shorter the feedback loop between model usage and model tuning, the better. When the worst, most hidden and insidious threats are disconnected from the feedback loop by the need to protect the brand, the stock price, the shareholders, or the employees’ privacy as they are in security, then model accuracy will suffer. This may be the right trade-off -- most would say it is -- but the point is that it is almost always impossible to prioritize model accuracy in the most difficult security investigations because the investigators and the model makers are unable to share information.
It's not just a training data problem. It's also hard to understand the relative costs of false positives and false negatives. For instance, does a SOC employee want to know about possible out-of-policy behaviors they can do nothing about and which may or may not be risky? Without being able to get feedback from a bunch of SOC employees in a bunch of different organizations about specific types of information, the tendency is to over-inform or make curation assumptions without the benefit of feedback from perspectives "on the ground". [This type of data-free context usually means the first, most simplified, loudest, most senior, or last voice heard in the room will win, regardless of validity. Ugh.]
With all these problems, why is AI even used in security at all? The seductions of AI are irresistible from a marketing perspective (and cybersecurity appears to be riddled with nervous copy cats - if company X says they are "powered by AI", Company Y will likely integrate AI into their marketing). Anything that promises a streamlined technologically sophisticated solution is appealing to the target audience: exhausted SOC workers whose mental and physical health are being pulverized by an ongoing assault of alerts, alarms, and urgent meetings. Even though these SOC workers are highly skeptical of AI claims - and experienced security personnel are the ones who trained me - there is still enough value in using AI in the context of ‘street cyber crime’ and accidental lapses in cyber hygiene, that it is reasonable to at least investigate new promises of AI superpowers from vendors. It's not that these workers are so downtrodden that they believe any AI promise hawked by the circus barkers at RSA. It’s that they know - we know - that some security companies will figure out how to scope problems correctly and use AI in ways that are net beneficial.
This brings me to the fifth problem:
5. Data hoarding feels like a good idea, but it’s complicated
In order to overcome some of the scarcity of feedback and the scarcity of sophisticated real-world exploits, many have been tempted to simply gather more data. First, I hope it is obvious to see that more data -- unless it’s more of the right kind of feedback -- is not likely to fundamentally move the needle in the right direction. Sure, having more successful supply chain attacks may help train more accurate models, but the goal is to keep the attacks to a minimum. Nobody wants more attacks.
Aside from data derived directly from threats, which I think we can all agree would be a net drain on companies, it also FEELS like having more data about reactions to threats should be beneficial.
But the benefits of data hoarding are not that straightforward.
Scanning all of employees’ email or videotaping everyone’s movements around the office 24/7 may lead to improvements in certain models. I am, in fact, all for letting Google scan my email to reduce spam. (Thank you, Google. I haven’t seen Viagra in my inbox for years.) But the details of these surveillance practices need to be balanced against corporate and employee privacy concerns. If there’s a reasonable likelihood that scanning email will be net beneficial, that everyone whose email is being scanned is aware of the scanning, and they have an alternative means of communication for super-sensitive communications (e.g. they can call their doctor on the phone or use a personal email account and avoid surveillance), then it may be justified. But we need to collectively move past assumptions that companies own and have a right to surveil every thought, movement, utterance, and keystroke emanating from their employees. Legally, corporations basically do own and have a right to control and surveil every aspect of their employees’ on-the-job and/or on-the-device behavior. This tension between employee surveillance and employee autonomy is not unique to security applications, but it comes up frequently in security settings and is often addressed in favor of security (over privacy). The introduction of AI and data science into the security realm has only tipped the balance of power more firmly towards surveillance, a situation that has remained largely unchecked by new privacy legislation in the US. (The EU is a little different, but not much.)
As seductive as it may be to work towards fitting AI models to security challenges, there are certain classes of problems that are better than others -- more common problems are better application spaces than novel or sophisticated threats -- and there are significant consequences to humans when AI enters the security context.
16 March 2021
From the desk of Laura Norén
(read more about Laura Norén)
A Timely explainer for academic data scientists
Federated learning is a technique in which models can be trained and updated without pooling data in a central data store or otherwise creating onward exposures of the original data. Federated learning came out of Google in 2016 and was initially more widely used in industry than in academic data science, likely because it solved scaling and consumer preference problems that were not as common in academia. However, federated learning is showing up in academic medical research and masters level teaching.
Assistant Professor Corey Arnold (UCLA) and his postdoc Karthik Sarma (UCLA/Caltech) are using federated learning to train diagnostic models on MRI images from several health care providers without having to remove the MRI image data from home data repositories.
In 2020, the University of Cambridge added a module on federated learning to their MPhil course in Advanced Computer Science; many data science, computer science, and AI/ML programs do not yet include federated learning.
A brief overview of federated learning
Where traditional data analysis and modeling first gather all the data into a central location, then run computations against that data to produce a model or models, federated learning leaves the data in decentralized locations, runs the computations in all of those decentralized locations, and sends only the model parameters to the central hub. The central hub then computes one federated model based on the many model estimates and sends that federated model back out to all the members of the federation. As new data become available within the decentralized members of the federation, the process re-runs to update the model. Any updates to the primary model are always available to members of the federation, which is excellent in situations where federation members may have sparse or infrequent data generation. (Note: there are several more complex versions of federated learning, including a direct node-to-node architecture with no central server.)
Federated learning was designed in the context of smart phone use cases where consumers prefer to keep their personal data on their phone but also want to have the latest phone-based models updated in near real-time. For instance, auto-correct helped me spell ‘Billie Eilish’ and ‘Megan Thee Stallion’ with my thick thumbs while texting about the Grammys. 🎸
What are key benefits of federated learning?
Federated learning has two key benefits:
- Federated learning splits the computational load among many devices. If you are paying for computation, this is appealing. Programmers please note: battery etiquette asks that federated learning computations run only when devices are plugged in.
- Federated learning is more privacy protecting because the data remains within its silos. Example: I want to spell Megan Thee Stallion’s name correctly but I may not want my thoughts about Megan Thee Stallion’s lyrics in a database somewhere. 🎤
Should we be pleased about offloading compute costs to distant devices?
There is no definitive answer to this question, but there are a couple common considerations. In cases where the entire network is within your AWS instance or Snowflake account, you’ll pay for all the computation anyway, but with federated learning you can at least assess how much each tenant/customer’s computations are costing your company or research grant. If you’re an app developer and some of the computations are happening on your customer’s phones, laptops, or IoT device, the cost-control objectives are even more obvious, but the computational complexity can be limited by the device type. Phones and laptops are usually computationally sufficient and plugged in regularly, but it is challenging to run federated learning on solar-powered devices clipped to small bird species. 🦜
Does federated learning solve privacy protection?
Federated learning provides a technical strategy that allows a great deal of data to remain where the original owner can control access to it. This is good for privacy and security. However, there are a number of papers suggesting additional constraints to prevent model parameters from revealing sensitive information, even if the underlying data are kept on the local device. For instance, imagine a query language with 500 possible query terms that can be used by the 100 customers of an app. In this example, 80 percent of queries use only ten of the 500 available query terms. The other 490 terms are infrequently used. A matrix that represents each query term as an entity would be sparsely populated outside the top ten, so any customer who used a rare term frequently could generate a substantially different model coefficient for that term, from that organization. This could reveal sensitive corporate information (imagine a query term closely correlated with being acquired, scheduling a large layoff, or investigating a data breach).
All criticisms considered, federated learning is fundamentally a more privacy-preserving approach than strategies that allow less control over data for users, though there is still scope to build additional safeguards and strategies.
Could federated learning offer an end-run around privacy protections and corrode data guardianship?
From a tech ethics perspective, there are some legitimate concerns about which types of data may become available for training federated models that would not otherwise be available. For instance, medical data is generally protected by HIPAA and cannot be shared without explicit consent. If federation allows model training without data sharing, this raises important questions about whether federated learning could be used in applications that are either not net beneficial or that privilege and prioritize those who are already advantaged over those who are not. When the medical data is shared for research purposes that are likely to be net beneficial and that are shared equitably across the population, federated learning is a tool for good. But if federated learning were used by private insurance companies to, say, decline to offer plans in states where access to health care is already challenging, it’s not clear that celebrating privacy preservation is the proper conversation to have about federated learning. In other words, just because a tool or technique is privacy preserving, does not mean it is net beneficial or equitably beneficial.
If federated learning is used to avoid protections built under the heading “privacy” which were actually meant to serve broader ethical goals, there is reason to pay close attention to net benefit and equitable distribution of benefit. No technology should be presented as inherently ethical. Too often privacy protecting technologies and applications are seen as de facto ethical or net beneficial.
In the academic applications of federated learning that I have seen, the net benefits are present and prominent.
Getting started with Federated Learning
👩💻 To do more:
- paperswithcode has 133 papers tagged with Federated Learning. A substantial number of these are software papers outlining tools available in open source repositories.
- Specifically check out: Flower, the open source FL package that the University of Cambridge is using to teach federated learning.
📚 To read more:
- Short and basic:
Brendan McMahan and Daniel Ramage. (2017) “Federated Learning: Collaborative Machine Learning without Centralized Training Data” Google AI Blog.
- Overview of federated learning in medical imaging applications:
Kaissis, G.A., Makowski, M.R., Rückert, D. et al. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell 2, 305–311 (2020). https://doi.org/10.1038/s42256-020-0186-1
- Solid, highly cited explainer on federated learning 101:
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. (2019) Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol. 10, 2, Article 12 (February 2019), 19 pages. https://doi.org/0000001.0000001