From the Desk
🖋 Here we make space for our Data Science Community Newsletter writers to dive deep into a topic and bring you more details, nuances, and insights than will fit in our standard DSCN issues. From their desks to yours, enjoy!
15 June 2021
From the desk of Laura Norén
(read more about Laura Norén)
Who IS the Data Science Community?
We call it the Data Science Community Newsletter because we believe that when all of us read the same updates, research advances, and commentary we developed a shared imagination of what counts as data science and who gets to have a seat at the data science table. The DSCN is our almost literal attempt to get data scientists on the same page. In order to stay community-focused, we source a lot of our material from readers’ twitter accounts and the departmental home pages where readers work. Your posts about data science are a good proxy for your thoughts about data science. We like to signal boost what’s already on your mind.
What do we know about who reads the newsletter?
There are over 8,500 subscribers, mostly based in the US (75%).
Canadians, Brits, and Germans make up 3% each of readers opening the newsletter with another 2% of newsletter opens coming from France. We also see readers in India, Spain, Australia, Brazil, Chile, and Ireland.
Speaking of opening the newsletter, anywhere from 25-30% of you open any given newsletter, which is pretty good engagement for an email newsletter. We also see a lot of room for improvement in our “open” rate.
DSCN readers ❤️ higher education
Our readership is dedicated to formal education. Readers, congratulations on all of your formal education.
Almost half of you have PhDs (47%) or MDs (1%). That’s way above the US national level of about 2% of the population having either a PhD or MD. Another ~40% of you have some flavor of Masters Degree (compare that to 11% of the overall US population). Add it up and 86 percent of the responding readers hold graduate degrees. Way to write those dissertations and theses. (You are my people - there’s no doubt about that.)
Our readers also have strong intentions to continue their data science training. Considering how well-educated readers already are, it is surprising that 17% plan to get another formal university degree and 7% plan to get a certificate from a university.
A quarter plan to attend a 1-6 day workshop or other training session, about half (47%) plan to take an online course (Coursera, Udacity), and almost 70% plan to learn more about data science on StackOverflow, GitHub, or YouTube as needed.
Takeaway: With all the continuing training among a group of highly educated people, I sense the need for a professional data science organization that can identify relevant material for continuing professional education (CPE) credits, similar to what physicians, dentists, and architects have to do to retain their professional standing. There has to be a good balance between informal training on StackOverflow and YouTube and spending the time and money to get another formal university certificate or degree.
Where do DSCN readers work?
Almost half (47%) of our readers work in higher education, well above the US national rate (2%). Makes sense. We write for the academic data science audience. The DSCN readership is also over-represented by readers working in tech and finance with a healthy representation among non-profits and government agencies.
The DSCN readership is radically interdisciplinary
While the Computer Science + Data Science (CS+DS) category is where the largest number of readers got their highest degree (29%), that group does not constitute a numerical majority of readers. That group doesn’t even make up one-third of our readers. This is one key differentiating fact about our readership - it is radically interdisciplinary*. The social science group is roughly tied for second place with the stats/math group at ~13%. Then there are six disciplines that are similar in size - between 4 and 8% of the total. None of this surprises us, but it often surprises outsiders who assume that data science/AI is done by people with CS degrees.
*Radically interdisciplinary groups are groups in which no single discipline makes up 50% of the membership, there are at least five disciplines representing at least 5% of the membership each, and the disciplines in the group use different methods and bodies of theory. In our readership, many groups use data science in their methodology, but there is methodological divergence in terms of how data are collected, in how data relate to theory, in whether/how data science is applied, and in which literatures are considered canonical. There can also be differences in the way data are seen to be related to human subjects with some disciplines seeing data as intrinsically and inextricably linked to human identity, thus potentially being covered by human rights law and other ethical expectations designed to cover humans. Other disciplines may rarely use data derived from humans (astronomers) or may see data as wholly distinct from the humans who generate it, thus outside the purview of laws and ethical principles designed to address humans and human rights.
The DSCN readership is possibly ready to get out again
We wanted to know when we might start seeing DSCN readers at events as we emerge from what was hopefully the worst stretch of the pandemic. We realize that pandemic recovering processes are proceeding at different paces in different places and that the pandemic is not over yet. We went back and forth on differentiating between in-person events and web-based interaction, but eventually gave up and simply asked about readers’ expectations for any kind of event attendance in the next 12 months. The poll was conducted in April/May so there was still a fair amount of confusion about the state of the pandemic, WFH policies, and what mental/physical/emotional/sartorial/misanthropic state you’d be in after what may have been the most challenging year of your professional life.
It looks like many of you are planning to get back to your top 2-4 events, though 15% of you didn’t know enough about your plans for the next 12 months to hazard a guess.
There are still more people canceling everything (9%) than going with the ‘say yes to everything’ approach (5%), but the majority appears to be ready for at least one event.
As for organizing events, over a third of our readers are pitching in to help organize at least one event. Thank you for your service to our community.
Summary - DSCN readers are radically interdisciplinary, mostly from the US, highly educated, probably working in higher education or tech, very interested in getting more education and training, and ready to attend events.
Now that you know who our readers are, you may want to know what they want more of and less of from the DSCN crew. We do love your funny, creative, occasionally cranky responses. Every last one of them.
Reaching the DSCN audience
Perhaps you want to reach our super smart, passionate, opinionated, event-attending audience. They have a lot going for them. We do have a few opportunities for sponsors. Please email DSCN editor firstname.lastname@example.org to find out more. You may also want to read the “What DSCN readers want” blog post while you’re waiting for a response. That post has even more insights into our audience and our plans for continuing to build the academic data science community.
Want to know what readers want from the DSCN?
15 June 2021
From the desk of Laura Norén
(read more about Laura Norén)
What do DSCN readers want?
We asked which 4 attributes of the DSCN readers most liked and disliked. Our “comprehensive coverage” and “academic/research focus” got the broadest support and had the fewest critics. The “ethics angle” was in the top three with the closely related “writing tone” in fourth. All four are attributes the editorial team values highly. We interpret the survey results as a vote of confidence in our shared vision.
The job postings and Tweet of the Week are controversial with similar numbers of likers and haters.
As a result of the lackluster enthusiasm bordering on revulsion, we will drop the Tweet of the Week as a standard feature. We will now run exceptional tweets and/or TikToks when they are truly hilarious and on-topic.
We have a different attitude about job postings. Here’s the thing, readers. If you have a good job and aren’t responsible for hiring, the job postings may feel like clutter. We see that. But readers who want new jobs - 10% of you! - or who want to hire great people, see job postings as highly relevant. Upshot? We will not be eliminating or further reducing the job posts, but we may revamp how we present them.
What we’re getting right
Looks like we are mostly getting the cadence right. Half of you like the every other week schedule, and the remaining half of you are split between wanting DSCN more often and less often. We will keep the every other week frequency for now.
If you happen to be among the group that wants more DSCN in your life, keep reading.
New DSCN content types
OK! We may have caused a mild panic for some readers by asking about other content types. Some of you thought we might stop emailing you. LOL. You gotta request an unsubscribe if you want us to stop showing up in your inbox.
I repeat: We are one hundred percent committed to producing an email newsletter.
We are also considering adding new content types to the Data Science Community Newsletter line-up.
DSCN Podcast, anyone?
Over half of all respondents (55%) want to try a podcast and 6% are curious about the ClubHouse audio format. With audio, listeners can get DSCN content while baking, doing laundry, cleaning up after dinner, walking a dog, or creating data visualizations. We are definitely exploring our audio options. We love the idea of adding an audio content form.
To make a podcast we need: a marquee podcast sponsor
The most important caveat about the podcast offering is that it will cost money we don’t currently have. We don’t believe in asking people to work for free nor do we have the risk tolerance to rely on inconsistent funding streams like Patreon.
I would love to produce a DSCN podcast - with an even higher snark quotient than the newsletter due to vocal intonation alone - so please let the DSCN know if your organization might want to be our marquee podcast sponsor.
Many readers (41%) would also like to see us produce tweet threads of specific stories. This makes a ton of sense. We use your Twitter feeds (well, we use the Twitter feeds of those of you who have given us permission to use your feeds) to keep us aware of what you’re publishing, what you’re talking about, and what your institutions are launching, curtailing, botching, building and claiming as victories.
For those of you who compose tweet threads about your research: Keep doing it. We love them. Your fellow readers love them. Thank you.
Tweet threads are surprisingly time consuming to produce. We want to signal boost the great science communicators out there who are already doing the work it takes to write a good tweet thread. Pro-tip: please use plots, charts, and other data visualizations if you are summarizing research.
We will retweet more of the threads we find and appreciate from the ADSA account (https://twitter.com/AcademicDataSci) and my personal account (https://www.twitter.com/digitalFlaneuse). If you want to get our attention, use hashtag #DSCN. It’s nice and short and will help us find and amplify your tweet threads quickly.
A good fifth of you also want more LinkedIn posts (22%). This is somewhat surprising because the Data Science Community Newsletter has almost no LinkedIn presence. It was also an extremely polarizing type of content. Others threatened to raise hell if we start doing much with LinkedIn.
Given the less polarizing arenas that garnered greater enthusiasm, we are hitting snooze on the LinkedIn posts to give us time to gather more feedback.
I included the option to receive SMS texts throughout the week as a joke. Two percent of you may have taken me seriously. Either that or at least 2 percent of the people taking the survey were robots. If there are truly readers who want me to text you about data science news throughout the week, send your number and preferred topics to email@example.com. I will avoid SMS, but I will start a Signal group.
Disclaimer: I do not know all the hidden meanings of emojis and may inadvertently send something that’s a bit...off. Please DM me to let me know what it is that I have accidentally done.
We didn’t ask you if you wanted DSCN via TikTok.
But some of you very much DO want DSCN via TikTok, preferably with dance.
To quote my favorite VP of Eng: we shall see.
We didn’t write a response option for Slack because the Academic Data Science Alliance (ADSA) already runs a Slack instance. If you have a login to academicdatascience.slack.com, get on in there and chat. If you want access to the ADSA Slack instance, please email me and ask for an invite (firstname.lastname@example.org).
The Write-In Comments
Ah, yes. The write-in comments. Thank you to those of you who took the time to write in anything, especially if you were funny about it. For instance, the person who wants to get the newsletter as a giant scroll to unfurl at a lab meeting? I think you’ll be hearing from team DSCN about that one.
What really gets your pantaloons in a twist: Advertising
Many of you recoiled at the thought that there may be advertisers associated with the DSCN.
First, let us assure you, we aren’t offering a traditional advertising model. We have guidelines that outline the qualities sponsoring organizations must have in order for us to accept them. This will not be like the time organic vegetarian and vegan food bloggers got righteously angry when their advertising network placed ads for sausage links, cold cuts, and chicken nuggets on their blogs.
We are offering several sponsorship opportunities.
To the commenter who pointed out that advertising is a “bad look for the newsletter”: if there’s a way we can proceed that would be less “bad” looking, but still allows us to pay staff, please reach out again. Our goal is to avoid the “bad”-ness part of the look and still do good things like retaining editorial control, making the newsletter available for free, and paying our staff. We think we can pull it off.
We request that our readers give us the benefit of the doubt to find the kind of support that keeps the DSCN free. Ideally, our sponsors will be institutions, departments, and labs that promote things you want to know about.
Want to know more about the DSCN audience?
Go here to read our blog post about reader demographics for the DSCN.
29 March 2021
From the desk of Laura Norén
The seduction of AI in security
The seduction of AI in the context of security
The field of cybersecurity - which has given up its training wheels and is simply called "security" - has seen an increase in the number of vendors claiming to use AI to detect and/or prevent malicious intrusions and attacks. This is incredibly seductive - wouldn't it be great if an algorithm cranking through real-time and historical audit logs, user activity, and user status information could effectively head off bad situations before they turn into big problems? The overtaxed security center operators could sleep through more nights, have real weekends, eat dinner with their loved ones, all while knowing their organizations are protected and their colleague's productivity is unfettered by locked accounts and other productivity-sapping security strategies.
Employing AI in the context of typical security problems is difficult for five reasons:
1. AI isn’t great in radically novel contexts: Some security threats are radically novel
The most serious security threats are new, sophisticated, multi-part intrusion tactics that often unfold over long periods of time. Algorithms are bad at new, long term, sophisticated vectors that span multiple systems, and are designed to evade detection. Algorithms that work best - which are those with the lowest false positive and false negative rates and highest true positive/negative rates that do not generate negative unintended consequences - are generally developed in the context of simpler, at least somewhat repetitive patterns. For the 'street crime' of the internet - phishing, spam, malware - algorithms do a pretty good job. Phishing, spam, and malware are repetitive and generally involve email and/or links.
But for intrusions like the recent SolarWinds supply chain attack, AI was not a big help and we shouldn’t expect AI to be all that helpful against future sophisticated attacks. The attack strategy was new; it unfurled in stages that didn’t strongly resemble one another; and it spanned software updates from one company to access products from another company, within the domains of many companies. As far as I know, there is still no AI model capable of definitively declaring whether or not a given company was impacted by SolarWinds, and if so, in which ways.
AI simply isn't great at flat-out novelty, let alone novelty designed to go undetected.
But if the only criticism about using AI in the context of security is that it cannot detect sophisticated nation-state attacks, we would be in good shape. About 80-90% of security attacks are unsophisticated street crimes. If AI could tackle all of those, security professionals could focus on complex espionage and sophisticated hacks.
But it isn’t quite that simple.
2. Human burnout is the cost of shoddy AI
The second stumbling block for using AI in security has to do with human burnout. Working in security involves a great deal of adrenaline and accountability paradoxa. Over-reporting on anomalies that turn out to be nothing is annoying and slows organizations down. But choosing not to raise a red flag until there is 100% certainty that an anomaly is malicious would mean many intrusions would persist unmitigated for months or years.
In other words, false positives and false negatives are both dealbreakers in security. AI tends to work on big data so even if there is a small false positive rate, that rate translates into a large absolute number of alerts which may not sufficiently reduce the signal-to-noise ratio to humanscale. The human worker tends to get crushed in the crumple zone between low error percentages, which are nonetheless high absolute incidents deserving investigation. It is fairly common for companies who sign up with new vendors to get access to all the new-fangled AI to end up disabling their AI after a while.
The applications of AI that tend to work best are those in which the input signals are of a stable type. It’s all MRIs from the same brand of MRI scanner, all of the same type of organ. Or it’s all legalese from cases filed in the same district court, with responses prepared by a finite number of judges. With more variability in inputs, there will tend to be a higher error rate. The error rate may still be small as a percentage, but when the prediction has no ‘fail safer’ option - when both false positives and false negatives quickly become unmanageable - predictive modeling can lead to personnel burnout.
There is already a shortage of talent in security. It’s unclear if AI is going to eventually alleviate this problem or exacerbate it by burning people out even faster.
3. The security context is full of scheming: Scheming is a real challenge for AI
This is closely related to the next conundrum: security systems are more reliant on scheming human inputs than would be ideal for AI. The pace of change with respect to security threat vectors is rapid and the scope of change is vast. AI does not require perfectly defined terrains - that's the beauty of it! - but it may require more predictability than is available in typical organizations, under threat from everyone from teenagers with a grudge (e.g. Twitter) to nefarious hackers to highly motivated one-person clever thieves to sophisticated nation-state spies. Top this off with the perpetual state of internal transformation, digital and otherwise, that organizations impose upon themselves, and it’s easy to see why training predictive models to the level of required accuracy is simply harder in security, especially when both false positives and false negatives are crippling. For self-driving cars, the classes of threats are narrower. They generally come from somewhere within 300 feet of the exterior plane of the vehicle or from the driver’s seat.
Let's take a closer look at the comparison between predictive models for self-driving cars and trucks versus those for security threat detection. While there is some uncertainty associated with driving - there could be pedestrians, bikers, or novel road conditions like snow covering all lane markers - there is a large amount of sameness in driving. Maps don't change all that often. There is generally a road/non-road distinction, whether or not lane markings are available. (Blizzard conditions are a hard problem for self-driving cars...but blizzards are generally a hard problem for human drivers, too. We should all just stay off the roads during blizzards.) We don't suddenly decide to switch the side of the road that we drive on because it's Tuesday and we're up for a new experience. Bikers and pedestrians mostly do not cross roads wherever they wish, unless you're in NYC. Drivers themselves can cause some problems, but with the right kind of persuasion, they mostly fall asleep or otherwise fail to pay attention to the road, which is a problem the cars can be designed to handle. In other words, the task of driving is predictable enough that with copious amounts of model training and other physical safety features, self-driving cars are likely to be safer than human drivers within the next five years (though probably not in NYC, given the higher pedestrian::car ratio).
One nice optimization feature in self-driving cars is that false positives may not be crippling, though false negatives certainly are. Having a car stop for a plastic bag may be annoying to the rider, but it likely won't kill anyone. (There is the problem of the human driver behind the self-driving car rear-ending the self-driving car that has come to a screeching halt for no apparent reason, but that can be mitigated.) Having a car fail to stop for a toddler squatting in the street to pick up a penny would be a serious problem, so self-driving car makers can carefully optimize to err towards ‘stop’ rather than ‘go’ when uncertainty is high.
In security, there's far less predictability - the type of applications used, tasks performed, locations from which work is conducted, times at which work is conducted, sizes and types of files, strategies for intruding on an organization, assets targeted within the org (doing it for the lulz, doing it for the bitcoin, doing it for the USDs, doing it to eavesdrop for years, doing it to steal IP once) are constantly changing. The baseline is transformation. Further complicating this, security systems are fully human-machine hybrids. False positives and false negatives cripple the humans in the loop, in a variety of very human ways. Self-driving cars can scale into fleets or self-rearranging squads as they begin to primarily interact with each other, with a smaller and smaller percentage of (less predictable) human input. Security workers and tools don’t and can’t work this way, so the returns to AI investment also won’t be advantaged by the low-order fleet-like exponential benefits, either.
[Now, it could certainly be true that other systems to which AI has been applied are also more intertwined with humans than is ideal for AI, but here we're only talking about AI in security.]
4. An intractable problem: Security has weak feedback loops
In security, the people who make the AI are different from the people who apply the AI. This problem is not unique to security! It’s almost always true. I have driven several cars, but designed none. The crux of this difficulty is that security vendors - the people who design and develop many of the tools used in security - don't get to use them everyday, at scale, and the people who DO use them everyday at scale, have little incentive to tell the vendors when they run into truly bad things. When they run into plain vanilla bad things, sure, those aren’t so embarrassing that they’re going to damage corporate value. Those regular bad things that vendors help companies spotlight in the massive data flows coursing through their corporate veins can be shared back to the vendors and may help tune certain algorithms to be more accurate in the future. Particularly when it comes to AI - which tends to become more accurate with training - the shorter the feedback loop between model usage and model tuning, the better. When the worst, most hidden and insidious threats are disconnected from the feedback loop by the need to protect the brand, the stock price, the shareholders, or the employees’ privacy as they are in security, then model accuracy will suffer. This may be the right trade-off -- most would say it is -- but the point is that it is almost always impossible to prioritize model accuracy in the most difficult security investigations because the investigators and the model makers are unable to share information.
It's not just a training data problem. It's also hard to understand the relative costs of false positives and false negatives. For instance, does a SOC employee want to know about possible out-of-policy behaviors they can do nothing about and which may or may not be risky? Without being able to get feedback from a bunch of SOC employees in a bunch of different organizations about specific types of information, the tendency is to over-inform or make curation assumptions without the benefit of feedback from perspectives "on the ground". [This type of data-free context usually means the first, most simplified, loudest, most senior, or last voice heard in the room will win, regardless of validity. Ugh.]
With all these problems, why is AI even used in security at all? The seductions of AI are irresistible from a marketing perspective (and cybersecurity appears to be riddled with nervous copy cats - if company X says they are "powered by AI", Company Y will likely integrate AI into their marketing). Anything that promises a streamlined technologically sophisticated solution is appealing to the target audience: exhausted SOC workers whose mental and physical health are being pulverized by an ongoing assault of alerts, alarms, and urgent meetings. Even though these SOC workers are highly skeptical of AI claims - and experienced security personnel are the ones who trained me - there is still enough value in using AI in the context of ‘street cyber crime’ and accidental lapses in cyber hygiene, that it is reasonable to at least investigate new promises of AI superpowers from vendors. It's not that these workers are so downtrodden that they believe any AI promise hawked by the circus barkers at RSA. It’s that they know - we know - that some security companies will figure out how to scope problems correctly and use AI in ways that are net beneficial.
This brings me to the fifth problem:
5. Data hoarding feels like a good idea, but it’s complicated
In order to overcome some of the scarcity of feedback and the scarcity of sophisticated real-world exploits, many have been tempted to simply gather more data. First, I hope it is obvious to see that more data -- unless it’s more of the right kind of feedback -- is not likely to fundamentally move the needle in the right direction. Sure, having more successful supply chain attacks may help train more accurate models, but the goal is to keep the attacks to a minimum. Nobody wants more attacks.
Aside from data derived directly from threats, which I think we can all agree would be a net drain on companies, it also FEELS like having more data about reactions to threats should be beneficial.
But the benefits of data hoarding are not that straightforward.
Scanning all of employees’ email or videotaping everyone’s movements around the office 24/7 may lead to improvements in certain models. I am, in fact, all for letting Google scan my email to reduce spam. (Thank you, Google. I haven’t seen Viagra in my inbox for years.) But the details of these surveillance practices need to be balanced against corporate and employee privacy concerns. If there’s a reasonable likelihood that scanning email will be net beneficial, that everyone whose email is being scanned is aware of the scanning, and they have an alternative means of communication for super-sensitive communications (e.g. they can call their doctor on the phone or use a personal email account and avoid surveillance), then it may be justified. But we need to collectively move past assumptions that companies own and have a right to surveil every thought, movement, utterance, and keystroke emanating from their employees. Legally, corporations basically do own and have a right to control and surveil every aspect of their employees’ on-the-job and/or on-the-device behavior. This tension between employee surveillance and employee autonomy is not unique to security applications, but it comes up frequently in security settings and is often addressed in favor of security (over privacy). The introduction of AI and data science into the security realm has only tipped the balance of power more firmly towards surveillance, a situation that has remained largely unchecked by new privacy legislation in the US. (The EU is a little different, but not much.)
As seductive as it may be to work towards fitting AI models to security challenges, there are certain classes of problems that are better than others -- more common problems are better application spaces than novel or sophisticated threats -- and there are significant consequences to humans when AI enters the security context.
16 March 2021
From the desk of Laura Norén
(read more about Laura Norén)
A Timely explainer for academic data scientists
Federated learning is a technique in which models can be trained and updated without pooling data in a central data store or otherwise creating onward exposures of the original data. Federated learning came out of Google in 2016 and was initially more widely used in industry than in academic data science, likely because it solved scaling and consumer preference problems that were not as common in academia. However, federated learning is showing up in academic medical research and masters level teaching.
Assistant Professor Corey Arnold (UCLA) and his postdoc Karthik Sarma (UCLA/Caltech) are using federated learning to train diagnostic models on MRI images from several health care providers without having to remove the MRI image data from home data repositories.
In 2020, the University of Cambridge added a module on federated learning to their MPhil course in Advanced Computer Science; many data science, computer science, and AI/ML programs do not yet include federated learning.
A brief overview of federated learning
Where traditional data analysis and modeling first gather all the data into a central location, then run computations against that data to produce a model or models, federated learning leaves the data in decentralized locations, runs the computations in all of those decentralized locations, and sends only the model parameters to the central hub. The central hub then computes one federated model based on the many model estimates and sends that federated model back out to all the members of the federation. As new data become available within the decentralized members of the federation, the process re-runs to update the model. Any updates to the primary model are always available to members of the federation, which is excellent in situations where federation members may have sparse or infrequent data generation. (Note: there are several more complex versions of federated learning, including a direct node-to-node architecture with no central server.)
Federated learning was designed in the context of smart phone use cases where consumers prefer to keep their personal data on their phone but also want to have the latest phone-based models updated in near real-time. For instance, auto-correct helped me spell ‘Billie Eilish’ and ‘Megan Thee Stallion’ with my thick thumbs while texting about the Grammys. 🎸
What are key benefits of federated learning?
Federated learning has two key benefits:
- Federated learning splits the computational load among many devices. If you are paying for computation, this is appealing. Programmers please note: battery etiquette asks that federated learning computations run only when devices are plugged in.
- Federated learning is more privacy protecting because the data remains within its silos. Example: I want to spell Megan Thee Stallion’s name correctly but I may not want my thoughts about Megan Thee Stallion’s lyrics in a database somewhere. 🎤
Should we be pleased about offloading compute costs to distant devices?
There is no definitive answer to this question, but there are a couple common considerations. In cases where the entire network is within your AWS instance or Snowflake account, you’ll pay for all the computation anyway, but with federated learning you can at least assess how much each tenant/customer’s computations are costing your company or research grant. If you’re an app developer and some of the computations are happening on your customer’s phones, laptops, or IoT device, the cost-control objectives are even more obvious, but the computational complexity can be limited by the device type. Phones and laptops are usually computationally sufficient and plugged in regularly, but it is challenging to run federated learning on solar-powered devices clipped to small bird species. 🦜
Does federated learning solve privacy protection?
Federated learning provides a technical strategy that allows a great deal of data to remain where the original owner can control access to it. This is good for privacy and security. However, there are a number of papers suggesting additional constraints to prevent model parameters from revealing sensitive information, even if the underlying data are kept on the local device. For instance, imagine a query language with 500 possible query terms that can be used by the 100 customers of an app. In this example, 80 percent of queries use only ten of the 500 available query terms. The other 490 terms are infrequently used. A matrix that represents each query term as an entity would be sparsely populated outside the top ten, so any customer who used a rare term frequently could generate a substantially different model coefficient for that term, from that organization. This could reveal sensitive corporate information (imagine a query term closely correlated with being acquired, scheduling a large layoff, or investigating a data breach).
All criticisms considered, federated learning is fundamentally a more privacy-preserving approach than strategies that allow less control over data for users, though there is still scope to build additional safeguards and strategies.
Could federated learning offer an end-run around privacy protections and corrode data guardianship?
From a tech ethics perspective, there are some legitimate concerns about which types of data may become available for training federated models that would not otherwise be available. For instance, medical data is generally protected by HIPAA and cannot be shared without explicit consent. If federation allows model training without data sharing, this raises important questions about whether federated learning could be used in applications that are either not net beneficial or that privilege and prioritize those who are already advantaged over those who are not. When the medical data is shared for research purposes that are likely to be net beneficial and that are shared equitably across the population, federated learning is a tool for good. But if federated learning were used by private insurance companies to, say, decline to offer plans in states where access to health care is already challenging, it’s not clear that celebrating privacy preservation is the proper conversation to have about federated learning. In other words, just because a tool or technique is privacy preserving, does not mean it is net beneficial or equitably beneficial.
If federated learning is used to avoid protections built under the heading “privacy” which were actually meant to serve broader ethical goals, there is reason to pay close attention to net benefit and equitable distribution of benefit. No technology should be presented as inherently ethical. Too often privacy protecting technologies and applications are seen as de facto ethical or net beneficial.
In the academic applications of federated learning that I have seen, the net benefits are present and prominent.
Getting started with Federated Learning
👩💻 To do more:
- paperswithcode has 133 papers tagged with Federated Learning. A substantial number of these are software papers outlining tools available in open source repositories.
- Specifically check out: Flower, the open source FL package that the University of Cambridge is using to teach federated learning.
📚 To read more:
- Short and basic:
Brendan McMahan and Daniel Ramage. (2017) “Federated Learning: Collaborative Machine Learning without Centralized Training Data” Google AI Blog.
- Overview of federated learning in medical imaging applications:
Kaissis, G.A., Makowski, M.R., Rückert, D. et al. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell 2, 305–311 (2020). https://doi.org/10.1038/s42256-020-0186-1
- Solid, highly cited explainer on federated learning 101:
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. (2019) Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol. 10, 2, Article 12 (February 2019), 19 pages. https://doi.org/0000001.0000001