2020 ADSA Annual Meeting Abstracts
Here you will find the abstract for presentations, breakout sessions, demos, and lightning talks for the 2020 ADSA Annual Meeting. Abstracts are organized alphabetically by first author's last name. We recommend using your find other authors or keywords
Detecting biases in black-box AI models using inspirations from Psychology: a tutorial and applications in Python
Daniel Acuna, Syracuse University; Lizhen Liang, Syracuse University
There is no doubt that modern Artificial Intelligence (AI) has brought amazing progress to our everyday life. Recent incidents of biases in AI (e.g., racism and sexism) have raised concerns about hidden issues that widely-deployed systems might have. One reason for these problems is the opaque nature of AI agents, which makes it hard to interpret and monitor their decisions. In this tutorial, we will explain the main issues with bias in AI, current solutions, and proposals to fixing such biases. We will present our research to detect biases inspired by Psychophysics--a branch of Psychology aimed at understanding biases in humans. We will show Python code and demonstrate within Jupyter notebooks how to discover gender biases in Natural Language Processing and sentiment analysis. We will discuss other applications of our method. We will also present a small survey of other techniques and tools developed by companies and research groups aimed at detecting and fixing biases in modern AI systems.
Deploying JupyterHub-Ready Infrastructure with Terraform on AWS
Sebastian Alvis, University of Washington; Yuvi Panda, University of Berkeley; Scott Henderson, University of of Washington
Cloud computing has the potential to be a powerful tool for open, reproducible, and scalable science. However, one of the barriers to cloud computing for scientists is the complexity of setting up infrastructure. Terraform addresses these problems by managing infrastructure as code and allowing the building process to be scripted. Using it allows us to interact with cloud providers' consoles less, make fewer mistakes, and have a living record of our infrastructure that we can give to collaborators to foster open science.
Successes and Challenges of Hosting Remote Unconferences
Anthony Arendt, University of Washington; Daniela Huppenkothen, University of Washington
Participant-driven workshops such as hackweeks have been providing unique opportunities for fostering data science education and community building. These events strive to provide inclusive spaces that support learning, with an emphasis on teamwork and collaboration. The global coronavirus pandemic has introduced new challenges: how do we continue to build communities when we can no longer meet in person? Since April 2020 we have been designing new ways to host these events remotely, and we have much to share. In this session we will report on lessons learned and gather your insights on what you have observed as we all navigate the transition to virtual exchanges. We will focus on how we can best replicate some of the social interactions, such as random exchanges at coffee breaks, that often spur new connections and innovations. We will also plan for how these events might continue to incorporate some of the advantages of remote work, such as increased accessibility, as we transition back to in-person activities.
Debiasing Knowledge Graphs
Chaitan Baru, National Science Foundation; Krzysztof Janowicz, UC Santa Barbara; Michael Cafarella, University of Michigan
Knowledge Graphs use a combination of scalable technologies, specifications, and data cultures to represent densely interconnected statements derived from structured or unstructured data sources across domains in a human and machine-readable and reasonable way. These graphs play an increasing role in AI and machine learning research where they serve as a source for question answering models or for extracting contextual features for downstream tasks such as predictive analytics, trading, and data management. To improve the quality of these graphs, researcher have also developed methods for predicting missing links or predicting future changes/revisions.
Interdisciplinary programs and courses in data science
Amanda Beecher, Ramapo College of New Jersey; Angela Berardinelli, University of North Carolina at Charlotte; Andrea Pitts, University of North Carolina at Charlotte; Stephanie Moller, University of North Carolina at Charlotte; Scott Frees, Ramapo College of New Jersey
We invite any data science educator or professional interested in discussing the challenges and benefits of the interdisciplinary nature of data science education to participate in this session. The facilitators will discuss work at the course and programmatic levels that feature multiple disciplinary components to showcase the interdisciplinarity of data science. Specifically, one talk will discuss a new first-semester undergraduate data science course for majors and non-majors that integrates at least four different disciplines into one six-credit-hour studio-style course. The other will feature the challenges and benefits of launching two interdisciplinary data science degree programs simultaneously. We will present the highlights from our course design process and initial insights on these efforts as they are in progress at the time of the meeting. After our short discussions of our efforts, we will open the floor for questions, suggestions, experiences, and discussion from all participants.
Growing Diversity in Data Science with a Hands-On Deep Learning Workshop for Students Exploring Racial Bias in Facial Recognition
Charreau Bell, Vanderbilt University; Jesse Spencer-Smith, Vanderbilt University
Improving representation of Black students and students of color in Data Science begins with building interest and motivating students early in their academic careers. We present the details of a hands-on deep learning workshop where students train a facial recognition system that displays bias due to improper training sets, with disproportionately poor performance on non-white faces. Students learn DS skills, the importance of training sets, the dangers of misuse of technology, and approaches to fairness in AI. The workshop is appropriate for students with minimal coding experience. Code and materials for the workshop will be shared.
Career Development Network Welcome and Mixer
Career Development Network, Executive Committee
The Career Development Network (CDN) is an ADSA community that supports the professional advancement and growth of developing data scientists. Originally started by alumni of the Moore-Sloan Data Science Environments, the group now includes early career members from a variety of institutions and disciplinary backgrounds. This welcome and mixer session is open to all current members of the CDN as well as those who are interested in learning more about our goals and activities. We will introduce the current executive committee, describe our four major initiatives, and describe ways to get involved in our various activities. After this brief introduction, we will have a virtual “mixer” using Remo.co to allow members to catch up and meet new colleagues in an informal setting.
Career Development Network Activity Breakouts
Career Development Network, Executive Committee
This breakout session continues the discussion started during the Career Development Network (CDN) welcome and mixer session on Thursday. This session is open to all current members of the CDN as well as newly interested individuals who wish to join the CDN for the coming year. After a brief introduction, attendees will divide into four breakout rooms centered on our four major initiative areas: skill sharing, mentoring, communication, and grants. Each breakout will be hosted by a current member of the executive committee, who will lead the group in brainstorming ideas and making plans for network’s activities in the coming year.
ADSA Ethics Working Group Products: the Ethos Lifecycle Interactive Tool and Research Administration Guidelines
Cathryn Carson, University of California, Berkeley; Krystal S. Tsosie, Native BioData Consortium
ADSA sees ethical practice and social responsibility as fundamental to data science. As researchers and educators, we have an obligation to shape perspectives and tools that do justice to our complex, changing world. The ADSA Ethics Working Group has developed two interventions that it is presenting for review and comment. First: A re-imagining of the Data Science lifecycle (depicting the steps of a data science project) for practitioners, teachers, and students, incorporating ethical considerations at each stage of the data science workflow. Second: A suite of recommendations for research administrators seeking to bring an ethical framing to data science research. In addition to a general discussion, session participants will be asked to choose one product, review it in advance of the session, and provide feedback in a structured breakout setting.
Data Science Course in a Box
Mine Cetinkaya-Rundel, University of Edinburgh
Data Science in a Box (datasciencebox.org) is an open-source project that aims to equip educators with concrete information on content and infrastructure for designing and painlessly running a semester-long modern introductory data science course with R. In this talk we outline five guiding pedagogical principles that underlie the choice of topics and concepts introduced in the course as well as their ordering, highlight a sample of examples and assignments that demonstrate how the pedagogy is put into action, introduce `dsbox` -- the companion R package for datasets used in the course as well as interactive tutorials, and share sample student work and feedback. We will also walk through a quick start guide for faculty interested in using all or some of these resources in their teaching.
Geospatial Data for Political Redistricting Analysis
Daryl Deford, Washington State University
The problem of constructing reasonable political districts and the related problem of detecting intentional gerrymandering has received a significant amount of attention in recent years. A key problem in this area is determining the expected properties of a representative districting plan as a function of the input geographic and demographic data. A natural approach is to generate a comparison ensemble of plans using Markov chains methods and I will present successful applications of this approach in both court cases and legislative reform efforts. In addition to the difficulties inherent in obtaining accurate descriptions of current precinct boundaries, this analysis requires making decisions about which units to connect across water boundaries and how to associate demographic data defined on incompatible units, all of which can impact the properties of the sampling methods. Throughout the talk, I will present examples of the ways these modeling decision can impact the resulting distributions over districting plans.
An Overview of and Lessons Learned from Hosting a Data Science Workshop Series for Undergraduate Students from Under-Represented Backgrounds
Orianna Demasi, UC Davis; Stacy Dorton, UC Berkeley; Sara Stoudt, UC Berkeley
Data Science is not practiced by a population that is as diverse as and thus representative of the general population. This is problematic, and many efforts will be needed to level the playing field and attract, train, employ, and celebrate a diverse data science workforce. We, a working group at BIDS, wanted to take steps within our sphere of influence to better understand the experience of students and barriers to a more diverse field of data science. Knowing that the data science world at UC Berkeley is a vast one, spread across various departments, organizations, and initiatives, we also wanted to lessen some of the barriers to entry and support students from under-represented groups who were already pursuing data science by helping them navigate the resources available to them. As a preliminary effort we organized a series of informal workshops aimed at democratizing information that may be needed or helpful for undergraduate students from under-represented backgrounds to find their place and flourish in data science. We modeled our effort off of other efforts that have successfully celebrated diversity in computing but implemented them on a smaller, more local scale. In this talk, we will discuss how student experiences guided us to decide upon this effort, the structure of the workshop series, the experience and extent of participation, and reflections on lessons learned from what worked (and what didn't). Through sharing our experience, we hope to continue a conversation about building upon this effort and improving future efforts.
Using Mask-RCNN, a deep learning approach, to track the AOI in egocentric computer vision research
Tingyan Deng, Vanderbilt University; Alexander Langerman; Benoit Dawant, Vanderbilt University
There is increasing professional, societal and legislative interest in video records in surgery, yet there is no existing solution for automating the capture of high-resolution videos for the majority of surgical procedures. As part of an ongoing VISE-affiliated collaboration between VISE faculty and a VUMC surgeon, the Surgical Analytics Lab has prototyped a novel surgeon-worn video platform that provides an unobstructed view of the surgical field; a key consideration now is to develop methods to identify where the surgical "action" (area of interest; AOI) is taking place in the frame of these videos. This is important for developing tracking mechanisms that can automatically adjust the camera to maintain vigilance in the surgical field despite the natural body movement of the wearer. The goal of my project is to use deep learning to develop methods for automatic AOI detection system.
The Urban Observatory: Better Cities Through Imaging
Greg Dobler, University of Delaware
With millions of interacting people and hundreds of governing agencies, urban environments are the largest, most dynamic, and most complex macroscopic systems on Earth. The interaction between the three fundamental components of that system (human, natural, and built) can be studied much like any physical system, with observation and application of physical principles to the collection and analysis of that data. I will describe how persistent, synoptic imaging of an urban skyline can be used to better understand the urban system, in analogy to the way persistent, synoptic imaging of the sky can be used to better understand the heavens. At the "Urban Observatory", a multi-city facility consisting of a network of observational platforms, we are combining techniques from the domains of astronomy, physics, computer vision, remote sensing, and machine learning to address a myriad of questions related to urban informatics. I will demonstrate the power of these techniques when data from the Urban Observatory is fused with publicly available records and in situ sensing data to provide new insights into cities as living organisms that consume energy, have environmental impact, and display characteristic patterns of life, and how that new understanding can be used to improve city functioning and quality of life for its inhabitants.
Natural language processing and literature mining for novel discoveries from scientific literature
Orion Dollar, University of Washington; Chowdhury Ashraf, University of Washington; David Beck, University of Washington
The number of scientific articles published each year is growing exponentially and we will soon reach the point where it may take a single researcher a lifetime of reading to simply catch themselves up on the available body of literature for a given topic. It is important that we realize a means to distill this information so we do not waste countless hours of research. Natural language processing (NLP) represents a promising path forward to achieve this goal. Within the last several years, high profile successes have demonstrated that NLP methods when applied to scientific literature are capable of discovering new thermoelectric materials, identifying synthesis parameters and building vast, queryable databases. Furthermore, some models have even shown they are capable of extracting latent information that is not explicitly written in text but can be inferred from the surrounding context. This suggests that summarization, classification, property prediction and discovery are all achievable outcomes with the right selection of model. In this work, we present an NLP learning framework as applied to the use case of corrosion inhibition. After collecting and processing over 1.3 M abstracts from Elsevier, we show that our model can effectively identify known corrosion inhibitors and provide us with a promising set of novel candidates for further testing. We also explore the effects of normalization and model complexity on the results and present a unified pipeline that can be implemented for any set of abstracts or desired property.
Troubles in/with Text: Finetuning NLP to Analyze Declassified Rationalizations of Rights Violations
Sarah Dreier, University of Washington; Emily Gade, Emory University
How do governments rationalize policies that violate the rights of their citizens? To answer this question, researchers might consult classified government correspondence documents that are later publicly released. Analyzing archival collections is a challenging task, requiring experts to systematically code distinct rationalization categories and then qualitatively, quantitatively, and/or computationally evaluate hypotheses. The challenges are compounded by the nature of the data, which in many cases is not only unstructured but also imperfectly digitized text. New advances in automation for text-related tasks, originating in the computing field of natural language processing (NLP), offer potential improvements for scaling qualitative analysis and dealing with "noisy" data. Here, we use a novel data source recently declassified archives of the UK Prime Minister's correspondence during the "Troubles in Northern Ireland" and assess the effectiveness of emerging techniques from NLP in identifying and categorizing rationalizations for human rights violations in that collection. We show that NLP's pretrain-and-finetune approach to analyzing text (specifically the BERT-based family of models) represents improvements upon a model's ability to classify complex, specific concepts, like "rationalizations" for human rights violations. We close by identifying the promises and existing challenges represented by NLP's pretrain-and-finetune approach for analyzing political science concepts.
Moderating with the Mob: Crowd Sourced Fact Checking
William Godel, New York University; Zeve Sanderson, New York University; Kevin Aslett New York University
The quantity of news and news-like material on the internet now vastly exceeds the ability of fact checking organizations to address them in real time. These challenges are especially acute on social media platforms, where the volume and velocity of information diffusion far outpaces content moderation efforts. In this paper, we utilize a novel dataset of over 13,000 crowd sourced evaluations of recently published articles to see if crowds of laypeople can successfully discriminate between true and fake news. Utilizing both transparent rules-based methods and machine learning, we find that crowds can be used to identify fake news, but that there are significant tradeoffs between transparency, representativeness, and accuracy, as well as significant limitations depending on the task at hand.
Strategizing COVID-19 Lockdowns Using Mobility Patterns
Leila Hedayatifar, New England Complex Systems Institute; Yaneer Bar-Yam, New England Complex Systems Institute; Olga Buchel, New England Complex Systems Institute
During the COVID-19 pandemic, countries and states/provinces have tried to keep their territories safe by isolating themselves from others by limiting non-essential travel and imposing mandatory quarantines for travelers. While large-scale quarantine has been the most successful short-term policy, it is unsustainable over long periods as it exerts enormous economic costs. Countries which have been able to partially control the spread of COVID-19 are thinking about policies to reopen businesses. However, pandemic experts strongly warn against reopening too soon. Thus, it is urgent to consider a flexible policy that limits transmission without requiring national scale quarantines. Here, we have designed a multi-level quarantine process based on the mobility patterns of individuals and the severity of COVID-19 contagion in different areas. By identifying the natural boundaries of social mobility policy makers can impose travel restrictions that are minimally disruptive of social and economic activity. The dynamics of social fragmentation during the COVID-19 outbreak are analyzed by applying the Louvain method with modularity optimization to the weekly mobility networks. In a multi-scale community detection process, using the locations of confirmed cases, natural break points as well as high risk areas for contagion are identified. At the smaller scales, for communities with a higher number of confirmed cases, contact tracing and associated quarantine policies is increasingly important and can be informed by the community structure.
Application of a Robust Framework for neuroImaging Based Experimental Routines (FIBER) for Integrating Domain and Data Science
Hawley Helmbrecht, University of Washington; Mengying Zhang, University of Washington; Elizabeth Nance, University of Washington
The modern neuroimaging experiment produces more data than can be analyzed using traditional techniques. Data science is a necessity to efficiently process and analyze data generated from imaging-based experiments. However, to synthesize experimentalists with an effective data science strategy, a robust method for developing well-designed, data informed experimental studies must be followed. To effectively integrate data science and experimentalists, we developed a framework for neuroimaging based experimental routines (FIBER). The objective of this project is to demonstrate our implementation of the FIBER framework from the initial stages of an experiment to the final results. We applied FIBER to a investigate morphological changes in microglia in the oxygen-glucose deprived brain. We applied FIBER by: (1) developing a data awareness, (2) designing a data management plan, (3) determining an optimal experimental pipeline, (4) building out supporting data science infrastructure, (5) performing primary and supplemental imaging, and (6) producing interpretable visualizations of results. We successfully implemented this plan through open communication, facilitated discussions, data management conversations, and domain science feedback with our experimentalists. The FIBER framework provided a valuable foundation for collaboration. Due to FIBER implementation in this project, the integrated data science and experimentalist team achieved results that showed (1) distribution of representative microglia phenotypes, (2) deviance from the normal control phenotypic state, and (3) circularity of computer segmented cells in response to injury and treatment experimental groups. Our application of FIBER allowed for effective collaboration between data and domain science while providing scientific insight into cell morphology in the brain.
Machine Learning for Noise Correction and Classification Strategies for Astronomical Sources
Nina Hernitschek, Vanderbilt University
Nowadays large-scale astronomical surveys enable us to classify astronomical objects based on their time-domain variability. An often forgotten, but very important step is to reliably removing systematic noise from lightcurves, while keeping the overall characteristics. I will present methods successfully applied to data from the TESS (Transiting Exoplanet Survey Satellite) spacecraft, as well as machine-learning strategies applied to classify variable sources. I will also illustrate the widely differing requirements among different astronomical surveys by comparing the algorithms used for the single-band, high-cadence TESS survey versus the multi-band, sparse-cadence PS1 (Pan-STARRS1) survey.
Diversity, Equity and Inclusion - The Path Forward in Data Science
Florence Hudson, Columbia University; Jeannette Wing, Columbia University; Carly Strasser, Chan Zuckerberg Initiative; Kate Hertweck, Fred Hutchinson Cancer Research Center; Lauren Wolfe, Fred Hutchinson Cancer Research Center
ADSA has begun a concerted effort to increase awareness and improve our individual and combined efforts and results to address racism, lack of inclusivity, and pro-active outreach to build a diverse, equitable and inclusive data science community. Join this Birds of a Feather discussion to brainstorm together what steps we can take to improve diversity, equity and inclusion (DEI) in participation in the data science community , and the ethics of data science and its application to solve scientific and societal challenges today and into the future.
Rapid Response Data Science: Frameworks For Effective Mobilization During Crises
Eric Kolaczyk, Boston University; Jing Liu, University of Michigan; Meredith Lee, University of California, Berkeley and NSF West Big Data Hub
The COVID-19 pandemic has surfaced a spectrum of data science challenges and opportunities. Despite demonstrating strengths in generating new data, models, and outputs, the community has found itself insufficiently prepared to contribute in a coordinated fashion to the broader effort, across regional, national, and global scales. The magnitude of this pandemic, and other widespread crises, require more from science than relatively isolated contributions and impromptu consortia. There is a pressing need for a convergence of agility and coordinated capacity, with academia, government, industry, and community organizations, each bringing complementary resources to bear. A networked approach means standing up shared infrastructure, preparedness training, teams that act interoperably with swift coordination, and efficient and effective communication channels between and among academics and the private and public sectors. To prepare ourselves for the next inevitable crisis, the data science community needs a reliable, relevant, and cohesive data science rapid response network. This joint session of the ADSA Annual Meeting and the Data Science Leadership Summit will act as a launching point for partnership commitments and future coordination to establish the foundation and strategic path toward a data science rapid response to crises.
Panelists: Ran Canetti, Director of the Center for Reliable Information Systems and Cyber Security, Boston University; Lara Campbell, Program Director for the NSF Convergence Accelerator; Emma Spiro, co-founder of the Center for an Informed Public at the University of Washington; Nicolette Louissaint, Executive Director and President of Healthcare Ready; Member of FEMA National Advisory Council
Challenges of Data Science in Times of Political Crisis in Developing Countries
JoseManuel Magallanes, Pontificia Universidad Catolica del Peru/University of Washington
I discuss the challenges of data science from a political advisor perspective in time of COVID 19. Political decision makers are in a constant hurry to make decisions and political advisor lack strong evidence to support those decision, but advice is needed nevertheless. Novelty is inspiring in research, but lack of knowledge of particular problem, like COVID dynamics, forces many assumptions that may bring instant palliative remedies but unintended consequences in the long run. The case is based on Peru, an example of a country with many outstanding macroeconomic characteristics, but with a State that lacks coordination within each government subnational level, and among them. In those situations, Data Science can be theoretically applied in these situation, but crisis requires multi sector and multilevel data that is not available, revealing that most applications of Data science have strong assumptions about data availability, which is not the case in reality.
Locating New Social Work Responses to the COVID-19 Crisis Using NLP and Topic Analysis
Kobe Mike, Technion; Lea Levin, Tel Aviv University
The recent COVID-19 pandemic has exposed and amplified many of the acute social problems plaguing Israeli society as a whole, and specifically with regard to its most vulnerable populations. In Israel, as in many other countries, social workers are among the main entities charged with responding to such problems. Current responses are based on theories and professional experience gathered prior to the pandemic, which do not necessarily fit the unique situation that is unfolding. Our aim is to uncover the basic elements of the changing social work discourse underlying current interventions, in order to generate initial understandings regarding prevailing ways to respond to the ongoing crisis. Such understandings are crucial to social workers' ability to provide relevant assistance to their beneficiaries in a rapidly changing social, economic and health climate. They will be locally contextualized, but will be of potential international interest as well. Our project is rooted in an interdisciplinary collaboration of researchers from social and data sciences, and will be performed on closed professional groups and forums used by social workers in Israel to exchange insights, conceptualizations and dilemmas deriving from their daily practice. Our plan is to uncover the evolution of the topics and themes that characterize discussions in these platforms, using topic analysis and NLP. As data is highly sensitive, we have yet to obtain access to it. We are looking for more ideas regarding data sources and analysis algorithms.
The Effect of Binary Black holes on Primordial Gravitational Waves
Murti Nauth, Fisk / Vanderbilt
The Laser Interferometric Space Antenna (LISA) is a gravitational wave mission expected to observe supermassive black hole mergers from the present day to before reionization, providing a set of standard ``sirens'' with accurate distances that span the universe and are unperturbed by dust. As such, LISA observations hold great promise in constraining cosmology, such as non-gaussianity from inflation and the existence of dark radiation.
Many of these constraints, however, rely on the supposition that the supermassive black hole mergers themselves are non-biased tracers of the matter distribution, as are LISA observations of these mergers. We use large-volume cosmological gravo-magnetohydrodynamic simulations with supermassive black hole physics to measure the three-point correlation function of binary black hole mergers in the universe, and present results on the potential for biased structure measurements from the LISA data stream.
Teaching and Researching Data Science with ISLE
Rebecca Nugent, Carnegie Mellon University; Philipp Burckhardt, Carnegie Mellon University
The first wave of Data Science courses, certificates, and programs relied heavily on a foundation of already existing computing-oriented classes; less effort was spent on how people from diverse backgrounds and disciplines might approach working with data or, for example, how to optimally structure data science workflows or collaborations. To better navigate the pedagogical challenges of teaching both STEM and non-STEM populations while researching "data science as a science", we developed ISLE (Interactive Statistics Learning Environment; http://www.stat.cmu.edu/isle). ISLE is a browser-based interactive platform that, if desired, removes the computing cognitive load and lets users explore Statistics & Data Science in both structured and unstructured ways. The platform integrates data analysis, writing, group collaboration, and reproducible workflows. ISLE can be used in both remote and in-person settings and supports adaptive, interactive lectures, labs, and real-time feedback for any discipline. Currently ISLE is being used by hundreds of students in courses ranging from introductory through graduate level methodology and in executive education. On the back-end, we track and analyze every click, word, and action during data analysis, writing, and/or collaborating as well as how student engagement and performance vary with lesson/lecture structure, supporting extensive research on how to teach/do data science. In this demo, we will cover an introduction to ISLE and its capabilities with some focus on group collaboration and interactive (remote) lectures before moving to an interactive Q&A where participants will be able to explore ISLE and brainstorm their own research questions.
Data Science Support through Graduate Fellowship Programs
Jeffery Oliver, University of Arizona
College campuses are facing increasing demands for data science support and those needs are oft left unmet. At the University of Arizona, the Data Science Ambassadors program was developed to help meet the growing demands for data science training across campus. The program provides accessible support within individual colleges by capitalizing on domain knowledge and data science expertise in the graduate student population. Graduate students' ability to "speak the same language" as researchers in their respective college reduces the communication barrier many researchers face when learning data science applications. Two years on, the program assists researchers and educators in navigating data science campus resources and provides discipline-specific data science training.
Foundations of Data Science: The TRIPODS Experience
Abel Rodriguez, UC Santa Cruz
The panel / research update will highlight the work on Foundations of Data Science carried out under the first round of NSF-funded Phase I TRIPODS institutes. In addition to discussion/presentation of some technical topics that have come to form the core "foundation", the presenters will discuss the benefits and challenges associated with this effort, as well as highlight lines of interaction with other data science areas. Additional panelists: Xiaoming Huo, Georgia Tech; Michael Jordan, UC Berkeley and Helen Zhang, University of Arizona.
A Comparison of Practitioner/Industry Surveys and Published Data Science Curricula
Karl Schmitt; Ruth Wertz, Valparaiso University; Linda Clark, Brown University
During data science's emergence as a distinct discipline, there have been fraught discussions about what exactly constitutes data science. These conversations have been exacerbated by the lack of a single clear parent discipline. This has led to several computational sciences attempting to claim data science, and led to the creation of documents defining data science including recent work by the ACM Data Science Task Force and publications in the Harvard Data Science Review. The EDISON Project from the European Union offers the most complete effort scoping data science curricula with their Data Science Body of Knowledge (DS-BoK) and Competency Framework (CF-DS). This talk presents an additional perspective by taking a critical look at how EDISON's CF-DS compares to instructor and industry experiences. Their views were collected through a broadly administered survey. Our results and analysis provides important insights for those currently working to formalize the discipline and any college or university looking to build their own undergraduate degree.
An AI System for Predicting the Deterioration of Patients with COVID-19 in the Emergency Department
Farah Shamout, New York University; Krzysztof Geras, New York University; Carlos Fernandez-Granda, New York University
There is a pressing need to identify deterioration amongst patients with COVID-19 in order to avoid life-threatening adverse events. Chest radiographs are frequently collected from patients presenting with COVID-19 upon arrival to the emergency department, since it is considered as a first-line triage tool and the disease primarily manifests as a respiratory illness. In this talk, I will discuss the AI prognosis system we developed using data collected at NYU Langone Health to predict in-hospital deterioration, defined as the occurrence of intubation, mortality, or ICU admission. In particular, our system consists of an ensemble of an interpretable deep learning model to learn from chest X-ray images and a gradient boosting model to learn from routinely collected clinical variables, e.g. vital signs and laboratory tests. The system also computes deterioration risk curves to summarise how the risk is expected to evolve over time. The results of retrospective validation on the held-out test set, the reader study, and silent deployment in the hospital infrastructure highlight the promise of our AI system in assisting front-line workers through real-time assessment of prognosis.
From Data Sovereignty to Data Science: Implications for American Indian Self-Determination
Lea Shanley, University of Wisconsin-Madison; James Rattling Leaf Sr., University of Colorado-Boulder and Principal, Rattling Leaf Consulting LLC in Black Hawk, South Dakota.
American Indian Tribes have long been concerned about ownership, control, and access to data about their lands, resources, communities, and families. The development of sophisticated technologies for data collection and analysis, such as high-resolution remote sensing, GIS and machine learning, have heightened these concerns. Tribal concerns about the potential misuse of data include, but are not limited to: infringement on individual and group privacy; misappropriation of intellectual property and its use for commercial gain; misinterpretation or discrediting of cultural practices; abrogation of treaty rights; and the impact on the Federal-Tribal Trust relationship. Protecting tribal data from potential misuse, while at the same time ensuring access for tribes and their members, will require a creative combination of technical, legal, policy, and organizational research and solutions. Several initiatives have begun to tackle these issues head on, including the development of the CARE Principles for Indigenous Data Governance, the launch of IEEE P2890: Recommended Practice for Provenance of Indigenous Peoples Data, the international growth of the Indigenous Data Sovereignty networks including the Global Indigenous Data Alliance, and the rise of Indigenous biological and data repositories (or “biobanks”) such as the Native BioData Consortium. Tribal governments also will strengthen their sovereignty by building capacity to understand and use these emerging technologies to their own advantage. Data science is the new frontier in Self-Determination. Additional panelists: Joseph Robertson, Mato Ohitika Analytics LLC, Krystal Tsosie, Vanderbilt University and co-Founder, Native BioData Consortium; Randall Akee, Associate Professor, University of California-Los Angeles
Comparing Word Recognition by Humans and Deep Neural Networks and Application of Understanding Dyslexia
Elena Sizikova, New York University
We compare word recognition by deep neural networks (DNN) and humans, asking whether the effects of increased pooling in the network can model crowding in human vision. We study efficiency (ability to recognize words in noise) and crowding (ability to withstand clutter) of the network on word recognition. To measure efficiency, we assess the network's performance in recognizing random 4-letter words in mono-space font at various contrast levels on a white noise background, and find that the network has a lower efficiency than the human observer. The letter crowding phenomenon in human vision results in a minimum threshold spacing, independent of letter size. We measure word recognition accuracy as a function of letter size and spacing, and find different crowding patterns in humans and neural networks.
Communicating with Data: how and where does it fit in the data science curriculum?
Sara Stoudt, Smith College; Deborah Nolan, UC Berkeley
Science communication is becoming increasingly valued as a way to make technical work more accessible to a broader audience. Data science is no exception. As statisticians and data scientists it is important to write about data insights in a way that is both compelling and faithful to the data. However, formal training in this is often lacking. Our proposed breakout session will cover lessons learned from a course we developed and taught where students learned to write for a variety of genres from technical articles to news stories and blog posts. We will then facilitate a discussion about where statistical writing fits in the data science curriculum and discuss strategies for implementing a statistical writing course v. incorporating writing into pre-existing courses.
A Comparison of Methods in Political Science Text Classification: Transfer Learning Language Models for Politics
Zhanna Terechshenko, New York University; Fridolin Linder, Siemens Mobility; Vishakh Padmakumar, New York University
Automated text classification has rapidly become an important tool for political analysis. Recent advancements in natural language processing (NLP) enabled by advances in deep learning now achieve state of the art results in many standard tasks for the field. However, these methods require large amounts of both computing power and text data to learn the characteristics of the language, resources which are not always accessible to political scientists. One solution is a transfer learning approach, where knowledge learned in one area or source task is transferred to another area or a target task. A class of models that embody this approach are language models, which demonstrate extremely high levels of performance in multiple natural language understanding tasks. We investigate the feasibility of the use of these models and their performance in the political science domain by comparing multiple text classification methods and we find that RoBERTa and XLNet, language models while requiring fewer resources in terms of both computing power and text for training data, either perform on par with -- or outperform -- traditional text classification methods. Moreover, we find that the increase in accuracy is likely to be especially significant in the case of small data sets, highlighting the potential for reducing the cost of supervised methods for political scientists via the use of pretrained language models. We argue, therefore, that the use of transfer learning methods can reduce the cost of many text classification tasks for political scientists.
ACTS: Accelerating COVID-19 Testing with Screening
Daniela Ushizima, Berkeley Lab
Top infectious disease experts continue to reiterate the need for aggressive diagnostic testing for COVID-19, not only the number of tests but the capacity to actually perform them, however both tasks are still to be scaled to very large populations. Therefore, when it comes to COVID-19 testing, one big question is whether we could use computed tomography (CT) scans for frontline diagnosis. This presentation will discuss the design of a computational tool that explores lung scans as part of screening protocols for patients with suspected COVID-19 infection. By using machine learning associated with computer vision algorithms, we expect to rank patients and provide early warning of COVID- 19 infection. We will illustrate some preliminary results using efficient mathematical models that might amplify COVID-19 testing and surveillance. Discussion will include, but it is not limited to, use of pneumothorax X-ray CT as algorithmic input and encouraging results using a fully automated algorithm that detects the lungs, the first step towards identifying lung lesions. We will share lessons learned using a public dataset (https://github.com/ieee8023/covid-chestxray-dataset) and an unsupervised algorithm for lossless data reduction and artifact removal, followed by semantic segmentation.
Principles for Data-Intensive Research Workflows: Guidance for the Classroom and the Computational Laboratory
Valeri Vasquez, UC Berkeley; Ciera Martinez, UC Berkeley; Sara Stoudt, Smith College
Traditional data science education includes a review of various statistical analysis methods as well as training in computational tools, software, and programming languages. However, the development and pursuit of a research workflow -- the process that moves a scientific investigation from raw data to coherent question to insightful contribution -- is a crucial component of pragmatic data science that is often left out of classroom discussions. Too frequently, students and practitioners of data science are left to learn these essential skills on their own and on the job. Guidance on the breadth of potential products beyond traditional academic publications that can emerge from research is also lacking. In this discussion, we will consider how to effectively train new researchers to develop a research approach that allows for creativity while standardizing practices to organize data and code such that work is (1) reproducible and (2) culminates in results that make a scientific contribution. We discuss a workflow that includes three phases (Exploratory, Refinement, and Polishing), emphasizing the value of accessible communication and making analogies to software development and design processes where appropriate.
Framework for ML Bias/Fairness
Brian Wright, University of Virginia
This session is designed to walk the audience through different types of bias that are represented in the field of Machine Learning. This includes examples of how these biases are present in the real-world. Technical approaches around identifying biases as it relates to classification will also be presented. All these topics will lead to recommendations on how to move forward and how to encourage a healthy distrust when in comes to machine learning approaches as a potential solution.
Teaching Introduction to Data Science courses in Python and R using OER materials
Debbie Yuster, Ramapo College
In the last several years, new curricula have been developed for Introduction to Data Science which are high quality, low prerequisite, and Open Educational Resources (free of charge with permission to alter or remix). One well known example is UC Berkeley's Foundations of Data Science, better known as 'Data 8', developed by Ani Adhikari, John DeNero, et al. The Data 8 curriculum has students work through guided data analyses in Python Jupyter notebooks, served in the cloud so students have easy and equitable access requiring only a web browser. Students can run code checks in-notebook to ensure they're on the right track, and assignments are marked through a combination of automatic and manual grading. Another open source curriculum is Data Science in a Box, developed by Mine ?etinkaya-Rundel. This curriculum uses R and RStudio, the 'tidyverse' suite of R packages, and emphasizes student collaboration and version control via GitHub. Available materials include lecture slides, videos, interactive tutorials written with the 'learnr' package, and assignments, meant to accompany the freely available books R For Data Science by Garrett Grolemund and Hadley Wickham, and OpenIntro Statistics by David Diez, ?etinkaya-Rundel, and Chris Barr. In 2020, I taught two Introduction to Data Science courses, one using the Python curriculum and the other using the R curriculum. I will highlight and compare key aspects of each course, reflect on the experiences, and give tips for instructors interested in adopting one of these curricula.