MSDSE Summit - 2018

this website was ported from it's original location to the new ADSA website in late 2022. we have attempted to capture the most relevant information, though some may have been lost

Moore-Sloan Data Science Environments logo

“AI and Astrophysics” (Richard Galvez) - The function of this session is to introduce certain problems lying on the interface between Astrophysics and deep learning, and request suggestions from colleagues on ideas moving forward. I will first give a short introduction to a specific problem, relevant data available, and what has been attempted thus far; followed by a brainstorming session focusing on each problem in turn.

“Beyond Fake News: A framework to analyze dissemination of motivated information through social Media" (Sunandan Chakraborty) - Fake news has been an important part of our contemporary public discourse, the term has been used by a variety of actors to suggest broadcast items that mislead people in the guise of legitimacy. In India, with a large share of new social media users, information that is either blatantly false, or motivated to inflame can spread to a large population within a small span of time. On many platforms, such as, WhatsApp, due to encryption of messages, it becomes extremely difficult for law enforcement to intervene and stop such information from spreading. At present, most detection is carried out manually/ through manual methods With the increasingly fast rate of generation and even faster rate of spread of such stories, it is infeasible to rely purely on manual interventions to address this phenomenon. In this talk we shall present a work in progress that targets automatic detection of such malicious messages and stop their further spread.

“Billion-Pixel Digital Pathology Analysis: Mapping Tau protein in the human brain” (Maryana Alegro) - Deposits of abnormal Tau inclusions in the brain are a well-known Alzheimer’s disease (AD) pathological feature and are the best predictor of neuronal loss and clinical decline. Tau is a potential in-vivo imaging biomarker that could leverage earlier AD diagnosis. In fact, there are several research initiatives to develop PET tracers for imaging Tau in the human brain. Validation of such studies, however, is only reliably performed using histological data, where Tau inclusion can be accurately located. Manually locating and segmenting Tau in such images is unfeasible since a whole human brain slice spans several Gigabytes of data and carries thousands of inclusions. We present our preliminary results on using convolutional neural networks (CNN) to automatically locate and segment Tau in whole human brain histological slides. Our dataset is composed of histological slices stained for AT100, AT8, and MC1. Each slide was imaged in our in-house built slide scanner at 1.22μm resolution. Each image is about 82000x37000 pixel wide. On average, a whole human brain dataset has 16Tb of data. Images were preprocessed and stitched using the UCSF Wynton cluster. A UNet variant was used for Tau localization and segmentation on the full-resolution images, which was performed using an NVIDIA Titan X 12Gb GPU. We then computed Tau ‘heatmaps’ based on the segmented images that, in turn, were registered to the MRI and Tau PET allowing direct comparison between PET signal and percentage of Tau.

“Blockchain-Enabled Data Trusts” (Neil Davies) - We will explore the potential of blockchains and cryptocurrencies as socio-technical infrastructure for open science. One use case is the Island Digital Ecosystem Avatar (IDEA) Consortium’s effort to build a biosample blockchain for Genomic Observatories. Advances in ‘omics technologies have greatly increased the scientific value of biosamples and led to an explosion in data diversity and quantity. The need for a software stack that digitally integrates the value chain of biosamples from collection through analysis, storage, use and re-use has never been greater. Blockchain technology promises to digitally integrate the value chain of biosamples, seamlessly maintaining the chain of custody of specimens and data. As distributed databases, blockchains elegantly address the technological challenge of registering (tokenizing) digital assets with cryptographic identifiers, linking all exchanges of assets in an immutable decentralized ledger (provenance), and enabling programmable ‘smart’ contracts to enforce terms of use. Obstacles to data sharing, however, are not only technological, social barriers are perhaps even more challenging. Blockchains could be transformational here too, with their use of digital tokens as cryptocurrency providing a powerful mechanism to shape user behavior. Even without the development of a cryptocurrency, blockchains incentivize data/sample providers through automated discovery of what is learned (and earned) downstream. Similarly, users and repositories are motivated to engage in a system where provenance is transparent, affording confidence that upstream suppliers complied with regulatory requirements and ethical norms.

“Breaking the Big Data Barrier in Ecology: The promise of large-scale acoustic surveys” (Justin Kitzes) – Today, nearly all biodiversity data is collected through direct observation: organisms are seen or measured by a person, and this information is recorded in a database. Direct observation, however, cannot capture the amount of data needed to make inferences about entire communities, at global scales, for decades into the future. Despite the best efforts of professional ecologists and citizen scientists, we currently observe biodiversity on less than 0.00001% of Earth’s surface each year. Instead, indirect observation methods, where data are recorded by autonomous sensors and processed by machine learning models, will ultimately be needed to break through this data limitation. Here I describe our work developing a platform, OpenSoundscape, that supports one specific type of indirect observation: acoustic surveys of vocal taxa such as bats, birds and amphibians. OpenSoundscape combines inexpensive hardware, open source software, and hierarchical statistical models, with the goal of estimate population densities and characterizing macroecological patterns. We are currently beginning several field studies using this platform, with an initial focus on birds.

“Building a National Data Science Pedagogy Community of Practice” (Anthony Suen) - Universities across the country are developing new programs to prepare their student bodies to tackle the emerging challenges of data in science and industry using 21st-century tools and techniques. However, these programs are often developed and implemented in silos often leading to duplications of efforts and differences in course quality. Furthermore, while a number of useful curriculum guidelines for degrees in data science have been proposed, opportunities for engaging in pedagogic exchanges, sharing resources remain rare. Given the challenges described above, we propose three significant initiatives to facilitate collaboration across the county (possibly internationally). The first is an annual pedagogy workshop/conference to explore ideas, datasets, and tools for the creative and effective teaching of data science. The second is a centralized portal and associated repository to support and enhance collaborating around teaching data science. Third will be to advocate for establishing, developing, and sustaining a US National Educational Cyberinfrastructure.

“Classifier-Agnostic Saliency Map Extraction” (Krzysztof Geras) - We argue for the importance of decoupling saliency map extraction from any specific classifier. We propose a practical algorithm to train a classifier-agnostic saliency mapping by simultaneously training a classifier and a saliency mapping. The proposed algorithm is motivated as finding the mapping that is not strongly coupled with any specific classifier. We qualitatively and quantitatively evaluate the proposed approach and verify that it extracts higher quality saliency maps compared to the existing approaches that are dependent on a fixed classifier. The proposed approach performs well even on images containing objects from classes unseen during training.

“Cloud-computing Tools for Satellite Radar Imagery: With applications to landslide monitoring in the Pacific Northwest” (Scott Henderson) - Satellite radar imagery (SAR) can measure sub-cm surface displacements over 100s of kilometers. These measurements are useful for geologic studies of earthquakes, volcanic activity, and landslides. The Pb-scale Sentinel-1 SAR archive contains raw data, requiring Cloud-computing tools to efficiently generate useful imagery for scientific analysis. We will present an overview of the open-source software package we have developed to facilitate SAR analyses, with applications to geological hazard monitoring in the Pacific Northwest United States. The software is a freely available Python package that facilitates running legacy code on Amazon Web Services (AWS Batch) via Docker. NASA has signaled their intent to store SAR data on AWS, which would finally negate the need for downloading these large datasets! The software facilitates scaling Cloud resources (CPUs, RAM, and GPUs) based on need. Finally, Cloud-optimized image storage (Cloud-optimized Geotiffs and SpatioTemporal Asset Catalog metadata) is used to facilitate archiving, visualization, and distribution of processed imagery. We demonstrate how we have used the software to monitor the Rattlesnake Ridge slow-moving landslide in Union Gap, Washington State. The landslide initiated in 2017 and continues to advance at a rate of 2 mm/hr through the summer of 2018.

“Cordial Poetry” (David Mongeau) - Data science institutes tend to be populated with more engineers than humanists and artists.  Moreover, there is not a widely held view in these places that science and art are of equal importance. Humanists and artists have been open to exploring engineering methods and technology from the STEM disciplines.  For example, there is an entire community of visual artists and sculptors creating data-driven art with 3D-printers and digital displays.  Similarly, there are poets and other writers relating through open source composition and humanities hackathons.  One of them, Wendy Vardaman, has dubbed herself a word designer.  She reports gaining an understanding of the potential of blending computational methodologies with humanist questions to create art that can bring communities together.

“Crowd Size Estimation” (Djellel Difallah) - In this lightning talk, I will present our recent work on developing a statistical technique for estimating the size of a population using longitudinal survey data (mechanical turk) and check-ins data (Foursquare). The methods have roots in ecology and biostatistics, which we adapt to human dynamics in online and urban settings. While related techniques consider probabilities of observing individuals be constant or time-varying, instead we account for the skewed propensity of individuals to participate in our surveys or to check-in in a particular venue. Finally, I will discuss some direct applications of inferring these values in a couple of datascience-related projects, namely, for mturk task completion prediction, and taxi demand prediction.

“Data Science: Bridges to the Abstract” (Richard Barnes) - Data and model visualization, done well, express hidden truths and exquisite interrelationships in elegant abstractions for easy digestion. But, as any practitioner knows, the slightest misstep in a script can produce another kind of abstraction entirely: modern art. This temporary exhibit invites datartists to submit science-derived art of any sort, though we discourage anything so blasé as a readily interpretable graph. Each piece should have an accompanying artist's statement [ may provide inspiration], as well as a statement of what is actually being shown and, if it's unintended, where it all went wrong. Exhibitors should bring their own prints and email their statements to "" ahead of time. We will provide museum placards to accompany the works.

“Data Science and Genomics Research” (Diya Das, Nelle Varoquaux, Chris Kennedy, Ciera Martinez) - This session is meant for researchers using data science tools for genomics research. We intend to give brief (5 minute) overviews of our current research and discuss common issues we face in our work. Our research ranges from cell type identification in regeneration of adult tissues, drought resistance in crops, enhancer identification and function in development, and studies of childhood leukemia. We welcome participation from others, particularly from our colleagues at NYU and UW. The goal of this session is to identify shared topics of interest and to potentially catalyze cross-institute research collaborations and tools/methods sharing.

“Data Sciences for Climate, Water and Energy Research” (Zexuan Xu, Deborah Sunter) - Traditionally, physically-based processed models are applied to study the climate, water and energy issues, leveraging supercomputer to simulate many parts of the Earth’s system in details across scales. Recently, data-driven predictive models and machine learning techniques have been developed for evaluating the impacts of climate change on water resources and energy security. This session welcomes presentations and discussions of cutting-edge techniques, research ideas, collaborations and funding opportunities on climate, water and energy disciplines.

“Deploy and Customize Your Own JupyterHub in the Cloud in Less Than 30 Minutes” (Chris Holdgraf) - JupyterHub has proven to be a valuable tool for educators, researchers, and analytics teams. However, deploying your own JupyterHub can be difficult - either requiring knowledge of Kubernetes or the internals of the JupyterHub architecture to accomplish. In this lightning talk I'll cover a new distribution of JupyterHub called "The Littlest JupyterHub". It runs on a single VM that is easily created in the cloud. I'll cover how to start and install JupyterHub, as well as some basics for how to configure it. I'll end by asking anyone interested to run through a user test with one of the JupyterHub team members.

"Jupyter Books - Online, interactive, cloud-ready textbooks with Jupyter Notebooks" (Chris Holdgraf) - Jupyter Notebooks are a great way to combine conceptual ideas with code and results. However, many scientific and educational projects require *collections* of notebooks, usually in the form of a "book". Jupyter Books is a simple system to build a "book" interface to a collection of notebooks. It uses nbconvert to create Jekyll markdown files out of Jupyter Notebooks, and a custom Jekyll theme that creates a gitbook-like interface to the material. Jupyter Books can also automatically embed links to Jupyter Notebooks that can be run on cloud services such as Binder or Jupyter Hub. To use it, simply replace the demo notebooks with your own, and update the table of contents with new links to your notebooks. Then, build your Jupyter Book and push to GitHub, where your book will now be publicly available.

“From OpenStreetCab to Navcab: Deploying mobile applications in the wild as urban transport transforms in the era of big data” (Anastasios Noulas) - From OpenStreetCab to NavCab: deploying mobile applications in the wild as urban transport transforms in the era of big data. The urban transport industry has been going through a series of grande changes in recent years, driven by the rise of tech companies that have been disrupting traditionally operating professions in transport (e.g. Uber and taxis). Such changes do not simply manifest in terms of ways users travel in the city, but they also fundamentally change the market, economic and labour structures of the industry as a whole. In this talk we will go through the experience of developing and deploying travel apps for journey planning in cities. We will discusses the challenges of deploying OpenStreetCab an application that helps users select the best taxi providers for their journeys, focusing on new pricing strategies like 'surge pricing' introduced by Uber as well as navigation differences between Uber drivers relying on GPS navigation and traditional cab drivers.  We will then introduce a new app, Navcab, that introduces an intelligent platform for the traditional taxi, helping drivers to route intelligently in the city and optimise pick-up rewards, while it also enables taxi communities strengthen their structure as a social network through technological means.

“From the Wet Lab to the Web Lab” (Anisha Keshavan) – Advances in technology have enabled scientists to collect massive amounts of data to answer important scientific questions. But the drawback is that we are experiencing a “data deluge”, which has brought about new challenges that we must overcome in order to truly reap the benefits that Big Data promises. In this talk, I propose that web technology can help us overcome these challenges, and present examples of how this is done in the field of neuroimaging. First, how web-based data visualization can address the challenges of high data dimensionality. Second, how web-based collaborative meta analyses can address the challenge of integrating the never-ending stream of new results in the research literature. Finally, how web-based citizen science platforms can address the problem that decisions made by neuroimaging experts cannot be reliably scaled to large datasets. Web technology has completely transformed our everyday lives, but we are only just beginning to unleash its full potential to accelerate scientific discovery.

“A General-Purpose Natural Language Processing Pipeline for the Annotation of Clinical Notes with Biomedical Concepts” (Vikas Pejaver) - It is estimated that clinical notes account for as much as 80% of the meaningful information present in patients’ electronic health records (EHRs).  For the purposes of research, extracting such information is relevant to numerous applications ranging from keyword-based searches to build patient cohorts to engineering features for machine learning tasks.  However, the unstructured nature of notes makes this non-trivial, and requires the use of highly customized natural language processing (NLP) techniques.  For clinical data warehouses (CDWs) in large medical systems, this approach is not feasible due to two major reasons: (1) the patient population is too big to allow for repeated applications of NLP, and (2) the end-goals of different research studies are wide-ranging, and are often unknown in advance. To address this, we have developed a general-purpose NLP pipeline to annotate clinical notes with standardized task-independent biomedical concepts to enable a variety of downstream applications.  In this presentation, we describe and evaluate this pipeline, and discuss our ongoing pilot study to annotate a year’s worth of the approximately 75 million clinical notes present in the University of Washington (UW) Medicine CDW.

“Gigantum: A web shell for reproducibility” (Dav Clark) – Gigantum was a founded by a faculty, program managers, and software engineers who had run into the same challenges with collaborative data science again and again. We have built an open source client that can replace diverse command line activities to seamlessly construct and share a Docker environment along with all code and data. The Gigantum Client integrates seamlessly with the Jupyter messaging protocol to allow fine-grained tracking of the inputs, outputs, and context of each execution. We maintain a visual activity record in a novel git-friendly datastore, making it easy to scan for specific outputs, and the exact code, data, and environment used to generate them. We are currently integrating additional through-the-web tools like RStudio. Everything syncs with a single click and collaborators can automatically generate the same environment on their computer. Our open source Client is available today and works across Windows, Linux, and Mac systems. We look forward to constructive engagement with the community and making open source tools more welcoming and robust for collaboration between novice and expert users.

“Grounding Compositional Hypothesis Generation” (Neil Bramley) - A number of recent computational models treat human learning and hypothesis generation as involving probabilistic induction over a space of language-like, compositional concepts.  Inference in such models requires repeatedly sampling from an (infinite) distribution of possible concepts and comparing the relative likelihood of samples in light of current data or evidence.  However, we argue that most existing algorithms for such top-down sampling are inefficient and cognitively implausible accounts of human hypothesis generation.  We propose an alternative, Instance Driven Generation (IDG), that constructs bottom-up hypotheses directly out of encountered positive instances of a concept. Using a novel rule induction task, we compare these "bottom-up" and "top-down" approaches to inference.  We find that the bottom-up IDG model accounts better for behavioral patterns, and results in a computationally more tractable inference mechanism for concept learning models based on a probabilistic language of thought.

“Hack Weeks as a Model for Data Science Education and Collaboration” (Daniela Huppenkothen) -  Across many scientific disciplines, methods for recording, storing and analyzing data are rapidly increasing in complexity. Skillfully using data science tools that manage this complexity requires training in new programming languages and frameworks, as well as immersion in new modes of interaction that foster data sharing, collaborative software development and exchange across disciplines. Learning these skills from traditional university curricula can be challenging because most courses are not designed to evolve on time scales that can keep pace with rapidly shifting data science methods. Here we present the concept of a hack week as a novel and effective model offering opportunitites for networking and community building, education in state-of-the-art data science methods and immersion in collaborative project work. We find that hack weeks are successful at cultivating collaboration and facilitating the exchange of knowledge. Participants self-report that these events help them both in their day- to-day research as well as their careers. Based on our results, we conclude that hack weeks present an effective, easy-to-implement, fairly low-cost tool to positively impact data analysis literacy in aca- demic disciplines, foster collaboration and cultivate best practices.

“Hackathons, Hack Weeks, Unconferences, Sprints” (Daniela Huppenkothen, Anthony Arendt, Ariel Rokem) - Over the past decade, traditional models of learning and collaboration within academic disciplines (e.g. conferences and summer schools) have been supplemented by alternative forms of knowledge transfer and cooperation. Examples include unconferences, sprints, hackathons and hack weeks. Many of these events require organization and facilitation beyond common modes of knowledge generation and transfer (presentations, unstructured discussions). Their different implementations across academic fields have led to a variety of strategies employed to make these workshops effective, focused and inclusive. In this session, we aim to bring together the current organizers of events falling into these (and other!) categories in order to discuss experiences, share best practices and generate ideas for how to improve or implement future events. We also welcome researchers who are thinking of organizing an event of this kind for the first time and would like to share their ideas and connect with the existing community.

“Human-Centered Data Science” (Cecilia Aragon, Anissa Tanweer) - In this informal breakout session, we will discuss efforts over the past five years on the three campuses in human-centered data science, data science studies, ethics of data science, and related methodologies. We will also brainstorm ideas for the future, potential collaborations, proposals, and articles. Suggestions welcome in this open discussion!

“Improving Scientific Workflows for Multidimensional Data with Cloud-Based Computational Tools” (Scott Henderson) - Scientific communities are storing an increasing number of large important datasets on the Cloud with the hope of accelerating the pace of scientific discovery.  However, Cloud-based computational tools that facilitate data analysis and visualization are in early stages of development. In this session, we will describe the Pangeo project: A coordinated effort funded by NSF and NASA to develop cutting edge tools for discovery, pre-processing, re-gridding, and visualization of multidimensional Earth Science datasets on the Cloud. A major component of Pangeo is the development of a computing environment that can run next to data - negating the need to download large files and taking advantage of scalable computing resources. To this end, Pangeo is developing a JupyterHub variant that can autoscale large computations on multidimensional data using Kubernetes combined with Python libraries such as Dask and Xarray. This breakout session uses Earth Science applications as an example, but we hope to discuss Cloud-based workflow issues of broader interest. For example, pros and cons of various storage formats, transferability to different local or public cloud platforms, costs to maintain operations, approaches to encourage a cultural shift in scientific computation towards Cloud-based workflows, and finally, a reflection on if and how these workflows are accelerating scientific discovery.

“Learning a Non-Linear Controller for Insect Flight Dynamics With a Deep Neural Network” (Callin Switzer) - Insect flight is a highly non-linear dynamical system.  As such, strategies for understanding its control have typically relied on either simulation methods (e.g., Model Predictive Control (MPC), genetic algorithms) or linearization of the dynamical system. Here we develop a new framework that combines MPC and deep learning to create an efficient method for solving the inverse problem of flight control. We used a feedforward, fully-connected neural network to answer the question, “What is the temporal pattern of forces required to follow a complex trajectory?” Combining neural networks with simulations based on dynamical systems models yields a data-driven controller where the data are derived from a non-linear physical model. We first trained a deep neural network (4 hidden layers, with hundreds of nodes) on ~8 million simulated 2D insect trajectories. Our network accurately predicted the force, force angle, abdomen angle, and tangential and angular velocities (7 outputs), when it was provided with initial conditions and a goal location (12 inputs). The coefficient of determination (r^2) for all predictions was > 0.999 on a validation dataset (1 million additional trajectories). Next, we evaluated the neural network’s ability to control a simulated insect.  We used the aforementioned predictions and compared the final conditions generated to simulations. Again, we found that network-prescribed final conditions were nearly identical to numerically solved conditions (r^2 > 0.999). Overall, this work shows that machine-learning may be an efficient approach for controlling nonlinear dynamical systems.

“Learning to Discover Patterns without Human Supervision” (Stella Yu) - Suppose that you have a lot of data but you don't know what you are looking for.  You may have designed the experiment and collected the data, but you are not sure what type of patterns would show up in the data.  Is there something that can be done to discover the patterns automatically without imposing any or very little human supervision?  I will describe a few of our recent works in computer vision that address such problems and invite people in different data science domains to bring their applications and needs to discussions.

“Lessons Learned from Teaching Software Engineering to Data Scientists” (Joe Hellerstein, David Beck) - At UW, we teach software engineering and statistical model (e.g., machine learning) as separate skills. However, our machine learning course requires significant programming to access python ML codes. And, in our course Software Engineering for Data Scientists (which teaches skills in design, testing, coding style and project management), most students do projects that involve statistical modeling. For example, recent projects involved predicting housing prices, Bitcoin prices, and river flows for kayaking. Based on the approximately 100 student group projects over four years of teaching the Software Engineering Course, we see a significant gap in our data science curriculum. Although our current curriculum provides students with good background in statistical modeling and the skills for software engineering, practical data science requires additional knowledge and the integration of these skills. In particular, students lack insight into: (a) identifying and verifying modeling assumptions (e.g., independence); (b) diagnostics to determine why a model performs poorly (e.g., residual plots); and (c) how to test their modeling procedures (typically a combination of statistical techniques) are performing as expected (e.g., using Monte Carlo techniques to generate known distributions for a classification algorithm). Some of the above gaps are likely addressed by having a capstone project, as we do at UW. However, my contention is that there are a sufficient number of skills to learn that classroom time is required to close these gaps. This could be done as a separate class. Or it could be done by a re-design of existing courses. This breakout will discuss the needs for the skills described above and how to best address these needs in a data science curriculum.

“Mathematics of Gerrymandering” (Soledad Villar) – Gerrymandering is a long-standing issue within the US political system and it has been recently under scrutiny by the US Supreme Court. In this talk we focus on the mathematics of it. I will give an overview of the arguments that had been successful and the ones that had been rejected by the Supreme Court.

“The Mechanisms of Protest Recruitment through Social Media Networks” (Andreu Casas) – The literature on protest mobilization has long suggested that social ties have a strong influence in the decision to protest. Recent literature on social media mobilization also shows that in the current digital environment these social ties effects often take place online, particularly in social media. However, previous research does not provide clear evidence for why personal ties play a mobilizing role. In this paper we lay out four main theoretical mechanisms and we test them using real-world protest attendance data: social networks mobilize because a) they provide basic logistic information that is vital to protest coordination, and b) they create motivations for others to protest, c) they solve coordination problems, and d) they put pressure on others to participate. We collect data on Twitter activity during the 2018 Women's March that took place in many cities in the United States on January 20th and 21st. We then use geolocated accounts to find a set of users who attended a march and a set of users who did not. We use machine learning techniques to determine the amount of information, motivation, coordination, and pressure frames to which they were exposed through their Twitter networks. In line with current theories, we find users who protested to be more connected among themselves than those who did not. In regards to the mechanism analysis, we find that users whose friends sent a larger number of information, coordination, and to a lesser extent, pressure frames, were more likely to protest, but find the opposite effect for motivation frames.

“Mining Gold from Implicit Models to Improve Likelihood-Free Inference” (Johann Brehmer) - Many real-world phenomena are best described by complex computer simulations. But the probability density implicitly defined by these simulators is often intractable, which makes inference in these systems challenging. We present a new family of simulation-based inference techniques. They go beyond the traditional Approximate Bayesian Computation approach, which struggles in a high-dimensional setting, and extend methods that use surrogate models based on neural networks. The key idea is to extract additional information on the latent process from the simulator. This can then be used to augment the training data for the surrogate models. We demonstrate that these techniques are more sample-efficient and provide higher-fidelity inference than traditional methods. These new methods can be applied in fields as diverse as particle physics, cosmology, genetics, and epidemiology.

“ML for Sustainable Fisheries” (Falk Schuetzenmeister) - The Nature Conservancy is invested in developing sustainable tuna fisheries in the Pacific Island region by implementing cameras on longline fishing boats to detect illegal or unsustainable practices. In a pilot project, we ran a data science competition to create a classifier for fish species hauled on deck of fishing boats. Our competition provided us with a vast amount of voluntary exploration and experimentation from which we learned a lot about machine learning, image augmentation, platforms, and algorithms. However, it did not provide a deploy-able solution, we still needed our own data scientists and software engineers to implement these findings (image augmentation mattered most), retraining models, and building services. In my lightening talk, I would like to share my experience in working on a non-academic data science team.

“MSDSE Alumni Network Planning Meeting” (Nick Adams, Micaela Parker) - The MSDSEs are now old enough to have created dozens of alumni who have gone on to exciting work in the Academy, industry, and non-profit sectors. As pioneers in our fields, what are our experiences, challenges, and opportunities? Are there ways we can work together or share expertise and resources to broaden and deepen our impacts and facilitate the growth of data science in the academy and beyond?

“Non-Linear Regression for Manifold Learning in Molecular Dynamics” (Samson Koelle) - We present a method for explaining low-energy paths between molecular conformations by combining recent techniques in both manifold learning, which identifies such paths, and functional regression, which in our case can attribute them within a wide class of explanatory non-linear functions.  Unsupervised manifold learning approaches are useful for understanding molecular dynamics simulations since they disregard small-scale information such as peripheral hydrogen vibrations that can nevertheless drastically effect the observed energy.  However, understanding the role of covariates such as bond rotation in determining the energy landscape is made difficult by non-trivial data topology and geometry. In order to deal with these difficulties, we regress embedding differentials on functional covariate differentials, and use a group-lasso inspired penalty for inducing sparsity. Differentiation of functional covariates is done automatically, while embedding differentials are estimated. The key statistical advance in this method is the use of a manifold regularization norm derived in a data-dependent way from the Riemannian metrics of the data and the embedding in both the embedding differentiation and covariate differentiation procedures.  This method replaces visual inspection for determining which bonds are pivotal in a small molecule.

“OpenSpace: Moving from what to how in public dissemination” (Alexander Bock) - Ongoing advances have increased the public's access to high-quality visualizations tools. The ability for the general public to use the same tools as domain experts can lead to the confluence of the use of visualization for exploration and explanation and thus enables a deeper engagement with the provided material. This talk presents OpenSpace, an open-source astrovisualization software that bridges the gap between scientific discoveries and public dissemination and thus paves the way for the next generation of science communication and data exploration. By shifting public presentations from only *what* has been discovered to being able to *how* it is being discovered, it is possible for the general public to gain a much deeper understanding of the scientific discoveries that have been undertaken. This paradigm is explored with spacecraft missions, planetary rendering, and the wider universe in general.

“Organic and Organized” (David Mongeau, Chris Holdgraf, Maryam Vareth, Sarah Stone) - Data science consciousness has been raised at the five institutions involved and many of us see value in getting more specific about future intent.  What’s been learned about structures among the MSDSEs deserves focused discussion.  It can inform strategic and organizational planning efforts.  It can help sustain our community. This session is meant to 1) Capture lessons learned across the MSDSE-borne (or nurtured) institutes, 2) Engage in thoughtful debate about opportunities and risks with the organic and the organized in regards to what the institutes do next, and 3) Explore planning methods that we’re using at BIDS, in addition to the taxonomy of institutional approaches to data science, which is coming out of the Data Science Academic Leadership Summit. All session participant will be provided a pre-read with links to studies and opinion pieces about strategic and organizational planning in higher education and what it takes for a data science institute to be sustainable longterm.

“Our Research and Our Lives” (Peter Krafft) - In this session we will share and discuss thoughts and personal reflections on various topics regarding the Big Choices we make in our careers and in our lives. What factors have led to the research problems we choose to devote ourselves to? How can we be more intentional about aligning our work with our values?  What tricks have you found for living a happy, healthy life in the context of the stressful environment of academia?

“Pedagogy and Support Systems for Non-Classroom Data Science Instruction” (Diya Das, Ciera Martinez, Orianna DeMasi) - Many people across our campuses teach workshops on data science tools and methods or mentor teams of undergraduates for data science research projects. We are creating a group at BIDS for instructors/mentors to network and receive support from one another, as much of the pedagogical support in an academic context is focused on coursework. This session is meant for instructors and mentors to discuss what they have learned about instruction in non-classroom environments, particularly in regards to teaching the tools and methods of data science. With discussion, we hope to develop a list of ways in which people feel they could be better supported by their institution in their non-classroom instructional efforts and develop strategies for implementation within their respective MSDSEs.

“Play as Learning: WebVR is fun, computation, and math” (Dav Clark) - Last year, I discussed some of the ideas of Seymour Papert. More recently, I ran a weekend-long party where total novices learned to create WebVR scenes, along with figuring out things like coordinate transforms and 3D geometry along the way. It was fun! I'm happy to help set people up to create things and explore what's possible with relatively simple tools. Anyone can get started no matter their technical experience!

“pomegranate” (Jacob Schreiber) - pomegranate is a Python package for fast and flexible probabilistic modeling. The basic unit is the probability distribution, which can be combined into compositional models such as hidden Markov models, mixture models, and Bayesian networks. These more complicated models can themselves be used as components to models, such as building a mixture of Bayesian networks, or a Bayes classifier of hidden Markov models for the classification of sentences instead of fixed feature sets. This format for specifying models is augmented by a variety of sophisticated training strategies, such as multi-threaded parallelism, GPU support, semi-supervised learning, support for missing values, mini-batch learning, out-of-core learning for massive data sets, and any combination of the above. This tutorial will give a high level overview of the pomegranate and then focus on some specific applications.

“Reproducibility and Management of Data Research Teams” (Ciera Martinez) - Reproducibility is a continual topic in data intensive research discourse.  While the motivating principles and importance for computational reproducibility are becoming largely accepted, reproducibility is often discussed, taught, and practiced with a focus on the individual researcher. There is an increasing need to spread knowledge and experience working reproducibly in a team setting, especially management practices. Management of data science teams is difficult for many reasons and we will start this discussion identifying the challenges of team management. We will continue the discussion by sharing experiences managing computational research teams.  Topics will include what works, what doesn't, and share best practices for project organization and workflows. In addition, with the increasing data skills of undergraduate students, how do we best manage and properly mentor these individuals? How do we separate and manage computational tasks? What can we learn from industry?  What can we learn from management of open source projects?

“ReproServer: Making reproducibility easier and less intrusive” (Vicky Steeves) - We will introduce ReproServer, an open source Web application that allows users to reproduce experiments from the comfort of their Web browser. ReproServer leverages ReproZip: users can unpack and interact with ReproZip bundles over the Web, without having to download any software, and finally share persistent links to the unpacked versions, useful for including in publications.

“A guided ethnographic reflection on a large-scale U.S.-China data science collaboration” (Marla Stuart, R. Stuart Geiger) - In this session, a ethnographer of data science will guide/coach/interview a social welfare data scientist about her experiences launching a computational lab between the U.S. and China. This lab aspires to foster collaboration between Chinese government officials, social work researchers, and data scientists to identify and solve complex social problems and promote social well-being. The collaboration was launched with much excitement and has made good progress, but has faced many unexpected challenges/diversions such as data sharing procedures, official meeting protocols, role definition, workplace expectations, revolving lab staff, reproducibility, cross-cultural communication, and conceptions of time and reality. This ethnographic interview will serve as a model of reflexivity and the audience will be invited to speak about their own experiences, where similar kinds of issues may have arisen in very different research projects.  The goal is that attendees will leaving having spent some time making sense of their own expectations of what research collaborations involve. During this conversation, we will reflect on many issues around doing applied data science, some of which might resonate with your own experiences.

“Teaching Data Intuition: A book” (Rebecca Barter) - Together with Senior BIDS Fellow Bin Yu, I am writing a book that teaches data science from a perspective of critical thinking and intuitive understanding. In contrast to most data science books that are focused primarily either on math or software, our book provides the reader with real (messy) data experience, and teaches analytic concepts using hands-on examples and explanatory graphics rather than math or code alone. I introduced the idea of this book at last year's summit. This year I will be giving an update and providing a glimpse at the book's content.

“To What Extent Can Biomedical and Health Data Be Made FAIR?” (Vikas Pejaver) - Modern data science is built upon the core value of making data findable, accessible, interoperable and reusable (FAIR).  However, the biomedical and health data sciences appear to be inherently incompatible with the current movement to share data openly, due to the sensitive and personal nature of such data.  As a result, this core value is seriously challenged, limiting the ability of the broader data science community to bring its expertise into this exciting area of research.  However, is this perspective more pessimistic than realistic?  The proposed session aims to address this motivating question by first orienting the broader data science community towards the technical, legal and ethical challenges in the biomedical sciences and the solutions that have been proposed to address some of these.  This will be followed by a discussion on whether a middle ground can be achieved in terms of protecting privacy while making data useful to researchers without affiliations to medical organizations.  If so, what is this middle ground?  How do we, as a community, adapt our technologies and best practices to achieve this middle ground?  Although these questions may not be substantially answered during the session, their discussion will serve as an ideal starting point.

“Unraveling Tissue Regeneration With Single-Cell RNA-Sequencing” (Diya Das) – Tissue homeostasis and regeneration are mediated by programs of adult stem cell renewal and differentiation. However, the mechanisms that regulate stem cell fates under such widely varying conditions are not fully understood. Using single cell techniques, we assessed the transcriptional changes associated with stem cell self-renewal and differentiation and followed the maturation of stem cell-derived clones using sparse lineage tracing in the regenerating mouse olfactory epithelium. Using a normalization approach implemented in scone, a clustering approach implemented in clusterExperiment and a lineage trajectory algorithm implemented in slingshot, we identify four distinct lineage trajectories arising from these activated stem cells. We are also able to determine that the transcription factor Sox2 (known to be necessary for neurogenesis) is specifically required for the formation of neuronal progenitors from activated cells. The conclusions of single-cell RNA-sequencing analysis are validated using clonal lineage tracing, demonstrating the usefulness of these algorithms for unraveling developmental processes.

“Using Artificial Neural Networks for Experimental Automation and Data Processing in Behavioral Experiments” (Callin Switzer) – Artificial neural networks, a group of highly successful and flexible machine learning methods, are commonly used on problems ranging from automatic speech recognition to image restoration. Using neural networks for conducting behavioral experiments, however, is still relatively rare.  In this presentation, I will describe several applications of neural network models in the field of animal behavior. One experiment requires annotation of time series data collected from an accelerometer.  The labeling process can be done by an expert, but with large sample sizes (over 10K), expert annotation is not feasible. I evaluated and applied several neural network architectures to this labeling. In my presentation, I will discuss the benefits and drawbacks of each method. In follow-up experiments, I plan to focus on automating experiments by using trained neural networks to classify behaviors in real time.

“Vizier DB: Streamlined data curation” (Heiko Mueller) - Vizier is a new powerful tool to streamline the data curation process. Vizier makes it easier and faster to explore and analyze raw data by combining a simple notebook interface with spreadsheet views of your data. Implemented the Vizier Web API backend to manage and execute cleaning workflows and the Web User-Interface to explore data and build the cleaning workflow. The ideas of Vizier have been presented as a poster at the MSDSE summit two years ago. We now have a prototype that I would like to demonstrate.

“Where (and How) Have Data Scientists Found Careers in Academia?” (Diya Das, David Beck, Stuart Geiger) - The BIDS Career Paths Working Group will present on outcomes of our discussions with university administration and will lead a discussion on future plans to increase support for sustainable career paths for staff researchers in data science. These plans include connecting with other data science institutes to understand how their academic structures have successfully (or not) been adapted to provide for longer-term, financially secure appointments for staff while providing a measure of academic freedom, all of which were key elements of a desirable career track for staff identified in our recently published report ( From discussion with others, we hope to identify other avenues of approach to create more career trajectories for data scientists within academia.

“Workshopping Figures and Visualizations” (David Hogg) - Bring a figure from a paper you are working on, and we will discuss it and develop suggestions for improvements. Or just come and bring your best constructive self to participate. Or just come to watch and get ideas for your own work. This will be a collaborative working session with no talks or presentations!

“Zero To API” (Rob Fatland, Amanda Tan) - For some time we've been interested in answers and thoughts on: 'How quickly and easily can a single person stand up a REST interface to some data?' Sub-themes include motivation, one-click install web frameworks, (possibly lazy) data prep/formatting/cleaning, reproducibility, scale on both sides of the transaction, and confederation. The monster in the room being the question 'Is going to this kind of effort only worth doing for Big Data projects?' Note to organizers: As a lightning talk I can easily encourage interested people to chat later so that's a perfectly good way to go if you are full-up on breakouts.


MSDSE Program Coordinator

Micaela Parker - micaela at msdse dot org

Berkeley Summit contact

Marsha Fenner - mwfenner at berkeley dot edu

NYU Summit contact

Emily Corona - emily.mathis at nyu dot edu

UW Summit contact

Sarah Stone - exec-director at escience dot washington dot edu