Scientific computing is faced with challenges handling massive datasets. Current tools for managing and analyzing these datasets fail to scale and are often hard to use and deploy. The goals of the workshop are to understand the needs in term of software infrastructure of the intersection of big data and biocomputing. We are particularly interested in which programming abstractions, tools, programming languages are best suited to deal with big data. How these languages and tools could be supported in the context of a NSF software institute.

Funded by the NSF Software Infrastructure for Sustained Innovation (SI2) program.

Posters: small and large.

Slides for all the talks

Organization

Talks


Challenges and Pragmatic Solutions to Statistical Analysis of High-throughput Genomic Data
Martin Morgan
The R / Bioconductor project (http://bioconductor.org) provides a proving ground for computational approaches to handling high-volume genomic data. Many investigators have primary interests and talent in domains other than computer science. Their research questions
raise transient analytic needs that make it difficult to justify narrowly-focused investment in sophisticated computational methods or machinery. Very diverse computational environments make many solutions idiosyncratic. This leads us toward development of reusable infrastructure to support simple and standardized models of high-throughput computation, relying on opportunistic community standards, and offering consistently-configured computational environments for scalable evaluation.
Senior Staff Scientist at the Fred Hutchinson Cancer Research Center (FHCRC) in Seattle. He is the director of the Bioconductor project.


Challenges and novel approaches for large-scale sequence alignment and phylogenetic estimation
Tandy Warnow
Tandy Warnow is David Bruton Jr. Centennial Professor of Computer Sciences at the University of Texas at Austin. Her research combines mathematics, computer science, and statistics to develop improved models and algorithms for reconstructing complex and large-scale evolutionary histories in both biology and historical linguistics. She received the NSF Young Investigator Award in 1994, the David and Lucile Packard Foundation Award in Science and Engineering in 1996, a Radcliffe Institute Fellowship in 2006, and a Guggenheim Foundation Fellowship for 2011. From 2009-2011 she was the Chair of the NIH BDMA study section, and she is currently an NSF program officer working on BIGDATA.

Fostering sensitivity analysis for genome-scale inference
V. Carey
Workflows in bioinformatics are holistic views of highly fragmented data resources and algorithmic processes. This talk discusses how to strike a balance between managerial values of data and algorithm holism, and performance values of support for arbitrary fragmentation of data and algorithm elements. The balancing act is illustrated in the investigation of sensitivity of genome-wide searches for genetic determinants of gene expression variation to values of a few basic tuning parameters.
Vince Carey is Associate Professor of Medicine (Biostatistics), Harvard Medical School. He is co-founder and core member of bioconductor.org, was inaugural editor of The R Journal, and has been contributing methods and tools for longitudinal and clustered data analysis for 20 years
.

Adding large data support R
Luke Tierney
The R language and framework is a widely used platform for the exploration and analysis of data. Originally designed for data sets of moderate size, R is now being enhanced to support computations on much larger data sets. This talk describes current efforts in this direction and discusses some of the issues and challenges involved.
Luke Tierney is Ralph E. Waremahm professor of Mathematical Sciences and chair of department in the Department of Statistics and Actuarial Sciences at the University of Iowa. He is a member of the R Core developer group; his main research focus is on computing environments for data analysis and computational methods for Bayesian statistics.

Safe Programming in Dynamic Languages
Jeff Foster
Today, many important software systems, including those that operate on big data, have components that are written in dynamic languages such as Ruby, Python, and Perl. Dynamic languages are appealing because they are lightweight; provide flexible support for many different coding idioms; and encourage rapid prototyping. However, at the same time, dynamic languages lack some features, notably static type systems, that traditionally help in building large, robust software systems. This talk will discuss our experience over the last several years in developing techniques to bring some of the benefits of static typing to Ruby, and how these ideas may apply to software that manipulates big data. We will also discuss some of the lessons we learned in applying our techniques to a wide range of existing systems.
Jeff Foster is an Associate Professor in the Department of Computer Science at the University of Maryland, College Park and in the University of Maryland Institute for Advanced Computer Studies. He is also Associate Chair for Graduate Education in the Department of Computer Science. His research focuses on developing programming languages and software engineering approaches to making software easier to write and more reliable, secure, and available.


BIG: The Billion Genome Project

by Jong Bhak
BiG project is a human genome project that aims to sequence all the human beings on Earth and process all the genomes, transcriptomes, epigenomes, and microbiomes associated with the genomes. It is an open project and invites everyone on Earth. It accomodates PGP (personal
genome project) protocols that can be adapted to each participating national. The informatics pipeline (BiG project Information Tools: BIT) will be built to store, analyze, and distribute the data. Possible data processing issues will be discussed for BIG project. This presentation will be an opening for questions, suggestions, and opinions on how to process such a large scale genomic data.
Jong Bhak is the director of the Personal Genomics Institute since 2010. He has worked on a broad range of bioinformatics problems with interests in computer hardware, operating systems, programming, and applications. He holds a PhD in BioInformatics from the MRC Centre, Cambridge. He worked as a research fellow in EBI and as a group leader to research on geronto-genomics at MRC-DUNN. In 2003, he became an associate professor at KAIST, Korea. His group analyzed the first Korean human genome and the first publically available female genome.

Scaling Data Analytics
Jan Vitek
Big data requires changes to the way we program. This talk will look at abstractions and language support for operating over very large data. I will present a number of project related to the R programming language. We have built TraceR, a tracing infrastructure which allows developers to understand performance issues in their code. We are developing a new R virtual machine, called FastR that leverage the Java platform for enhanced performance and scalability. Lastly the ReactoR project extends FastR with support for large data.
Jan Vitek is a Professor of Computer Science and Faculty Scholar at Purdue University. He has extensive experience in programming language design and implementation. He led the development of the Ovm real-time Java virtual machine and the Thorn distributed programming language. He is currently the chair of the ACM SIGPLAN.

Bringing Performance and Scalability to Dynamic Languages
Mario Wolczko
High performance is typically associated with statically compiled languages such as C and its offshoots, and Fortran, and with the managed languages that have become widespread in business computing (Java, C#). But there is no reason that dynamic languages cannot be close in performance to these, and at Oracle Labs we are building a dynamic language virtual machine framework to achieve this. This talk will describe the work in progress, and how it builds on earlier work on both static and dynamic languages.
Mario Wolczko is an Architect at Oracle Labs. His research interests include virtual machines and dynamic languages, computer architecture and memory systems design, garbage collection and performance instrumentation and analysis.


Abstractions for parallel programming
David Padua
David Padua is the Donald Biggar Willet Professor of engineering at the University of Illinois at Urbana-Champaign, where he has been a faculty member since 1985. He is a member of the editorial boards of the Journal of Parallel and Distributed Computing and the International Journal of Parallel Programming, and Editor in Chief of the Encyclopedia of Parallel Computing. His areas of interest include compilers, programming tools, machine organization, and parallel computing. He is a fellow of the IEEE and the ACM.

Sponsored by:

NSF logo