Open, Reproducible and Exploratory Data Science

August 21, 2014 - 3:30pm to 5:00pm
Open, Reproducible and Exploratory Data Science
Professor Brian Granger, Physics, Cal Poly State University
Center for Complex Network Research, Northeastern University
Boston, MA

Data Science involves the application of scientific methodologies to data driven computations across a wide range of fields. As Drew Conway has clarified, it sits at the intersection of hacking/programming, math/statistics and domain specific expertise. Because data science is data- and computing-centric it requires powerful software tools. In this talk I will describe open source software tools for data science that i) are built with open languages, architectures and standards, ii) promote reproducibility and iii) are optimized for exploratory data analysis and visualization.


In particular, I will describe the Jupyter Notebook (formerly named IPython), an open-source, web-based interactive computing environment for Python, R, Julia and other programming languages. The Notebook enables users to create documents that combine live code, narrative text, equations, images, video and other content. These notebook documents provide a complete and reproducible record of a computation, its results and accompanying material and can be shared over email, Dropbox, GitHub or converted to static PDF/LaTeX, HTML, Markdown, etc. Most importantly, the Jupyter Notebook is built on top of an open architecture for interactive computing that is completely language neutral, allowing it to serve as a foundation for other data science projects and products.


One of the most important aspects of data science is interacting with data. This involves iterative cycles of visualization, computation and human computer interaction to extract understanding and make predictions. Jupyter now provides an architecture for interactive JavaScript/HTML/CSS widgets that allows users to interact with their data in a direct and simple way by automatically creating appropriate user interfaces for Python objects and functions. This allows the power of modern JavaScript libraries (d3.js, leaflet.js, backbone.js, etc.) to be leveraged in Python/Julia/R driven computations.


Throughout the talk, I will provide examples of how IPython is being used across a wide range of fields including science, engineering, social sciences, finance, computer science, industry, publishing and journalism. Jupyter/IPython is funded through the Alfred P. Sloan Foundation, the Simons Foundation, the National Science Foundation, Microsoft and Rackspace.



Brian Granger is an Associate Professor of Physics at Cal Poly State University in San Luis Obispo, CA. He has a background in theoretical atomic, molecular and optical physics, with a Ph.D from the University of Colorado. His current research interests include quantum computing, symbolic computer algebra, parallel and distributed computing and interactive computing environments for scientific computing and data science. He is a lead developer on the IPython project, a co-founder of Project Jupyter, creator of PyZMQ and is an active contributor to a number of other open source projects focused on scientific computing in Python. He is @ellisonbg on Twitter and GitHub.