Important libraries for “data science”

R

The entire tidyverse

# This takes a while.
# Go get a coffee or a beer
install.packages("tidyverse")

See here for the free version of the book. Buy it (or have your PI do so) if possible. Documentation is better when people are paid to write it!

Seriously: if you use R a lot, read this book and practice what is within it. Look at the current websites for these projects, and learn the new stuff that’s not in the book! This stuff will speed up your analyses and you will write less code that is more readable!

dplyr

Within tidyverse, extra special mention goes to dplyr, which is the “swiss army knife” package of data.frame processing.

data.table

data.table is an alternative to dplyr.

Speed!

When you are dealing with large data sets, not using dplyr or data.frame is probably costing your orders of magnitude of time. When dplyr first came out, I had analyses go from about a day to “several minutes”.

Interactive graphics

High-performance computing

  • Rcpp lets you write R functions in C++. Rcpp is a mature project. Use this when you have a performance bottleneck.
  • extendr is to rust as Rcpp is to C++. This project is just a few weeks old!

Python

Python is a “general purpose” programming language. Thus, we look to add-on libraries for serious numerical/scientific computing (stuff that R has built-in because it is designed for stats/data analysis). Usually, these libraries are written in C or C++ to be fast, and then Python lets you use it with a nice interface.

numpy

The “numeric Python”, or numpy gives us arrays, matrices, linear algebra operations, vectorized calculations on arrays/matrices, random number generation, and probably some other stuff.

The fundamental type is the array:

import numpy as np

a = np.array([0, 3, -4, 11])
print(a)
## [ 0  3 -4 11]
a = np.identity(4)
print(a)
## [[1. 0. 0. 0.]
##  [0. 1. 0. 0.]
##  [0. 0. 1. 0.]
##  [0. 0. 0. 1.]]

A good book on numpy. (It also covers pandas and matplotlib.)

scipy

A scientific computing library built using numpy. Provides:

  • Numerical optimization
  • Regression
  • More random number features
  • Lots of other things!

pandas

pandas provides pandas.DataFrame, analogous to R’s data.frame. It also provides the “split/apply/combine” functionality of dplyr. Many would argue that dplyr’s interface is more readable, but that’s debatable once the analysis is sufficiently complex.

Book by the lead developer of pandas.

Machine learning tools

Interactive graphics

High-performance computing

  • pybind11 lets you write Python libraries in C++11/14/17. Think Rcpp, but for Python. Mature project. Key tool in my lab!
  • pyO3 is the rust analog of pybind11.