# This takes a while.
# Go get a coffee or a beer
install.packages("tidyverse")
See here for the free version of the book. Buy it (or have your PI do so) if possible. Documentation is better when people are paid to write it!
Seriously: if you use R a lot, read this book and practice what is within it. Look at the current websites for these projects, and learn the new stuff that’s not in the book! This stuff will speed up your analyses and you will write less code that is more readable!
Within tidyverse, extra special mention goes to dplyr, which is the “swiss army knife” package of data.frame processing.
data.table is an alternative to dplyr.
When you are dealing with large data sets, not using dplyr or data.frame is probably costing your orders of magnitude of time. When dplyr first came out, I had analyses go from about a day to “several minutes”.
Python is a “general purpose” programming language. Thus, we look to add-on libraries for serious numerical/scientific computing (stuff that R has built-in because it is designed for stats/data analysis). Usually, these libraries are written in C or C++ to be fast, and then Python lets you use it with a nice interface.
The “numeric Python”, or numpy gives us arrays, matrices, linear algebra operations, vectorized calculations on arrays/matrices, random number generation, and probably some other stuff.
The fundamental type is the array:
import numpy as np
a = np.array([0, 3, -4, 11])
print(a)
## [ 0 3 -4 11]
a = np.identity(4)
print(a)
## [[1. 0. 0. 0.]
## [0. 1. 0. 0.]
## [0. 0. 1. 0.]
## [0. 0. 0. 1.]]
A good book on numpy. (It also covers pandas and matplotlib.)
A scientific computing library built using numpy. Provides:
pandas provides pandas.DataFrame, analogous to R’s data.frame. It also provides the “split/apply/combine” functionality of dplyr. Many would argue that dplyr’s interface is more readable, but that’s debatable once the analysis is sufficiently complex.
Book by the lead developer of pandas.