The fancy printing of help files, etc., in this document is handled by
printr:
library(printr)
## Registered S3 method overwritten by 'printr':
## method from
## knit_print.data.frame rmarkdown
Today’s exercise requires some imagination on your part. We will not give you a massive data set. Rather, we provide you the foundation to efficiently handle some very large computations.
Some of you won’t “get it”. Not now at least. But, a recent paper of mine literally would not have been possible without the methods outlined here.
R/dplyr or pandas methods to process it.samtools or bedtools or some other
domain-specific software already do.Let’s reason through a few things:
Let’s fire up the venerable mtcars data set and take a look:
data(mtcars)
help(mtcars)
## Motor Trend Car Road Tests
##
## Description:
##
## The data was extracted from the 1974 _Motor Trend_ US magazine,
## and comprises fuel consumption and 10 aspects of automobile design
## and performance for 32 automobiles (1973-74 models).
##
## Usage:
##
## mtcars
##
## Format:
##
## A data frame with 32 observations on 11 (numeric) variables.
##
## [, 1] mpg Miles/(US) gallon
## [, 2] cyl Number of cylinders
## [, 3] disp Displacement (cu.in.)
## [, 4] hp Gross horsepower
## [, 5] drat Rear axle ratio
## [, 6] wt Weight (1000 lbs)
## [, 7] qsec 1/4 mile time
## [, 8] vs Engine (0 = V-shaped, 1 = straight)
## [, 9] am Transmission (0 = automatic, 1 = manual)
## [,10] gear Number of forward gears
## [,11] carb Number of carburetors
##
## Note:
##
## Henderson and Velleman (1981) comment in a footnote to Table 1:
## 'Hocking [original transcriber]'s noncrucial coding of the Mazda's
## rotary engine as a straight six-cylinder engine and the Porsche's
## flat engine as a V engine, as well as the inclusion of the diesel
## Mercedes 240D, have been retained to enable direct comparisons to
## be made with previous analyses.'
##
## Source:
##
## Henderson and Velleman (1981), Building multiple regression models
## interactively. _Biometrics_, *37*, 391-411.
##
## Examples:
##
## require(graphics)
## pairs(mtcars, main = "mtcars data", gap = 1/4)
## coplot(mpg ~ disp | as.factor(cyl), data = mtcars,
## panel = panel.smooth, rows = 1)
## ## possibly more meaningful, e.g., for summary() or bivariate plots:
## mtcars2 <- within(mtcars, {
## vs <- factor(vs, labels = c("V", "S"))
## am <- factor(am, labels = c("automatic", "manual"))
## cyl <- ordered(cyl)
## gear <- ordered(gear)
## carb <- ordered(carb)
## })
## summary(mtcars2)
The first several rows are:
head(mtcars)
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
The data are “rectangular”, meaning that they are set up like a spreadsheet.
A typical analysis of such data involves:
(cylinder, gear) combos.mpg for each group.In base R, we can do these analyses using aggregate. These
analyses use R’s formula notation, response ~ factors. This kind of
formula notation is common in statistical languages, and you’ll see it
in many corners of R.
Let’s get the mean mpg after grouping by cyl:
aggregate(mpg ~ cyl, data=mtcars, mean)
| cyl | mpg |
|---|---|
| 4 | 26.66364 |
| 6 | 19.74286 |
| 8 | 15.10000 |
Now, get the means for all (cyl, gear) combos:
aggregate(mpg ~ cyl + gear, data=mtcars, mean)
| cyl | gear | mpg |
|---|---|---|
| 4 | 3 | 21.500 |
| 6 | 3 | 19.750 |
| 8 | 3 | 15.050 |
| 4 | 4 | 26.925 |
| 6 | 4 | 19.750 |
| 4 | 5 | 28.200 |
| 6 | 5 | 19.700 |
| 8 | 5 | 15.400 |
Okay, cool. The code isn’t too bad, and it uses a common syntax that is
all over the R world (the “formula”). Let’s look at the dplyr
version now.
The first example is:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
results = mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg))
## `summarise()` ungrouping output (override with `.groups` argument)
results
| cyl | mean_mpg |
|---|---|
| 4 | 26.66364 |
| 6 | 19.74286 |
| 8 | 15.10000 |
The second analysis becomes:
results = mtcars %>%
group_by(cyl, gear) %>%
summarise(mean_mpg = mean(mpg))
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
as.data.frame(results)
| cyl | gear | mean_mpg |
|---|---|---|
| 4 | 3 | 21.500 |
| 4 | 4 | 26.925 |
| 4 | 5 | 28.200 |
| 6 | 3 | 19.750 |
| 6 | 4 | 19.750 |
| 6 | 5 | 19.700 |
| 8 | 3 | 15.050 |
| 8 | 5 | 15.400 |
To summarise:
So, why bother?
dplyr variant will tend
to be less verbose than the base version.dplyr solution will be vastly faster.dplyr has magic super powers that help solve the problem
of data sets too big to fit in memory. This is the part that
requires imagination, as our examples so far are in memory. Trust
me, though, the payoff is coming…%>% code chains together a series of function calls. To read
it, to left to right starting after the =:
mtcars data.group_by, which is our split step.summarise, which is our apply step.results, which is our combine step.as.data.frame business in the last example is only needed so
that printr gives this page a nice rendering. Technically, these
operations return a tibble, which is a fancified version of a
data.frame. You can read more about tibbles in the R for Data
Science book. Most of the time, though,
you can take them for granted and pretend you have a vanilla
data.frame. (That is intentional! A tibble is designed to be as
close to a drop-in replacement as possible.)So, if our data can be arranged as one or more “tidy” spreadsheets and our analysis will be “split/apply/combine”, then the solution to our problem is:
To oversimplify a bit, these are spreadsheets/data frames stored on disk. Key examples of open-source software versions are:
All of these have SQL in the name, which means “structured query
language”. This is an “English-adjacent” language for performing
split/apply/combine analysis on databases. Each program uses a slightly
different SQL “dialect”, which is quite irritating.
The first two tools (My and Postgre) require administrative privileges to use. The latter, sqlite3, does not, making it amenable to using in your pipelines. You give up some speed, though.
This is the payoff: dplyr can analyze data in a relational database
using the same code that you use for in-memory data frames!!!
This capability of dplyr is due to some very impressive software
engineering! It will write the SQL syntax for you. It can do so for
several SQL dialects.
The only substantive difference is that your input to a dplyr analysis
will be a connection to a database table rather than a data frame that’s
already in memory.
You can only perform analyses if the dplyr “verbs” you use have a
direct analog in the SQL dialect of your database. This limitation
will not affect you now, but may in the future.
## [1] TRUE
First, get our data into an sqlite3 database:
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
# Create a connection ("con") to a database file:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars.sqlite3")
# Write our data frame to the database in a table called "mtcars"
DBI::dbWriteTable(con, "mtcars", mtcars)
# Disconnect from our database
DBI::dbDisconnect(con)
The dbplyr docs use the following code, which did not work for me:
copy_to(con, mtcars)
Let’s check that we have a database whose file is > 0:
ls -lhrt *.sqlite3
## -rw-r--r-- 1 molpopgen molpopgen 12K Jan 20 12:01 mtcars_from_pandas.sqlite3
## -rw-r--r-- 1 molpopgen molpopgen 8.0K Jan 20 12:04 mtcars.sqlite3
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars.sqlite3")
mtcars2 <- tbl(con, "mtcars")
g = mtcars2 %>%
group_by(cyl) %>%
summarise(mean_mpg=mean(mpg))
The object g is not (yet) a dplyr result! It is an SQL query:
g %>% show_query()
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL>
## SELECT `cyl`, AVG(`mpg`) AS `mean_mpg`
## FROM `mtcars`
## GROUP BY `cyl`
We need to collect it to execute the query:
result = g %>% collect()
as.data.frame(result)
| cyl | mean_mpg |
|---|---|
| 4 | 26.66364 |
| 6 | 19.74286 |
| 8 | 15.10000 |
Well, the point is that it (mostly) looks the same! However, there is a
big difference! The query is executed by the sqlite3 database
software. By doing so, the entire database is not loaded into
memory.
In other words:
dplyr.dplyr (to analyze stuff
in a database).One toolkit can handle all of your needs.
## [1] TRUE
Python users can play these games, too.
library(reticulate)
Grab mtcars into Python:
mtcars = r.mtcars
mtcars.head()
## mpg cyl disp hp drat ... qsec vs am gear carb
## Mazda RX4 21.0 6.0 160.0 110.0 3.90 ... 16.46 0.0 1.0 4.0 4.0
## Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.90 ... 17.02 0.0 1.0 4.0 4.0
## Datsun 710 22.8 4.0 108.0 93.0 3.85 ... 18.61 1.0 1.0 4.0 1.0
## Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 ... 19.44 1.0 0.0 3.0 1.0
## Hornet Sportabout 18.7 8.0 360.0 175.0 3.15 ... 17.02 0.0 0.0 3.0 2.0
##
## [5 rows x 11 columns]
pandas’ DataFrame object has the
split/apply/combine stuff built in.
Repeat our split/apply/combine analyses:
mtcars.groupby(['cyl'])['mpg'].mean()
## cyl
## 4.0 26.663636
## 6.0 19.742857
## 8.0 15.100000
## Name: mpg, dtype: float64
And the other one:
mtcars.groupby(['cyl', 'gear'])['mpg'].mean()
## cyl gear
## 4.0 3.0 21.500
## 4.0 26.925
## 5.0 28.200
## 6.0 3.0 19.750
## 4.0 19.750
## 5.0 19.700
## 8.0 3.0 15.050
## 5.0 15.400
## Name: mpg, dtype: float64
Simple:
import sqlite3 # Built into the Python language!
con = sqlite3.connect("mtcars_from_pandas.sqlite3")
# Add our data frame to the mtcars table in the database
mtcars.to_sql("mtcars", con)
con.close()
Check that it worked:
ls -lhrt *.sqlite3
## -rw-r--r-- 1 molpopgen molpopgen 8.0K Jan 20 12:04 mtcars.sqlite3
## -rw-r--r-- 1 molpopgen molpopgen 12K Jan 20 12:04 mtcars_from_pandas.sqlite3
NOTE: The two data base files differ in size! This is because R
and pandas make different default decisions about the column data
types. Thus, your database may be bigger if you create it from pandas
without changing the default, meaning you may be using more disk space
than you need! It is an “exercise for the interested reader” to learn
how to change those defaults.
This is where R is ahead of the curve. Sadly, to get data back into a
pandas.DataFrame, we must use “raw” SQL:
import pandas as pd
con = sqlite3.connect("mtcars_from_pandas.sqlite3")
df = pd.read_sql("select * from mtcars", con)
df.head()
## index mpg cyl disp hp ... qsec vs am gear carb
## 0 Mazda RX4 21.0 6.0 160.0 110.0 ... 16.46 0.0 1.0 4.0 4.0
## 1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 ... 17.02 0.0 1.0 4.0 4.0
## 2 Datsun 710 22.8 4.0 108.0 93.0 ... 18.61 1.0 1.0 4.0 1.0
## 3 Hornet 4 Drive 21.4 6.0 258.0 110.0 ... 19.44 1.0 0.0 3.0 1.0
## 4 Hornet Sportabout 18.7 8.0 360.0 175.0 ... 17.02 0.0 0.0 3.0 2.0
##
## [5 rows x 12 columns]
To do our analyses:
df = pd.read_sql("select cyl, avg(mpg) from mtcars group by cyl", con)
df.head()
## cyl avg(mpg)
## 0 4.0 26.663636
## 1 6.0 19.742857
## 2 8.0 15.100000
df = pd.read_sql("select cyl, gear, avg(mpg) from mtcars group by cyl, gear", con)
df.head()
## cyl gear avg(mpg)
## 0 4.0 3.0 21.500
## 1 4.0 4.0 26.925
## 2 4.0 5.0 28.200
## 3 6.0 3.0 19.750
## 4 6.0 4.0 19.750
Most of the time, I’d use dplyr here instead. The SQL shown here is
deceptively simple. Trust me–it gets complex fast!
dplyr and pandas are both capable of joining multiple data
frames together. dplyr is able to join SQL tables together.rm -f your database
file or learn to drop tables. (That’s another “exercise for the
reader”.)Repeat the above code blocks! Couldn’t be simpler, right! (Famous last words.)
At the very minimum, you’ll need:
You should have pandas from last week’s exercise.