sqlite

Kevin Thornton
Advanced Informatics, Week 5

What are we talking about?

  • Today, we focus on when you have make your own data files in your work flows.
  • I assume rectangular data with missing values allowed.
  • Emphasis on very large data, meaning that you cannot read them into memory:
#Fails due to out of memory
x=read.table("mydata.gz",header=T)

What are we not talking about?

  • BAM/SAM/CRAM/BED/VCF
  • There is a big community supporting tools working with theses things already.

Where you come in.

  • Often, you need to make a file that is in-between a BAM/SAM/CRAM/BED/VCF and your final analysis.
  • You have to decide what goes in it.
  • This file could be arbitrarily big.

Intermediate files -- what to do?

We want a file format that is:

  • Simple.
  • Usable with many different programming languages.
  • Fast.
  • Can handle very large amounts of data.
  • Can be quickly looked at via command-line tools.

Standard option

  • .csv.gz (compressed file with delimited columns)
  • Really big files can be read with data.table:
library(data.table)
x=fread('gunzip -c myfile.csv.gz')
  • But really really big files may consume all RAM or still be too slow.
  • Also neither searchable/indexable nor easily viewable.

SQLite

  • A relational database format.
  • Data stored in binary format.
  • Data can be indexed for rapid lookup.
  • There are Python/R/C/C++/you-name-it interfaces.

SQL = Structured Query Language

  • Relatively simple means of asking for things from databases
select columns from table where conditions are true group by other columns;
select temperature from climate_data where year > 1975 group by month;
  • See Buffalo Ch. 13 for example of queries.

Key concept: Indexes

  • Certain columns may be treated as indexes.
  • “where” and “group by” are faster in indexed columns.
  • Rule of thumb: indexes should be integers.

Key concept: aggregation and group by

select avg(temperature) from climate_data where year > 1975 group by month;

See here for list of aggregators supported by SQL.

SQLite

  • It is database software that doesn't require privileged access to the system.
  • You can write to it in Python without knowing any SQL
  • You can write to it and analyze tables in R without knowing any SQL.
  • That is rather amazing!

Why is this cool?

  • You can replace (gzipped) text files with “.db” files…
  • …and use them as if they were gzipped text files…
  • …and use the same files in Python…
  • …and have files bigger than you'd normally be able to process…
  • …even on HPC.
  • And, you can create temporary data bases in memory or on disk.

R packages (Install them!)

  • RSQLite: R interface to SQLite
  • DBI: generic Data Base Interface package. Supports RSQLite, MySQL, PostgreSQL.
  • dplyr: “A fast, consistent tool for working with data frame like objects, both in memory and out of memory.”

The dplyr package has a nice vignette showing off what it can do with databases.

dplyr: create a database

library(dplyr)
my_db <- src_sqlite("my_db.sqlite3", create = T)

dplyr: copy a dataframe to a database

library(nycflights13)
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = list(
  c("year", "month", "day"), "carrier", "tailnum"))

dplyr: select from a database

#Connect to a db on disk
db <- src_sqlite("my_db.sqlite3")
#Create a 'handle' to relevant table
flights_data = tbl(db,'flights')
#Query uses STANDARD DPLYR
#and acts as if flights_data is a data.table!
query = flights_data %>% 
  select(year:day, dep_delay, arr_delay)
#Execute the query and get results
results=collect(query,n=3)

The collect thing

  • dplyr works on principle of lazy evaluation.
  • Nothing is executed until “collect” is called.
  • dplyr creates the relevant SQL syntax.

The results

print(results)
# A tibble: 3 × 5
   year month   day dep_delay arr_delay
  <int> <int> <int>     <dbl>     <dbl>
1  2013     1     1         2        11
2  2013     1     1         4        20
3  2013     1     1         2        33

Split/apply/combine

The syntax is 100% standard dplyr!

query = table %>%
  group_by(col1,col2) %>% 
  summarise(mean_col3 = mean(col3))

Temporary databases in-memory

db_in_mem <- src_sqlite(":memory:",create=T)

Why do this?

  • Avoids creating temporary files
  • Automatically deleted when session closes

Adding rows to a database

db_insert_into(db_in_mem$con,'tablename',values=tibble or data.frame)

More info

See the vignettes/docs for RSQLite and dplyr.

Python

The package is sqlite3:

import sqlite3

Py2 docs here

Python: create databases with Pandas

Pandas is rough equivalent of R's data.frame + dplyr.

import sqlite3
import pandas as pd

conn=sqlite3.connect("mydb.db")
df=pd.DataFrame()
#Do stuff that fills up df
df.to_sql('tablename',conn,if_exists='append')
conn.close()

See docs.

Python: reading databses back in with Pandas

This requires the full SQL syntax:

import pandas as pd
conn=sqlite3.connect("mydb.db")
x=pd.read_sql('select stuff from table where other_stuff is true group by even_more_stuff',conn)
conn.close()

See docs.

Python: pandas/sqlite3 lets you then work in R

Useful if:

  • You find pandas operations confusing. (They can be!)
  • You love dplyr. (You should.)
  • Trying to plot in Python makes you wanna die. (Totally fair.)

Python/sqlite3 and "power user" features:

  • Custom aggregation functions. See here.

Why does all of this matter?

  • Files with millions of rows are the new normal.
  • You need a file format that works with many different analysis platforms (read: programming languages).

Case study:

  • 80Gb database for this week's lab
  • Using dplyr to get some column means + ggplot2 to plot: 15 minutes, a bit under 4Gb RAM.
  • Using RSQLite to load entire file, dplyr to analyze: over 113Gb RAM, 26 minutes.