Kevin Thornton
Advanced Informatics, Week 5
#Fails due to out of memory
x=read.table("mydata.gz",header=T)
We want a file format that is:
library(data.table)
x=fread('gunzip -c myfile.csv.gz')
select columns from table where conditions are true group by other columns;
select temperature from climate_data where year > 1975 group by month;
select avg(temperature) from climate_data where year > 1975 group by month;
See here for list of aggregators supported by SQL.
library(dplyr)
my_db <- src_sqlite("my_db.sqlite3", create = T)
library(nycflights13)
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
#Connect to a db on disk
db <- src_sqlite("my_db.sqlite3")
#Create a 'handle' to relevant table
flights_data = tbl(db,'flights')
#Query uses STANDARD DPLYR
#and acts as if flights_data is a data.table!
query = flights_data %>%
select(year:day, dep_delay, arr_delay)
#Execute the query and get results
results=collect(query,n=3)
print(results)
# A tibble: 3 × 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
The syntax is 100% standard dplyr!
query = table %>%
group_by(col1,col2) %>%
summarise(mean_col3 = mean(col3))
db_in_mem <- src_sqlite(":memory:",create=T)
Why do this?
db_insert_into(db_in_mem$con,'tablename',values=tibble or data.frame)
Pandas is rough equivalent of R's data.frame + dplyr.
import sqlite3
import pandas as pd
conn=sqlite3.connect("mydb.db")
df=pd.DataFrame()
#Do stuff that fills up df
df.to_sql('tablename',conn,if_exists='append')
conn.close()
See docs.
This requires the full SQL syntax:
import pandas as pd
conn=sqlite3.connect("mydb.db")
x=pd.read_sql('select stuff from table where other_stuff is true group by even_more_stuff',conn)
conn.close()
See docs.
Useful if:
Case study: