data.table - How lazy can R be? eg. Lazily reading a portion of a file into R based on downstream operations -


synopsis:

i trying understand if there are/what limits r's lazy-ness. in, walls independent of implementation. use useful example below.

setting scene:

i have flat files (1e9 1e11 rows) example file file.txt:

"chr\tstart\tend\nchr1\t1\t2\nchr1\t2\t3\nchr2\t3\t4\nchr2\t1\t2\nchr3\t2\t3\nchr3\t3\t4\n" 

i.e.

readr::read_tsv("file.txt")  #>  source: local data frame [6 x 3] #>   #>     chr start end #>  1 chr1     1   2 #>  2 chr1     2   3 #>  3 chr2     3   4 #>  4 chr2     1   2 #>  5 chr3     2   3 #>  6 chr3     3   4 

but real files sample and/or filter contents (sometimes performance, analysis targets subset of observations).

continuing file.txt may in dplyr way:

library(dplyr) df <- readr::read_tsv("file.txt") %>% sample_n(3) df2 <- readr::read_tsv("file.txt") %>% filter(chr == "chr2") df3 <- readr::read_tsv("file.txt") %>% group_by(chr) %>% sample_n(1) 

or equivalently use data.table's dt <- fread("file.txt")[...].

the question:

r awesomely lazy can lazy read in what's specified downstream operations? i'm asking if there limit r's laziness or matter of implementation? , if so, implementer figure out shortest path or r have tools along way?

n.b.

i aware there number of ways of reading in subset of observations in file both on r side , outside of r (r.utils::readtable,pipe(...),...) before/as read in. i'm not asking if has been implemented in r, i'm asking can , if r or implementer this?


Comments