data.table - How lazy can R be? eg. Lazily reading a portion of a file into R based on downstream operations -
synopsis:
i trying understand if there are/what limits r
's lazy-ness. in, walls independent of implementation. use useful example below.
setting scene:
i have flat files (1e9 1e11 rows) example file file.txt
:
"chr\tstart\tend\nchr1\t1\t2\nchr1\t2\t3\nchr2\t3\t4\nchr2\t1\t2\nchr3\t2\t3\nchr3\t3\t4\n"
i.e.
readr::read_tsv("file.txt") #> source: local data frame [6 x 3] #> #> chr start end #> 1 chr1 1 2 #> 2 chr1 2 3 #> 3 chr2 3 4 #> 4 chr2 1 2 #> 5 chr3 2 3 #> 6 chr3 3 4
but real files sample and/or filter contents (sometimes performance, analysis targets subset of observations).
continuing file.txt
may in dplyr
way:
library(dplyr) df <- readr::read_tsv("file.txt") %>% sample_n(3) df2 <- readr::read_tsv("file.txt") %>% filter(chr == "chr2") df3 <- readr::read_tsv("file.txt") %>% group_by(chr) %>% sample_n(1)
or equivalently use data.table
's dt <- fread("file.txt")[...]
.
the question:
r
awesomely lazy can lazy read in what's specified downstream operations? i'm asking if there limit r's laziness or matter of implementation? , if so, implementer figure out shortest path or r have tools along way?
n.b.
i aware there number of ways of reading in subset of observations in file both on r side , outside of r (r.utils::readtable
,pipe(...)
,...) before/as read in. i'm not asking if has been implemented in r
, i'm asking can , if r
or implementer this?
Comments
Post a Comment