#NYC_Taxi
Ok, just looked at the benchmark overview and I’m a little disappointed. Comparing performance on a dataset of ~30 GBs tells me very little given it could all fit into RAM on commodity hardware. Like, the difference is up to chunking.
January 2, 2025 at 3:29 AM
open_dataset("nyc-taxi/")
nyc_taxi |>
filter(payment_type == "Credit card") |>
group_by(year, month) |>
write_dataset("nyc-taxi-credit")

Input is 1.7 billion rows (70GB), output is 500 million (15GB). Takes 3-4 mins on my laptop 🙂

#rstats (2/3)
November 23, 2024 at 12:55 AM
RT @djnavarro@fosstodon.org
My favourite trick for working with huge data sets in R. If your dataset is larger than memory and the query result is also larger than memory, you can still use dplyr/arrow pipelines. Example:

library(arrow)
library(dplyr)

nyc_taxi <- (1/3)
November 23, 2024 at 12:49 AM