Let’s compare the compute time needed for an equivalent operation between Python and R. The operation is to:
ingest a largeish csv file (~18GB) with 1,000,000 records and 1,000 columns of random normal variates plus one ID column,
group by ID (100 records per ID),
summarize as mean for each double/float column,
filter to IDs with any one or more double/float column with a mean > 0.4, and
report how many such rows were found.
In Python, we will use polars with lazy evaluation. In R, we will use dplyr, dtplyr, and tidytable. The latter two packages interpret dplyr syntax and deploy the data.table equivalent for efficiency.
Python with polars
import polars as plimport polars.selectors as csfrom datetime import datetimestart=datetime.now()q=( pl.scan_csv("big.csv") .group_by("id") .agg(cs.float().mean()) .filter(pl.any_horizontal(cs.float()>.4)))table=q.collect()elapsed=datetime.now()-startprint(f"{table.height} rows returned\nelapsed time for query: {elapsed}")
301 rows returned
elapsed time for query: 0:00:31.164063
R
Note that data.table::fread() is used for all R examples, as we’re really just focusing on the data manipulation approach penalties.
Plain dplyr
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(hms)
Attaching package: 'hms'
The following object is masked from 'package:lubridate':
hms
start<-Sys.time()rows<-data.table::fread("big.csv") %>%group_by(id) %>%summarize(across(where(is.double),mean),.groups="keep") %>%filter(if_any(where(is.double),~.x>.4)) %>%nrow()end<-Sys.time()print(str_glue("{rows} rows returned\nelapsed time for query: {as_hms(end-start)}"))
301 rows returned
elapsed time for query: 00:01:57.947875
dtplyr
This is stylistically the R version that is most similar to the polars approach, but it does come with some downsides in that not all dplyr functionality is supported. In this example that is most obvious in the inability to use tidyselect helpers in summarize() and filter().
library(dtplyr,warn.conflicts=F)start<-Sys.time()big<-data.table::fread("big.csv")varnames<-setdiff(colnames(big),"id")rows<-lazy_dt(big) %>%group_by(id) %>%summarize(across(all_of(varnames),mean),.groups="keep") %>%filter(if_any(all_of(varnames),~.x>.4)) %>%collect() %>%nrow()end<-Sys.time()print(str_glue("{rows} rows returned\nelapsed time for query: {as_hms(end-start)}"))
301 rows returned
elapsed time for query: 00:00:43.045766
tidytable
This should be computationally comparable to the dtplyr approach as both are deploying data.table behind the scenes, but this approach has the benefit of preserving the plain dplyr syntax, including the ability to use tidyselect helpers.
start<-Sys.time()rows<-data.table::fread("big.csv") %>% tidytable::group_by(id) %>% tidytable::summarize(tidytable::across(where(is.double),mean),.groups="keep") %>% tidytable::filter(tidytable::if_any(where(is.double),~.x>.4)) %>%nrow()end<-Sys.time()print(str_glue("{rows} rows returned\nelapsed time for query: {as_hms(end-start)}"))
301 rows returned
elapsed time for query: 00:00:56.428901
Conclusions
After rendering this several time on both my work PC (2022 Windows laptop with 2.5GHz i7 and 32GB RAM) and my Mac at home (2023 M2 Max with 32GB memory) the general takeaway is that the plain vanilla dplyr approach is anywhere from 3 to 8 times slower than polars. On the Mac, dtplyr and polars are very close with tidytable coming in only a tad slower. On the Windows machine, both dtplyr and tidytable are about 2-3x polars.
Citation
BibTeX citation:
@online{couzens2025,
author = {Couzens, Lance},
title = {R Vs. {Python} {Query} {Compute} {Time} {Example}},
date = {2025-03-19},
url = {https://mostlyunoriginal.github.io/posts/2025-03-19-TidyR-to-PolarsPython/},
langid = {en}
}