Let’s compare the compute time needed for an equivalent operation between Python and R. The operation is to:

ingest a largeish csv file (~18GB) with 1,000,000 records and 1,000 columns of random normal variates plus one ID column,
group by ID (100 records per ID),
summarize as mean for each double/float column,
filter to IDs with any one or more double/float column with a mean > 0.4, and
report how many such rows were found.

In Python, we will use polars with lazy evaluation. In R, we will use dplyr, dtplyr, and tidytable. The latter two packages interpret dplyr syntax and deploy the data.table equivalent for efficiency.

Python with polars

import polars as pl
import polars.selectors as cs
from datetime import datetime

start=datetime.now()

q=(
    pl.scan_csv("big.csv")
    .group_by("id")
    .agg(cs.float().mean())
    .filter(pl.any_horizontal(cs.float()>.4))
)

table=q.collect()

elapsed=datetime.now()-start

print(f"{table.height} rows returned\nelapsed time for query: {elapsed}")

301 rows returned
elapsed time for query: 0:00:31.164063

R

Note that data.table::fread() is used for all R examples, as we’re really just focusing on the data manipulation approach penalties.

Plain dplyr

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(hms)


Attaching package: 'hms'

The following object is masked from 'package:lubridate':

    hms

start<-Sys.time()

rows<-data.table::fread("big.csv") %>%
  group_by(id) %>%
  summarize(across(where(is.double),mean),.groups="keep") %>%
  filter(if_any(where(is.double),~.x>.4)) %>%
  nrow()

end<-Sys.time()

print(str_glue("{rows} rows returned\nelapsed time for query: {as_hms(end-start)}"))

301 rows returned
elapsed time for query: 00:01:57.947875

dtplyr

This is stylistically the R version that is most similar to the polars approach, but it does come with some downsides in that not all dplyr functionality is supported. In this example that is most obvious in the inability to use tidyselect helpers in summarize() and filter().

library(dtplyr,warn.conflicts=F)

start<-Sys.time()

big<-data.table::fread("big.csv")

varnames<-setdiff(colnames(big),"id")

rows<-lazy_dt(big) %>%
  group_by(id) %>%
  summarize(across(all_of(varnames),mean),.groups="keep") %>%
  filter(if_any(all_of(varnames),~.x>.4)) %>%
  collect() %>%
  nrow()

end<-Sys.time()

print(str_glue("{rows} rows returned\nelapsed time for query: {as_hms(end-start)}"))

301 rows returned
elapsed time for query: 00:00:43.045766

tidytable

This should be computationally comparable to the dtplyr approach as both are deploying data.table behind the scenes, but this approach has the benefit of preserving the plain dplyr syntax, including the ability to use tidyselect helpers.

start<-Sys.time()

rows<-data.table::fread("big.csv") %>%
  tidytable::group_by(id) %>%
  tidytable::summarize(tidytable::across(where(is.double),mean),.groups="keep") %>%
  tidytable::filter(tidytable::if_any(where(is.double),~.x>.4)) %>%
  nrow()

end<-Sys.time()

print(str_glue("{rows} rows returned\nelapsed time for query: {as_hms(end-start)}"))

301 rows returned
elapsed time for query: 00:00:56.428901

Conclusions

After rendering this several time on both my work PC (2022 Windows laptop with 2.5GHz i7 and 32GB RAM) and my Mac at home (2023 M2 Max with 32GB memory) the general takeaway is that the plain vanilla dplyr approach is anywhere from 3 to 8 times slower than polars. On the Mac, dtplyr and polars are very close with tidytable coming in only a tad slower. On the Windows machine, both dtplyr and tidytable are about 2-3x polars.

Citation

BibTeX citation:

@online{couzens2025,
  author = {Couzens, Lance},
  title = {R Vs. {Python} {Query} {Compute} {Time} {Example}},
  date = {2025-03-19},
  url = {https://mostlyunoriginal.github.io/posts/2025-03-19-TidyR-to-PolarsPython/},
  langid = {en}
}

For attribution, please cite this work as:

Couzens, Lance. 2025. “R Vs. Python Query Compute Time Example.” March 19, 2025. https://mostlyunoriginal.github.io/posts/2025-03-19-TidyR-to-PolarsPython/.