<-mtcars %>% #1, 5
cylsdistinct(cyl) %>% #2
arrange(cyl) %>% #3
pull() #4
Background
Before diving into the examples below, it’s important to acknowledge some general differences between R and Python and specific differences in style and approach between Tidy- and Polars-style data manipulation. First and foremost: Python has a strong object orientation while R is essentially a functional language. The practical impact of that difference here is that Python objects are manipulated, or their attributes extracted, by way of methods, while R objects are inputs and outputs of functions. But Python uses functions too, and in fact methods are themselves functions, so this can be very confusing!
What is a method, then? In simple terms, it’s a function defined as part of the blueprint for a given type (or ‘class’) of object. A Polars DataFrame is a class of object, and there are certain functions defined in that class—these are the Polars DataFrame methods. By creating a specific DataFrame, we ‘instantiate’ the class into an object, and we can deploy a predefined set of methods to do things with or to that object.
In both R and Python we often want to do several operations in a row without distinct assignments for each intermediate step of a process. In R—and especially in the Tidy style of R programming—we can use piping with either the magrittr
or base pipes (%>%
and |>
, respectively) to achieve this. The resulting pipeline starts with an object, passes that object into a function which returns a new object which is passed into another function, and so on and so forth until the desired object is returned by the final function in the pipeline and is captured with an assignment, returned to the console, or passed as input to another pipeline or function. Consider the following example.
Here, we start with the data frame mtcars
(1), which is piped as input to the distinct()
function along with the column reference cyl
(2), which returns a data frame containing only the column cyl
and one row for each distinct value. This is piped as input to arrange()
(3) along with a column reference to cyl
, which returns a sorted data frame. This is piped into pull()
(4), which extracts a single column (the only one there: cyl
) as a vector. This final object is then assigned to the environment variable cyls
(5). Now consider the Python version which utilizes a technique called ‘method chaining’.
=( #5
cyls#1
mtcars "cyl") #2
.unique("cyl") #3
.sort("cyl") #4
.get_column( )
Here, we start with mtcars
, a Polars DataFrame (1). We then apply the unique()
method with a reference to the column cyl
(2), yielding a Polars DataFrame containing the distinct values of cyl
(note that it still contains all the other variables too!). Calling the sort()
method sorts the rows by the values of cyl
(3). The Polars DataFrame method get_column()
(4) extracts a single column and yields a Polars Series (analogous to the atomic vectors that comprise R data frame columns). The resulting Series is assigned to the variable cyls
(5).
Both of these code blocks look quite similar, and the Python version should feel familiar to anyone who employs the Tidy-style of programming in R. Now that we’ve seen method chaining in action we can introduce a twist that unlocks some additional efficiency and that may seem strange compared to the Tidy style. The Python block above utilizes what’s called ‘eager evaluation’, which means the code inside cyls=(...)
is immediately evaluated and in exactly the manner we have specified. However, Polars is actually implemented in Rust (a high performance systems programming language) and has a query optimization capability that we can exploit via something called ‘lazy evaluation’. The following ‘lazy’ alternative to the previous example gathers our instructions, performs query optimization (yielding a ‘query plan’), and ultimately executes an optimized query only when we invoke the collect()
method (a method of Polars LazyFrames which result from invoking the lazy()
method of a regular Polars DataFrame).
=(
cyls
mtcars
.lazy()"cyl")
.unique("cyl")
.sort(
.collect()"cyl")
.get_column( )
Note that Polars LazyFrames do not have a get_column()
method like DataFrames do—it can therefore only be invoked after collection. The advantages of lazy evaluation in this trivial example would not be noticeable but could be significant depending on the size of the data and the complexity of the query. Lazy evaluation also allows for efficient processing of larger-than-memory data frames. See the User guide for more detail. This approach may seem familiar to anyone who has used the dtplyr
R package which allows the user to proved dplyr
syntax which is translated into data.table
(which is written primarily in C and is much faster than dplyr
) under the hood.
Without further ado, let’s dive into some examples.
Data Manipulation
Example 1. Basic Summarize without Generalization across Variables
Here, we take on a very simple and very common task: calculating the mean of a continuous variable (mpg
) by the levels of a categorical variable (cyl
).
R Version
The Tidy approach utilizes a pipeline comprised of the mtcars
data frame and the group_by()
and summarize()
functions. Note that these functions take a data frame (or tibble) as the first argument, but prevailing style allows this to be passed implicitly (as is done here).
library(dplyr)
<-mtcars %>%
tablegroup_by(cyl) %>%
summarize(mpg.mean=mean(mpg))
print(table)
# A tibble: 3 × 2
cyl mpg.mean
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
Python Version
The Polars approach below begins by reading the R mtcars
data frame into the Polars LazyFrame mtcars
. The LazyFrame method group_by()
is invoked followed by the agg()
method. agg()
contains an expression that is itself a method chain which yields the mean values for each group as the new variable mpg.mean
. table
is a Polars DataFrame realized as the result of evaluating an optimized query plan (via collect()
).
import polars as pl
=pl.LazyFrame(r.mtcars)
mtcars
=(
q
mtcars"cyl")
.group_by("mpg").mean().alias("mpg.mean"))
.agg(pl.col(
)
=q.collect()
table
print(table)
shape: (3, 2)
┌─────┬───────────┐
│ cyl ┆ mpg.mean │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═══════════╡
│ 8.0 ┆ 15.1 │
│ 6.0 ┆ 19.742857 │
│ 4.0 ┆ 26.663636 │
└─────┴───────────┘
Example 2. Basic Mutate with Grouping and without Generalization
Here we want to add a new variable to our data frame—the new variable is the ratio of each value of mpg
relative to the mean value for the group (defined by the levels of the variable cyl
).
R Version
In R I can create the new variable with a call to mutate()
that utilizes both group-level statistics and record-level data. This can be done in a single step with very little code.
<-mtcars %>%
tablegroup_by(cyl) %>%
mutate(rel.mpg=mpg/mean(mpg))
print(table)
# A tibble: 32 × 12
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb rel.mpg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1.06
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1.06
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0.855
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1.08
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 1.24
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 0.917
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0.947
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 0.915
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 0.855
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 0.973
# ℹ 22 more rows
Python Version
The Python version uses the with_columns()
LazyFrame method. Here, unlike in the R version, the grouping is baked into the recode expression itself by way of over()
. Aside from looking a bit different, the Polars approach is more powerful because each expression can utilize its own grouping. Note that the Polars documentation utilizes a ‘contexts’ and ‘expressions’ framework to describe what could also be referred to as methods or method chains. In this example, with_columns()
is the context in which the expression yielding the new variable rel.mpg
is nested.
=(
q
mtcars
.with_columns("mpg")/pl.col("mpg").mean().over("cyl")).alias("rel.mpg")
(pl.col(
)
)
=q.collect()
table
print(table)
shape: (32, 12)
┌──────┬─────┬───────┬───────┬───┬─────┬──────┬──────┬──────────┐
│ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ am ┆ gear ┆ carb ┆ rel.mpg │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪═════╪═══════╪═══════╪═══╪═════╪══════╪══════╪══════════╡
│ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 1.063676 │
│ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 1.063676 │
│ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 1.0 ┆ 0.855097 │
│ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 1.0 ┆ 1.083936 │
│ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 2.0 ┆ 1.238411 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 2.0 ┆ 1.14013 │
│ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 4.0 ┆ 1.046358 │
│ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 6.0 ┆ 0.997829 │
│ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 8.0 ┆ 0.993377 │
│ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 2.0 ┆ 0.802591 │
└──────┴─────┴───────┴───────┴───┴─────┴──────┴──────┴──────────┘
Example 3. Summarize Generalized by Variable Type with Across
In this example we want to generate means by group (like in example 1), but across a set of columns described by a selection criteria (i.e., not by name).
R Version
As before, we specify the grouping via group_by()
and generate the means within summarize()
. In order to create means for several variables not explicitly specified we can utilize across()
. To get means for all variables stored as doubles, we use the helper function where()
in the .cols
specification. Glue syntax in the .names
specification allows us to rename all affected columns.
%>%
mtcars group_by(cyl,gear) %>%
summarize(
across(
.cols=where(is.double)
.fns=mean
,.names="{.col}_mean"
,
) )
`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.
# A tibble: 8 × 11
# Groups: cyl [3]
cyl gear mpg_mean disp_mean hp_mean drat_mean wt_mean qsec_mean vs_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 3 21.5 120. 97 3.7 2.46 20.0 1
2 4 4 26.9 103. 76 4.11 2.38 19.6 1
3 4 5 28.2 108. 102 4.1 1.83 16.8 0.5
4 6 3 19.8 242. 108. 2.92 3.34 19.8 1
5 6 4 19.8 164. 116. 3.91 3.09 17.7 0.5
6 6 5 19.7 145 175 3.62 2.77 15.5 0
7 8 3 15.0 358. 194. 3.12 4.10 17.1 0
8 8 5 15.4 326 300. 3.88 3.37 14.6 0
# ℹ 2 more variables: am_mean <dbl>, carb_mean <dbl>
Python Version
The Python version looks very similar to the R version and has the same basic structure as in example 1. Here, though, instead of specifying a column with pl.col()
we indicate that we want all columns stored as floats by using cs.float()
. Note that there are many selector functions available, as explained here. The name method name.suffix()
is used to rename all affected variables. See other name methods here.
import polars.selectors as cs
=(
q
mtcars"cyl","gear")
.group_by(float().mean().name.suffix("_mean"))
.agg(cs.
)
=q.collect()
table
print(table)
shape: (8, 11)
┌─────┬──────┬──────────┬────────────┬───┬───────────┬─────────┬─────────┬───────────┐
│ cyl ┆ gear ┆ mpg_mean ┆ disp_mean ┆ … ┆ qsec_mean ┆ vs_mean ┆ am_mean ┆ carb_mean │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════╪══════════╪════════════╪═══╪═══════════╪═════════╪═════════╪═══════════╡
│ 8.0 ┆ 3.0 ┆ 15.05 ┆ 357.616667 ┆ … ┆ 17.1425 ┆ 0.0 ┆ 0.0 ┆ 3.083333 │
│ 4.0 ┆ 3.0 ┆ 21.5 ┆ 120.1 ┆ … ┆ 20.01 ┆ 1.0 ┆ 0.0 ┆ 1.0 │
│ 4.0 ┆ 4.0 ┆ 26.925 ┆ 102.625 ┆ … ┆ 19.6125 ┆ 1.0 ┆ 0.75 ┆ 1.5 │
│ 6.0 ┆ 3.0 ┆ 19.75 ┆ 241.5 ┆ … ┆ 19.83 ┆ 1.0 ┆ 0.0 ┆ 1.0 │
│ 4.0 ┆ 5.0 ┆ 28.2 ┆ 107.7 ┆ … ┆ 16.8 ┆ 0.5 ┆ 1.0 ┆ 2.0 │
│ 8.0 ┆ 5.0 ┆ 15.4 ┆ 326.0 ┆ … ┆ 14.55 ┆ 0.0 ┆ 1.0 ┆ 6.0 │
│ 6.0 ┆ 4.0 ┆ 19.75 ┆ 163.8 ┆ … ┆ 17.67 ┆ 0.5 ┆ 0.5 ┆ 4.0 │
│ 6.0 ┆ 5.0 ┆ 19.7 ┆ 145.0 ┆ … ┆ 15.5 ┆ 0.0 ┆ 1.0 ┆ 6.0 │
└─────┴──────┴──────────┴────────────┴───┴───────────┴─────────┴─────────┴───────────┘
Example 4. Conditional Recode
In this example we use if/else if/else logic to create a string recode of the numeric variable mpg
.
R Version
In the Tidy R approach we deploy case_when()
inside of mutate()
to create a recode with cascading conditional logic.
%>%
mtcars mutate(
mpg.cat=case_when(
<10~"very bad"
mpg<15~"bad"
,mpg<20~"okay"
,mpg<25~"good"
,mpgTRUE~"great"
,
)%>%
) arrange(desc(mpg))
mpg cyl disp hp drat wt qsec vs am gear carb mpg.cat
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 great
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 great
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 great
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 great
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 great
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 great
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 good
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 good
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 good
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 good
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 good
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 good
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 good
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 good
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 okay
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 okay
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 okay
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 okay
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 okay
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 okay
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 okay
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 okay
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 okay
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 okay
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 okay
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 okay
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 okay
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 bad
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 bad
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 bad
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 bad
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 bad
Python Version
The Python version is quite a bit more wordy. Note that pl.lit()
is needed here to resolve ambiguity in the way column references can appear as strings in then()
—in other words, we’re indicating we want the recoded values to be the provided strings, not the values of columns represented by those strings.
=(
q
mtcars
.with_columns("mpg")<10).then(pl.lit("very bad"))
pl.when(pl.col("mpg")<15).then(pl.lit("bad"))
.when(pl.col("mpg")<20).then(pl.lit("okay"))
.when(pl.col("mpg")<25).then(pl.lit("good"))
.when(pl.col("great"))
.otherwise(pl.lit("mpg.cat")
.alias(
)"mpg",descending=True)
.sort(
)
=q.collect()
df
print(df)
shape: (32, 12)
┌──────┬─────┬───────┬───────┬───┬─────┬──────┬──────┬─────────┐
│ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ am ┆ gear ┆ carb ┆ mpg.cat │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════╪═════╪═══════╪═══════╪═══╪═════╪══════╪══════╪═════════╡
│ 33.9 ┆ 4.0 ┆ 71.1 ┆ 65.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 1.0 ┆ great │
│ 32.4 ┆ 4.0 ┆ 78.7 ┆ 66.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 1.0 ┆ great │
│ 30.4 ┆ 4.0 ┆ 75.7 ┆ 52.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 2.0 ┆ great │
│ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 2.0 ┆ great │
│ 27.3 ┆ 4.0 ┆ 79.0 ┆ 66.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 1.0 ┆ great │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 14.7 ┆ 8.0 ┆ 440.0 ┆ 230.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 4.0 ┆ bad │
│ 14.3 ┆ 8.0 ┆ 360.0 ┆ 245.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 4.0 ┆ bad │
│ 13.3 ┆ 8.0 ┆ 350.0 ┆ 245.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 4.0 ┆ bad │
│ 10.4 ┆ 8.0 ┆ 472.0 ┆ 205.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 4.0 ┆ bad │
│ 10.4 ┆ 8.0 ┆ 460.0 ┆ 215.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 4.0 ┆ bad │
└──────┴─────┴───────┴───────┴───┴─────┴──────┴──────┴─────────┘
Example 5. Pivots
In this example we start with mtcars
(a version with rownames mapped to the column car
), pivot to a long file and then back to wide.
R Version
Here we use very simple forms of pivot_longer()
and pivot_wider()
.
library(tidyr)
library(tibble)
<-rownames_to_column(mtcars,"car") %>%
carspivot_longer(
cols=where(is.numeric)
names_to="variable"
,values_to="value"
,
)
print(cars)
# A tibble: 352 × 3
car variable value
<chr> <chr> <dbl>
1 Mazda RX4 mpg 21
2 Mazda RX4 cyl 6
3 Mazda RX4 disp 160
4 Mazda RX4 hp 110
5 Mazda RX4 drat 3.9
6 Mazda RX4 wt 2.62
7 Mazda RX4 qsec 16.5
8 Mazda RX4 vs 0
9 Mazda RX4 am 1
10 Mazda RX4 gear 4
# ℹ 342 more rows
<-cars %>%
mtcars_w_namespivot_wider(
id_cols=car
names_from="variable"
,values_from="value"
,
)
print(mtcars_w_names)
# A tibble: 32 × 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
Python Version
With polars, going from wide to long is an unpivot and long to wide is a pivot. Note that the pivot()
is only available in eager mode, as shown below.
import polars.selectors as cs
=(
q"mtcars_w_names.csv")
pl.scan_csv(
.unpivot(=cs.numeric()
on="car"
,index="variable"
,variable_name="value"
,value_name
)
)
=q.collect()
cars
print(cars)
shape: (352, 3)
┌───────────────────┬──────────┬───────┐
│ car ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════════════════╪══════════╪═══════╡
│ Mazda RX4 ┆ mpg ┆ 21.0 │
│ Mazda RX4 Wag ┆ mpg ┆ 21.0 │
│ Datsun 710 ┆ mpg ┆ 22.8 │
│ Hornet 4 Drive ┆ mpg ┆ 21.4 │
│ Hornet Sportabout ┆ mpg ┆ 18.7 │
│ … ┆ … ┆ … │
│ Lotus Europa ┆ carb ┆ 2.0 │
│ Ford Pantera L ┆ carb ┆ 4.0 │
│ Ferrari Dino ┆ carb ┆ 6.0 │
│ Maserati Bora ┆ carb ┆ 8.0 │
│ Volvo 142E ┆ carb ┆ 2.0 │
└───────────────────┴──────────┴───────┘
=(
mtcars_w_names
cars
.pivot(="car"
index="variable"
,on="value"
,values=None
,aggregate_function
)
)
print(mtcars_w_names)
shape: (32, 12)
┌───────────────────┬──────┬─────┬───────┬───┬─────┬─────┬──────┬──────┐
│ car ┆ mpg ┆ cyl ┆ disp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════════╪══════╪═════╪═══════╪═══╪═════╪═════╪══════╪══════╡
│ Mazda RX4 ┆ 21.0 ┆ 6.0 ┆ 160.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4.0 ┆ 4.0 │
│ Mazda RX4 Wag ┆ 21.0 ┆ 6.0 ┆ 160.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4.0 ┆ 4.0 │
│ Datsun 710 ┆ 22.8 ┆ 4.0 ┆ 108.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0 ┆ 1.0 │
│ Hornet 4 Drive ┆ 21.4 ┆ 6.0 ┆ 258.0 ┆ … ┆ 1.0 ┆ 0.0 ┆ 3.0 ┆ 1.0 │
│ Hornet Sportabout ┆ 18.7 ┆ 8.0 ┆ 360.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 3.0 ┆ 2.0 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ Lotus Europa ┆ 30.4 ┆ 4.0 ┆ 95.1 ┆ … ┆ 1.0 ┆ 1.0 ┆ 5.0 ┆ 2.0 │
│ Ford Pantera L ┆ 15.8 ┆ 8.0 ┆ 351.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 4.0 │
│ Ferrari Dino ┆ 19.7 ┆ 6.0 ┆ 145.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 6.0 │
│ Maserati Bora ┆ 15.0 ┆ 8.0 ┆ 301.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 8.0 │
│ Volvo 142E ┆ 21.4 ┆ 4.0 ┆ 121.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0 ┆ 2.0 │
└───────────────────┴──────┴─────┴───────┴───┴─────┴─────┴──────┴──────┘
Example 6. Joins
In this example we will show the various join types with two distinct but overlapping subsets of mtcars
: cars with 6-cylinder engines and those with horsepower less than 110.
R Version
This code is pretty self-explanatory.
<-mtcars %>%
carslrownames_to_column("car") %>%
filter(cyl==6) %>%
select(car,cyl)
<-mtcars %>%
carsrrownames_to_column("car") %>%
filter(hp<110) %>%
select(car,hp)
print(carsl)
car cyl
1 Mazda RX4 6
2 Mazda RX4 Wag 6
3 Hornet 4 Drive 6
4 Valiant 6
5 Merc 280 6
6 Merc 280C 6
7 Ferrari Dino 6
print(carsr)
car hp
1 Datsun 710 93
2 Valiant 105
3 Merc 240D 62
4 Merc 230 95
5 Fiat 128 66
6 Honda Civic 52
7 Toyota Corolla 65
8 Toyota Corona 97
9 Fiat X1-9 66
10 Porsche 914-2 91
11 Volvo 142E 109
left_join(carsl,carsr,by="car")
car cyl hp
1 Mazda RX4 6 NA
2 Mazda RX4 Wag 6 NA
3 Hornet 4 Drive 6 NA
4 Valiant 6 105
5 Merc 280 6 NA
6 Merc 280C 6 NA
7 Ferrari Dino 6 NA
right_join(carsl,carsr,by="car")
car cyl hp
1 Valiant 6 105
2 Datsun 710 NA 93
3 Merc 240D NA 62
4 Merc 230 NA 95
5 Fiat 128 NA 66
6 Honda Civic NA 52
7 Toyota Corolla NA 65
8 Toyota Corona NA 97
9 Fiat X1-9 NA 66
10 Porsche 914-2 NA 91
11 Volvo 142E NA 109
inner_join(carsl,carsr,by="car")
car cyl hp
1 Valiant 6 105
full_join(carsl,carsr,by="car")
car cyl hp
1 Mazda RX4 6 NA
2 Mazda RX4 Wag 6 NA
3 Hornet 4 Drive 6 NA
4 Valiant 6 105
5 Merc 280 6 NA
6 Merc 280C 6 NA
7 Ferrari Dino 6 NA
8 Datsun 710 NA 93
9 Merc 240D NA 62
10 Merc 230 NA 95
11 Fiat 128 NA 66
12 Honda Civic NA 52
13 Toyota Corolla NA 65
14 Toyota Corona NA 97
15 Fiat X1-9 NA 66
16 Porsche 914-2 NA 91
17 Volvo 142E NA 109
anti_join(carsl,carsr,by="car")
car cyl
1 Mazda RX4 6
2 Mazda RX4 Wag 6
3 Hornet 4 Drive 6
4 Merc 280 6
5 Merc 280C 6
6 Ferrari Dino 6
Python Version
=(
carsl"mtcars_w_names.csv")
pl.scan_csv(filter(pl.col("cyl")==6)
."car","cyl")
.select(
)
=(
carsr"mtcars_w_names.csv")
pl.scan_csv(filter(pl.col("hp")<110)
."car","hp")
.select(
)
print(carsl.collect())
shape: (7, 2)
┌────────────────┬─────┐
│ car ┆ cyl │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════╪═════╡
│ Mazda RX4 ┆ 6 │
│ Mazda RX4 Wag ┆ 6 │
│ Hornet 4 Drive ┆ 6 │
│ Valiant ┆ 6 │
│ Merc 280 ┆ 6 │
│ Merc 280C ┆ 6 │
│ Ferrari Dino ┆ 6 │
└────────────────┴─────┘
print(carsr.collect())
shape: (11, 2)
┌────────────────┬─────┐
│ car ┆ hp │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════╪═════╡
│ Datsun 710 ┆ 93 │
│ Valiant ┆ 105 │
│ Merc 240D ┆ 62 │
│ Merc 230 ┆ 95 │
│ Fiat 128 ┆ 66 │
│ … ┆ … │
│ Toyota Corolla ┆ 65 │
│ Toyota Corona ┆ 97 │
│ Fiat X1-9 ┆ 66 │
│ Porsche 914-2 ┆ 91 │
│ Volvo 142E ┆ 109 │
└────────────────┴─────┘
print(carsl.join(carsr,on="car",how="left").collect())
shape: (7, 3)
┌────────────────┬─────┬──────┐
│ car ┆ cyl ┆ hp │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞════════════════╪═════╪══════╡
│ Mazda RX4 ┆ 6 ┆ null │
│ Mazda RX4 Wag ┆ 6 ┆ null │
│ Hornet 4 Drive ┆ 6 ┆ null │
│ Valiant ┆ 6 ┆ 105 │
│ Merc 280 ┆ 6 ┆ null │
│ Merc 280C ┆ 6 ┆ null │
│ Ferrari Dino ┆ 6 ┆ null │
└────────────────┴─────┴──────┘
print(carsl.join(carsr,on="car",how="right").collect())
shape: (11, 3)
┌──────┬────────────────┬─────┐
│ cyl ┆ car ┆ hp │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞══════╪════════════════╪═════╡
│ null ┆ Datsun 710 ┆ 93 │
│ 6 ┆ Valiant ┆ 105 │
│ null ┆ Merc 240D ┆ 62 │
│ null ┆ Merc 230 ┆ 95 │
│ null ┆ Fiat 128 ┆ 66 │
│ … ┆ … ┆ … │
│ null ┆ Toyota Corolla ┆ 65 │
│ null ┆ Toyota Corona ┆ 97 │
│ null ┆ Fiat X1-9 ┆ 66 │
│ null ┆ Porsche 914-2 ┆ 91 │
│ null ┆ Volvo 142E ┆ 109 │
└──────┴────────────────┴─────┘
print(carsl.join(carsr,on="car",how="inner").collect())
shape: (1, 3)
┌─────────┬─────┬─────┐
│ car ┆ cyl ┆ hp │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╡
│ Valiant ┆ 6 ┆ 105 │
└─────────┴─────┴─────┘
print(carsl.join(carsr,on="car",how="full",coalesce=True).collect())
shape: (17, 3)
┌────────────────┬──────┬──────┐
│ car ┆ cyl ┆ hp │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞════════════════╪══════╪══════╡
│ Datsun 710 ┆ null ┆ 93 │
│ Valiant ┆ 6 ┆ 105 │
│ Merc 240D ┆ null ┆ 62 │
│ Merc 230 ┆ null ┆ 95 │
│ Fiat 128 ┆ null ┆ 66 │
│ … ┆ … ┆ … │
│ Merc 280 ┆ 6 ┆ null │
│ Mazda RX4 ┆ 6 ┆ null │
│ Ferrari Dino ┆ 6 ┆ null │
│ Hornet 4 Drive ┆ 6 ┆ null │
│ Merc 280C ┆ 6 ┆ null │
└────────────────┴──────┴──────┘
print(carsl.join(carsr,on="car",how="anti").collect())
shape: (6, 2)
┌────────────────┬─────┐
│ car ┆ cyl │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════╪═════╡
│ Mazda RX4 ┆ 6 │
│ Mazda RX4 Wag ┆ 6 │
│ Hornet 4 Drive ┆ 6 │
│ Merc 280 ┆ 6 │
│ Merc 280C ┆ 6 │
│ Ferrari Dino ┆ 6 │
└────────────────┴─────┘
Functional Programming
Example 7. Function for n & pct by Grouping Variables
Here we want a custom function to create simple, list-style frequency tables based on one or more variables provided by the user.
R Version
We use dynamic dots (...
) here to tunnel in the variables provided by the user in the function call. This is powerful and flexible, allowing for 0+ variables provided as naked symbols rather than strings (doit()
); an alternative version (doit2()
) also uses dynamic dots, but with the intention to call with variable names provided as strings—this scales up better and is more comparable to the python version.
library(rlang)
Attaching package: 'rlang'
The following objects are masked from 'package:purrr':
%@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
flatten_raw, invoke, splice
library(purrr)
<-function(df,...){
doit%>%
df ungroup() %>%
mutate(N=n()) %>%
group_by(...) %>%
summarize(n=n(),pct=n()*100/mean(N),.groups="drop") %>%
mutate(cumn=cumsum(n),cumpct=cumsum(pct))
}
doit(mtcars)
# A tibble: 1 × 4
n pct cumn cumpct
<int> <dbl> <int> <dbl>
1 32 100 32 100
doit(mtcars,cyl)
# A tibble: 3 × 5
cyl n pct cumn cumpct
<dbl> <int> <dbl> <int> <dbl>
1 4 11 34.4 11 34.4
2 6 7 21.9 18 56.2
3 8 14 43.8 32 100
doit(mtcars,cyl,gear)
# A tibble: 8 × 6
cyl gear n pct cumn cumpct
<dbl> <dbl> <int> <dbl> <int> <dbl>
1 4 3 1 3.12 1 3.12
2 4 4 8 25 9 28.1
3 4 5 2 6.25 11 34.4
4 6 3 2 6.25 13 40.6
5 6 4 4 12.5 17 53.1
6 6 5 1 3.12 18 56.2
7 8 3 12 37.5 30 93.8
8 8 5 2 6.25 32 100
<-function(df,...){
doit2<-dots_list(...) %>%
varslist_c() %>%
syms()
%>%
df ungroup() %>%
mutate(N=n()) %>%
group_by(!!!vars) %>%
summarize(n=n(),pct=n()*100/mean(N),.groups="drop") %>%
mutate(cumn=cumsum(n),cumpct=cumsum(pct))
}
doit2(mtcars)
# A tibble: 1 × 4
n pct cumn cumpct
<int> <dbl> <int> <dbl>
1 32 100 32 100
doit2(mtcars,"cyl")
# A tibble: 3 × 5
cyl n pct cumn cumpct
<dbl> <int> <dbl> <int> <dbl>
1 4 11 34.4 11 34.4
2 6 7 21.9 18 56.2
3 8 14 43.8 32 100
doit2(mtcars,"cyl","gear")
# A tibble: 8 × 6
cyl gear n pct cumn cumpct
<dbl> <dbl> <int> <dbl> <int> <dbl>
1 4 3 1 3.12 1 3.12
2 4 4 8 25 9 28.1
3 4 5 2 6.25 11 34.4
4 6 3 2 6.25 13 40.6
5 6 4 4 12.5 17 53.1
6 6 5 1 3.12 18 56.2
7 8 3 12 37.5 30 93.8
8 8 5 2 6.25 32 100
Python Version
The version below gets very close! The only differences are that the python version of doit()
doesn’t work as-is if 0 variables are provided, and the variable names are passed as strings (i.e., this isn’t optional as with the tidy versions). This latter point should actually simplify some situations that are complex due to data mask ambiguities in tidy functions.
def doit(df,*argv):
=(
q
dflen().alias("N"))
.with_columns(pl.*argv)
.group_by(
.agg(len().alias("n")
pl.len()*100)/pl.col("N").mean()).alias("pct")
,((pl.
)*argv)
.sort(
.with_columns("n").cum_sum().alias("cumn")
pl.col("pct").cum_sum().alias("cumpct")
,pl.col(
)
)=q.collect()
tableprint(table)
"cyl") doit(mtcars,
shape: (3, 5)
┌─────┬─────┬────────┬──────┬────────┐
│ cyl ┆ n ┆ pct ┆ cumn ┆ cumpct │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ u32 ┆ f64 ┆ u32 ┆ f64 │
╞═════╪═════╪════════╪══════╪════════╡
│ 4.0 ┆ 11 ┆ 34.375 ┆ 11 ┆ 34.375 │
│ 6.0 ┆ 7 ┆ 21.875 ┆ 18 ┆ 56.25 │
│ 8.0 ┆ 14 ┆ 43.75 ┆ 32 ┆ 100.0 │
└─────┴─────┴────────┴──────┴────────┘
"cyl","gear") doit(mtcars,
shape: (8, 6)
┌─────┬──────┬─────┬───────┬──────┬────────┐
│ cyl ┆ gear ┆ n ┆ pct ┆ cumn ┆ cumpct │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 ┆ f64 ┆ u32 ┆ f64 │
╞═════╪══════╪═════╪═══════╪══════╪════════╡
│ 4.0 ┆ 3.0 ┆ 1 ┆ 3.125 ┆ 1 ┆ 3.125 │
│ 4.0 ┆ 4.0 ┆ 8 ┆ 25.0 ┆ 9 ┆ 28.125 │
│ 4.0 ┆ 5.0 ┆ 2 ┆ 6.25 ┆ 11 ┆ 34.375 │
│ 6.0 ┆ 3.0 ┆ 2 ┆ 6.25 ┆ 13 ┆ 40.625 │
│ 6.0 ┆ 4.0 ┆ 4 ┆ 12.5 ┆ 17 ┆ 53.125 │
│ 6.0 ┆ 5.0 ┆ 1 ┆ 3.125 ┆ 18 ┆ 56.25 │
│ 8.0 ┆ 3.0 ┆ 12 ┆ 37.5 ┆ 30 ┆ 93.75 │
│ 8.0 ┆ 5.0 ┆ 2 ┆ 6.25 ┆ 32 ┆ 100.0 │
└─────┴──────┴─────┴───────┴──────┴────────┘
Example 8. Iterate a Custom Function
Here we want to apply the doit
functions over parameters.
R Version
We can use purrr::pmap()
in the R version with a list of parameters. Since we defined the R version of doit
to take naked symbols, the mapped version is kind of clunky—a cleaner alternative based on doit2
highlights this point.
<-list(
parmslist(mtcars,mtcars)
"var1"=list(mtcars$cyl,mtcars$cyl)
,"var2"=list(mtcars$gear,mtcars$am)
,
)
pmap(parms,doit)
[[1]]
# A tibble: 8 × 6
var1 var2 n pct cumn cumpct
<dbl> <dbl> <int> <dbl> <int> <dbl>
1 4 3 1 3.12 1 3.12
2 4 4 8 25 9 28.1
3 4 5 2 6.25 11 34.4
4 6 3 2 6.25 13 40.6
5 6 4 4 12.5 17 53.1
6 6 5 1 3.12 18 56.2
7 8 3 12 37.5 30 93.8
8 8 5 2 6.25 32 100
[[2]]
# A tibble: 6 × 6
var1 var2 n pct cumn cumpct
<dbl> <dbl> <int> <dbl> <int> <dbl>
1 4 0 3 9.38 3 9.38
2 4 1 8 25 11 34.4
3 6 0 4 12.5 15 46.9
4 6 1 3 9.38 18 56.2
5 8 0 12 37.5 30 93.8
6 8 1 2 6.25 32 100
<-list(
parms2list(mtcars,mtcars)
c("cyl","cyl")
,c("gear","am")
,
)
pmap(parms2,doit2)
[[1]]
# A tibble: 8 × 6
cyl gear n pct cumn cumpct
<dbl> <dbl> <int> <dbl> <int> <dbl>
1 4 3 1 3.12 1 3.12
2 4 4 8 25 9 28.1
3 4 5 2 6.25 11 34.4
4 6 3 2 6.25 13 40.6
5 6 4 4 12.5 17 53.1
6 6 5 1 3.12 18 56.2
7 8 3 12 37.5 30 93.8
8 8 5 2 6.25 32 100
[[2]]
# A tibble: 6 × 6
cyl am n pct cumn cumpct
<dbl> <dbl> <int> <dbl> <int> <dbl>
1 4 0 3 9.38 3 9.38
2 4 1 8 25 11 34.4
3 6 0 4 12.5 15 46.9
4 6 1 3 9.38 18 56.2
5 8 0 12 37.5 30 93.8
6 8 1 2 6.25 32 100
Python Version
Here we combine 3 parameter lists into a single iterator object via zip
—we can then map doit
over parms
via itertools.starmap
.
import itertools as it
=zip(
parms
[mtcars,mtcars]'cyl','cyl']
,['gear','am']
,[
)
list(it.starmap(doit,parms))
shape: (8, 6)
┌─────┬──────┬─────┬───────┬──────┬────────┐
│ cyl ┆ gear ┆ n ┆ pct ┆ cumn ┆ cumpct │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 ┆ f64 ┆ u32 ┆ f64 │
╞═════╪══════╪═════╪═══════╪══════╪════════╡
│ 4.0 ┆ 3.0 ┆ 1 ┆ 3.125 ┆ 1 ┆ 3.125 │
│ 4.0 ┆ 4.0 ┆ 8 ┆ 25.0 ┆ 9 ┆ 28.125 │
│ 4.0 ┆ 5.0 ┆ 2 ┆ 6.25 ┆ 11 ┆ 34.375 │
│ 6.0 ┆ 3.0 ┆ 2 ┆ 6.25 ┆ 13 ┆ 40.625 │
│ 6.0 ┆ 4.0 ┆ 4 ┆ 12.5 ┆ 17 ┆ 53.125 │
│ 6.0 ┆ 5.0 ┆ 1 ┆ 3.125 ┆ 18 ┆ 56.25 │
│ 8.0 ┆ 3.0 ┆ 12 ┆ 37.5 ┆ 30 ┆ 93.75 │
│ 8.0 ┆ 5.0 ┆ 2 ┆ 6.25 ┆ 32 ┆ 100.0 │
└─────┴──────┴─────┴───────┴──────┴────────┘
shape: (6, 6)
┌─────┬─────┬─────┬───────┬──────┬────────┐
│ cyl ┆ am ┆ n ┆ pct ┆ cumn ┆ cumpct │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ u32 ┆ f64 ┆ u32 ┆ f64 │
╞═════╪═════╪═════╪═══════╪══════╪════════╡
│ 4.0 ┆ 0.0 ┆ 3 ┆ 9.375 ┆ 3 ┆ 9.375 │
│ 4.0 ┆ 1.0 ┆ 8 ┆ 25.0 ┆ 11 ┆ 34.375 │
│ 6.0 ┆ 0.0 ┆ 4 ┆ 12.5 ┆ 15 ┆ 46.875 │
│ 6.0 ┆ 1.0 ┆ 3 ┆ 9.375 ┆ 18 ┆ 56.25 │
│ 8.0 ┆ 0.0 ┆ 12 ┆ 37.5 ┆ 30 ┆ 93.75 │
│ 8.0 ┆ 1.0 ┆ 2 ┆ 6.25 ┆ 32 ┆ 100.0 │
└─────┴─────┴─────┴───────┴──────┴────────┘
[None, None]
Example 9. Stack Data Frames by List Binding with Map and Anonymous Function
What we’re achieving with this example—returning mtcars
—isn’t very useful, but it illustrates a common task: mapping an anonymous function over a vector to create a list of data frames which are subsequently stacked together via row binding. In other words, in this example we’re reassembling mtcars
by stacking together portions returned from each iteration of map
.
R Version
Pretty straight forward. Note that parms is a vector here.
<-distinct(mtcars,cyl) %>%
parmspull()
<-map(
list
parmsfunction (x){
,%>%
mtcars ::filter(cyl==x) %>%
dplyrarrange(desc(mpg))
}
)
<-list_rbind(list)
df
print(df)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Python Version
This is extremely similar. Note that we’re pulling from the csv of mtcars
to utilize eager evaluation (for simplicity).
=pl.read_csv("mtcars.csv")
mtcars
=mtcars.get_column("cyl").unique()
iterator
=map(
dflambda x: (
mtcarsfilter(pl.col("cyl")==x)
."mpg",descending=True)
.sort(
)
,iterator
)
=pl.concat(list(df))
df
print(df)
shape: (32, 11)
┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
│ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ f64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
│ 33.9 ┆ 4 ┆ 71.1 ┆ 65 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
│ 32.4 ┆ 4 ┆ 78.7 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
│ 30.4 ┆ 4 ┆ 75.7 ┆ 52 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 2 │
│ 30.4 ┆ 4 ┆ 95.1 ┆ 113 ┆ … ┆ 1 ┆ 1 ┆ 5 ┆ 2 │
│ 27.3 ┆ 4 ┆ 79.0 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 14.7 ┆ 8 ┆ 440.0 ┆ 230 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
│ 14.3 ┆ 8 ┆ 360.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
│ 13.3 ┆ 8 ┆ 350.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
│ 10.4 ┆ 8 ┆ 472.0 ┆ 205 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
│ 10.4 ┆ 8 ┆ 460.0 ┆ 215 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
└──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘
Example 10. Stack Data Frames by List Binding with pmap and Anonymous Function of Dots (...
)
This example generalizes the previous one to use a data frame with any number of columns (here we’re just using 2) to parameterize the mapping.
R Version
Dynamic dots are captured in the list parms
within the function and column values are referenced as elements of that list.
<-distinct(mtcars,cyl,gear)
parms
<-pmap(
list
parmsfunction (...){
,<-rlang::dots_list(...)
parms%>%
mtcars ::filter(cyl==parms$cyl & gear==parms$gear)
dplyr
}
)
<-list_rbind(list)
df
print(df)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Python Version
This is very similar to the R version above. The key is that each row of the data frame parms
is turned into a dictionary (i.e., has the form {'key1':_key1val_,...,'keyk':_keykval_}
). iterator
is then a list of dictionaries (iter_rows()
returns dictionaries when named=True
) which allows us to capture a single row of all parameters needed in the body of the anonymous function with the single parameter dctnry
. The parameter values can be referenced by the original parms
data frame variable name in the function body via the dictionary method get()
. Note that there is a Polars DataFrame method map_rows()
that does something similar, but there’s no way to preserve variable names for reference inside the function, so this approach seems preferrable.
=pl.read_csv("mtcars.csv")
mtcars
=(
parms
mtcars"cyl","gear")
.group_by(
.agg()
)
=list(parms.iter_rows(named=True))
iterator
=map(
dfslambda dctnry: (
mtcarsfilter(
."cyl")==dctnry.get('cyl')) &
(pl.col("gear")==dctnry.get('gear'))
(pl.col(
)
)
,iterator
)
=pl.concat(list(dfs))
df
print(df)
shape: (32, 11)
┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
│ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ f64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
│ 18.7 ┆ 8 ┆ 360.0 ┆ 175 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 2 │
│ 14.3 ┆ 8 ┆ 360.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
│ 16.4 ┆ 8 ┆ 275.8 ┆ 180 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 3 │
│ 17.3 ┆ 8 ┆ 275.8 ┆ 180 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 3 │
│ 15.2 ┆ 8 ┆ 275.8 ┆ 180 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 3 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 21.4 ┆ 6 ┆ 258.0 ┆ 110 ┆ … ┆ 1 ┆ 0 ┆ 3 ┆ 1 │
│ 18.1 ┆ 6 ┆ 225.0 ┆ 105 ┆ … ┆ 1 ┆ 0 ┆ 3 ┆ 1 │
│ 15.8 ┆ 8 ┆ 351.0 ┆ 264 ┆ … ┆ 0 ┆ 1 ┆ 5 ┆ 4 │
│ 15.0 ┆ 8 ┆ 301.0 ┆ 335 ┆ … ┆ 0 ┆ 1 ┆ 5 ┆ 8 │
│ 21.5 ┆ 4 ┆ 120.1 ┆ 97 ┆ … ┆ 1 ┆ 0 ┆ 3 ┆ 1 │
└──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘
Citation
@online{couzens2025,
author = {Couzens, Lance},
title = {Recreating {Some} {Tidy-Style} {R} {Operations} with {Python}
and {Polars}},
date = {2025-03-19},
url = {https://mostlyunoriginal.github.io/posts/2025-03-19-TidyR-to-PolarsPython/},
langid = {en}
}