Background

Before diving into the examples below, it’s important to acknowledge some general differences between R and Python and specific differences in style and approach between Tidy- and Polars-style data manipulation. First and foremost: Python has a strong object orientation while R is essentially a functional language. The practical impact of that difference here is that Python objects are manipulated, or their attributes extracted, by way of methods, while R objects are inputs and outputs of functions. But Python uses functions too, and in fact methods are themselves functions, so this can be very confusing!

What is a method, then? In simple terms, it’s a function defined as part of the blueprint for a given type (or ‘class’) of object. A Polars DataFrame is a class of object, and there are certain functions defined in that class—these are the Polars DataFrame methods. By creating a specific DataFrame, we ‘instantiate’ the class into an object, and we can deploy a predefined set of methods to do things with or to that object.

In both R and Python we often want to do several operations in a row without distinct assignments for each intermediate step of a process. In R—and especially in the Tidy style of R programming—we can use piping with either the magrittr or base pipes (%>% and |>, respectively) to achieve this. The resulting pipeline starts with an object, passes that object into a function which returns a new object which is passed into another function, and so on and so forth until the desired object is returned by the final function in the pipeline and is captured with an assignment, returned to the console, or passed as input to another pipeline or function. Consider the following example.

cyls<-mtcars %>% #1, 5
  distinct(cyl) %>% #2
  arrange(cyl) %>% #3
  pull() #4

Here, we start with the data frame mtcars (1), which is piped as input to the distinct() function along with the column reference cyl (2), which returns a data frame containing only the column cyl and one row for each distinct value. This is piped as input to arrange() (3) along with a column reference to cyl, which returns a sorted data frame. This is piped into pull() (4), which extracts a single column (the only one there: cyl) as a vector. This final object is then assigned to the environment variable cyls (5). Now consider the Python version which utilizes a technique called ‘method chaining’.

cyls=( #5
    mtcars #1
    .unique("cyl") #2
    .sort("cyl") #3
    .get_column("cyl") #4
)

Here, we start with mtcars, a Polars DataFrame (1). We then apply the unique() method with a reference to the column cyl (2), yielding a Polars DataFrame containing the distinct values of cyl (note that it still contains all the other variables too!). Calling the sort() method sorts the rows by the values of cyl (3). The Polars DataFrame method get_column() (4) extracts a single column and yields a Polars Series (analogous to the atomic vectors that comprise R data frame columns). The resulting Series is assigned to the variable cyls (5).

Both of these code blocks look quite similar, and the Python version should feel familiar to anyone who employs the Tidy-style of programming in R. Now that we’ve seen method chaining in action we can introduce a twist that unlocks some additional efficiency and that may seem strange compared to the Tidy style. The Python block above utilizes what’s called ‘eager evaluation’, which means the code inside cyls=(...) is immediately evaluated and in exactly the manner we have specified. However, Polars is actually implemented in Rust (a high performance systems programming language) and has a query optimization capability that we can exploit via something called ‘lazy evaluation’. The following ‘lazy’ alternative to the previous example gathers our instructions, performs query optimization (yielding a ‘query plan’), and ultimately executes an optimized query only when we invoke the collect() method (a method of Polars LazyFrames which result from invoking the lazy() method of a regular Polars DataFrame).

cyls=(
    mtcars
    .lazy()
    .unique("cyl")
    .sort("cyl")
    .collect()
    .get_column("cyl")
)

Note that Polars LazyFrames do not have a get_column() method like DataFrames do—it can therefore only be invoked after collection. The advantages of lazy evaluation in this trivial example would not be noticeable but could be significant depending on the size of the data and the complexity of the query. Lazy evaluation also allows for efficient processing of larger-than-memory data frames. See the User guide for more detail. This approach may seem familiar to anyone who has used the dtplyr R package which allows the user to proved dplyr syntax which is translated into data.table (which is written primarily in C and is much faster than dplyr) under the hood.

Without further ado, let’s dive into some examples.

Data Manipulation

Example 1. Basic Summarize without Generalization across Variables

Here, we take on a very simple and very common task: calculating the mean of a continuous variable (mpg) by the levels of a categorical variable (cyl).

R Version

The Tidy approach utilizes a pipeline comprised of the mtcars data frame and the group_by() and summarize() functions. Note that these functions take a data frame (or tibble) as the first argument, but prevailing style allows this to be passed implicitly (as is done here).

library(dplyr)

table<-mtcars %>%
    group_by(cyl) %>%
    summarize(mpg.mean=mean(mpg))

print(table)

# A tibble: 3 × 2
    cyl mpg.mean
  <dbl>    <dbl>
1     4     26.7
2     6     19.7
3     8     15.1

Python Version

The Polars approach below begins by reading the R mtcars data frame into the Polars LazyFrame mtcars. The LazyFrame method group_by() is invoked followed by the agg() method. agg() contains an expression that is itself a method chain which yields the mean values for each group as the new variable mpg.mean. table is a Polars DataFrame realized as the result of evaluating an optimized query plan (via collect()).

import polars as pl

mtcars=pl.LazyFrame(r.mtcars)

q=(
    mtcars
    .group_by("cyl")
    .agg(pl.col("mpg").mean().alias("mpg.mean"))
)

table=q.collect()

print(table)

shape: (3, 2)
┌─────┬───────────┐
│ cyl ┆ mpg.mean  │
│ --- ┆ ---       │
│ f64 ┆ f64       │
╞═════╪═══════════╡
│ 8.0 ┆ 15.1      │
│ 6.0 ┆ 19.742857 │
│ 4.0 ┆ 26.663636 │
└─────┴───────────┘

Example 2. Basic Mutate with Grouping and without Generalization

Here we want to add a new variable to our data frame—the new variable is the ratio of each value of mpg relative to the mean value for the group (defined by the levels of the variable cyl).

R Version

In R I can create the new variable with a call to mutate() that utilizes both group-level statistics and record-level data. This can be done in a single step with very little code.

table<-mtcars %>%
    group_by(cyl) %>%
    mutate(rel.mpg=mpg/mean(mpg))

print(table)

# A tibble: 32 × 12
# Groups:   cyl [3]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb rel.mpg
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4   1.06 
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4   1.06 
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1   0.855
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1   1.08 
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2   1.24 
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1   0.917
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4   0.947
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2   0.915
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2   0.855
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4   0.973
# ℹ 22 more rows

Python Version

The Python version uses the with_columns() LazyFrame method. Here, unlike in the R version, the grouping is baked into the recode expression itself by way of over(). Aside from looking a bit different, the Polars approach is more powerful because each expression can utilize its own grouping. Note that the Polars documentation utilizes a ‘contexts’ and ‘expressions’ framework to describe what could also be referred to as methods or method chains. In this example, with_columns() is the context in which the expression yielding the new variable rel.mpg is nested.

q=(
    mtcars
    .with_columns(
        (pl.col("mpg")/pl.col("mpg").mean().over("cyl")).alias("rel.mpg")
    )
)

table=q.collect()

print(table)

shape: (32, 12)
┌──────┬─────┬───────┬───────┬───┬─────┬──────┬──────┬──────────┐
│ mpg  ┆ cyl ┆ disp  ┆ hp    ┆ … ┆ am  ┆ gear ┆ carb ┆ rel.mpg  │
│ ---  ┆ --- ┆ ---   ┆ ---   ┆   ┆ --- ┆ ---  ┆ ---  ┆ ---      │
│ f64  ┆ f64 ┆ f64   ┆ f64   ┆   ┆ f64 ┆ f64  ┆ f64  ┆ f64      │
╞══════╪═════╪═══════╪═══════╪═══╪═════╪══════╪══════╪══════════╡
│ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0  ┆ 4.0  ┆ 1.063676 │
│ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0  ┆ 4.0  ┆ 1.063676 │
│ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0  ┆ … ┆ 1.0 ┆ 4.0  ┆ 1.0  ┆ 0.855097 │
│ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 3.0  ┆ 1.0  ┆ 1.083936 │
│ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 3.0  ┆ 2.0  ┆ 1.238411 │
│ …    ┆ …   ┆ …     ┆ …     ┆ … ┆ …   ┆ …    ┆ …    ┆ …        │
│ 30.4 ┆ 4.0 ┆ 95.1  ┆ 113.0 ┆ … ┆ 1.0 ┆ 5.0  ┆ 2.0  ┆ 1.14013  │
│ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 1.0 ┆ 5.0  ┆ 4.0  ┆ 1.046358 │
│ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 1.0 ┆ 5.0  ┆ 6.0  ┆ 0.997829 │
│ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 1.0 ┆ 5.0  ┆ 8.0  ┆ 0.993377 │
│ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 1.0 ┆ 4.0  ┆ 2.0  ┆ 0.802591 │
└──────┴─────┴───────┴───────┴───┴─────┴──────┴──────┴──────────┘

Example 3. Summarize Generalized by Variable Type with Across

In this example we want to generate means by group (like in example 1), but across a set of columns described by a selection criteria (i.e., not by name).

R Version

As before, we specify the grouping via group_by() and generate the means within summarize(). In order to create means for several variables not explicitly specified we can utilize across(). To get means for all variables stored as doubles, we use the helper function where() in the .cols specification. Glue syntax in the .names specification allows us to rename all affected columns.

mtcars %>%
    group_by(cyl,gear) %>%
    summarize(
        across(
            .cols=where(is.double)
            ,.fns=mean
            ,.names="{.col}_mean"
        )
    )

`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.

# A tibble: 8 × 11
# Groups:   cyl [3]
    cyl  gear mpg_mean disp_mean hp_mean drat_mean wt_mean qsec_mean vs_mean
  <dbl> <dbl>    <dbl>     <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
1     4     3     21.5      120.     97       3.7     2.46      20.0     1  
2     4     4     26.9      103.     76       4.11    2.38      19.6     1  
3     4     5     28.2      108.    102       4.1     1.83      16.8     0.5
4     6     3     19.8      242.    108.      2.92    3.34      19.8     1  
5     6     4     19.8      164.    116.      3.91    3.09      17.7     0.5
6     6     5     19.7      145     175       3.62    2.77      15.5     0  
7     8     3     15.0      358.    194.      3.12    4.10      17.1     0  
8     8     5     15.4      326     300.      3.88    3.37      14.6     0  
# ℹ 2 more variables: am_mean <dbl>, carb_mean <dbl>

Python Version

The Python version looks very similar to the R version and has the same basic structure as in example 1. Here, though, instead of specifying a column with pl.col() we indicate that we want all columns stored as floats by using cs.float(). Note that there are many selector functions available, as explained here. The name method name.suffix() is used to rename all affected variables. See other name methods here.

import polars.selectors as cs

q=(
    mtcars
    .group_by("cyl","gear")
    .agg(cs.float().mean().name.suffix("_mean"))
)

table=q.collect()

print(table)

shape: (8, 11)
┌─────┬──────┬──────────┬────────────┬───┬───────────┬─────────┬─────────┬───────────┐
│ cyl ┆ gear ┆ mpg_mean ┆ disp_mean  ┆ … ┆ qsec_mean ┆ vs_mean ┆ am_mean ┆ carb_mean │
│ --- ┆ ---  ┆ ---      ┆ ---        ┆   ┆ ---       ┆ ---     ┆ ---     ┆ ---       │
│ f64 ┆ f64  ┆ f64      ┆ f64        ┆   ┆ f64       ┆ f64     ┆ f64     ┆ f64       │
╞═════╪══════╪══════════╪════════════╪═══╪═══════════╪═════════╪═════════╪═══════════╡
│ 8.0 ┆ 3.0  ┆ 15.05    ┆ 357.616667 ┆ … ┆ 17.1425   ┆ 0.0     ┆ 0.0     ┆ 3.083333  │
│ 4.0 ┆ 3.0  ┆ 21.5     ┆ 120.1      ┆ … ┆ 20.01     ┆ 1.0     ┆ 0.0     ┆ 1.0       │
│ 4.0 ┆ 4.0  ┆ 26.925   ┆ 102.625    ┆ … ┆ 19.6125   ┆ 1.0     ┆ 0.75    ┆ 1.5       │
│ 6.0 ┆ 3.0  ┆ 19.75    ┆ 241.5      ┆ … ┆ 19.83     ┆ 1.0     ┆ 0.0     ┆ 1.0       │
│ 4.0 ┆ 5.0  ┆ 28.2     ┆ 107.7      ┆ … ┆ 16.8      ┆ 0.5     ┆ 1.0     ┆ 2.0       │
│ 8.0 ┆ 5.0  ┆ 15.4     ┆ 326.0      ┆ … ┆ 14.55     ┆ 0.0     ┆ 1.0     ┆ 6.0       │
│ 6.0 ┆ 4.0  ┆ 19.75    ┆ 163.8      ┆ … ┆ 17.67     ┆ 0.5     ┆ 0.5     ┆ 4.0       │
│ 6.0 ┆ 5.0  ┆ 19.7     ┆ 145.0      ┆ … ┆ 15.5      ┆ 0.0     ┆ 1.0     ┆ 6.0       │
└─────┴──────┴──────────┴────────────┴───┴───────────┴─────────┴─────────┴───────────┘

Example 4. Conditional Recode

In this example we use if/else if/else logic to create a string recode of the numeric variable mpg.

R Version

In the Tidy R approach we deploy case_when() inside of mutate() to create a recode with cascading conditional logic.

mtcars %>%
    mutate(
        mpg.cat=case_when(
            mpg<10~"very bad"
            ,mpg<15~"bad"
            ,mpg<20~"okay"
            ,mpg<25~"good"
            ,TRUE~"great"
        )
    ) %>%
    arrange(desc(mpg))

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb mpg.cat
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1   great
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1   great
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2   great
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2   great
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1   great
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2   great
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2    good
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    good
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2    good
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1    good
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    good
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2    good
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4    good
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4    good
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6    okay
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4    okay
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2    okay
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    okay
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1    okay
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4    okay
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3    okay
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3    okay
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4    okay
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2    okay
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3    okay
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2    okay
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8    okay
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4     bad
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     bad
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4     bad
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4     bad
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4     bad

Python Version

The Python version is quite a bit more wordy. Note that pl.lit() is needed here to resolve ambiguity in the way column references can appear as strings in then()—in other words, we’re indicating we want the recoded values to be the provided strings, not the values of columns represented by those strings.

q=(
    mtcars
    .with_columns(
        pl.when(pl.col("mpg")<10).then(pl.lit("very bad"))
        .when(pl.col("mpg")<15).then(pl.lit("bad"))
        .when(pl.col("mpg")<20).then(pl.lit("okay"))
        .when(pl.col("mpg")<25).then(pl.lit("good"))
        .otherwise(pl.lit("great"))
        .alias("mpg.cat")
    )
    .sort("mpg",descending=True)
)

df=q.collect()

print(df)

shape: (32, 12)
┌──────┬─────┬───────┬───────┬───┬─────┬──────┬──────┬─────────┐
│ mpg  ┆ cyl ┆ disp  ┆ hp    ┆ … ┆ am  ┆ gear ┆ carb ┆ mpg.cat │
│ ---  ┆ --- ┆ ---   ┆ ---   ┆   ┆ --- ┆ ---  ┆ ---  ┆ ---     │
│ f64  ┆ f64 ┆ f64   ┆ f64   ┆   ┆ f64 ┆ f64  ┆ f64  ┆ str     │
╞══════╪═════╪═══════╪═══════╪═══╪═════╪══════╪══════╪═════════╡
│ 33.9 ┆ 4.0 ┆ 71.1  ┆ 65.0  ┆ … ┆ 1.0 ┆ 4.0  ┆ 1.0  ┆ great   │
│ 32.4 ┆ 4.0 ┆ 78.7  ┆ 66.0  ┆ … ┆ 1.0 ┆ 4.0  ┆ 1.0  ┆ great   │
│ 30.4 ┆ 4.0 ┆ 75.7  ┆ 52.0  ┆ … ┆ 1.0 ┆ 4.0  ┆ 2.0  ┆ great   │
│ 30.4 ┆ 4.0 ┆ 95.1  ┆ 113.0 ┆ … ┆ 1.0 ┆ 5.0  ┆ 2.0  ┆ great   │
│ 27.3 ┆ 4.0 ┆ 79.0  ┆ 66.0  ┆ … ┆ 1.0 ┆ 4.0  ┆ 1.0  ┆ great   │
│ …    ┆ …   ┆ …     ┆ …     ┆ … ┆ …   ┆ …    ┆ …    ┆ …       │
│ 14.7 ┆ 8.0 ┆ 440.0 ┆ 230.0 ┆ … ┆ 0.0 ┆ 3.0  ┆ 4.0  ┆ bad     │
│ 14.3 ┆ 8.0 ┆ 360.0 ┆ 245.0 ┆ … ┆ 0.0 ┆ 3.0  ┆ 4.0  ┆ bad     │
│ 13.3 ┆ 8.0 ┆ 350.0 ┆ 245.0 ┆ … ┆ 0.0 ┆ 3.0  ┆ 4.0  ┆ bad     │
│ 10.4 ┆ 8.0 ┆ 472.0 ┆ 205.0 ┆ … ┆ 0.0 ┆ 3.0  ┆ 4.0  ┆ bad     │
│ 10.4 ┆ 8.0 ┆ 460.0 ┆ 215.0 ┆ … ┆ 0.0 ┆ 3.0  ┆ 4.0  ┆ bad     │
└──────┴─────┴───────┴───────┴───┴─────┴──────┴──────┴─────────┘

Example 5. Pivots

In this example we start with mtcars (a version with rownames mapped to the column car), pivot to a long file and then back to wide.

R Version

Here we use very simple forms of pivot_longer() and pivot_wider().

library(tidyr)
library(tibble)

cars<-rownames_to_column(mtcars,"car") %>%
  pivot_longer(
    cols=where(is.numeric)
    ,names_to="variable"
    ,values_to="value"
  )

print(cars)

# A tibble: 352 × 3
   car       variable  value
   <chr>     <chr>     <dbl>
 1 Mazda RX4 mpg       21   
 2 Mazda RX4 cyl        6   
 3 Mazda RX4 disp     160   
 4 Mazda RX4 hp       110   
 5 Mazda RX4 drat       3.9 
 6 Mazda RX4 wt         2.62
 7 Mazda RX4 qsec      16.5 
 8 Mazda RX4 vs         0   
 9 Mazda RX4 am         1   
10 Mazda RX4 gear       4   
# ℹ 342 more rows

mtcars_w_names<-cars %>%
  pivot_wider(
    id_cols=car
    ,names_from="variable"
    ,values_from="value"
  )

print(mtcars_w_names)

# A tibble: 32 × 12
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Python Version

With polars, going from wide to long is an unpivot and long to wide is a pivot. Note that the pivot() is only available in eager mode, as shown below.

import polars.selectors as cs

q=(
    pl.scan_csv("mtcars_w_names.csv")
    .unpivot(
        on=cs.numeric()
        ,index="car"
        ,variable_name="variable"
        ,value_name="value"
    )
)

cars=q.collect()

print(cars)

shape: (352, 3)
┌───────────────────┬──────────┬───────┐
│ car               ┆ variable ┆ value │
│ ---               ┆ ---      ┆ ---   │
│ str               ┆ str      ┆ f64   │
╞═══════════════════╪══════════╪═══════╡
│ Mazda RX4         ┆ mpg      ┆ 21.0  │
│ Mazda RX4 Wag     ┆ mpg      ┆ 21.0  │
│ Datsun 710        ┆ mpg      ┆ 22.8  │
│ Hornet 4 Drive    ┆ mpg      ┆ 21.4  │
│ Hornet Sportabout ┆ mpg      ┆ 18.7  │
│ …                 ┆ …        ┆ …     │
│ Lotus Europa      ┆ carb     ┆ 2.0   │
│ Ford Pantera L    ┆ carb     ┆ 4.0   │
│ Ferrari Dino      ┆ carb     ┆ 6.0   │
│ Maserati Bora     ┆ carb     ┆ 8.0   │
│ Volvo 142E        ┆ carb     ┆ 2.0   │
└───────────────────┴──────────┴───────┘

mtcars_w_names=(
    cars
    .pivot(
        index="car"
        ,on="variable"
        ,values="value"
        ,aggregate_function=None
    )
)

print(mtcars_w_names)

shape: (32, 12)
┌───────────────────┬──────┬─────┬───────┬───┬─────┬─────┬──────┬──────┐
│ car               ┆ mpg  ┆ cyl ┆ disp  ┆ … ┆ vs  ┆ am  ┆ gear ┆ carb │
│ ---               ┆ ---  ┆ --- ┆ ---   ┆   ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str               ┆ f64  ┆ f64 ┆ f64   ┆   ┆ f64 ┆ f64 ┆ f64  ┆ f64  │
╞═══════════════════╪══════╪═════╪═══════╪═══╪═════╪═════╪══════╪══════╡
│ Mazda RX4         ┆ 21.0 ┆ 6.0 ┆ 160.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4.0  ┆ 4.0  │
│ Mazda RX4 Wag     ┆ 21.0 ┆ 6.0 ┆ 160.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4.0  ┆ 4.0  │
│ Datsun 710        ┆ 22.8 ┆ 4.0 ┆ 108.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0  ┆ 1.0  │
│ Hornet 4 Drive    ┆ 21.4 ┆ 6.0 ┆ 258.0 ┆ … ┆ 1.0 ┆ 0.0 ┆ 3.0  ┆ 1.0  │
│ Hornet Sportabout ┆ 18.7 ┆ 8.0 ┆ 360.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 3.0  ┆ 2.0  │
│ …                 ┆ …    ┆ …   ┆ …     ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
│ Lotus Europa      ┆ 30.4 ┆ 4.0 ┆ 95.1  ┆ … ┆ 1.0 ┆ 1.0 ┆ 5.0  ┆ 2.0  │
│ Ford Pantera L    ┆ 15.8 ┆ 8.0 ┆ 351.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0  ┆ 4.0  │
│ Ferrari Dino      ┆ 19.7 ┆ 6.0 ┆ 145.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0  ┆ 6.0  │
│ Maserati Bora     ┆ 15.0 ┆ 8.0 ┆ 301.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0  ┆ 8.0  │
│ Volvo 142E        ┆ 21.4 ┆ 4.0 ┆ 121.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0  ┆ 2.0  │
└───────────────────┴──────┴─────┴───────┴───┴─────┴─────┴──────┴──────┘

Example 6. Joins

In this example we will show the various join types with two distinct but overlapping subsets of mtcars: cars with 6-cylinder engines and those with horsepower less than 110.

R Version

This code is pretty self-explanatory.

carsl<-mtcars %>%
    rownames_to_column("car") %>%
    filter(cyl==6) %>%
    select(car,cyl)

carsr<-mtcars %>%
    rownames_to_column("car") %>%
    filter(hp<110) %>%
    select(car,hp)

print(carsl)

             car cyl
1      Mazda RX4   6
2  Mazda RX4 Wag   6
3 Hornet 4 Drive   6
4        Valiant   6
5       Merc 280   6
6      Merc 280C   6
7   Ferrari Dino   6

print(carsr)

              car  hp
1      Datsun 710  93
2         Valiant 105
3       Merc 240D  62
4        Merc 230  95
5        Fiat 128  66
6     Honda Civic  52
7  Toyota Corolla  65
8   Toyota Corona  97
9       Fiat X1-9  66
10  Porsche 914-2  91
11     Volvo 142E 109

left_join(carsl,carsr,by="car")

             car cyl  hp
1      Mazda RX4   6  NA
2  Mazda RX4 Wag   6  NA
3 Hornet 4 Drive   6  NA
4        Valiant   6 105
5       Merc 280   6  NA
6      Merc 280C   6  NA
7   Ferrari Dino   6  NA

right_join(carsl,carsr,by="car")

              car cyl  hp
1         Valiant   6 105
2      Datsun 710  NA  93
3       Merc 240D  NA  62
4        Merc 230  NA  95
5        Fiat 128  NA  66
6     Honda Civic  NA  52
7  Toyota Corolla  NA  65
8   Toyota Corona  NA  97
9       Fiat X1-9  NA  66
10  Porsche 914-2  NA  91
11     Volvo 142E  NA 109

inner_join(carsl,carsr,by="car")

      car cyl  hp
1 Valiant   6 105

full_join(carsl,carsr,by="car")

              car cyl  hp
1       Mazda RX4   6  NA
2   Mazda RX4 Wag   6  NA
3  Hornet 4 Drive   6  NA
4         Valiant   6 105
5        Merc 280   6  NA
6       Merc 280C   6  NA
7    Ferrari Dino   6  NA
8      Datsun 710  NA  93
9       Merc 240D  NA  62
10       Merc 230  NA  95
11       Fiat 128  NA  66
12    Honda Civic  NA  52
13 Toyota Corolla  NA  65
14  Toyota Corona  NA  97
15      Fiat X1-9  NA  66
16  Porsche 914-2  NA  91
17     Volvo 142E  NA 109

anti_join(carsl,carsr,by="car")

             car cyl
1      Mazda RX4   6
2  Mazda RX4 Wag   6
3 Hornet 4 Drive   6
4       Merc 280   6
5      Merc 280C   6
6   Ferrari Dino   6

Python Version

carsl=(
    pl.scan_csv("mtcars_w_names.csv")
    .filter(pl.col("cyl")==6)
    .select("car","cyl")
)

carsr=(
    pl.scan_csv("mtcars_w_names.csv")
    .filter(pl.col("hp")<110)
    .select("car","hp")
)

print(carsl.collect())

shape: (7, 2)
┌────────────────┬─────┐
│ car            ┆ cyl │
│ ---            ┆ --- │
│ str            ┆ i64 │
╞════════════════╪═════╡
│ Mazda RX4      ┆ 6   │
│ Mazda RX4 Wag  ┆ 6   │
│ Hornet 4 Drive ┆ 6   │
│ Valiant        ┆ 6   │
│ Merc 280       ┆ 6   │
│ Merc 280C      ┆ 6   │
│ Ferrari Dino   ┆ 6   │
└────────────────┴─────┘

print(carsr.collect())

shape: (11, 2)
┌────────────────┬─────┐
│ car            ┆ hp  │
│ ---            ┆ --- │
│ str            ┆ i64 │
╞════════════════╪═════╡
│ Datsun 710     ┆ 93  │
│ Valiant        ┆ 105 │
│ Merc 240D      ┆ 62  │
│ Merc 230       ┆ 95  │
│ Fiat 128       ┆ 66  │
│ …              ┆ …   │
│ Toyota Corolla ┆ 65  │
│ Toyota Corona  ┆ 97  │
│ Fiat X1-9      ┆ 66  │
│ Porsche 914-2  ┆ 91  │
│ Volvo 142E     ┆ 109 │
└────────────────┴─────┘

print(carsl.join(carsr,on="car",how="left").collect())

shape: (7, 3)
┌────────────────┬─────┬──────┐
│ car            ┆ cyl ┆ hp   │
│ ---            ┆ --- ┆ ---  │
│ str            ┆ i64 ┆ i64  │
╞════════════════╪═════╪══════╡
│ Mazda RX4      ┆ 6   ┆ null │
│ Mazda RX4 Wag  ┆ 6   ┆ null │
│ Hornet 4 Drive ┆ 6   ┆ null │
│ Valiant        ┆ 6   ┆ 105  │
│ Merc 280       ┆ 6   ┆ null │
│ Merc 280C      ┆ 6   ┆ null │
│ Ferrari Dino   ┆ 6   ┆ null │
└────────────────┴─────┴──────┘

print(carsl.join(carsr,on="car",how="right").collect())

shape: (11, 3)
┌──────┬────────────────┬─────┐
│ cyl  ┆ car            ┆ hp  │
│ ---  ┆ ---            ┆ --- │
│ i64  ┆ str            ┆ i64 │
╞══════╪════════════════╪═════╡
│ null ┆ Datsun 710     ┆ 93  │
│ 6    ┆ Valiant        ┆ 105 │
│ null ┆ Merc 240D      ┆ 62  │
│ null ┆ Merc 230       ┆ 95  │
│ null ┆ Fiat 128       ┆ 66  │
│ …    ┆ …              ┆ …   │
│ null ┆ Toyota Corolla ┆ 65  │
│ null ┆ Toyota Corona  ┆ 97  │
│ null ┆ Fiat X1-9      ┆ 66  │
│ null ┆ Porsche 914-2  ┆ 91  │
│ null ┆ Volvo 142E     ┆ 109 │
└──────┴────────────────┴─────┘

print(carsl.join(carsr,on="car",how="inner").collect())

shape: (1, 3)
┌─────────┬─────┬─────┐
│ car     ┆ cyl ┆ hp  │
│ ---     ┆ --- ┆ --- │
│ str     ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╡
│ Valiant ┆ 6   ┆ 105 │
└─────────┴─────┴─────┘

print(carsl.join(carsr,on="car",how="full",coalesce=True).collect())

shape: (17, 3)
┌────────────────┬──────┬──────┐
│ car            ┆ cyl  ┆ hp   │
│ ---            ┆ ---  ┆ ---  │
│ str            ┆ i64  ┆ i64  │
╞════════════════╪══════╪══════╡
│ Datsun 710     ┆ null ┆ 93   │
│ Valiant        ┆ 6    ┆ 105  │
│ Merc 240D      ┆ null ┆ 62   │
│ Merc 230       ┆ null ┆ 95   │
│ Fiat 128       ┆ null ┆ 66   │
│ …              ┆ …    ┆ …    │
│ Merc 280       ┆ 6    ┆ null │
│ Mazda RX4      ┆ 6    ┆ null │
│ Ferrari Dino   ┆ 6    ┆ null │
│ Hornet 4 Drive ┆ 6    ┆ null │
│ Merc 280C      ┆ 6    ┆ null │
└────────────────┴──────┴──────┘

print(carsl.join(carsr,on="car",how="anti").collect())

shape: (6, 2)
┌────────────────┬─────┐
│ car            ┆ cyl │
│ ---            ┆ --- │
│ str            ┆ i64 │
╞════════════════╪═════╡
│ Mazda RX4      ┆ 6   │
│ Mazda RX4 Wag  ┆ 6   │
│ Hornet 4 Drive ┆ 6   │
│ Merc 280       ┆ 6   │
│ Merc 280C      ┆ 6   │
│ Ferrari Dino   ┆ 6   │
└────────────────┴─────┘

Functional Programming

Example 7. Function for n & pct by Grouping Variables

Here we want a custom function to create simple, list-style frequency tables based on one or more variables provided by the user.

R Version

We use dynamic dots (...) here to tunnel in the variables provided by the user in the function call. This is powerful and flexible, allowing for 0+ variables provided as naked symbols rather than strings (doit()); an alternative version (doit2()) also uses dynamic dots, but with the intention to call with variable names provided as strings—this scales up better and is more comparable to the python version.

library(rlang)


Attaching package: 'rlang'

The following objects are masked from 'package:purrr':

    %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
    flatten_raw, invoke, splice

library(purrr)

doit<-function(df,...){
  df %>%
    ungroup() %>%
    mutate(N=n()) %>%
    group_by(...) %>%
    summarize(n=n(),pct=n()*100/mean(N),.groups="drop") %>%
    mutate(cumn=cumsum(n),cumpct=cumsum(pct))
}

doit(mtcars)

# A tibble: 1 × 4
      n   pct  cumn cumpct
  <int> <dbl> <int>  <dbl>
1    32   100    32    100

doit(mtcars,cyl)

# A tibble: 3 × 5
    cyl     n   pct  cumn cumpct
  <dbl> <int> <dbl> <int>  <dbl>
1     4    11  34.4    11   34.4
2     6     7  21.9    18   56.2
3     8    14  43.8    32  100

doit(mtcars,cyl,gear)

# A tibble: 8 × 6
    cyl  gear     n   pct  cumn cumpct
  <dbl> <dbl> <int> <dbl> <int>  <dbl>
1     4     3     1  3.12     1   3.12
2     4     4     8 25        9  28.1 
3     4     5     2  6.25    11  34.4 
4     6     3     2  6.25    13  40.6 
5     6     4     4 12.5     17  53.1 
6     6     5     1  3.12    18  56.2 
7     8     3    12 37.5     30  93.8 
8     8     5     2  6.25    32 100

doit2<-function(df,...){
    vars<-dots_list(...) %>%
        list_c() %>%
        syms()

    df %>%
        ungroup() %>%
        mutate(N=n()) %>%
        group_by(!!!vars) %>%
        summarize(n=n(),pct=n()*100/mean(N),.groups="drop") %>%
        mutate(cumn=cumsum(n),cumpct=cumsum(pct))
}

doit2(mtcars)

# A tibble: 1 × 4
      n   pct  cumn cumpct
  <int> <dbl> <int>  <dbl>
1    32   100    32    100

doit2(mtcars,"cyl")

# A tibble: 3 × 5
    cyl     n   pct  cumn cumpct
  <dbl> <int> <dbl> <int>  <dbl>
1     4    11  34.4    11   34.4
2     6     7  21.9    18   56.2
3     8    14  43.8    32  100

doit2(mtcars,"cyl","gear")

# A tibble: 8 × 6
    cyl  gear     n   pct  cumn cumpct
  <dbl> <dbl> <int> <dbl> <int>  <dbl>
1     4     3     1  3.12     1   3.12
2     4     4     8 25        9  28.1 
3     4     5     2  6.25    11  34.4 
4     6     3     2  6.25    13  40.6 
5     6     4     4 12.5     17  53.1 
6     6     5     1  3.12    18  56.2 
7     8     3    12 37.5     30  93.8 
8     8     5     2  6.25    32 100

Python Version

The version below gets very close! The only differences are that the python version of doit() doesn’t work as-is if 0 variables are provided, and the variable names are passed as strings (i.e., this isn’t optional as with the tidy versions). This latter point should actually simplify some situations that are complex due to data mask ambiguities in tidy functions.

def doit(df,*argv):
    q=(
        df
        .with_columns(pl.len().alias("N"))
        .group_by(*argv)
        .agg(
            pl.len().alias("n")
            ,((pl.len()*100)/pl.col("N").mean()).alias("pct")
        )
        .sort(*argv)
        .with_columns(
            pl.col("n").cum_sum().alias("cumn")
            ,pl.col("pct").cum_sum().alias("cumpct")
        )
    )
    table=q.collect()
    print(table)

doit(mtcars,"cyl")

shape: (3, 5)
┌─────┬─────┬────────┬──────┬────────┐
│ cyl ┆ n   ┆ pct    ┆ cumn ┆ cumpct │
│ --- ┆ --- ┆ ---    ┆ ---  ┆ ---    │
│ f64 ┆ u32 ┆ f64    ┆ u32  ┆ f64    │
╞═════╪═════╪════════╪══════╪════════╡
│ 4.0 ┆ 11  ┆ 34.375 ┆ 11   ┆ 34.375 │
│ 6.0 ┆ 7   ┆ 21.875 ┆ 18   ┆ 56.25  │
│ 8.0 ┆ 14  ┆ 43.75  ┆ 32   ┆ 100.0  │
└─────┴─────┴────────┴──────┴────────┘

doit(mtcars,"cyl","gear")

shape: (8, 6)
┌─────┬──────┬─────┬───────┬──────┬────────┐
│ cyl ┆ gear ┆ n   ┆ pct   ┆ cumn ┆ cumpct │
│ --- ┆ ---  ┆ --- ┆ ---   ┆ ---  ┆ ---    │
│ f64 ┆ f64  ┆ u32 ┆ f64   ┆ u32  ┆ f64    │
╞═════╪══════╪═════╪═══════╪══════╪════════╡
│ 4.0 ┆ 3.0  ┆ 1   ┆ 3.125 ┆ 1    ┆ 3.125  │
│ 4.0 ┆ 4.0  ┆ 8   ┆ 25.0  ┆ 9    ┆ 28.125 │
│ 4.0 ┆ 5.0  ┆ 2   ┆ 6.25  ┆ 11   ┆ 34.375 │
│ 6.0 ┆ 3.0  ┆ 2   ┆ 6.25  ┆ 13   ┆ 40.625 │
│ 6.0 ┆ 4.0  ┆ 4   ┆ 12.5  ┆ 17   ┆ 53.125 │
│ 6.0 ┆ 5.0  ┆ 1   ┆ 3.125 ┆ 18   ┆ 56.25  │
│ 8.0 ┆ 3.0  ┆ 12  ┆ 37.5  ┆ 30   ┆ 93.75  │
│ 8.0 ┆ 5.0  ┆ 2   ┆ 6.25  ┆ 32   ┆ 100.0  │
└─────┴──────┴─────┴───────┴──────┴────────┘

Example 8. Iterate a Custom Function

Here we want to apply the doit functions over parameters.

R Version

We can use purrr::pmap() in the R version with a list of parameters. Since we defined the R version of doit to take naked symbols, the mapped version is kind of clunky—a cleaner alternative based on doit2 highlights this point.

parms<-list(
    list(mtcars,mtcars)
    ,"var1"=list(mtcars$cyl,mtcars$cyl)
    ,"var2"=list(mtcars$gear,mtcars$am)
)

pmap(parms,doit)

[[1]]
# A tibble: 8 × 6
   var1  var2     n   pct  cumn cumpct
  <dbl> <dbl> <int> <dbl> <int>  <dbl>
1     4     3     1  3.12     1   3.12
2     4     4     8 25        9  28.1 
3     4     5     2  6.25    11  34.4 
4     6     3     2  6.25    13  40.6 
5     6     4     4 12.5     17  53.1 
6     6     5     1  3.12    18  56.2 
7     8     3    12 37.5     30  93.8 
8     8     5     2  6.25    32 100   

[[2]]
# A tibble: 6 × 6
   var1  var2     n   pct  cumn cumpct
  <dbl> <dbl> <int> <dbl> <int>  <dbl>
1     4     0     3  9.38     3   9.38
2     4     1     8 25       11  34.4 
3     6     0     4 12.5     15  46.9 
4     6     1     3  9.38    18  56.2 
5     8     0    12 37.5     30  93.8 
6     8     1     2  6.25    32 100

parms2<-list(
    list(mtcars,mtcars)
    ,c("cyl","cyl")
    ,c("gear","am")
)

pmap(parms2,doit2)

[[1]]
# A tibble: 8 × 6
    cyl  gear     n   pct  cumn cumpct
  <dbl> <dbl> <int> <dbl> <int>  <dbl>
1     4     3     1  3.12     1   3.12
2     4     4     8 25        9  28.1 
3     4     5     2  6.25    11  34.4 
4     6     3     2  6.25    13  40.6 
5     6     4     4 12.5     17  53.1 
6     6     5     1  3.12    18  56.2 
7     8     3    12 37.5     30  93.8 
8     8     5     2  6.25    32 100   

[[2]]
# A tibble: 6 × 6
    cyl    am     n   pct  cumn cumpct
  <dbl> <dbl> <int> <dbl> <int>  <dbl>
1     4     0     3  9.38     3   9.38
2     4     1     8 25       11  34.4 
3     6     0     4 12.5     15  46.9 
4     6     1     3  9.38    18  56.2 
5     8     0    12 37.5     30  93.8 
6     8     1     2  6.25    32 100

Python Version

Here we combine 3 parameter lists into a single iterator object via zip—we can then map doit over parms via itertools.starmap.

import itertools as it

parms=zip(
    [mtcars,mtcars]
    ,['cyl','cyl']
    ,['gear','am']
)

list(it.starmap(doit,parms))

shape: (8, 6)
┌─────┬──────┬─────┬───────┬──────┬────────┐
│ cyl ┆ gear ┆ n   ┆ pct   ┆ cumn ┆ cumpct │
│ --- ┆ ---  ┆ --- ┆ ---   ┆ ---  ┆ ---    │
│ f64 ┆ f64  ┆ u32 ┆ f64   ┆ u32  ┆ f64    │
╞═════╪══════╪═════╪═══════╪══════╪════════╡
│ 4.0 ┆ 3.0  ┆ 1   ┆ 3.125 ┆ 1    ┆ 3.125  │
│ 4.0 ┆ 4.0  ┆ 8   ┆ 25.0  ┆ 9    ┆ 28.125 │
│ 4.0 ┆ 5.0  ┆ 2   ┆ 6.25  ┆ 11   ┆ 34.375 │
│ 6.0 ┆ 3.0  ┆ 2   ┆ 6.25  ┆ 13   ┆ 40.625 │
│ 6.0 ┆ 4.0  ┆ 4   ┆ 12.5  ┆ 17   ┆ 53.125 │
│ 6.0 ┆ 5.0  ┆ 1   ┆ 3.125 ┆ 18   ┆ 56.25  │
│ 8.0 ┆ 3.0  ┆ 12  ┆ 37.5  ┆ 30   ┆ 93.75  │
│ 8.0 ┆ 5.0  ┆ 2   ┆ 6.25  ┆ 32   ┆ 100.0  │
└─────┴──────┴─────┴───────┴──────┴────────┘
shape: (6, 6)
┌─────┬─────┬─────┬───────┬──────┬────────┐
│ cyl ┆ am  ┆ n   ┆ pct   ┆ cumn ┆ cumpct │
│ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---    │
│ f64 ┆ f64 ┆ u32 ┆ f64   ┆ u32  ┆ f64    │
╞═════╪═════╪═════╪═══════╪══════╪════════╡
│ 4.0 ┆ 0.0 ┆ 3   ┆ 9.375 ┆ 3    ┆ 9.375  │
│ 4.0 ┆ 1.0 ┆ 8   ┆ 25.0  ┆ 11   ┆ 34.375 │
│ 6.0 ┆ 0.0 ┆ 4   ┆ 12.5  ┆ 15   ┆ 46.875 │
│ 6.0 ┆ 1.0 ┆ 3   ┆ 9.375 ┆ 18   ┆ 56.25  │
│ 8.0 ┆ 0.0 ┆ 12  ┆ 37.5  ┆ 30   ┆ 93.75  │
│ 8.0 ┆ 1.0 ┆ 2   ┆ 6.25  ┆ 32   ┆ 100.0  │
└─────┴─────┴─────┴───────┴──────┴────────┘
[None, None]

Example 9. Stack Data Frames by List Binding with Map and Anonymous Function

What we’re achieving with this example—returning mtcars—isn’t very useful, but it illustrates a common task: mapping an anonymous function over a vector to create a list of data frames which are subsequently stacked together via row binding. In other words, in this example we’re reassembling mtcars by stacking together portions returned from each iteration of map.

R Version

Pretty straight forward. Note that parms is a vector here.

parms<-distinct(mtcars,cyl) %>%
  pull()

list<-map(
  parms
  ,function (x){
    mtcars %>%
      dplyr::filter(cyl==x) %>%
      arrange(desc(mpg))
  }
)

df<-list_rbind(list)

print(df)

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4

Python Version

This is extremely similar. Note that we’re pulling from the csv of mtcars to utilize eager evaluation (for simplicity).

mtcars=pl.read_csv("mtcars.csv")

iterator=mtcars.get_column("cyl").unique()

df=map(
    lambda x: (
        mtcars
        .filter(pl.col("cyl")==x)
        .sort("mpg",descending=True)
    )
    ,iterator
)

df=pl.concat(list(df))

print(df)

shape: (32, 11)
┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
│ mpg  ┆ cyl ┆ disp  ┆ hp  ┆ … ┆ vs  ┆ am  ┆ gear ┆ carb │
│ ---  ┆ --- ┆ ---   ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ f64  ┆ i64 ┆ f64   ┆ i64 ┆   ┆ i64 ┆ i64 ┆ i64  ┆ i64  │
╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
│ 33.9 ┆ 4   ┆ 71.1  ┆ 65  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
│ 32.4 ┆ 4   ┆ 78.7  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
│ 30.4 ┆ 4   ┆ 75.7  ┆ 52  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 2    │
│ 30.4 ┆ 4   ┆ 95.1  ┆ 113 ┆ … ┆ 1   ┆ 1   ┆ 5    ┆ 2    │
│ 27.3 ┆ 4   ┆ 79.0  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
│ …    ┆ …   ┆ …     ┆ …   ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
│ 14.7 ┆ 8   ┆ 440.0 ┆ 230 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
│ 14.3 ┆ 8   ┆ 360.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
│ 13.3 ┆ 8   ┆ 350.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
│ 10.4 ┆ 8   ┆ 472.0 ┆ 205 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
│ 10.4 ┆ 8   ┆ 460.0 ┆ 215 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
└──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘

Example 10. Stack Data Frames by List Binding with pmap and Anonymous Function of Dots (`...`)

This example generalizes the previous one to use a data frame with any number of columns (here we’re just using 2) to parameterize the mapping.

R Version

Dynamic dots are captured in the list parms within the function and column values are referenced as elements of that list.

parms<-distinct(mtcars,cyl,gear)

list<-pmap(
  parms
  ,function (...){
    parms<-rlang::dots_list(...)
    mtcars %>%
      dplyr::filter(cyl==parms$cyl & gear==parms$gear) 
  }
)

df<-list_rbind(list)

print(df)

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Python Version

This is very similar to the R version above. The key is that each row of the data frame parms is turned into a dictionary (i.e., has the form {'key1':_key1val_,...,'keyk':_keykval_}). iterator is then a list of dictionaries (iter_rows() returns dictionaries when named=True) which allows us to capture a single row of all parameters needed in the body of the anonymous function with the single parameter dctnry. The parameter values can be referenced by the original parms data frame variable name in the function body via the dictionary method get(). Note that there is a Polars DataFrame method map_rows() that does something similar, but there’s no way to preserve variable names for reference inside the function, so this approach seems preferrable.

mtcars=pl.read_csv("mtcars.csv")

parms=(
    mtcars
    .group_by("cyl","gear")
    .agg()
)

iterator=list(parms.iter_rows(named=True))

dfs=map(
    lambda dctnry: (
        mtcars
        .filter(
            (pl.col("cyl")==dctnry.get('cyl')) & 
            (pl.col("gear")==dctnry.get('gear'))
        )
    )
    ,iterator
)

df=pl.concat(list(dfs))

print(df)

shape: (32, 11)
┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
│ mpg  ┆ cyl ┆ disp  ┆ hp  ┆ … ┆ vs  ┆ am  ┆ gear ┆ carb │
│ ---  ┆ --- ┆ ---   ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ f64  ┆ i64 ┆ f64   ┆ i64 ┆   ┆ i64 ┆ i64 ┆ i64  ┆ i64  │
╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
│ 18.7 ┆ 8   ┆ 360.0 ┆ 175 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 2    │
│ 14.3 ┆ 8   ┆ 360.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
│ 16.4 ┆ 8   ┆ 275.8 ┆ 180 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 3    │
│ 17.3 ┆ 8   ┆ 275.8 ┆ 180 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 3    │
│ 15.2 ┆ 8   ┆ 275.8 ┆ 180 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 3    │
│ …    ┆ …   ┆ …     ┆ …   ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
│ 21.4 ┆ 6   ┆ 258.0 ┆ 110 ┆ … ┆ 1   ┆ 0   ┆ 3    ┆ 1    │
│ 18.1 ┆ 6   ┆ 225.0 ┆ 105 ┆ … ┆ 1   ┆ 0   ┆ 3    ┆ 1    │
│ 15.8 ┆ 8   ┆ 351.0 ┆ 264 ┆ … ┆ 0   ┆ 1   ┆ 5    ┆ 4    │
│ 15.0 ┆ 8   ┆ 301.0 ┆ 335 ┆ … ┆ 0   ┆ 1   ┆ 5    ┆ 8    │
│ 21.5 ┆ 4   ┆ 120.1 ┆ 97  ┆ … ┆ 1   ┆ 0   ┆ 3    ┆ 1    │
└──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘

Citation

BibTeX citation:

@online{couzens2025,
  author = {Couzens, Lance},
  title = {Recreating {Some} {Tidy-Style} {R} {Operations} with {Python}
    and {Polars}},
  date = {2025-03-19},
  url = {https://mostlyunoriginal.github.io/posts/2025-03-19-TidyR-to-PolarsPython/},
  langid = {en}
}

For attribution, please cite this work as:

Couzens, Lance. 2025. “Recreating Some Tidy-Style R Operations with Python and Polars.” March 19, 2025. https://mostlyunoriginal.github.io/posts/2025-03-19-TidyR-to-PolarsPython/.

Background

Data Manipulation

Example 1. Basic Summarize without Generalization across Variables

R Version

Python Version

Example 2. Basic Mutate with Grouping and without Generalization

R Version

Python Version

Example 3. Summarize Generalized by Variable Type with Across

R Version

Python Version

Example 4. Conditional Recode

R Version

Python Version

Example 5. Pivots

R Version

Python Version

Example 6. Joins

R Version

Python Version

Functional Programming

Example 7. Function for n & pct by Grouping Variables

R Version

Python Version

Example 8. Iterate a Custom Function

R Version

Python Version

Example 9. Stack Data Frames by List Binding with Map and Anonymous Function

R Version

Python Version

Example 10. Stack Data Frames by List Binding with pmap and Anonymous Function of Dots (...)

R Version

Python Version

Citation

Example 10. Stack Data Frames by List Binding with pmap and Anonymous Function of Dots (`...`)