= ["zero", "one", "two", "three", "two", "four", "five", "one", "six"]
schema
= [
data 0, 1, 2, 3, 2, 4, 5, 1, 6],
[0, 1, 2, 3, 2, 4, 5, 1, 6],
[0, 1, 2, 3, 2, 4, 5, 1, 6],
[ ]
The Issue
As is often the case, I introduced a new feature to an existing codebase and thereby introduced a new bug. Some days ago now I updated my cendat
library to allow for a shortcut in obtaining all variables in a group from the Census API. This is very handy and wonderful, but in some cases the basket of group variables includes NAME
and GEO_ID
, which the user separately has the option to pull. The result was that in many cases the returned JSON object contained duplication of those variables, and this could cause problems when those objects were turned into DataFrames.
Initially, I thought I would just internally undo the user’s variable-level requests for NAME
and GEO_ID
if they were doing a group-level call, but I’m just not sure that all groups will always include those variables, which would leave the user with no way to obtain them in a group call. In the end I settled on it being best to remove duplicates in the raw data before it’s processed as a Polars or Pandas DataFrame. Here’s an abstracted example of the issue and its remedy which is simple and concise using basic Python tools.
Incoming Data
Here we can see the format in which the raw data are obtained: schema
is a list of strings containing the variable names, and data
is a list of lists, each representing a row of data. As shown here, some columns may be duplicated, and we need to identify the duplicates and remove them before processing further.
A Dictionary for Column Indexes
from collections import defaultdict
from pprint import pprint
= defaultdict(list)
index_map for index, name in enumerate(schema):
index_map[name].append(index)
f"{index_map=}") pprint(
("index_map=defaultdict(<class 'list'>, {'zero': [0], 'one': [1, 7], 'two': "
"[2, 4], 'three': [3], 'four': [5], 'five': [6], 'six': [8]})")
The first step is to create the dictionary index_map
which we instantiate with defaultdict
so we can default each item to contain list values. We then enumerate over the schema
list to build out the dictionary’s items with keys being the unique variable names and the values being each index in which they are found. Variables that occur only once in schema
will have index lists of length 1, while those that are duplicated will have multiple entries in their index lists.
A Set of Indexes Marked for Removal
Next, we create the set removals
to contain the indexes where extra occurrences of variables can be found in the data rows. Since we know the indexes will be unique, we can utilize a set for maximum efficiency.
= set()
removals for indexes in index_map.values():
if len(indexes) > 1:
1:])
removals.update(indexes[
f"{removals=}") pprint(
'removals={4, 7}'
Updated Data Rows
Now we have everything we need to update the raw data before we attempt to process it. We update both the schema and data rows via list comprehension.
= [var for i, var in enumerate(schema) if i not in removals]
new_schema = [
new_data for i, datum in enumerate(row) if i not in removals]
[datum for row in data
]
f"{new_schema=}")
pprint(f"{new_data=}") pprint(
"new_schema=['zero', 'one', 'two', 'three', 'four', 'five', 'six']"
'new_data=[[0, 1, 2, 3, 4, 5, 6], [0, 1, 2, 3, 4, 5, 6], [0, 1, 2, 3, 4, 5, 6]]'
I found this solution both very satisfying and a nice, compact example of why I really like Python: you’ve got a nice variety of useful collection types to work with (here we used lists, dictionaries, and sets) and syntactic constructs like comprehensions to easily and conditionally manipulate them.
Citation
@online{couzens2025,
author = {Couzens, Lance},
title = {A {Quick} {Patch} / {Love} {Letter} to {Python}},
date = {2025-09-06},
url = {https://mostlyunoriginal.github.io/posts/2025-09-06-quick-patch-python-love/},
langid = {en}
}