The Issue

As is often the case, I introduced a new feature to an existing codebase and thereby introduced a new bug. Some days ago now I updated my cendat library to allow for a shortcut in obtaining all variables in a group from the Census API. This is very handy and wonderful, but in some cases the basket of group variables includes NAME and GEO_ID, which the user separately has the option to pull. The result was that in many cases the returned JSON object contained duplication of those variables, and this could cause problems when those objects were turned into DataFrames.

Initially, I thought I would just internally undo the user’s variable-level requests for NAME and GEO_ID if they were doing a group-level call, but I’m just not sure that all groups will always include those variables, which would leave the user with no way to obtain them in a group call. In the end I settled on it being best to remove duplicates in the raw data before it’s processed as a Polars or Pandas DataFrame. Here’s an abstracted example of the issue and its remedy which is simple and concise using basic Python tools.

Incoming Data

schema = ["zero", "one", "two", "three", "two", "four", "five", "one", "six"]

data = [
    [0, 1, 2, 3, 2, 4, 5, 1, 6],
    [0, 1, 2, 3, 2, 4, 5, 1, 6],
    [0, 1, 2, 3, 2, 4, 5, 1, 6],
]

Here we can see the format in which the raw data are obtained: schema is a list of strings containing the variable names, and data is a list of lists, each representing a row of data. As shown here, some columns may be duplicated, and we need to identify the duplicates and remove them before processing further.

A Dictionary for Column Indexes

from collections import defaultdict
from pprint import pprint

index_map = defaultdict(list)
for index, name in enumerate(schema):
    index_map[name].append(index)

pprint(f"{index_map=}")

("index_map=defaultdict(<class 'list'>, {'zero': [0], 'one': [1, 7], 'two': "
 "[2, 4], 'three': [3], 'four': [5], 'five': [6], 'six': [8]})")

The first step is to create the dictionary index_map which we instantiate with defaultdict so we can default each item to contain list values. We then enumerate over the schema list to build out the dictionary’s items with keys being the unique variable names and the values being each index in which they are found. Variables that occur only once in schema will have index lists of length 1, while those that are duplicated will have multiple entries in their index lists.

A Set of Indexes Marked for Removal

Next, we create the set removals to contain the indexes where extra occurrences of variables can be found in the data rows. Since we know the indexes will be unique, we can utilize a set for maximum efficiency.

removals = set()
for indexes in index_map.values():
    if len(indexes) > 1:
        removals.update(indexes[1:])

pprint(f"{removals=}")

'removals={4, 7}'

Updated Data Rows

Now we have everything we need to update the raw data before we attempt to process it. We update both the schema and data rows via list comprehension.

new_schema = [var for i, var in enumerate(schema) if i not in removals]
new_data = [
    [datum for i, datum in enumerate(row) if i not in removals] 
    for row in data
]

pprint(f"{new_schema=}")
pprint(f"{new_data=}")

"new_schema=['zero', 'one', 'two', 'three', 'four', 'five', 'six']"
'new_data=[[0, 1, 2, 3, 4, 5, 6], [0, 1, 2, 3, 4, 5, 6], [0, 1, 2, 3, 4, 5, 6]]'

I found this solution both very satisfying and a nice, compact example of why I really like Python: you’ve got a nice variety of useful collection types to work with (here we used lists, dictionaries, and sets) and syntactic constructs like comprehensions to easily and conditionally manipulate them.

Citation

BibTeX citation:

@online{couzens2025,
  author = {Couzens, Lance},
  title = {A {Quick} {Patch} / {Love} {Letter} to {Python}},
  date = {2025-09-06},
  url = {https://mostlyunoriginal.github.io/posts/2025-09-06-quick-patch-python-love/},
  langid = {en}
}

For attribution, please cite this work as:

Couzens, Lance. 2025. “A Quick Patch / Love Letter to Python.” September 6, 2025. https://mostlyunoriginal.github.io/posts/2025-09-06-quick-patch-python-love/.