Skip to content

Theoretical Key Collision in remove_duplicates_by_uniques #3632

@atheendre130505

Description

@atheendre130505

Description: There are two significant issues in the deduplication utility functions located in augur/tasks/util/worker_util.py:

Key Collision in remove_duplicates_by_uniques:
This function generates lookup keys by joining field values with underscores ("_".join(...)). This leads to collisions when field values contain underscores (e.g., {"a": "foo_bar", "b": "baz"} and {"a": "foo", "b": "bar_baz"} both result in the key "foo_bar_baz"). Additionally, it uses str() on values, which causes None and the string "None" to collide. Crash in remove_duplicate_dicts: This function uses set(tuple(x.items()) for x in data) to deduplicate. If any dictionary value is another dictionary (nested data), the tuple(x.items()) contains a dictionary, which is unhashable, causing a TypeError.

Create a script that calls these utility functions with the following data:

Case 1: Collision

data = [{"a": "foo_bar", "b": "baz"}, {"a": "foo", "b": "bar_baz"}]
remove_duplicates_by_uniques(data, ["a", "b"]) # Returns 1 item instead of 2

Case 2: Crash

data = [{"id": 1, "metadata": {"key": "value"}}]
remove_duplicate_dicts(data) # Raises TypeError: unhashable type: 'dict'

Run the script.
Observe the incorrect deduplication and the crash.

remove_duplicates_by_uniques should handle values with underscores without collision (e.g., by using a safer separator or a tuple-based key). remove_duplicate_dicts should handle nested dictionaries (e.g., by recursively converting them to hashable types or using a different deduplication approach). Screenshots N/A (Utility function logic issue)

Log files N/A (Identified during code audit and verified via standalone script)

Software versions:

Augur: 0.91.0
OS: Microsoft Windows 10 Home Single Language (10.0.19045)
Browser: N/A

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions