-
Notifications
You must be signed in to change notification settings - Fork 990
Description
Description: There are two significant issues in the deduplication utility functions located in augur/tasks/util/worker_util.py:
Key Collision in remove_duplicates_by_uniques:
This function generates lookup keys by joining field values with underscores ("_".join(...)). This leads to collisions when field values contain underscores (e.g., {"a": "foo_bar", "b": "baz"} and {"a": "foo", "b": "bar_baz"} both result in the key "foo_bar_baz"). Additionally, it uses str() on values, which causes None and the string "None" to collide. Crash in remove_duplicate_dicts: This function uses set(tuple(x.items()) for x in data) to deduplicate. If any dictionary value is another dictionary (nested data), the tuple(x.items()) contains a dictionary, which is unhashable, causing a TypeError.
Create a script that calls these utility functions with the following data:
Case 1: Collision
data = [{"a": "foo_bar", "b": "baz"}, {"a": "foo", "b": "bar_baz"}]
remove_duplicates_by_uniques(data, ["a", "b"]) # Returns 1 item instead of 2
Case 2: Crash
data = [{"id": 1, "metadata": {"key": "value"}}]
remove_duplicate_dicts(data) # Raises TypeError: unhashable type: 'dict'
Run the script.
Observe the incorrect deduplication and the crash.
remove_duplicates_by_uniques should handle values with underscores without collision (e.g., by using a safer separator or a tuple-based key). remove_duplicate_dicts should handle nested dictionaries (e.g., by recursively converting them to hashable types or using a different deduplication approach). Screenshots N/A (Utility function logic issue)
Log files N/A (Identified during code audit and verified via standalone script)
Software versions:
Augur: 0.91.0
OS: Microsoft Windows 10 Home Single Language (10.0.19045)
Browser: N/A