Skip to content

[tasks] Fix deduplication key collisions and crash on nested data#3639

Draft
atheendre130505 wants to merge 1 commit intochaoss:mainfrom
atheendre130505:fix/augur-deduplication-v2
Draft

[tasks] Fix deduplication key collisions and crash on nested data#3639
atheendre130505 wants to merge 1 commit intochaoss:mainfrom
atheendre130505:fix/augur-deduplication-v2

Conversation

@atheendre130505
Copy link
Contributor

@atheendre130505 atheendre130505 commented Jan 31, 2026

This PR resolves two critical logic issues in the deduplication utilities within
augur/tasks/util/worker_util.py:

Key Collision: Replaced underscore-joined string keys with tuple-based keys in remove_duplicates_by_uniques
to prevent false duplicates when field values contain underscores or when distinguishing between None and "None".
TypeError Crash: Introduced a recursive _make_hashable helper function to handle nested unhashable structures (dictionaries and lists), preventing crashes in remove_duplicate_dicts and remove_duplicates_by_uniques.

This PR fixes #3632

Notes for Reviewers
The fix incorporates a new internal helper _make_hashable that ensures any nested data structure is converted to a sorted, immutable tuple before being used as a set or dictionary key. Rigorous edge case testing was performed to verify that boundary collisions (e.g., {"a": "foo_bar", "b": "baz"} vs {"a": "foo", "b": "bar_baz"}) are now handled correctly.

All original imports and the existing heading structure of the file have been preserved.
Signed commits

Yes, I signed my commits.
Signed-off-by:atheendre130505 atheendre@gmail.com

Signed-off-by: atheendre130505 <atheendreramesh@gmail.com>
@atheendre130505 atheendre130505 force-pushed the fix/augur-deduplication-v2 branch from d617405 to 4770953 Compare January 31, 2026 13:43
@shlokgilda
Copy link
Collaborator

Tuple-based keys fix the theoretical collision, _make_hashable handles nested structures, and the new remove_duplicate_dicts is actually better than the original since it preserves insertion order (the old set() approach didn't). Although, I am not sure if we even want to add these change considering the fact that this is a defensive pattern. We can discuss this further in the issue #3632.

@MoralCode MoralCode added the waiting This change is waiting for some other changes to land first label Feb 3, 2026
@MoralCode MoralCode marked this pull request as draft February 3, 2026 18:18
@MoralCode MoralCode added the discussion Seeking active feedback, usually for items under active development label Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discussion Seeking active feedback, usually for items under active development waiting This change is waiting for some other changes to land first

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Theoretical Key Collision in remove_duplicates_by_uniques

3 participants