[tasks] Fix deduplication key collisions and crash on nested data#3639
Draft
atheendre130505 wants to merge 1 commit intochaoss:mainfrom
Draft
[tasks] Fix deduplication key collisions and crash on nested data#3639atheendre130505 wants to merge 1 commit intochaoss:mainfrom
atheendre130505 wants to merge 1 commit intochaoss:mainfrom
Conversation
Signed-off-by: atheendre130505 <atheendreramesh@gmail.com>
d617405 to
4770953
Compare
Collaborator
|
Tuple-based keys fix the theoretical collision, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR resolves two critical logic issues in the deduplication utilities within
augur/tasks/util/worker_util.py:
Key Collision: Replaced underscore-joined string keys with tuple-based keys in remove_duplicates_by_uniques
to prevent false duplicates when field values contain underscores or when distinguishing between None and "None".
TypeError Crash: Introduced a recursive _make_hashable helper function to handle nested unhashable structures (dictionaries and lists), preventing crashes in remove_duplicate_dicts and remove_duplicates_by_uniques.
This PR fixes #3632
Notes for Reviewers
The fix incorporates a new internal helper _make_hashable that ensures any nested data structure is converted to a sorted, immutable tuple before being used as a set or dictionary key. Rigorous edge case testing was performed to verify that boundary collisions (e.g., {"a": "foo_bar", "b": "baz"} vs {"a": "foo", "b": "bar_baz"}) are now handled correctly.
All original imports and the existing heading structure of the file have been preserved.
Signed commits
Yes, I signed my commits.
Signed-off-by:atheendre130505 atheendre@gmail.com