[PATCH RFC] releasenotes: add similarity check function to compare incoming notes

Jun Wu quark at fb.com
Tue Jul 4 20:37:53 EDT 2017


Excerpts from Rishabh Madan's message of 2017-07-01 11:16:12 +0200:
> > By the way, did you notice that there is mercurial/similar.py? That file
> > has a "_score" which is simple but somehow effective. It is currently
> > based on line diffs. If we can do word diffs, do you think
> > similar._score could be somehow be reused to satisfy your need?
> 
> I took a look at the _score function and if I understand correctly by word
> diffs you mean first tokenizing those strings and then matching words,
> right? If we plan to do something like this, then we can go one step
> further with it and simply duplicate the fuzz function that we use from
> fuzzywuzzy. The function basically tokenizes the string and forms three
> different strings, first contains intersecting words from both strings
> (alphabetically sorted), second contains sorted intersection along with
> rest of the words from string 1 (alphabetically sorted) and third one
> contains sorted intersection with rest of the words from string 2
> (alphabetically sorted). It then simply uses SequenceMatcher from
> difflib and gets score for all possible pairs of these three strings and
> finally picks the best match score.

I see. Sounds like it's not a direct match of what you want. Since
fuzzywuzzy is working, and supporting word-diff requires some
not-too-trivial changes. I think it's better to just continue use
fuzzywuzzy now.


More information about the Mercurial-devel mailing list