[PATCH RFC] releasenotes: add similarity check function to compare incoming notes

Rishabh Madan rishabhmadan96 at gmail.com
Sat Jul 1 09:16:12 UTC 2017


On Fri, Jun 30, 2017 at 6:49 AM, Jun Wu <quark at fb.com> wrote:

> Excerpts from 's message of 2017-06-27 17:37:08 +0200:
> > Sure I'll do that. And about the external dependency issue, actually
> > fuzzywuzzy is implemented using difflib and python levenshtien distance,
> > so if we want we can import that part of the code or probably even
> rewrite
> > it.  I discussed this with Kevin and he told me that we will have to
> > discuss with them about the licensing if we plan to do anything similar
> to
> > this.
>
> By the way, did you notice that there is mercurial/similar.py? That file
> has
> a "_score" which is simple but somehow effective. It is currently based on
> line diffs. If we can do word diffs, do you think similar._score could be
> somehow be reused to satisfy your need?
>

I took a look at the _score function and if I understand correctly by word
diffs you mean first tokenizing those strings and then matching words,
right? If we plan to do something like this, then we can go one step
further with it and simply duplicate the fuzz function that we use from
fuzzywuzzy. The function basically tokenizes the string and forms three
different strings, first contains intersecting words from both strings
(alphabetically sorted), second contains sorted intersection along with
rest of the words from string 1 (alphabetically sorted) and third one
contains sorted intersection with rest of the words from string 2
(alphabetically sorted). It then simply uses SequenceMatcher from
difflib and gets score for all possible pairs of these three strings and
finally picks the best match score.
ᐧ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170701/e2dffc7f/attachment.html>


More information about the Mercurial-devel mailing list