Design related issues with similarity function in releasenotes extension.

Fri Jun 16 19:29:03 EDT 2017

> On Jun 15, 2017, at 16:46, Rishabh Madan <rishabhmadan96 at gmail.com> wrote:
> 
> An important part of the release notes extension is to deal with the notes from the incoming commit messages and combining/ignoring them with an existing releasenotes file. To begin with, we first look for an exact match of the incoming notes fragments in the existing file. If a match is found, we simply ignore (don't add it to release notes) the fragment. Then we compare the remaining incoming notes fragments of a particular section (sections like fix, features, perf etc.) with the notes items under this same section in the existing release notes. 
> As of now, I'm using fuzzywuzzy's token set ratio method (http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) for comparison, but there might be licensing issues that we may need to talk about in case we plan to use it. (Any other solutions if any can also be given a thought.) In the basic implementation (link to the image) that I made, I simply threshold the match score and ignore the fragment if the score is above a certain threshold. 
> 
> But the problem is that we simply can't afford to ignore some of them. For eg, if the message is really small, say in the case of bug fixes, then there are chances that it might cross the threshold even though it's different from what exists in the release notes. There can be other such cases too. One solution as Greg suggested is that we can just "union merge" both the old file and the incoming data when it can't be "merged" automatically. Then we could invoke a merge tool and ask the user to resolve conflicts. We could potentially record conflict resolutions based on the final result and store that somewhere to help guide future "merges.

I know it's one-off for bug fixes, but maybe we could look for issueNNNN and use that as a stronger signal than fuzzywuzzy can provide?

Maybe have a minimum length in words before we'll deduplicate? I like that idea less well, but maybe it's enough...

> I would like to discuss these problems and it would be great if someone can suggest a better solution.