RFC: safe pattern matching for problematic encoding

Thu May 24 12:12:23 CDT 2012

On Fri, 2012-05-25 at 01:23 +0900, FUJIWARA Katsunori wrote:
> At Wed, 23 May 2012 13:56:43 -0500,
> Matt Mackall wrote:
> > 
> > On Wed, 2012-05-23 at 21:38 +0900, FUJIWARA Katsunori wrote:
> > > Hi, devels.
> > > 
> > > I'm working to achieve safe pattern matching/parsing for problematic
> > > encodings (e.g.: cp932), in which strings may contain '\\' as a part
> > > of multi-byte characters.
> > 
> > Please provide an example of where we'd want this for discussion.
> 
> We need such safeness in situations below:
> 
>   - for file/directory patterns of "hg status", "hg log" and so on:
>     (path, globbing or regex)
> 
>       in this case, backslashes in patterns are skipped, because they
>       are recognized as an escape character of next by "_globre()" in
>       "match.py".
> 
>       this causes unexpected matching result: "_globre()" doesn't
>       raise exception, even though specified pattern is ended by
>       backslash of MBCS.

This is not an example yet. With bytes, please.

Beyond Ruby's 's' switch, there seems to be very little precedent for
how to deal with ShiftJIS (where there's even confusion about whether a
'\' character even exists).

When we implement UTF-8 mode, this will all be irrelevant: we'll only be
able to use UTF-8 encoded ignore files and we'll only accept UTF-8
command arguments.

-- 
Mathematics is the supreme nostalgia of our time.