RFC: safe pattern matching for problematic encoding
FUJIWARA Katsunori
foozy at lares.dti.ne.jp
Thu May 24 11:23:02 CDT 2012
At Wed, 23 May 2012 13:56:43 -0500,
Matt Mackall wrote:
>
> On Wed, 2012-05-23 at 21:38 +0900, FUJIWARA Katsunori wrote:
> > Hi, devels.
> >
> > I'm working to achieve safe pattern matching/parsing for problematic
> > encodings (e.g.: cp932), in which strings may contain '\\' as a part
> > of multi-byte characters.
>
> Please provide an example of where we'd want this for discussion.
We need such safeness in situations below:
- for file/directory patterns of "hg status", "hg log" and so on:
(path, globbing or regex)
in this case, backslashes in patterns are skipped, because they
are recognized as an escape character of next by "_globre()" in
"match.py".
this causes unexpected matching result: "_globre()" doesn't
raise exception, even though specified pattern is ended by
backslash of MBCS.
- for regexp patterns of "hg grep":
in this case, backslashes in patterns are skipped, because they
are recognized as an escape character of next by "re.compile()".
this causes unexpected matching result (MBCS is in the middle of
the pattern), or parse error (in the tail of the pattern)
- for arguments of revsets/filesets predicates:
(pathes, regexp, keywords and so on)
- for strings of styles/templates:
in these cases, backslashes in patterns are skipped, because
they are recognized as an escape character of next by:
- "tokenize()" in "fileset.py"
- "tokenize()" in "revset.py"
- "tokenize()" in "templater.py"
this causes unexpected matching result (MBCS is in the middle of
the argument), or parse error (in the tail of the argument)
even though safeness in "strings for styles/templates" situation is
not needed so seriously.
----------------------------------------------------------------------
[FUJIWARA Katsunori] foozy at lares.dti.ne.jp
More information about the Mercurial-devel
mailing list