RFC: safe pattern matching for problematic encoding

Martin Geisler martin at geisler.net
Sat May 26 09:51:30 CDT 2012


Matt Mackall <mpm at selenic.com> writes:

> On Fri, 2012-05-25 at 10:53 +0200, Martin Geisler wrote:
>> Matt Mackall <mpm at selenic.com> writes:
>> 
>> > http://mercurial.selenic.com/wiki/EncodingStrategy#The_.22makefile.22_problem
>> 
>> I honestly don't buy this argument and neither does anybody else that
>> I've discussed it with here. Before you start yelling: this is not
>> trolling, I'm seriously trying to figure out why this is important to
>> you
>
> Trolling is when you knowingly do something that's annoying and is
> likely to provoke a frustrated response. You should well know that I
> am sick to death of this topic, ergo you are a troll. Being a sincere
> troll doesn't make me love you more.

I'm sorry, but I don't want your sensitivities to dictate what I can and
cannot write about - I simply don't think you have the right to pick and
choose the topics we talk about.

>>  and if it even works the way you think it works.
>> 
>> The basic problem with the argument is that Makefiles that reference
>> non-ASCII files aren't portable to start with!
>
> You've mistaken the example for the principle.

Of course I haven't done that. You keep putting the "Makefile problem"
forward as a key issue, so it seemed like a good place to start.

> It's not about makefiles, per se, it's about the existence of large
> ecosystems of tools that are intentionally encoding agnostic. It
> affects everything from compilers to web servers to typesetters.

I was actually close to do a similar test with LaTeX since I'm also
quite familiar with that system. But I decided to skip this since
non-ASCII filenames (and filenames with tilde, space, and other simple
characters) are known to be tricky to use on even a single system. As an
example, I made a file called "café.tex" which uses \input{flûte} to
include another file. Both file content, filenames, and the locale were
UTF-8 and this is what I get:

  ! LaTeX Error: File `flûte.tex' not found.

Interestingly, pdfLaTeX (or probably the inputenc package) decodes the
UTF-8 encoded filenames and then looks for a Latin-1(!) encoded filename
on disk. So things work fine when I use Latin-1 encoded filenames,
regardless of the encoding of the file content or the local settings.

All this is just to say that yes, systems like LaTeX are very tricky to
use with filenames outside of a-zA-Z0-9. This is not Mercurial's fault,
it just shows that anybody who uses LaTeX for anything important wont be
using such "exotic" characters in their filenames.

>> I've demonstrated this before and I just demonstrated it again with
>> Windows 7 in a VM. Please try cloning
>>
>>   https://bitbucket.org/mg/makefile-problem
>
> Now try it with GNU Make from msys. I just did. Works a treat on both
> changesets. Also works on Linux and Mac. As it obviously will _with
> any tool that hasn't drunk the UTF-16 Kool Aid_.

I just used the first hit for "gnu make windows":

  http://gnuwin32.sourceforge.net/packages/make.htm

That is a GNU Make 3.81 for Windows. Maybe that port is too native for
your taste, but I think others will find and download it too. There are
of course many other Make variants in the wild and asking people to use
not just a GNU Make, but a *particular* variant of GNU Make sounds quite
non-portable to begin with.

> And thus you've proved my point.
>
> a) important toolchains exist that work JUST FINE across platforms
> with the existing encoding strategy

I don't believe this at all -- I don't know of an important project that
uses non-ASCII filenames and Makefiles and expect this to work across
platforms. It's a lot of and's... Do you know many such projects?

> b) changing that strategy will cause a REGRESSION and is therefore off
> the table

It is only a regression if people are actually relying on this.

> c) having standard tools like GNU make work trumps human legibility:
> software that doesn't compile but that you can still read is not
> software, it's merely literature

That's very poetic, but it doesn't hold up in reality. Programmers know
From decades of experience that naming files with anything outside of
ASCII is asking for trouble -- heck, many programmers don't like to use
spaces in filenames since it complicates things.

I claim that the vast majority of people who use non-ASCII characters in
their filenames are people who've never heard of Make.

It's a claim that I base on what I've seen around. When I migrated a
team at University of Zürich, they immediately ran into this problem
with a spreadsheet that had an umlaut in the name. They couldn't work
with a broken filename, so they immediately renamed the file.

Have you ever seen people use Make to compile something with non-ASCII
characters in the filenames on a platform where those files didn't
appear as they should? Did they go on like that for long?

This whole discussion would make much more sense to me if you said
something along the lines of

  Yes, I worked for company that had developers in both California and
  Mexico -- they named their files with non-ASCII characters even though
  they looked weird on Linux/ Windows.

That kind of experience would make it easier for me to understand why
you believe there are people out there who rely on this "feature".

>> Matt, do you acknowledge that we break tools in other ecosystems by
>> not transcoding filenames?
>
> Yes. And it simply doesn't matter. I'm not going to trade "breaks ANSI
> C/Unix world" for "fixes Java world". And not because I designed
> Mercurial for the former, nor because it's obviously the closest thing
> to a sane, coherent strategy, but because it would break things that
> are working today. There is obviously no solution that will make
> everyone happy; betraying existing users to please potential users is
> a strategy for eventually having no users.
>
> Now you've wasted another hour+ of my time and predictably gotten
> nowhere, because there's exactly zero new information in either your
> message or my response that hasn't been repeatedly presented over the
> past seven years.

I'm simply attacking the basic premise that lies behind the existing
strategy. I tried to do this with some concrete experiments.

> That's an hour I could have used doing something useful, or at least
> spent without elevated blood pressure. In other words, you have
> successfully trolled me. Pleased with yourself, troll?

There's no need to keep mentioning your blood pressure. My mail was not
an attempt to cause stress or pain -- it was attempt to have a serious
discussion about whether your stance about the "Makefile problem" makes
sense or not.

> If you want to do something useful, go work on the VFS layer.

I'm sorry, but I can't find any time or motivation to work on core
things at the moment.

-- 
Martin Geisler

aragost Trifork
Commercial Mercurial support
http://aragost.com/mercurial/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20120526/537251c2/attachment.pgp>


More information about the Mercurial-devel mailing list