[PATCH 3 of 5] Add util.splitpath() and use it instead of using split() directly

Mon Jan 7 23:22:12 CST 2008

On Tue, 2008-01-08 at 12:03 +0900, Shun-ichi GOTO wrote:
> 2008/1/8, Matt Mackall <mpm at selenic.com>:
> >
> > On Sun, 2008-01-06 at 21:26 +0900, Shun-ichi Goto wrote:
> > > # HG changeset patch
> > > # User Shun-ichi GOTO <shunichi.goto at gmail.com>
> > > # Date 1199621785 -32400
> > > # Node ID e7739db328e05cf824c8d5f3bf6d694e15bb0d02
> > > # Parent  43ff7c5ed8446a721a7bad5655ddd59f8fc62e7b
> > > Add util.splitpath() and use it instead of using split() directly.
> > >
> > > This is required for workaround of 0x5c issue.
> > >
> > > +def splitpath(path):
> > > +    '''Split path by os.sep with supporting mbcs local encoding.'''
> > > +    return path.split(os.sep)
> > > +
> >
> > Confused, how does this do anything with mbcs? Also, given 2 of 2, what
> > about os.altsep?
> 
> This function (and util.endswithsep()) themself do nothing for mbcs.
> It is inteded to be wrapped by [patch 5 of 5].
> # ah, it should be described so
> 
> About os.altsep, I could not decide to use it.
> I've just replaced a part in code using os.sep and '\\'.
> 
> To consider os.altsep,
> Is it better to be like "def splitpath(path, sep=os.sep):"?
> Or split by both os.sep and os.altsep?

Here's a big question that we have to answer with regard to MBCS: what
happens if you check in a path with a 0x5c on a shift-jis machine and I
check it out in ascii-land? I suspect the answer is: I get an extra
directory level and much confusion.

Unfortunately, the alternatives are also highly problematic. I don't
really want to go too far down this road until there's a coherent plan.
And that plan should handle (in some fashion):

a) latin-1 variants blindly used in ascii locales (extremely common)
b) Makefiles and such containing charset x checked out on system using
charset y and still building (works today for lots of character sets)
c) lets people with silly character sets play too.

And here's my thought: do it in an extension. And have the extension
override a bunch of the standard interfaces and make Mercurial think
your filesystem is actually in utf-8. Which does the following:

- people with sensible charsets unaffected and (a) and (b) above just
work
- shift-jis and friends get stored in a format other people (and the
Mercurial internals!) can understand
- MBCS weirdness is well contained
- people who enable the extension will know that they're playing a
little outside of the box

Sound feasible?

-- 
Mathematics is the supreme nostalgia of our time.