Dealing with binary files (was Re: [PATCH]Make hg diff go nice on binary files)

Matt Mackall mpm at selenic.com
Fri Aug 26 15:26:43 CDT 2005


On Wed, Jul 27, 2005 at 11:54:02PM -0400, Theodore Ts'o wrote:
> On Wed, Jul 27, 2005 at 05:29:02PM -0700, Matt Mackall wrote:
> > On Wed, Jul 27, 2005 at 07:52:32PM -0400, Kevin Smith wrote:
> > > Matt Mackall wrote:
> > > >Looked at another way, there are exactly three things we'll use a
> > > >binary flag for:
> > > >
> > > >- deciding whether we can diff/export/annotate
> > > >- deciding whether to merge
> > > >- deciding how to display something in hgweb
> > > 
> > > Probably true. However, a fourth item is at least related: newline 
> > > mangling. Either we say that text == mangled, OR we need another very 
> > > similar flag to allow users to indicate which files should be mangled 
> > > and which should not.
> > 
> > I'm pretty strongly against building any file mangling into Mercurial,
> > either for locale conversion or newline conversion.
> > 
> > I'm willing to provide a hook for commit-time and checkout-time
> > filtering, but then the user is responsible for all the pain thus
> > incurred.
> 
> Even if we do this in a separate wrapper program, that program needs
> some way to know whether or not to do newline mangling --- there may
> be files where even though it is a text file, the user may not want to
> do eoln mangling on that particular text file (for one reason or
> another).  
> 
> Also, as we discussed at the Kernel Summit BOF session, there may be
> other more general uses where it may be useful to be able to specify a
> specialized pipeline of canonicalization and decanoncalization filters
> on a per-file basis.  For example, if a particular file is an
> openoffice file or some other file which is a compressed XML stream,
> decompressing the XML stream before checking in the file will allow
> for more efficient diffs to be stored in the SCM.  Then when the file
> is checked out into the working directory, the XML stream can be
> recompressed so that openoffice can work with the file.
> 
> Another use that would be nice to implement as a filter would be a way
> of expanding RCS and/or SCCS keywords when the file is checked out,
> but to make them go away before doing the checkin so as not to
> contaminate the diffs.  BitKeeper had this functionality, and I miss
> it.  But I can understand not wanting to contaminate the core
> mercurial functionality with this feature, so doing it as some kind of
> plugin or hook makes perfect sense.
> 
> So it would be nice if there was a way of associated a pipeline of
> filters on a per-file basis.  From a user convenience point of view it
> would be really cool if it could be stored in the file's metadata, and
> could be changed as part of a changeset operation.  (Perhaps "hg
> admin" to set and get the pipeline of filters for each file.)
> 
> If this is too much complexity, a slightly more kludgy, but simpler to
> implement method would be to create a file which associates regular
> expressions with a set of filter pipelines.  
> 
> In any case I think that implementing the capability of associating a
> pipeline of filters on a per-file basis is a much cooler way of
> implementing functionality which in BitKeeper is implemented as a
> series of special-case hacks (for eoln termination, RCS keyword
> expansion, SCCS keyword expansion, etc.)  The fact that it also allows
> us to do something Bitkeeper can't --- a more efficient way of dealing
> with compressed XML files from OpenOffice --- also makes this approach
> very appealing, and IMHO suggests that it may be the Right Way to add
> a lots of functionality while keeping the core SCM as simple as
> possible.

Update on this:

I've set things up so that all working directory file I/O now goes
through localrepository.wread and .wwrite. So all that remains is to
add the filtering hooks to these functions.

-- 
Mathematics is the supreme nostalgia of our time.


More information about the Mercurial mailing list