NewIdeas/FileTypes

This page does not meet our wiki style guidelines. Please help improve this page by cleaning up its formatting.

Note:

This page is primarily intended for developers of Mercurial.

(Please see the current BinaryFiles page for why tracking 'file types' is always a bad idea -mpm)

As mpm has pointed out, the notion of BinaryFiles is problematic. A lot of tools use the "search for NUL" heuristic that Mercurial uses, but there are files for which this does not work and it doesn't really help us select the right merge tool.

As pointers, here are two earlier discussions related to this topic, both concerning handling of character sets:

Some definitions:

A binary file is simply a file whose content bytes must not be altered when it is encoded/decoded to the workspace.
A text file is one that should be subject to newline conversion when encoded/decoded to the workspace. It is also one where in the absence of further specification a line-based diff/merge tool should be used.

The current .hgrc mechanism provides a means to assign an encode/decode tool on the basis of file name matching. This is a fine way to specify default behavior for a large number of files, but we found in OpenCM that there were always exceptions. A particularly unpleasant exceptional case is that XML content written by programs is binary content while XML content written by humans is text content.

Our plan in OpenCM was to record an (optionally specified) notion of type for each file in the manifest. A type is simply a unique name -- it has no intrinsic semantics. Other tools, including the encode/decode tools and the merge tools, can find out what the type name is and use that information to decide how to process the file. In the absence of an explicitly specified type, heuristics similar to the ones currently used in .hgrc can be applied, with the 'check for NUL' test providing an ultimate fallback to binary.

In the context of Mercurial, this would imply some minor changes. I think they can be done backwards compatibly:

A mechanism needs to be added by which a user can specify the type of a given file explicitly. This should be optional.
The glob matching in .hgrc should be used to map to a type name, not to an encode/decode handler.
There should be a [types] section that specifies, for each type, what tool(s) should be used to encode, decode, and merge that type. The best way to achieve this given the current Mercurial design is probably to modify the invocation of hgmerge to provide the type names in addition to the file names that are to be merged. Interpretation of these type names is left to the hgmerge tool. However, the encode/decode logic can sensibly be determined from the type name much as it is derived from the glob patterns now.
The type names 'binary' and 'text' should be reserved as common fall back cases.
Hgmerge should exhibit well-specified failure behavior when asked to merge two files of incompatible type.
The default hgmerge should issue some sensible complaint when asked to handle type(s) that it does not recognize.

So here is a specific proposal to serve as a starting point of discussion:

There should be a means to specify a per-file typename.
In the absence of a specified type, heuristics in .hgrc are applied to determine the type.
If that does not resolve type, the current NUL check is used, resulting in one of the types binary or text.
Type names should be passed to hgmerge.
Selection of encode/decode strategy should be based on the type name, not the GLOB match.