[PATCH 0 of 1] Interest for convert option --contentsfilter?

Mon Apr 18 06:06:26 CDT 2011

On Apr 18, 2011, at 12:00 PM, Martin Geisler wrote:

> Jason Harris <jason.f.harris at gmail.com> writes:
> 
>> Hi All,
>> 
>> I was wondering if there was interest in an option --contentsfilter
>> for convert.
> 
> I think there is a need for such a filter -- I believe I've seen Patrick
> explain to users how to hack this themselves in the past.
> 
>> Here is a description:
>> 
>>   The contentsfilter can potentially transform the contents of every
>>   file at every revision as the conversion happens. If the
>>   --contentsfilter scriptname option is specified, then scriptname
>>   should be the full path name of a script which takes three
>>   arguments - (i) original file name, (ii) the original changeset
>>   hash, and (iii) the original contents of the file.
> 
> You cannot really pass the original content as an argument on the
> command line... there are restrictions on command lines lengths in some
> operating systems and command line arguments are separated by NUL bytes
> when passed to the programs (so you cannot pass "binary" files this
> way). The normal way would be to let the script read the data on stdin.

Yep. My patch was more a experiment of concept rather than final code...

> I think --filter would be a shorter and better name for this option?

Its maybe too general but I think its better...

>>   The script should transform the passed in file contents into the
>>   desired final output form and write this to stdout. The result
>>   code returned should be 0 if the contents were modified and 1 if
>>   the contents did not need modification.
>> 
>> So this option could be used to for example fix all the newlines for
>> every file at every revision in a repository.
>> 
>> I needed this option this weekend, since I needed to transform all of
>> the tab characters in all the source files of a repo into 4 spaces.
>> (Don't ask why.. .) But anyway with the following script:
>> 
>> --------------------
>> #!/usr/bin/python
>> import sys, os, re
>> 
>> originalFilename = sys.argv[1]
>> originalHash = sys.argv[2]
>> originalContents = sys.argv[3]
>> 
>> # In this example we only transform files with endings .h, .m, .hpp, .c, .cpp
>> fileNameMatcher = re.compile(r".*\.(h|m|hpp|c|cpp|txt)$")
>> if (not fileNameMatcher.match(originalFilename)):
>>   sys.exit(1)
> 
> The filtering on filenames makes me think of our encode/decode
> filters... could we instead use that API and infrastructure? So instead
> of a new option to convert we let convert read/write data through the
> normal encode/decode filters. That already allows for calling both shell
> scripts and Python functions.

Maybe, but the point is that it's just one type of filtering.

I might have another filter which is 

newContents = re.sub(r"(\W)BrowserView(\W)", r"\1FileView\2", originalContents)
newContents = re.sub(r"(\W)inspectorBWSplitView(\W)", r"\1inspectorSplitView\2", newContents)
newContents = re.sub(r"(\W)reasonForInvalIdityOfSelectedEntries(\W)", r"\reasonForInvalidityOfSelectedEntries\2", newContents)

or something like that.. ie a series of global replacements on the contents to
correct annoying mistakes.

Or I could restrict such changes to just certain functions since the hash is
passed in. In fact although its not passed in now one could imagine passing in
as well the rev number and only doing a transformation for a certain range of
changesets.

Thus I don't see this is exactly the same as a normal encode / decode filter.
I guess though one common use for it will be to fix eol, everywhere and
consistently throughout the repo.

> For newline conversion people could use the functions from the eol
> extension directly.
> 
> Since people may have such encode/decode rules already we should trigger
> this with a command line flag. I suggest a new option called --encode,
> matching the --decode option we already have for 'hg cat'.

>> tabre = re.compile(r"\t")
>> substituePattern = r"    "
>> newContents = tabre.sub(substituePattern, originalContents)
> 
> (Don't use regular expressions for this trivial replacement, use
> originalContents.replace("\t", " ") instead -- much simpler and faster.)

Yep. I more wanted to show template code. But yep you are right that, that of
course is a perfectly valid optimization...

Thus would this be accepted / included with name change:

 contentsfilter -> filter
 using stdin and stdout instead of passing the data via the command line.

I didn't mention it but some people might already know that git has such an
option git-filter-branch, and this allows Mercurial to do the same thing...

Cheers,
 Jas