[PATCH 2 of 2 V4] hgweb: teach archive how to handle file patterns

Angel Ezquerra angel.ezquerra at gmail.com
Thu Feb 14 02:22:02 CST 2013

On Thu, Feb 14, 2013 at 2:30 AM, Mads Kiilerich <mads at kiilerich.com> wrote:
> Angel Ezquerra wrote, On 02/10/2013 11:56 AM:
>> # HG changeset patch
>> # User Angel Ezquerra <angel.ezquerra at gmail.com>
>> # Date 1360493525 -3600
>> # Node ID fb655ad16f6675265da9d472ded7140a223fb283
>> # Parent  be3e96a41d0f4b7a1f1dd443f5261d6eeb66626a
>> hgweb: teach archive how to handle file patterns
>> The archive web command now takes into account the "file" request entry,
>> if one
>> is provided.
>> The provided "file" is processed as a "path" pattern by default, which
>> makes it
>> easy to only archive a certain file or directory. However, it is possible
>> to
>> specify a different type of pattern, such as relglob by specifying it
>> explicitly on the query URL. Note that only "safe" patterns are allowed.
>> Safe
>> patterns are 'path', 'relpath', 'glog' and 'relglob'. Other pattern types
>> are
>> not allowed because they could be expensive to calculate.
>> With this change hgweb can to process requests such as:
>> 1. http://mercurial.selenic.com/hg/archive/tip.zip/mercurial/templates
>>      This will download all files on the mercurial/templates directory as
>> a zip
>>      file
>> 2. http://mercurial.selenic.com/hg/archive/tip.tar.gz/relglob:*.py
>>      This will download all *.py files in the repository into a tar.gz
>> file.
>> An so forth.
>> Note that this is a first step to add support for downloading directories
>> from
>> the web interface. Currently the only way to use this feature is by
>> manually
>> constructing the URL that you want to download. We will have to modify the
>> archiveentry map entry on the different templates so that it adds the
>> current
>> folder path to the archive links.
>> This revision also adds a two tests for this feature to test-archive.t.
>> The
>> first tests the selective archive feature and the second tests that the
>> server
>> rejects "unsafe" patterns.
>> diff --git a/mercurial/hgweb/webcommands.py
>> b/mercurial/hgweb/webcommands.py
>> --- a/mercurial/hgweb/webcommands.py
>> +++ b/mercurial/hgweb/webcommands.py
>> @@ -803,6 +803,17 @@
>>       if cnode == key or key == 'tip':
>>           arch_version = short(cnode)
>>       name = "%s-%s" % (reponame, arch_version)
>> +
>> +    ctx = webutil.changectx(web.repo, req)
>> +    pats = []
>> +    file = req.form.get('file', None)
>> +    defaultpat = 'path'
>> +    if file:
>> +        pats = [req.form['file'][0]]
>> +        if not scmutil.patsaresafe(pats, defaultpat):
>> +            msg = 'Archive pattern not allowed: %s' % pats[0]
>> +            raise ErrorResponse(HTTP_FORBIDDEN, msg)
>> +
>>       mimetype, artype, extension, encoding = web.archive_specs[type_]
>>       headers = [
>>           ('Content-Disposition', 'attachment; filename=%s%s' % (name,
>> extension))
>> @@ -812,9 +823,9 @@
>>       req.headers.extend(headers)
>>       req.respond(HTTP_OK, mimetype)
>>   -    ctx = webutil.changectx(web.repo, req)
>> +    matchfn = scmutil.match(ctx, pats, default=defaultpat)
>>       archival.archive(web.repo, req, cnode, artype, prefix=name,
>> -                     matchfn=scmutil.match(ctx, []),
>> +                     matchfn=matchfn,
>>                        subrepos=web.configbool("web", "archivesubrepos"))
>>       return []
>>   diff --git a/mercurial/scmutil.py b/mercurial/scmutil.py
>> --- a/mercurial/scmutil.py
>> +++ b/mercurial/scmutil.py
>> @@ -682,6 +682,15 @@
>>         return l
>>   +def patsaresafe(pats, defaultpattype):
>> +    for pat in pats:
>> +        pattype = defaultpattype
>> +        if ':' in pat:
>> +            pattype = pat.split(':')[0]
>> +        if pattype.lower() not in ('path', 'relpath', 'glog', 'relglob'):

Mads, thanks for your review. You raise some very valid concerns.
Please see my comments below.

> (btw: relpath and relglob are completely undocumented in the patterns help.)
> 'glog' seems to be a typo. That indicates that the feature doesn't have good
> test coverage and also haven't been fully tested manually.

The error is harmless in the sense that it means that glob patterns
would be rejected as well (i.e. the patsaresafe() method is actually
safer than it should :-)

I could add tests for all safe pattern types but I think that would be
overkill? Anyway I'm thinking this may not be necessary if I reduce
the scope of the pattern feature as I'll explain below.

> But both kinds of globs are in in my opinion not sufficiently safe. Consider
> for example the execution time for
>   hg locate "glob:*********************x"
> and how something like that can be used to denial of service attacks in
> hgweb.

True. The primary use cases for this feature are:

1. Download a single folder (with its subfolders)
2. Download a subset of the files, particularly the files matching a
set of filetypes, from "big" repositories.

The feature as currently implemented covers both use cases, but it
actually goes way beyond that. Maybe we can tight it up? That is,
maybe we can make sure that only those two very specific scenarios are

> I must say that I am no big fan of this feature as it is.
> * It is conceptually too complex compared to the value it adds.

I think we can simplify it while taking it a bit beyond the "download
a single folder" use case.

> * Patterns can not be made explorable in hgweb and it is undocumented and
> there is no good place to document it.

The obvious place to document them would be on the 403 error message
that we show when an "unsafe" pattern is shown.

> * URLs thus has to be constructed manually ...

True. I don't have a good solution for that. However hgweb already
lets you access URLs that you can construct but to which that there is
no way to get through the web interface (for example, explicit TAG
links, explicit RSS links, etc).

> and it is not obvious how to
> encode for instance globs with '?' in a url.

Also True. I guess that we could urldecode the selected file? In any
case I think that this is another indication that we should limit the
scope of what can be done further.

> * It doesn't have the full power of specifying multiple patterns with -X and
> -I as we are used to when using patterns.

I don't think this is necessary. What I mean is that this is not an
all or nothing proposition. The fact that we cannot do everything that
can be done with a proper mercurial client through the mercurial
command line does not mean that should not provide any of
functionality at all through hgweb!

> * It is not obvious which subset of patterns that can be used.

If we make the valid patterns much more explicit I think we can make
it quite obvious, and the 403 error can help educate the user.

What I have in mind is to completely disallow explicitly specifying
the pattern type. Instead we could just implicitly allow simple
"relglob" patterns when , which basically would behave as command line
shell globs. That means that we would only allow using a single "*" or
"**". If one of these were found we would run a relglob. If more than
one were found we would consider the pattern unsafe.

These are valid concerns. When I started implement this feature my
plan was just to add a way to download a single folder through the web
interface (to do so I still need to do another patch that changes the
templates to add the current path folder to archive links). However I
realized that it would be just as easy and almost free to also allow
to download specific types of files (what I had in mind was to
download all *.py files from a repository, for example). Now you have
convinced me that we must limit the scope of this much more.

> If something in this area is needed then I would suggest focusing on just
> making it possible to download a single directory as tar file. There is no
> need for a pattern - we only need a path after the archive, for instance
> .../archive/REV.tar.bz2/sub/dir .

This will definitely be possible and even accessible through the web
interface. I'd just like to also be able to do:




More information about the Mercurial-devel mailing list