This page is primarily intended for developers of Mercurial.

Grep Plan

Status: Project

Main proponents: JordiGH

1. Problem statement

This is the summary of a discussion held during the 3.8sprint.

The grep command has a troubled history. One of the most common thing people want to do is grep the current directory, but only files under hg control. Currently, this is inconvenient and at best requires doing something like hg files -0 | xargs -0 grep in order to accomplish this. At the same time, almost nobody understands what plain hg grep does and how it differs from hg grep --all. The documentation for the command does not explain this, except vague allusions about grepping history.

So, to begin with, this is a description of how the command currently works.

1.1. Plain hg grep

This searches all filelogs in reverse revlogs order, via walkchangerevs, and for each filelog outputs the first revision in which the target regexp is found. File contents are searched, not diffs, and all filelogs are searched, not just those in the current manifest.

The search can be narrowed to a subset of revisions, but in this case, it will only search the filelogs that are touched by the given revisions. In other words, hg grep -r . will not search all of the files in the current commit, but only the files touched by the current commit.

Narrowing to a group of files by specifying files does work, though.

This behaviour has been in place for a long time, but it's confusing and not very useful. It happens to be the easiest kind of search to implement given Mercurial's storage formats, though, so that probably explains why it's the way it is.

1.2. hg grep --all

This changes behaviour in two ways:

This one is a very useful tool for seeing when a change was introduced or removed. However, it has an annoying bug:

When specifying revisions, it changes the order in which revisions are visited, which messes up the expectations of the revfiles and matches caches:

A workaround is to never clear those two caches, but that becomes a memory leak instead.

1.3. walkchangerevs

The current implementation is deeply tied to walkchangerevs. This function addresses the following problem: revlogs are most efficiently traversed in revlog order, but we want to show changes starting from the top, not the bottom. Thus, walkchangerevs performs a windowing function. Starting from the top, it will go back and cache into memory in revlog order a few revisions (current default: 8). Then those windowed revisions are searched for the regexp of interest in reverse revlog order (or should be, see below). If the regexp isn't found, the window size is doubled up to a maximum (current default: 512), then the next top-most windowed revisions are cached into memory in revlog order, and so on, until all filelogs are searched.

(Need a diagram here explaining walkchangerevs, as this was made most clear by Pierre-Yves during a whiteboard session.)

2. Proposed changes

2.1. hg grep

The current plain hg grep is awkward and nonsensical. During 3.8sprint we agreed to BC it into oblivion. Instead, plain hg grep should search all tracked files, by default in the current working directory, very similar to the GrepfileExtension. If given a set of revisions, it will do the same but search the given revision(s).

2.2. hg grep --all

The --all parameter will be deprecated into --diff, in order to emphasise that it searches diffs, not file contents. Fixing Issue3885 is important.

3. Obstacles

Keeping performance reasonable while fixing the bugs and changing hg grep's default behaviour is challenging. We first need to understand walkchangerevs very well and see if it can be fixed, expanded, or replaced in order to handle our cases.

The utility of walkchangerevs is questionable in case the user specified a revset that gives revisions in revlog order instead of reverse revlog order. But I (JordiGH) still don't understand if we should just ignore the order of the given revset and always force its evaluation in reverse revlog order does, like what presumably hg glog does, or if we can do this without eagerly evaluating the revset instead of lazily.


GrepPlan (last edited 2018-03-19 14:18:34 by SangeetKumarMishra)