[PATCH] lfs: add a progress bar when searching for blobs to upload

Sat Sep 1 03:28:32 UTC 2018

On Fri, 31 Aug 2018 02:23:47 -0400, Martin von Zweigbergk  
<martinvonz at google.com> wrote:

> On Thu, Aug 30, 2018 at 11:16 PM Martin von Zweigbergk <
> martinvonz at google.com> wrote:
>
>>
>>
>> On Thu, Aug 30, 2018 at 11:02 PM Matt Harbison <mharbison72 at gmail.com>
>> wrote:
>>
>>>
>>> On Aug 31, 2018, at 1:19 AM, Martin von Zweigbergk  
>>> <martinvonz at google.com>
>>> wrote:
>>>
>>>
>>>
>>> On Thu, Aug 30, 2018 at 10:14 PM Martin von Zweigbergk <
>>> martinvonz at google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Aug 29, 2018 at 8:18 PM Matt Harbison <mharbison72 at gmail.com>
>>>> wrote:
>>>>
>>>>> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <
>>>>> mharbison72 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> > # HG changeset patch
>>>>> > # User Matt Harbison <matt_harbison at yahoo.com>
>>>>> > # Date 1535147146 14400
>>>>> > #      Fri Aug 24 17:45:46 2018 -0400
>>>>> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
>>>>> > # Parent  c9a3f7f5c0235e3ae35135818c48ec5ea006de37
>>>>> > lfs: add a progress bar when searching for blobs to upload
>>>>> >
>>>>> > The search itself can take an extreme amount of time if there are a
>>>>> lot
>>>>> > of
>>>>> > revisions involved.  I've got a local repo that took 6 minutes to
>>>>> push
>>>>> > 1850
>>>>> > commits, and 60% of that time was spent here (there are ~70K  
>>>>> files):
>>>>> >
>>>>> >      \ 58.1%  wrapper.py:     extractpointers      line 297:
>>>>> pointers =
>>>>> > extractpointers(...
>>>>> >        | 57.7%  wrapper.py:     pointersfromctx    line 352:  for p
>>>>> in
>>>>> > pointersfromctx(ct...
>>>>> >        | 57.4%  wrapper.py:     pointerfromctx     line 397:  p =
>>>>> > pointerfromctx(ctx, f, ...
>>>>> >          \ 38.7%  context.py:     __contains__     line 368:  if f
>>>>> not
>>>>> > in ctx:
>>>>> >            | 38.7%  util.py:        __get__        line 82:  return
>>>>> key
>>>>> > in self._manifest
>>>>> >            | 38.7%  context.py:     _manifest      line 1416:   
>>>>> result
>>>>> =
>>>>> > self.func(obj)
>>>>> >            | 38.7%  manifest.py:    read           line 472:   
>>>>> return
>>>>> > self._manifestctx.re...
>>>>> >              \ 25.6%  revlog.py:      revision     line 1562:  text
>>>>> =
>>>>> > rl.revision(self._node)
>>>>> >                \ 12.8%  revlog.py:      _chunks    line 2217:  bins
>>>>> =
>>>>> > self._chunks(chain, ...
>>>>> >                  | 12.0%  revlog.py:      decompressline 2112:
>>>>> > ladd(decomp(buffer(data, ch...
>>>>> >                \  7.8%  revlog.py:      checkhash  line 2232:
>>>>> > self.checkhash(text, node, ...
>>>>> >                  |  7.8%  revlog.py:      hash     line 2315:  if
>>>>> node
>>>>> > != self.hash(text, ...
>>>>> >                  |  7.8%  revlog.py:      hash     line 2242:
>>>>> return
>>>>> > hash(text, p1, p2)
>>>>> >              \ 12.0%  manifest.py:    __init__     line 1565:
>>>>> > self._data = manifestdict(t...
>>>>> >          \ 16.8%  context.py:     filenode         line 378:  if  
>>>>> not
>>>>> > _islfs(fctx.filelog(...
>>>>> >            | 15.7%  util.py:        __get__        line 706:   
>>>>> return
>>>>> > self._filelog
>>>>> >            | 14.8%  context.py:     _filelog       line 1416:   
>>>>> result
>>>>> =
>>>>> > self.func(obj)
>>>>> >            | 14.8%  localrepo.py:   file           line 629:   
>>>>> return
>>>>> > self._repo.file(self...
>>>>> >            | 14.8%  filelog.py:     __init__       line 1134:
>>>>> return
>>>>> > filelog.filelog(self...
>>>>> >            | 14.5%  revlog.py:      __init__       line 24:
>>>>> > censorable=True)
>>>>>
>>>>> Any ideas how to trim down some of this overhead?
>>>>
>>>>
>>>> You can possibly save on some of that manifest-reading time by calling
>>>> manifestlog.readfast() like changegroup (and verify, I think) does.
>>>>
>>>>
>>>>>   revset._matchfiles()
>>>>> has a comment about reading the changelog directly because of the
>>>>> overhead
>>>>> of creating changectx[1].  I think that could work here too, but  
>>>>> falls
>>>>> apart because of the need to access the filelogs too.
>>>>
>>>>
>>> I don't see changectx-creation in the profile output. What makes you
>>> think that's a significant cost here?
>>>
>>>
>>> The footnote below is to a comment that says itâ€™s expensive to create
>>> each one, which is basically what happened here in this case. Maybe the
>>> comment is stale now?  I wasnâ€™t sure why it would be that much more
>>> expensive either, but reading the changelog seems more direct.
>>>
>>
>> I saw that comment (in the link), but since changelog-reading doesn't  
>> seem
>> to appear in the profiling output, it doesn't seem like the first thing  
>> to
>> tackle at least. Since most of the time seems to be spent reading
>> manifests, I think readfast() might help. Can you see if the following
>> patch helps? (The special handling of the working copy is because
>> workingctx apparently doesn't have a manifestctx() method. We should
>> perhaps fix that.)
>>
>> diff --git a/hgext/lfs/wrapper.py b/hgext/lfs/wrapper.py
>> --- a/hgext/lfs/wrapper.py
>> +++ b/hgext/lfs/wrapper.py
>> @@ -259,7 +259,11 @@ def _prefetchfiles(repo, revs, match):
>>
>>      for rev in revs:
>>          ctx = repo[rev]
>> -        for f in ctx.walk(match):
>> +        if ctx.rev() is None:
>> +            manifest = ctx.manifest()
>> +        else:
>> +            manifest = ctx.manifestctx().readfast()
>> +        for f in manifest.walk(match):
>>              p = pointerfromctx(ctx, f)
>>              if p and p.oid() not in oids and not  
>> localstore.has(p.oid()):
>>                  p.filename = f
>>
>>
> Never mind, seems I patched the wrong method :P Hopefully you get the  
> idea
> and you can figure out how to do it yourself in pointersfromctx()
> and pointerfromctx(). You want to avoid the "if f not in ctx" there.

OK, thanks!  I'll submit something as an RFC after the current tests  
finish, because I saw some wildly differing results and will probably need  
help figuring out how to accurately benchmark this.  (e.g., the first run  
with the change was 9 minutes, then the next run without it 5:34, and then  
I ran with the change again, and it was 5:09.  This is from an nvme drive,  
so I can't believe caching could have that much influence.)

Is there any documentation for this sort of thing?  There are not a lot of  
function docs in this area.  I've got a high level understanding of  
changelog -> manifest -> filelog relationships, and internals.revlog  
provides some format detail.  But I'm pretty lost looking at the code.   
ctx.manifestctx().readfast() returns a manifest, and is annotated in  
localrepo with '00manifest.i'.  `hg debugdata` on that file and `hg  
manifest` both print out every single file in a revision.  But this  
manifest only contains added and modified files, so it's like it has been  
logically AND'd with the changelog.  That's very helpful here, but not at  
all obvious.

>>>   It seems like
>>>>> reading the changelog and accessing the filelogs directly here is too
>>>>> low
>>>>> level, especially with @indygreg trying to add support for  
>>>>> non-filelog
>>>>> storage.
>>>>>
>>>>> [1]
>>>>>
>>>>> https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
>>>>> _______________________________________________
>>>>> Mercurial-devel mailing list
>>>>> Mercurial-devel at mercurial-scm.org
>>>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>>>>