[PATCH] lfs: add a progress bar when searching for blobs to upload
Matt Harbison
mharbison72 at gmail.com
Sat Sep 1 03:28:32 UTC 2018
On Fri, 31 Aug 2018 02:23:47 -0400, Martin von Zweigbergk
<martinvonz at google.com> wrote:
> On Thu, Aug 30, 2018 at 11:16 PM Martin von Zweigbergk <
> martinvonz at google.com> wrote:
>
>>
>>
>> On Thu, Aug 30, 2018 at 11:02 PM Matt Harbison <mharbison72 at gmail.com>
>> wrote:
>>
>>>
>>> On Aug 31, 2018, at 1:19 AM, Martin von Zweigbergk
>>> <martinvonz at google.com>
>>> wrote:
>>>
>>>
>>>
>>> On Thu, Aug 30, 2018 at 10:14 PM Martin von Zweigbergk <
>>> martinvonz at google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Aug 29, 2018 at 8:18 PM Matt Harbison <mharbison72 at gmail.com>
>>>> wrote:
>>>>
>>>>> On Fri, 24 Aug 2018 18:18:32 -0400, Matt Harbison <
>>>>> mharbison72 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> > # HG changeset patch
>>>>> > # User Matt Harbison <matt_harbison at yahoo.com>
>>>>> > # Date 1535147146 14400
>>>>> > # Fri Aug 24 17:45:46 2018 -0400
>>>>> > # Node ID 76eca3ae345b261c0049d16269cdf991a31af21a
>>>>> > # Parent c9a3f7f5c0235e3ae35135818c48ec5ea006de37
>>>>> > lfs: add a progress bar when searching for blobs to upload
>>>>> >
>>>>> > The search itself can take an extreme amount of time if there are a
>>>>> lot
>>>>> > of
>>>>> > revisions involved. I've got a local repo that took 6 minutes to
>>>>> push
>>>>> > 1850
>>>>> > commits, and 60% of that time was spent here (there are ~70K
>>>>> files):
>>>>> >
>>>>> > \ 58.1% wrapper.py: extractpointers line 297:
>>>>> pointers =
>>>>> > extractpointers(...
>>>>> > | 57.7% wrapper.py: pointersfromctx line 352: for p
>>>>> in
>>>>> > pointersfromctx(ct...
>>>>> > | 57.4% wrapper.py: pointerfromctx line 397: p =
>>>>> > pointerfromctx(ctx, f, ...
>>>>> > \ 38.7% context.py: __contains__ line 368: if f
>>>>> not
>>>>> > in ctx:
>>>>> > | 38.7% util.py: __get__ line 82: return
>>>>> key
>>>>> > in self._manifest
>>>>> > | 38.7% context.py: _manifest line 1416:
>>>>> result
>>>>> =
>>>>> > self.func(obj)
>>>>> > | 38.7% manifest.py: read line 472:
>>>>> return
>>>>> > self._manifestctx.re...
>>>>> > \ 25.6% revlog.py: revision line 1562: text
>>>>> =
>>>>> > rl.revision(self._node)
>>>>> > \ 12.8% revlog.py: _chunks line 2217: bins
>>>>> =
>>>>> > self._chunks(chain, ...
>>>>> > | 12.0% revlog.py: decompressline 2112:
>>>>> > ladd(decomp(buffer(data, ch...
>>>>> > \ 7.8% revlog.py: checkhash line 2232:
>>>>> > self.checkhash(text, node, ...
>>>>> > | 7.8% revlog.py: hash line 2315: if
>>>>> node
>>>>> > != self.hash(text, ...
>>>>> > | 7.8% revlog.py: hash line 2242:
>>>>> return
>>>>> > hash(text, p1, p2)
>>>>> > \ 12.0% manifest.py: __init__ line 1565:
>>>>> > self._data = manifestdict(t...
>>>>> > \ 16.8% context.py: filenode line 378: if
>>>>> not
>>>>> > _islfs(fctx.filelog(...
>>>>> > | 15.7% util.py: __get__ line 706:
>>>>> return
>>>>> > self._filelog
>>>>> > | 14.8% context.py: _filelog line 1416:
>>>>> result
>>>>> =
>>>>> > self.func(obj)
>>>>> > | 14.8% localrepo.py: file line 629:
>>>>> return
>>>>> > self._repo.file(self...
>>>>> > | 14.8% filelog.py: __init__ line 1134:
>>>>> return
>>>>> > filelog.filelog(self...
>>>>> > | 14.5% revlog.py: __init__ line 24:
>>>>> > censorable=True)
>>>>>
>>>>> Any ideas how to trim down some of this overhead?
>>>>
>>>>
>>>> You can possibly save on some of that manifest-reading time by calling
>>>> manifestlog.readfast() like changegroup (and verify, I think) does.
>>>>
>>>>
>>>>> revset._matchfiles()
>>>>> has a comment about reading the changelog directly because of the
>>>>> overhead
>>>>> of creating changectx[1]. I think that could work here too, but
>>>>> falls
>>>>> apart because of the need to access the filelogs too.
>>>>
>>>>
>>> I don't see changectx-creation in the profile output. What makes you
>>> think that's a significant cost here?
>>>
>>>
>>> The footnote below is to a comment that says itâs expensive to create
>>> each one, which is basically what happened here in this case. Maybe the
>>> comment is stale now? I wasnât sure why it would be that much more
>>> expensive either, but reading the changelog seems more direct.
>>>
>>
>> I saw that comment (in the link), but since changelog-reading doesn't
>> seem
>> to appear in the profiling output, it doesn't seem like the first thing
>> to
>> tackle at least. Since most of the time seems to be spent reading
>> manifests, I think readfast() might help. Can you see if the following
>> patch helps? (The special handling of the working copy is because
>> workingctx apparently doesn't have a manifestctx() method. We should
>> perhaps fix that.)
>>
>> diff --git a/hgext/lfs/wrapper.py b/hgext/lfs/wrapper.py
>> --- a/hgext/lfs/wrapper.py
>> +++ b/hgext/lfs/wrapper.py
>> @@ -259,7 +259,11 @@ def _prefetchfiles(repo, revs, match):
>>
>> for rev in revs:
>> ctx = repo[rev]
>> - for f in ctx.walk(match):
>> + if ctx.rev() is None:
>> + manifest = ctx.manifest()
>> + else:
>> + manifest = ctx.manifestctx().readfast()
>> + for f in manifest.walk(match):
>> p = pointerfromctx(ctx, f)
>> if p and p.oid() not in oids and not
>> localstore.has(p.oid()):
>> p.filename = f
>>
>>
> Never mind, seems I patched the wrong method :P Hopefully you get the
> idea
> and you can figure out how to do it yourself in pointersfromctx()
> and pointerfromctx(). You want to avoid the "if f not in ctx" there.
OK, thanks! I'll submit something as an RFC after the current tests
finish, because I saw some wildly differing results and will probably need
help figuring out how to accurately benchmark this. (e.g., the first run
with the change was 9 minutes, then the next run without it 5:34, and then
I ran with the change again, and it was 5:09. This is from an nvme drive,
so I can't believe caching could have that much influence.)
Is there any documentation for this sort of thing? There are not a lot of
function docs in this area. I've got a high level understanding of
changelog -> manifest -> filelog relationships, and internals.revlog
provides some format detail. But I'm pretty lost looking at the code.
ctx.manifestctx().readfast() returns a manifest, and is annotated in
localrepo with '00manifest.i'. `hg debugdata` on that file and `hg
manifest` both print out every single file in a revision. But this
manifest only contains added and modified files, so it's like it has been
logically AND'd with the changelog. That's very helpful here, but not at
all obvious.
>>> It seems like
>>>>> reading the changelog and accessing the filelogs directly here is too
>>>>> low
>>>>> level, especially with @indygreg trying to add support for
>>>>> non-filelog
>>>>> storage.
>>>>>
>>>>> [1]
>>>>>
>>>>> https://www.mercurial-scm.org/repo/hg/file/6f38284b23f4/mercurial/revset.py#l1113
>>>>> _______________________________________________
>>>>> Mercurial-devel mailing list
>>>>> Mercurial-devel at mercurial-scm.org
>>>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>>>>
More information about the Mercurial-devel
mailing list