[PATCH 01 of 10] largefiles: centralize the logic to get outgoing largefiles

Tue Mar 11 05:35:00 CDT 2014

At Fri, 07 Mar 2014 20:53:26 +0100,
Mads Kiilerich wrote:
> 
> On 03/07/2014 03:47 PM, FUJIWARA Katsunori wrote:
> > At Thu, 06 Mar 2014 15:48:20 +0100, Mads Kiilerich wrote:
> >>>    shortname = '.hglf'
> >>>    shortnameslash = shortname + '/'
> >>> @@ -365,3 +366,31 @@
> >>>            if f[0] not in filelist:
> >>>                filelist.append(f[0])
> >>>        return filelist
> >>> +
> >>> +def getlfilestoupload(repo, missing, mapfunc=None):
> >>> +    toupload = set()
> >>> +    if not missing:
> >>> +        return toupload
> >> Do this check make any difference? Looping an empty list is free.
> > "missing" will be bound to "None" if there are no outgoing revisions,
> > and it causes exception raising at looping.
> 
> Really? It seems like this function in both places replaces a 'for n in 
> o' loop?

More exactly, "missing" will be bound to "None" if (1) there are no
outgoing revisions, and (2) "getlfilestoupload()" is invoked with the
value returned by "hg._outgoing()".

But it seems that there is no specific reason for "hg._outgoing()" to
return "None" instead of empty list (returning "None" should comes
from 8888e56ac417)

So, I'll clean these code paths up in revised series.

> (Btw: it would be nice with a docstring that says what 'missing' is and 
> that the function returns a set of mapfunc results.)
> 
> >>> +    if not mapfunc:
> >>> +        mapfunc = lambda f, ctx: f
> >> I not think this mapfunc thing is pretty.
> >>
> >> In the end, it would be nice to tell in the UI which largefile we are
> >> pulling/pushing. Not only the hash, but also the largefile name (or one
> >> of them).
> >>
> >> Currently outgoing and summary will list all largefiles in outgoing
> >> changesets, no matter if they really need pushing or not. Arguably it
> >> should only list the largefiles that really are outgoing. For that we
> >> need the hashes.
> > Do you mean that "getlfilestoupload()" should return also hashes in
> > the way for the improvement (to list or count only really outgoing
> > largefiles) of outgoing/summary in the future ?
> 
> Yes, I think so. Outgoing/summary do not show all revisions - it only 
> shows the revisions that are outgoing. Currently outgoing/summary shows 
> all largefiles in these revisions, no matter if they are outgoing or not 
> (if I understand it correctly) - that seems like a bug. Instead it 
> should only show the largefiles that actually will be pushed. There 
> could be different ways of indicating that. Showing the involved 
> largefile names could be one. Showing the actual number of largefiles 
> (unique hashes) could be another, and yet another could be to show the 
> total amount of bytes.
> 
> What do you think about this?

It sounds good. How about the message below ?

    largefiles: NN entities to upload for MM largefiles
    path/to/file .....  (only for "hg outgoinge")
    ....

and below for "hg outgoing --large --debug" ?

    largefiles: NN entities to upload for MM largefiles
    path/to/file (hash)
    path/to/file (hash1 hash2 hash3) # for multiple changes
    ....

"NN" is number of unique hashes, and "MM" is number of paths of
largefiles, to be uploaded.

To keep patches small and simple, I plan to achieve it in steps
below.

  (1) refactor to avoid redundant discovery

      it just centralizes the logic to check like this series.

  (2) check entities which may be uploaded

      get unique hashes of largefiles which may be uploaded to remote,
      and show detailed messages like above.

      in this step, existence of largefiles on remote side is still
      not checked: messages above may show grater NN/MM than exact
      ones.

  (3) check entities which should be uploaded exactly

      in addition to (2), check existence of each largefiles on remote
      side to show exact numbers in outgoing/summary messages.

      this involves "exists" wire access and increases remote access
      cost.

The patch for (3) can be dropped individually, if wire accessing seems
not to be reasonable in performance or so.

> >> So how about letting this function return a dict mapping from hash to a
> >> filename that uses it? For now the two users of this function can just
> >> look at either keys or values.
> > I chose this (= passing map function, if needed) implementation to
> > avoid performance impact below when there are many outgoing
> > largefiles:
> >
> >    - returning hashes always may slow down outgoing/summary by
> >      meaningless "ctx[f].data().strip()" for them
> 
> A largefile is still a big thing and expensive to shuffle around due to 
> its size (and the general design of what it is). My gut feeling is that 
> an extra call to .data() will be a small extra cost in the greater 
> picture ... especially considering that it (and a lstat call to the 
> remote location) is essential for actually giving a useful result for 
> outgoing/summary.
> 
> >    - trying "filename/context => hash" on the result of
> >      "getlfilestoupload()" causes scanning (maybe large) list again
> 
> Sorry, I don't understand that.
> 
> My thought would be that the function should return a "hash => 
> one-of-the-filenames" mapping. I don't think there would be a need for 
> any scanning ... or how it would be a problem.
> 
> > If these are not so serious and/or you intend improvement of
> > outgoing/summary in the future as I guess above, I'll change
> > "getlfilestoupload()" as you mentioned.
> 
> I don't have any particular plans for that. There is lots of bigger 
> problems with largefiles - this nice series is addressing one of them.
> 
> 
> Anyway: These were just my thoughts. If you still prefer the way you 
> proposed to do these changes then it is fine with me.
> 
> /Mads
> 

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp