[PATCH 01 of 10] largefiles: centralize the logic to get outgoing largefiles

Fri Mar 7 13:53:26 CST 2014

On 03/07/2014 03:47 PM, FUJIWARA Katsunori wrote:
> At Thu, 06 Mar 2014 15:48:20 +0100, Mads Kiilerich wrote:
>>>    shortname = '.hglf'
>>>    shortnameslash = shortname + '/'
>>> @@ -365,3 +366,31 @@
>>>            if f[0] not in filelist:
>>>                filelist.append(f[0])
>>>        return filelist
>>> +
>>> +def getlfilestoupload(repo, missing, mapfunc=None):
>>> +    toupload = set()
>>> +    if not missing:
>>> +        return toupload
>> Do this check make any difference? Looping an empty list is free.
> "missing" will be bound to "None" if there are no outgoing revisions,
> and it causes exception raising at looping.

Really? It seems like this function in both places replaces a 'for n in 
o' loop?

(Btw: it would be nice with a docstring that says what 'missing' is and 
that the function returns a set of mapfunc results.)

>>> +    if not mapfunc:
>>> +        mapfunc = lambda f, ctx: f
>> I not think this mapfunc thing is pretty.
>>
>> In the end, it would be nice to tell in the UI which largefile we are
>> pulling/pushing. Not only the hash, but also the largefile name (or one
>> of them).
>>
>> Currently outgoing and summary will list all largefiles in outgoing
>> changesets, no matter if they really need pushing or not. Arguably it
>> should only list the largefiles that really are outgoing. For that we
>> need the hashes.
> Do you mean that "getlfilestoupload()" should return also hashes in
> the way for the improvement (to list or count only really outgoing
> largefiles) of outgoing/summary in the future ?

Yes, I think so. Outgoing/summary do not show all revisions - it only 
shows the revisions that are outgoing. Currently outgoing/summary shows 
all largefiles in these revisions, no matter if they are outgoing or not 
(if I understand it correctly) - that seems like a bug. Instead it 
should only show the largefiles that actually will be pushed. There 
could be different ways of indicating that. Showing the involved 
largefile names could be one. Showing the actual number of largefiles 
(unique hashes) could be another, and yet another could be to show the 
total amount of bytes.

What do you think about this?

>> So how about letting this function return a dict mapping from hash to a
>> filename that uses it? For now the two users of this function can just
>> look at either keys or values.
> I chose this (= passing map function, if needed) implementation to
> avoid performance impact below when there are many outgoing
> largefiles:
>
>    - returning hashes always may slow down outgoing/summary by
>      meaningless "ctx[f].data().strip()" for them

A largefile is still a big thing and expensive to shuffle around due to 
its size (and the general design of what it is). My gut feeling is that 
an extra call to .data() will be a small extra cost in the greater 
picture ... especially considering that it (and a lstat call to the 
remote location) is essential for actually giving a useful result for 
outgoing/summary.

>    - trying "filename/context => hash" on the result of
>      "getlfilestoupload()" causes scanning (maybe large) list again

Sorry, I don't understand that.

My thought would be that the function should return a "hash => 
one-of-the-filenames" mapping. I don't think there would be a need for 
any scanning ... or how it would be a problem.

> If these are not so serious and/or you intend improvement of
> outgoing/summary in the future as I guess above, I'll change
> "getlfilestoupload()" as you mentioned.

I don't have any particular plans for that. There is lots of bigger 
problems with largefiles - this nice series is addressing one of them.

Anyway: These were just my thoughts. If you still prefer the way you 
proposed to do these changes then it is fine with me.

/Mads