[PATCH 3 of 3] introduce filenamelog repository layout

Tue Jul 22 03:12:12 CDT 2008

On 22.07.2008 02:20, Matt Mackall wrote:
> On Tue, 2008-07-22 at 01:04 +0200, Adrian Buehlmann wrote:
>> On 21.07.2008 23:16, Matt Mackall wrote:
>>> On Mon, 2008-07-21 at 21:43 +0200, Adrian Buehlmann wrote:
>>>
>>>> +def debugfilenamelog(ui, repo, **opts):
>>>> +    """dumps the filenamelog, showing the encoded filename for each entry
>>>> +
>>>> +    Writes two lines of output per filenamelog entry: The first line is the
>>>> +    filenamelog entry itself, the second line is the encoded filename.
>>>> +
>>>> +    By default, outputs the hashed filenames only. Specify -f to output
>>>> +    only the full name (non-hased) filenames, or -a to get both.
>>> That's confusing.
>> I was trying to output only the interesting things by default (the
>> hashed filenames). I admit, this might be overkill.
>>
>> Not sure if debugfilenamelog is worth being dissected that closely
>> though. It's just a debugging feature.
>>
>> But if I know what's wanted, I can change it.
>>
>>>> +    Specify -q to suppress the statistic printed at the end.
>>>> +    """
>>>> +    all = opts["all"]
>>>> +    fullonly = opts["full"]
>>>> +    n = 0
>>>> +    nh = 0
>>>> +    for f in repo.store.readfnlog():
>>>> +        ef = repo.encodefn(f)
>>>> +        hashed = ef.startswith('dh/')
>>>> +        if hashed:
>>>> +          nh += 1
>>>> +        if all or (hashed and not fullonly) or (fullonly and not hashed):
>>>> +            ui.write("   '%s'\n-> '%s'\n" % (f, ef))
>>>> +        n += 1
>>>> +    ui.status("(filenamelog has %i filenames, %i hashed)\n" % (n, nh))
>>> Couldn't we use repo.storefiles + repo.sjoin to do this without
>>> knowing/caring about what sort of store we were using? A similar comment
>>> applies to your verify addition.
>> repo.storefiles locks the repo again, and also returns the meta data files
>> as well (changelog, manifest), which neither debugfilenamelog nor verify want
>> to see.
> 
> On the other hand, it's a debug function. If locking is an issue, we can
> migrate the lock back to its one user.
> 
>> Also, repo.storefiles stats the files to get the lengths, which neither
>> debugfilenamelog nor verify want to have (we could do that with a param,
>> of course).
>>
>> verify already locks the repo and debugfilenamelog doesn't need to lock it.
>>
>>>> diff --git a/mercurial/filenamelog.py b/mercurial/filenamelog.py
>>>> new file mode 100644
>>>> --- /dev/null
>>>> +++ b/mercurial/filenamelog.py
>>>> @@ -0,0 +1,44 @@
>>>> +# filenamelog.py - logging all filenames of a Mercurial repository
>>>> +#
>>>> +# Copyright 2008 Matt Mackall <mpm at selenic.com>
>>>> +#
>>>> +# This software may be used and distributed according to the terms
>>>> +# of the GNU General Public License, incorporated herein by reference.
>>>> +
>>>> +from i18n import _
>>>> +import util
>>>> +
>>>> +LOGNAME = 'filenamelog'
>>>> +
>>>> +def abort(text, linenum = None):
>>>> +    lineinfo = ""
>>>> +    if (linenum != None):
>>>> +        lineinfo = ", line %i" % linenum
>>>> +    raise util.Abort("%s%s: %s" % (LOGNAME, lineinfo, text))
>>>> +
>>>> +def append(opener, entries, transaction):
>>>> +    if len(entries) == 0:
>>>> +        return
>>>> +    fp = opener(LOGNAME, mode='a+')
>>>> +    fp.seek(0, 2)
>>>> +    offset = fp.tell()
>>>> +    if transaction != None:
>>>> +        transaction.add(LOGNAME, offset)
>>>> +    for p in entries:
>>>> +        fp.write(p + '\n') # assuming that filenames don't contain '\n'
>>>> +    fp.close()
>>>> +
>>>> +def entries(opener):
>>>> +    # yields: path, line number
>>>> +    n = 1
>>>> +    try:
>>>> +        fp = opener(LOGNAME, mode='rb')
>>>> +    except IOError:
>>>> +        # skip nonexisting file
>>>> +        return
>>>> +    n = 1
>>>> +    for line in fp:
>>>> +        if (len(line) < 2) or (line[-1] != '\n'):
>>>> +            abort(_('invalid entry'), LOGNAME, n)
>>>> +        yield line[:-1], n
>>>> +        n += 1
>>> Looks like this wants enumerate(). Or to not care about line numbers.
>>> And possibly a class?
>> The line numbers were handy for verify being able to complain about
>> filenamelog corruption.
> 
>> You mean a filenamelog class?
> 
> Yes.
> 
>>>> @@ -910,6 +911,7 @@
>>>>  
>>>>              n = self.changelog.add(mn, changed + removed, text, trp, p1, p2,
>>>>                                     user, wctx.date(), extra)
>>>> +            self.store.addnewfiles(tr)
>>>>              self.hook('pretxncommit', throw=True, node=hex(n), parent1=xp1,
>>>>                        parent2=xp2)
>>>>              tr.close()
>>>> @@ -1981,6 +1983,8 @@
>>>>              # make changelog see real files again
>>>>              cl.finalize(trp)
>>>>  
>>>> +            self.store.addnewfiles(tr)
>>>> +
>>>>              newheads = len(self.changelog.heads())
>>>>              heads = ""
>>>>              if oldheads and newheads != oldheads:
>>>> @@ -2055,6 +2059,7 @@
>>>>              for chunk in util.filechunkiter(fp, limit=size):
>>>>                  ofp.write(chunk)
>>>>              ofp.close()
>>>> +        self.store.addnewfiles(None)
>>>>          elapsed = time.time() - start
>>>>          if elapsed <= 0:
>>>>              elapsed = 0.001
>>> That's a bit magical. Let me see if I can figure it out..
>>>
>>> Ok, that's all a little hairy, but I don't have any suggestions just
>>> yet.
>> That's pretty much the key point of the whole patch series.
>> Resulting from the problem of filelog not being allowed to report
>> new store files back to repo.
>>
>> Which in turn results from having to maintain a filenamelog at
>> all, which results from not being allowed to save the filenames
>> of the hashed files inside themselves so that streamclone could
>> have asked the filelog file for the name of the workspace file it
>> tracks (which djc called a layering violation).
> 
> Yep, I'm aware. Here are some thoughts:
> 
> One thing I didn't point out is that your opener only needs to add to
> the log when it's in write mode.
> 
> Also note that in the rare occassions when we walk the file list, we're
> typically calling stat() to get the file list. This gives us the
> opportunity to check that all the files mentioned actually exist, so we
> don't actually have to worry about journalling this file. Because we
> have the lock held, at the end of storefiles(), we can rewrite it if we
> find errors.
> So I think we can drastically simplify things here:
> 
> - opener only appends the file in write/append mode if the file doesn't
> exist

That's what's currently already happening.

Files are only appended to the filenamelog if util.opener.open() reports an
nlink value of zero, which is only the case if the file has been newly created.

The same test is currently already done in util.opener.open() to determine,
whether the mode of the file needs to be set (util.py#1483):

        if nlink == 0:
            self._fixfilemode(f)

> - storefiles() or filelog.files() skips reporting files that don't exist

This assumes we stat for the entries to validate the entry before
using it.

Feels rather obscure and drastically degrades the filenamelog.

I currently fail to see what would justify doing such a drastic
thing.

> - we rewrite the list if we find errors (extra files)

This assumes we have a write lock when reading the filenamelog.
This might not be true in the future. For example, in the future,
we might be able to implement a lock-free streamclone.

Also, filenamelog currently is an append only file, like the revlogs.

Not wanting to journal it feels like asking for troubles
(needless, per my gut feeling).

I currently think this would be a step in the wrong direction.

> And I think that makes it all transparent to the higher levels?
> 
>>>> +    def hasfnlog(self):
>>>> +        return False
>>>> +
>>>> +    def readfnlog(self):
>>>> +        res = {}
>>>> +        return res
>>> We may be able to kill these.
>> verify needs to know if there is a filenamelog that needs to be verified.
>> (at least that was my thinking when I created these methods.)
> 
> It doesn't, actually. It needn't know anything about the store. All it
> needs to know is that for every file the changelog mentions, there is a
> file returned by storefiles() and vice-versa. And that test can happen
> on any store we can call storefiles() on.
> 

There's a new file 'filenamelog' in the store we want to trust to have
the filenames of all files in the store.

We have to check that during verify. If we don't, we can't trust the
filenamelog.

If we can't trust the filenamelog, we can't trust streamclone
doing the right thing.