statdaemon extension

Wed Aug 22 12:08:02 CDT 2012

On Wed, 2012-08-22 at 11:24 +0200, Nicolas Dumazet wrote:
> 2012/8/22 Martin Geisler <mg at aragost.com>:
> >
> > Nicolas wrote some questions for me on Facebook:
> >
> >> Interesting. Shouldnt be difficult to piggy back on inotify code to
> >> get linux & mac backends.
> >
> > Exactly -- that's definitely the hope! I looked at the inotify and
> > FSEvents APIs and it seems that the "tell me the state of this
> > directory" API of the statdaemon fits both of them well.
> >
> > One thing that could be better: for Windows, the daemon will listen on
> > events in the working copy -- recursively. This means that it will also
> > get events for (huge) ignore directories.
> >
> > I did it like this since I was wary of making the daemon understand the
> > .hgignore file (and in particular understand updates to it) and since it
> > seems costly to begin monitoring each directory by itself.
> >
> > As I understand it, you would use the WaitForMultipleObjects function to
> > wait on a file handle in Windows. That function can only wait for
> > MAXIMUM_WAIT_OBJECTS, which is normally only 64. Since the extension is
> > meant for file trees with many thousands of directories, one would need
> > to make some tree-like structure of objects to wait on. That seemed a
> > bit too cumbersome for a first prototype :)
> >
> >> Something that really isnt clear is how you fix the never-ending sync
> >> issues. In particular:
> >>
> >> 1) when your StatServer replies to fetchall, you have no guarantee
> >> that the latest scan has completed. Which means that you're
> >> potentially replying with old data
> >>
> >> 2) when your listener modifies self._cache, your StatServer might be
> >> reading it. Which means that fetchall() replies might be reflecting
> >> some incomplete / inconsistent state
> >>
> >> 2) is easy to fix (Lock!)
> >>
> >> Mitigating 1) seems harder since you seem to be scanning all the time.
> >> Any way to reply "I'm sure there was no events since I received your
> >> message, and here's the state?"
> >
> > I actually don't think it needs fixing. The way I see it, the daemon has
> > to reply with a consistent snapshot: that is, a snapshot that did exist
> > at some point in time. To be useful, the snapshot should preferably have
> > existed at some very recent point in time :)
> >
> > So if lots of events are coming in and the daemon is busy updating its
> > cache, then it can still send out a copy of its cache: this cached
> > snapshot represents the state you would see if you could do 'hg status'
> > really, really fast and thus capture the half-updated working copy.
> >
> > It might very well be that a subsequent 'hg status' will show more
> > changes -- but that's just because it was run at a later point in time.
> >
> > What is not allowed to happen is that the daemon returns a cache
> > mentioning only
> >
> >   ['bbb', 'xxx']
> >
> > when whatever process is modifying the working copy did
> >
> >   write('aaa')
> >   write('bbb')
> >   write('xxx')
> >   write('yyy')
> >
> > The snapshot with only ['bbb', 'xxx'] is inconsistent, but any snapshot
> > with a "prefix" of ['aaa', 'bbb', 'xxx', 'yyy'] is okay.
> >
> > This has some connection with the idea of "serializability" -- that the
> > final result will look *as if* the overlapping operations were executed
> > serially. Here a cache with ['aaa', 'bbb'] would be okay, since it would
> > be the result you would expect if the serial order had been:
> >
> >   write('aaa')
> >   write('bbb')
> >
> >   hg status
> >
> >   write('xxx')
> >   write('yyy')
> 
> Sure, there is some "eventual consistency": if the user calls "hg
> status" N times with N big enough, the Nth call will reflect the exact
> state of the system.
> 
> This, however, seems broken from an user perspective.
> 
> Taking again your example, and adding numbers:
> 1. write('aaa')
> 2. write('bbb')
> 3. write('xxx')
> 4. write('yyy')
> 
> If an user does those 4 changes, and issues a `hg status` after 4.,
> she will expect ['aaa', 'bbb', 'xxx', 'yyy'] as a result. Anything
> else is arguably okay from a system design perspective, but wrong for
> the user.
> Is the current implementation guaranteeing this?
> 
> As far as I understand it, your daemon can return anything in [ [],
> ['aaa'], ['aaa', 'bbb'], ['aaa', 'bbb', 'xxx'], ['aaa', 'bbb', 'xxx',
> 'yyy'] ]. Which means that in turn, hg status can return any of those
> 5 replies.
> 
> I don't think that there is a bullet-proof way to always have complete answers.
> 1) You need to guarantee that the request handler waits for FS events
> to be propagated & handled by the FS events handler
> 2) You need to sync the two threads
> 
> And I think that you should not care about the last case:
> 3) When the request handler receives a request at t0, more events
> might be coming (t1) by the time you eventually reply (t2)
> 
> Maybe better than the current implementation:
> 
> safety_margin_msecs = ...
> lock_ = ...
> cache_ = ...
> 
> last_poll_ = 0
> 
> def fsevents_thread():
>   while True:
>     with lock_:
>       data = poll(timeout=safety_margin_msecs)
>       last_poll_ = now()
>       cache_.update(data)
> 
> def handle_request(request):
>   wait(now() - last_poll_ + 2 * safety_margin_msecs)
>   with lock_:
>     cache_.reply(request)
> 
> In other words, I suggest adding explicit throttling before answering
> requests, trading speed for correctness.
> Waiting is _never_ the perfect way to handle synchronization issues,
> but given that we don't have access to OS internals, that's currently
> the best thing I can think of.
> 
> On Linux you could do something like ioctl(inotify_fd, FIONREAD,
> &bufsize); to tell if there are any FS events available for processing
> before replying to the client request, but that won't work under
> Windows or Mac.
> 
> Thoughts?

Some thoughts on correctness..

There are several different models of consistency that we can imagine:

1) status results are correct at _time of use_ (ie when we do a commit)
2) results are correct at _time of return_ (when status function exits)
3) results are consistent snapshot of some point of time during the call
4) results are consistent snapshot at the start of the call
5) results are collection of file states observable during the call

Mercurial itself uses the weakest of these, 5. The others are basically
impossible to implement, requiring either full directory tree locking or
some sort of arbitrary restartable transaction scheme + reliable
notification daemon. Also, the only one that's an actual meaningful
improvement on 5 is 1. So we basically have a model that requires the
user to serialize their changes and calls to hg for reliability.

The only model that is not acceptable is one that doesn't include all
events that occur before the call, because then even the serializing
user loses. So this should be the benchmark of correctness: it is
impossible to miss an event that occurred before the status call.

-- 
Mathematics is the supreme nostalgia of our time.