Bug 1866 - Linux CIFS mounts may corrupt hardlinked repos on Windows shares
Summary: Linux CIFS mounts may corrupt hardlinked repos on Windows shares
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: Mercurial (show other bugs)
Version: unspecified
Hardware: All All
: normal bug
Assignee: Bugzilla
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-12 03:24 UTC by bjoern
Modified: 2012-05-13 04:59 UTC (History)
7 users (show)

See Also:
Python Version: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description bjoern 2009-10-12 03:24 UTC
We're using Mercurial on Linux on a CIFS filesystem (a mounted Samba share
on a Windows XP server) which holds our repositories. This is the mount
commandline we're using:

mount -t cifs -o user=xyz,file_mode=0664,uid=1000,gid=100 //server/repo /mnt/smb

While Mercurial itself is working fine so far, I recently noticed that the
LocalBranch extension
(http://bitbucket.org/pk11/mercurial-extensions-localbranch/wiki/Home) fails
on filesystems which do not natively support links. Commits in one lbranch
end up in all lbranches.

lbranch uses Mercurial's util.copyfiles() utility method to do some of it's
magic, which in the end tries to copy using os.link hardlinking on Linux.
This seems to fail mysteriously on CIFS, rendering lbranch useless. Patching
util.copyfiles() to use shutil.copy() exclusively fixes the problem.

I'm not sure if this is a CIFS-, a Python-, or a Mercurial Problem.

mount.cifs version: 1.12-3.2.13
Mercurial Distributed SCM (version 1.3.1)

Regards
Comment 1 Matt Mackall 2010-10-11 15:01 UTC
It sounds like a Linux-specific CIFS problem. Please test hardlink creation
manually with ln(1) and tell us if it works.
Comment 2 bjoern 2010-10-12 05:19 UTC
OK, here we go (cifs filesystem mounted with the 'serverino' flag, see "man
mount.cifs"):

/mnt/smb$ touch test01.txt
/mnt/smb$ ls -i
 3096224743894959 test01.txt
/mnt/smb$ ln test01.txt test02.txt
/mnt/smb$ ls -i
 3096224743894959 test01.txt
 3096224743894959 test02.txt

/mnt/smb$ echo "this is a test" > test01.txt 
/mnt/smb$ cat test01.txt 
this is a test
/mnt/smb$ cat test02.txt 
this is a test

So at first glance hardlinks seem to be working just fine.

Btw.: A year has passed, here's the updated version info...

mount.cifs version: 4.5
Mercurial Distributed SCM (version 1.6.2)
Comment 3 Matt Mackall 2010-10-12 09:50 UTC
Mercurial doesn't much care about inode numbers. It identifies hardlinks by
link count. What link count is reported on those files? (ls -l)

See the section on Windows shares here:
http://mercurial.selenic.com/wiki/HardlinkedClones
Comment 4 bjoern 2010-10-13 08:01 UTC
Well, at least 'ls' seems to report the correct link count. 

/mnt/smb$ touch test01.txt
/mnt/smb$ ls -l
-rwxrwxrwx 1 bjoern bjoern       0 13. Okt 15:47 test01.txt
/mnt/smb$ ln test01.txt test02.txt
/mnt/smb$ ls -l
-rwxrwxrwx 2 bjoern bjoern       0 13. Okt 15:47 test01.txt
-rwxrwxrwx 2 bjoern bjoern       0 13. Okt 15:47 test02.txt
/mnt/smb$ rm test02.txt
/mnt/smb$ ls -l
-rwxrwxrwx 1 bjoern bjoern       0 13. Okt 15:47 test01.txt

Anyway, it's interesting what the wiki has to say about Windows shares.
Although I don't seem to be suffering from the described bug in the first
place I'm going to give Mercurial >=1.6.3 a shot tomorrow. 
It seems to me these issues are related after all since as I said in the
original bug report, the problem with localbranch disappears when forcing
Mercurial to copy instead of creating hardlinks.
Comment 5 bjoern 2010-10-13 08:34 UTC
Well, I just tried 1.6.4, no luck there. I still need to patch
util.copyfiles() to make localbranch work.
Comment 6 Matt Mackall 2010-10-13 11:16 UTC
Very strange. Does the existing code give a traceback or just silently fail
to copy?

If you manually link files, then unmount and remount the share, does the
link count still show as 2?
Comment 7 bjoern 2010-10-14 01:24 UTC
It's not that it actually fails. In fact localbranch seems to do it's
copy/hardlink magic just fine and gives me a localbranch to work on. The
problem only occurs when commiting. A commit should only be visible on the
localbranch it was created on, but when I switch branches it's still there,
on a CIFS filesystem anyway.

I guess it's necessary to find out what the localbranch extension actually
does when commiting. My naive understanding is that it would need to make a
deep copy of all the modified objects and commit only on this one branch.
But instead, eventually due to some hardlink-copy-mess, the commit ends up
on all the other branches as well.

I'll try to have a look at what the localbranch code does if I get the time
to. Although I'm not at all Python- and much less Mercurial literate.

Thanks for your help so far!
Comment 8 Adrian Buehlmann 2010-10-14 01:36 UTC
I wouldn't be surprised it the workaround
http://selenic.com/repo/hg/rev/50523b4407f6
(as mentioned on http://mercurial.selenic.com/wiki/HardlinkedClones) would
needs to be generalized to not only Windows.

Today, the workaround is only effective on Windows.

To me it smacks like CIFS servers in general report back wrong
hardlink counts.

Adding pmezard to nosy (hope you don't mind).
Comment 9 bjoern 2010-10-14 04:02 UTC
I did some more testing (create an lbranch from default and commit on it, on
CIFS and on Ext3), here's what I found. Ignore the fact that the inode
number on CIFS is screwed up again and have a look at the link count.
Interestingly at some point hg decides to break the hardlink and make a deep
copy. This does not happen on CIFS (which is obvoiusly the reason why
localbranch is not working properly)!!

Is there some piece of code in Mercurial which does fact look at the inode
number and not at the link count? 

Where and when does Mercurial decide to break the hardlink and make a deep
copy? Sometime during commit but I'm not deep enough into this stuff to
point my finger at it.


On CIFS:

/mnt/smb/hgtest$ hg lbranch test01
/mnt/smb/hgtest$ find ./ -iname "*manifest*" | xargs ls -li
8394 -rwxrwxrwx 2 bjoern bjoern 115 14. Okt 09:47
./.hg/branches/test01/store/00manifest.i
8369 -rwxrwxrwx 2 bjoern bjoern 115 14. Okt 09:47 ./.hg/store/00manifest.i

/mnt/smb/hgtest$ vi test.txt 
/mnt/smb/hgtest$ hg commit
/mnt/smb/hgtest$ find ./ -iname "*manifest*" | xargs ls -li
8394 -rwxrwxrwx 2 bjoern bjoern 230 14. Okt 09:49
./.hg/branches/test01/store/00manifest.i
8369 -rwxrwxrwx 2 bjoern bjoern 230 14. Okt 09:49 ./.hg/store/00manifest.i

On Ext3:

~/hgtest$ hg lbranch test01
~/hgtest$ find ./ -iname "*manifest*" | xargs ls -li
7028845 -rw-r--r-- 2 bjoern bjoern 115 14. Okt 11:52
./.hg/branches/test01/store/00manifest.i
7028845 -rw-r--r-- 2 bjoern bjoern 115 14. Okt 11:52 ./.hg/store/00manifest.i

~/hgtest$ vi test.txt 
~/hgtest$ hg commit
~/hgtest$ find ./ -iname "*manifest*" | xargs ls -li
7028859 -rw-r--r-- 1 bjoern bjoern 230 14. Okt 11:53
./.hg/branches/test01/store/00manifest.i
7028845 -rw-r--r-- 1 bjoern bjoern 115 14. Okt 11:52 ./.hg/store/00manifest.i
Comment 10 Adrian Buehlmann 2010-10-14 05:52 UTC
It happens here
http://hg.intevation.org/mercurial/file/52971985be14/mercurial/util.py#l872

Mercurial uses an opener object to open files. An opener object mimicks
python's open() function (http://docs.python.org/library/functions.html).

The "function call" (util.opener.__call__) does the magic. If opening the
file is for write, it checks the number of hardlinks and does a copy if
needed on the fly, before returning the file.

All accesses to files inside the repo (.hg) are done through such opener
objects.
Comment 11 bjoern 2010-10-14 07:56 UTC
Thanks for the info!

Strange things are happening:

Python 2.6.6 (r266:84292, Aug 29 2010, 12:36:23) 
[GCC 4.4.5 20100824 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>>
print(os.stat('/mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i').st_nlink)
3

Cool;) Python's right... reproducably...

Now I put that very same print statement into util.nlinks(). Right in the
beginning of it. Still with the above path hardcoded. Then I made a commit
in that hgtest repository and Python spits out... drum roll... "1". Every
time util.nlinks() is being called.

HG is using the same Python version as me. Not that there's more than one
around though, just to be clear.

Can anyone think of an explanation?
What else should I try? I'm going to do some more debugging tomorrow.
Comment 12 Matt Mackall 2010-10-14 11:23 UTC
You still didn't report back on my unmount/remount test:

link a file
unmount share
remount share
check link count

Python's UNIX implementation of os.stat is not particularly magical. It
doesn't know that it's on a non-UNIX filesystem, so it just returns what
it's getting from the syscall. The problem almost certainly lies elsewhere.

We already know:

a) hardlinks can be created on CIFS
b) hardlinks are apparently never reported on CIFS when using Windows
c) ..unless you're talking to a Samba server

So all signs up to now point to Windows servers not reporting hardlinks over
CIFS even though the protocol and Windows clients support it.
We would be surprised at this point to discover that Windows servers are
reporting link counts to Linux clients but not to Windows clients.

But your report is consistent with our knowns if we assume the Linux kernel
is doing some local caching that eventually becomes stale. Which is
reasonable to assume as the kernel has an inode cache and the link count is
stored in inodes. To know what's actually being reported by the server we
have to defeat the cache and remounting will do that.
Comment 13 bjoern 2010-10-15 01:22 UTC
Sorry I didn't answer your question, I tried the umount/mount thing but
nothing came off it really:

/mnt/smb$ touch test01.txt
/mnt/smb$ ln test01.txt test02.txt
/mnt/smb$ ls -li
3096224743898864 -rwxrwxrwx 2 bjoern bjoern 0 15. Okt 09:05 test01.txt
3096224743898864 -rwxrwxrwx 2 bjoern bjoern 0 15. Okt 09:05 test02.txt
/mnt/smb$ cd
~$ sudo umount /mnt/smb/
~$ sudo mount /mnt/smb/
~$ cd /mnt/smb/
/mnt/smb$ ls -li
3096224743898864 -rwxrwxrwx 2 bjoern bjoern 0 15. Okt 09:05 test01.txt
3096224743898864 -rwxrwxrwx 2 bjoern bjoern 0 15. Okt 09:05 test02.txt

Also the link count for yesterday's example is still correct and I had
turned off the computer when I left the office yesterday:

3940649674220877 -rwxrwxrwx 3 bjoern bjoern 3565 14. Okt 14:16
./.hg/branches/test01/store/00manifest.i
3940649674220877 -rwxrwxrwx 3 bjoern bjoern 3565 14. Okt 14:16
./.hg/branches/test02/store/00manifest.i
3940649674220877 -rwxrwxrwx 3 bjoern bjoern 3565 14. Okt 14:16
./.hg/store/00manifest.i

Python's os.stat()/os.lstat() report the correct link count as well when
called directly and not from HG code. 

Maybe it makes a difference if HG has touched/opened/stat'ed other files or
directories on that filesystem prior to stat'ing any hardlink? I'm not
saying this is a Mercurial problem. Might as well be the CIFS driver is
getting confused. It just puzzles me that "standalone" os.stat() seems to be
working, while in the context of HG it doesn't.
Comment 14 Matt Mackall 2010-10-16 13:21 UTC
Ok, that's mysterious. You did say this was being served by a real Windows
system, right? And when you say "Samba", you mean you're simply using the
Linux kernel's SMB/CIFS client support?

Can you reproduce this strange hardlink behavior with a minimal repo (ie one
file)? Can you reproduce it without using lbranch (ie using just clone)?

Grepping the output of strace for the files in question might also be revealing.
Comment 15 Adrian Buehlmann 2010-10-17 04:49 UTC
Here is what bjoern said in his very first message on this bug report:

  "We're using Mercurial on Linux on a CIFS filesystem (a mounted
   Samba share on a Windows XP server) which holds our repositories."

So I gather from this they are mounting Windows shares on Linux, served by a
real Windows XP server, accessed by Linux clients.

As we found out in the discussions and tests that lead to the workaround in

   http://selenic.com/repo/hg/rev/50523b4407f6

Windows computers serving network shares potentially fail to send the
correct link count to the client side computer. It seemed like they
always send a link count of one (as observed from using the API on the
client side).

I wouldn't be surprised if this would affect clients running *on Linux* as well.

As I understand matters, the fix 50523b4407f6 is ineffective for Linux
clients accessing a Windows shares served by a Windows computer, because
that part of the code is only executed on clients running *on Windows
computers*.

If I'm correct, then we would have to add a similar workaround for clients
running on linux.

50523b4407f6 uses win32file.GetDriveType to determine if the
file resides on such a windows share (and thus needs special treatment).

If my reasoning is correct, we would need an equivalent API on linux to
detect that situation as well on linux (accessing a file on a windows share)
and add the equivalent workaround for the linux code path too.
Comment 16 Adrian Buehlmann 2010-10-17 06:17 UTC
An interesting detail: bjoern reported in msg14034 [1] that

 os.stat(pathname).st_nlink

returned a value != 1 in manual testing with his setup.

We use os.lstat [2] in mercurial/util.py:

  554 def nlinks(pathname):
  555     """Return number of hardlinks for the given file."""
  556     return os.lstat(pathname).st_nlink

(note the difference of 'os.stat' vs 'os.lstat')

[1] http://mercurial.selenic.com/bts/msg14034
[2] http://docs.python.org/library/os.html#os.lstat
Comment 17 Matt Mackall 2010-10-17 09:02 UTC
Adrian: "a mounted Samba share on a Windows XP server" is not a real thing:
there's no Samba involved on either client or server. Since we currently
have no consistent theory that explains the behavior here, clarification of
all the facts is in order.

A consistent theory would be something of the order of:

- Windows servers don't communicate hardlinks, but Linux caches them, hence
hardlinks disappear after some interval

- reporter is actually using Samba (explaining why hardlinks are reported)
but some new issue is being 

os.stat and os.lstat are very unlikely to behave differently (I've just been
over the kernel implementation and CIFS hardly comes into it - generic
pathwalking+symlink following + generic copying of the generic in-memory
inode), but it's worth a test.

Everything here suggests that our understanding of hardlinks and CIFS (ie
Samba servers support it, but Windows servers don't) from 50523b4407f6 is
flawed, so let's put that aside for now.
Comment 18 Adrian Buehlmann 2010-10-17 09:41 UTC
mpm: I feel like we are talking past each others, but well. Maybe I'm
misunderstanding something. I fail to see where I claimed there is
a samba server involved.

Note: all tests done so far on windows clients accessing windows shares
served by other windows computers showed that hardlinks can always be
created (if the filesystem on the server supports it). They cannot be
read back though.
Comment 19 bjoern 2010-10-17 10:21 UTC
It's great to have people interested in this issue, thanks!

Let me try to clear things up a bit:

We have a mixed OS environment with Linux clients and Windows servers as
well as the other way round. My client is obviously Linux (Debian squeeze
for that matter) and although I don't like it very much I occasionally have
to work on a repository published by a machine running Windows XP on NTFS
(the reasons for this are kind of obscure and beyond this discussion I
guess). I use mount.cifs to mount and access that share on my linux box.

I have tried both, os.stat as well as os.lstat, and both of them indeed
behave the exact same way, with regard to hardlinks anyway. I'm sorry if I
didn't make that clear enough.

My Linux' "ls" reports correct hardlink counts on cifs.
Python's os.stat() and os.lstat() report the correct link count when invoked
directly at the Python interpreter prompt.
os.stat() and os.lstat() report the wrong link count when invoked from
util.nlinks() with a hardcoded path to an existing file. Which is pretty weird.

As far as I can remember regular clones weren't a problem. I'll try that
tomorrow along with some stracing as mpm suggested.
Comment 20 bjoern 2010-10-18 02:15 UTC
It boils down to this simple program:

import os

print("file: %s nlinks: %d" %
("/mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i",
os.lstat('/mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i').st_nlink))

fh = open("/mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i", 'w') 

print("file: %s nlinks: %d" %
("/mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i",
os.lstat('/mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i').st_nlink))

fh.close()

Output on my machine:

file: /mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i nlinks: 3
file: /mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i nlinks: 1

The link count of 3 for that file is correct, when the file has been opened
prior to stat'ing os.stat() and os.lstat() return a wrong result. So my
previous suspicion was kind of valid. I'll try to reproduce this from C code
as well. Then we should know who's the culprit.
Comment 21 Adrian Buehlmann 2010-10-18 02:24 UTC
Slightly irrelevant (and not really surprising), but

$ python
Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.stat(r'Z:\a').st_nlink
0
>>> os.lstat(r'Z:\a').st_nlink
0
>>> os.stat(r'Z:\foo').st_nlink
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
WindowsError: [Error 2] The system cannot find the file specified: 'Z:\\foo'

This was on a Windows 7 Ultimate x64 accessing a network share (Z:) served
by another Windows 7 Ultimate x64.

At least this could be taken as a reason to raise an exception if nlinks()
wants to return 0 (which is not supposed to happen on Windows since there
should be a different implementation of nlinks() in effects if running on
Windows).
Comment 22 bjoern 2010-10-18 02:35 UTC
This C program gives yields the same result:

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>

#define FPATH   "/mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i"

int main(void)
{
    int rc, fh;
    struct stat st;
    
    rc = lstat(FPATH, &st);
    if (rc != 0) perror("lstat");
    printf("file: %s nlinks %d\n", FPATH, st.st_nlink);

    fh = open(FPATH, 0, O_RDWR); 
    if (fh < 0) perror("open");

    rc = lstat(FPATH, &st);
    if (rc != 0) perror("lstat");
    printf("file: %s nlinks %d\n", FPATH, st.st_nlink);

    close(fh);

    return 0;
}

file: /mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i nlinks 3
file: /mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i nlinks 1

So obviously this is neither a Python nor a Mercurial Problem. It seems to
be the Kernel CIFS driver which reports wrong link counts on opened files. I
tried this on Ext3 and all is well there.

Now, what should be done about this?
Comment 23 Adrian Buehlmann 2010-10-18 02:43 UTC
In response to msg14072 by bjoern:

What happens if the file is opened for reading instead of writing? (when
checking the link count).

I've done the following:

diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -853,6 +853,7 @@ class opener(object):
     def __call__(self, path, mode="r", text=False, atomictemp=False):
         self.auditor(path)
         f = os.path.join(self.base, path)
+        print "opener:", mode, f

         if not text and "b" not in mode:
             mode += "b" # for that other OS

and then did a commit in a testrepo (with a modified file 't.txt'):

$ hg ci -m1
opener: r C:\Users\adi\hgrepos\tests\b\.hg\requires
opener: r C:\Users\adi\hgrepos\tests\b\.hg\sharedpath
opener: r C:\Users\adi\hgrepos\tests\b\.hg\branch
opener: r C:\Users\adi\hgrepos\tests\b\.hg\store\00changelog.i
opener: r C:\Users\adi\hgrepos\tests\b\.hg\branchheads.cache
opener: w C:\Users\adi\hgrepos\tests\b\.hg\branchheads.cache
opener: r C:\Users\adi\hgrepos\tests\b\.hg\branch
opener: r C:\Users\adi\hgrepos\tests\b\.hg\dirstate
opener: r C:\Users\adi\hgrepos\tests\b\.hg\dirstate
opener: r C:\Users\adi\hgrepos\tests\b\.hg\store\00manifest.i
opener: r C:\Users\adi\hgrepos\tests\b\.hg\merge/state
opener: wb C:\Users\adi\hgrepos\tests\b\.hg\last-message.txt
opener: r C:\Users\adi\hgrepos\tests\b\.hg\dirstate
opener: w C:\Users\adi\hgrepos\tests\b\.hg\journal.dirstate
opener: w C:\Users\adi\hgrepos\tests\b\.hg\journal.branch
opener: w C:\Users\adi\hgrepos\tests\b\.hg\journal.desc
opener: r C:\Users\adi\hgrepos\tests\b\.hg\store\00changelog.i
opener: r C:\Users\adi\hgrepos\tests\b\t.txt
opener: r C:\Users\adi\hgrepos\tests\b\.hg\store\data/t.txt.i
opener: rb C:\Users\adi\hgrepos\tests\b\.hg\store\fncache
opener: a+ C:\Users\adi\hgrepos\tests\b\.hg\store\data/t.txt.i
opener: r C:\Users\adi\hgrepos\tests\b\.hg\store\00manifest.i
opener: a+ C:\Users\adi\hgrepos\tests\b\.hg\store\00manifest.i
opener: a+ C:\Users\adi\hgrepos\tests\b\.hg\store\00changelog.i
opener: a C:\Users\adi\hgrepos\tests\b\.hg\store\00changelog.i
opener: w C:\Users\adi\hgrepos\tests\b\.hg\dirstate

I could imagine the file C:\Users\adi\hgrepos\tests\b\.hg\store\data/t.txt.i
already
being held open for reading while doing the nlinks() call right before

  opener: a+ C:\Users\adi\hgrepos\tests\b\.hg\store\data/t.txt.i

(which should be the one triggering the nlinks() call) but I can hardly imagine
being it held open *for writing* already.
Comment 24 bjoern 2010-10-18 02:54 UTC
Opening for reading has the same effect, but of course you're right,
Mercurial obviously only opens it for reading:

opener: r /mnt/smb/hgtest/.hg/requires
opener: r /mnt/smb/hgtest/.hg/sharedpath
opener: r /mnt/smb/hgtest/.hg/localbranch
opener: r /mnt/smb/hgtest/.hg/requires
opener: r /mnt/smb/hgtest/.hg/branch
opener: r /mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i
opener: r /mnt/smb/hgtest/.hg/branchheads.cache
opener: w /mnt/smb/hgtest/.hg/branchheads.cache
file: /mnt/smb/hgtest/.hg/branches/test02/store/00changelog.i nlinks: 1
...
Comment 25 Adrian Buehlmann 2010-10-18 03:03 UTC
Nice test and reporting, bjoern! Thanks.

So this will corrupt repos if you commit to hardlinked clones in your
setup (because mercurial will not break up hardlinks if it is told a
wrong link count of 1).

Now pondering what to do with this...
Comment 26 Adrian Buehlmann 2010-10-18 08:09 UTC
The comment in

http://selenic.com/repo/hg/file/80a3d1121c10/mercurial/revlog.py#l129

is quite interesting (quote):

    # lazyparser is not safe to use on windows if win32 extensions not
    # available. it keeps file handle open, which make it not possible
    # to break hardlinks on local cloned repos.
Comment 27 Matt Mackall 2010-10-18 09:00 UTC
Can you tell us what kernel version you're using?
Comment 28 Matt Mackall 2010-10-18 09:29 UTC
This may be connected to:

https://bugzilla.samba.org/show_bug.cgi?id=2823

Fixed here:

http://www.kernel.org/hg/index.cgi/linux-2.6/rev/7771ab7c774d

which is present in v2.6.20-rc1 and later.
Comment 29 bjoern 2010-10-18 10:40 UTC
~$ uname -r
2.6.32-5-amd64

Not really;) Although it does indeed sound a lot like that bug.
Comment 30 Adrian Buehlmann 2010-10-18 10:59 UTC
The samba 2823 bugreport mentions that an intermediate directory
listing call has a beneficial influence (quoting):

<quote>
open ("file.LCK", O_RDWR | O_CREAT | O_EXCL , 0);
system("ls -la");
link ("file.LCK", "file.LNK");
system("ls -la");
stat ("file.LCK", &statbuf);
 
RESULT: statbuf.st_nlink == 2
</quote>

Might be interesting to hear if it has in your setup as well
(but that probably won't really help us for finding a workaround
on mercurial).
Comment 31 Matt Mackall 2010-10-18 13:05 UTC
I've managed to reproduce this behavior with:

Linux 2.6.35
Samba 3.4.8

..with:

mount -o nounix

I've sent a bug report to the linux-cifs list.
Comment 32 Adrian Buehlmann 2010-10-18 13:57 UTC
Matt's bugreport: http://article.gmane.org/gmane.linux.kernel.cifs/1312
Comment 33 bjoern 2010-10-19 00:50 UTC
In response to msg14087:

This is probably irrelevant by now but I tried it anyway.

This code (I copied the original source code form the Samba bug report) does
not work in my setup in the first place (link: Text file busy):

<quote>
open ("file.LCK", O_RDWR | O_CREAT | O_EXCL , 0);
system("ls -la");
link ("file.LCK", "file.LNK");
system("ls -la");
stat ("file.LCK", &statbuf);
</quote>

So I used a file which already had a hardlink count > 1 and skipped the link
step. Same (wrong) result, the intermediate directory listing doesn't help
in this case.
Comment 34 Adrian Buehlmann 2010-10-19 01:33 UTC
Thanks for testing, Björn.

I've started playing a bit with workaround patches for mercurial, but it
isn't easy.

For example, trying to close files that are opened for reading doesn't work
because revlog.lazyparser keeps files open throughout its lifetime. And it
wouldn't be a robust approach anyway, since forgetting to close files opened
for reading early is just too easy.

I'm currently playing a bit with a hack in util.opener which tries to do a
singular (lazy) test to detect "broken hardlink mounts" and then if detected
later unconditionally forces doing a full file copy on every write access.

I'll post to the mercurial-devel mailing list if I have something usable.
Comment 35 Adrian Buehlmann 2010-10-23 02:58 UTC
The problem seems to have been fixed in newer kernels

  http://article.gmane.org/gmane.linux.kernel.cifs/1329

and I've sent the patch

  http://markmail.org/message/eqvzz65r3r64ny34

for Mercurial that would solve this as well. The patch automatically
detects if nlinks() fails to report hardlinks, which covers both
Linux and Windows clients.

According to private mail by bjoern, the patch is in use at his company
and works well.

The patch would make sure that older kernels with the buggy
CIFS driver won't be able to corrupt mercurial repoitories on Windows
shares any more. It will probably take some time until people have
upgraded to the fixed kernels.

Thanks to the automatic detection mechanism in my patch, if the Linux users
upgrade to a fixed kernel, mercurial with my patch will automatically
switch to trust nlinks(), which will be faster (since the full file copy on
every write will then not be needed any more). Same holds true if Windows
clients suddenly start successfully detecting hardlinks (the current
workaround 50523b4407f6 in effect for Windows clients causes a
hardcoded full file copy on every write for repos on Windows shares).
Comment 36 Matt Mackall 2010-10-27 14:46 UTC
Still broken in 2.6.36:

http://thread.gmane.org/gmane.linux.kernel.cifs/1312/focus=1329
Comment 37 HG Bot 2010-11-08 23:00 UTC
Fixed by http://hg.intevation.org/mercurial/crew/rev/bf826c0b9537
Adrian Buehlmann <adrian@cadifra.com>
opener: check hardlink count reporting (issue1866)
Comment 38 HG Bot 2010-11-08 23:00 UTC
Fixed by http://hg.intevation.org/mercurial/crew/rev/bf826c0b9537
Adrian Buehlmann <adrian@cadifra.com>
opener: check hardlink count reporting (issue1866)
Comment 39 bjoern 2010-11-09 04:09 UTC
Adrian just informed me that his fix got merged into stable. I did some
testing (on rev 158ca54a79cc) with my particular usecase (the lbranch
extension) and things seem to be working fine for me.
Comment 40 Bugzilla 2012-05-12 09:03 UTC

--- Bug imported by bugzilla@serpentine.com 2012-05-12 09:03 EDT  ---

This bug was previously known as _bug_ 1866 at http://mercurial.selenic.com/bts/issue1866