Bug 6349 - bytes vs strings: unnecessary git reindexing
Summary: bytes vs strings: unnecessary git reindexing
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: git (show other bugs)
Version: default branch
Hardware: PC Linux
: wish bug
Assignee: Bugzilla
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-08 18:40 UTC by Hollis Blanchard
Modified: 2020-06-20 00:01 UTC (History)
1 user (show)

See Also:
Python Version: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hollis Blanchard 2020-06-08 18:40 UTC
Using qemu.git (77K commits), I see that a simple "hg bookmarks" takes a surprisingly long time (20 seconds).

The problem is that we're missing the "don't reindex" optimization, because cur_cache_heads == cache_heads comparison is false. Why?

Our sqlite3 connection has .text_factory = bytes, so cache_heads is in bytes. pygit2, on the other hand, returns Python strings, so cur_cache_heads is in strings.

I don't know which way it should be, so I don't know if this change is appropriate, but it fixes the problem for me:

--- a/hgext/git/index.py
+++ b/hgext/git/index.py
@@ -245,7 +245,7 @@ def _index_repo(gitrepo, db, progress_fa
     # TODO: we should figure out how to incrementally index history
     # (preferably by detecting rewinds!) so that we don't have to do a
     # full changelog walk every time a new commit is created.
-    cache_heads = {x[0] for x in db.execute('SELECT node FROM possible_heads')}
+    cache_heads = {x[0].decode("utf-8") for x in db.execute('SELECT node FROM possible_heads')}
     walker = None
     cur_cache_heads = {h.hex for h in possible_heads}
     if cur_cache_heads == cache_heads:
Comment 1 HG Bot 2020-06-12 13:55 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/fb2936c5f6dc
Hollis Blanchard <hollis_blanchard@mentor.com>
git: decode node IDs back into Python strings (issue6349)

db.text_factory = bytes, so the database contains only strings. The object IDs
we get from pygit2 are Python strings. b'foo' != 'foo'

This change allows the "don't reindex" optimization to work by allowing the
"cur_cache_heads == cache_heads" comparison a few lines down to succeed.

Differential Revision: https://phab.mercurial-scm.org/D8622

(please test the fix)
Comment 2 Bugzilla 2020-06-20 00:01 UTC
Bug was set to TESTING for 7 days, resolving