[PATCH 04 of 19] sparse-revlog: add a test checking revlog deltas for a churning file

Gregory Szorc gregory.szorc at gmail.com
Mon Sep 10 11:51:31 EDT 2018


On Sat, Sep 8, 2018 at 3:57 AM Boris Feld <boris.feld at octobus.net> wrote:

> # HG changeset patch
> # User Boris Feld <boris.feld at octobus.net>
> # Date 1536333449 14400
> #      Fri Sep 07 11:17:29 2018 -0400
> # Node ID 607e4fcb774047629cea21bc4052ce8c35eab5d5
> # Parent  a4f94a5caf6f652ff4959daedddf1d88c6059f1b
> # EXP-Topic sparse-snapshot
> # Available At https://bitbucket.org/octobus/mercurial-devel/
> #              hg pull https://bitbucket.org/octobus/mercurial-devel/ -r
> 607e4fcb7740
> sparse-revlog: add a test checking revlog deltas for a churning file
>
> The test repository contains 5000 revisions and is therefore slow to build:
> five minutes with CHG, over fifteen minutes without. It is too slow to
> build
> during the test.  Bundling all content produce a sizeable result, 20BM, too
> large to be committed. Instead, we commit a script to build the expected
> bundle and the test checks if the bundle is available. Any run of the
> script
> will produce the same repository content, using resulting in the same
> hashes.
>
> Using smaller repositories was tried, however, it misses most of the cases
> we
> are planning to improve. Having them in a 5000 repository is already nice,
> we
> usually see these case in repositories in the order of magnitude of one
> million revisions.
>
> This test will be very useful to check various changes strategy for
> building
> delta to store in a sparse-revlog.
>
> In this series we will focus our attention on the following metrics:
>
> The ones that will impact the final storage performance (size, space):
> * size of the revlog data file (".hg/store/data/*.d")
> * chain length info
>
> The ones that describe the deltas patterns:
> * number of snapshot revision (and their level)
> * size taken by snapshot revision (and their level)
>
> diff --git a/.hgignore b/.hgignore
> --- a/.hgignore
> +++ b/.hgignore
> @@ -19,6 +19,7 @@ syntax: glob
>  *.zip
>  \#*\#
>  .\#*
> +tests/artifacts/cache/big-file-churn.hg
>  tests/.coverage*
>  tests/.testtimes*
>  tests/.hypothesis
> diff --git a/tests/artifacts/cache/big-file-churn.hg.md5
> b/tests/artifacts/cache/big-file-churn.hg.md5
> new file mode 100644
> --- /dev/null
> +++ b/tests/artifacts/cache/big-file-churn.hg.md5
> @@ -0,0 +1,1 @@
> +fe0d0bb5979de50f4fed71bb9437764d
> diff --git a/tests/artifacts/scripts/generate-churning-bundle.py
> b/tests/artifacts/scripts/generate-churning-bundle.py
> new file mode 100755
> --- /dev/null
> +++ b/tests/artifacts/scripts/generate-churning-bundle.py
> @@ -0,0 +1,139 @@
> +#!/usr/bin/env python
>

This script isn't Python 3 compatible. Next time, please author new scripts
such that they are Python 3 compatible. And for things like this, you could
probably get away with making them Python 3 only.


> +#
> +# generate-branchy-bundle - generate a branch for a "large" branchy
> repository
> +#
> +# Copyright 2018 Octobus, contact at octobus.net
> +#
> +# This software may be used and distributed according to the terms of the
> +# GNU General Public License version 2 or any later version.
> +#
> +# This script generates a repository suitable for testing delta
> computation
> +# strategies.
> +#
> +# The repository update a single "large" file with many updates. One
> fixed part
> +# of the files always get updated while the rest of the lines get updated
> over
> +# time. This update happens over many topological branches, some getting
> merged
> +# back.
> +#
> +# Running with `chg` in your path and `CHGHG` set is recommended for
> speed.
> +
> +from __future__ import absolute_import, print_function
> +
> +import hashlib
> +import os
> +import shutil
> +import subprocess
> +import sys
> +import tempfile
> +
> +BUNDLE_NAME = 'big-file-churn.hg'
> +
> +# constants for generating the repository
> +NB_CHANGESET = 5000
> +PERIOD_MERGING = 8
> +PERIOD_BRANCHING = 7
> +MOVE_BACK_MIN = 3
> +MOVE_BACK_RANGE = 5
> +
> +# constants for generating the large file we keep updating
> +#
> +# At each revision, the beginning on the file change,
> +# and set of other lines changes too.
> +FILENAME='SPARSE-REVLOG-TEST-FILE'
> +NB_LINES = 10500
> +ALWAYS_CHANGE_LINES = 500
> +FILENAME = 'SPARSE-REVLOG-TEST-FILE'
> +OTHER_CHANGES = 300
> +
> +def nextcontent(previous_content):
> +    """utility to produce a new file content from the previous one"""
> +    return hashlib.md5(previous_content).hexdigest()
> +
> +def filecontent(iteridx, oldcontent):
> +    """generate a new file content
> +
> +    The content is generated according the iteration index and previous
> +    content"""
> +
> +    # initial call
> +    if iteridx is None:
> +        current = ''
> +    else:
> +        current = str(iteridx)
> +
> +    for idx in xrange(NB_LINES):
> +        do_change_line = True
> +        if oldcontent is not None and ALWAYS_CHANGE_LINES < idx:
> +            do_change_line = not ((idx - iteridx) % OTHER_CHANGES)
> +
> +        if do_change_line:
> +            to_write = current + '\n'
> +            current = nextcontent(current)
> +        else:
> +            to_write = oldcontent[idx]
> +        yield to_write
> +
> +def updatefile(filename, idx):
> +    """update <filename> to be at appropriate content for iteration
> <idx>"""
> +    existing = None
> +    if idx is not None:
> +        with open(filename, 'rb') as old:
> +            existing = old.readlines()
> +    with open(filename, 'wb') as target:
> +        for line in filecontent(idx, existing):
> +            target.write(line)
> +
> +def hg(command, *args):
> +    """call a mercurial command with appropriate config and argument"""
> +    env = os.environ.copy()
> +    if 'CHGHG' in env:
> +        full_cmd = ['chg']
> +    else:
> +        full_cmd = ['hg']
> +    full_cmd.append('--quiet')
> +    full_cmd.append(command)
> +    if command == 'commit':
> +        # reproducible commit metadata
> +        full_cmd.extend(['--date', '0 0', '--user', 'test'])
> +    elif command == 'merge':
> +        # avoid conflicts by picking the local variant
> +        full_cmd.extend(['--tool', ':merge-local'])
> +    full_cmd.extend(args)
> +    env['HGRCPATH'] = ''
> +    return subprocess.check_call(full_cmd, env=env)
> +
> +def run(target):
> +    tmpdir = tempfile.mkdtemp(prefix='tmp-hg-test-big-file-bundle-')
> +    try:
> +        os.chdir(tmpdir)
> +        hg('init')
> +        updatefile(FILENAME, None)
> +        hg('commit', '--addremove', '--message', 'initial commit')
> +        for idx in xrange(1, NB_CHANGESET + 1):
> +            if sys.stdout.isatty():
> +                print("generating commit #%d/%d" % (idx, NB_CHANGESET))
> +            if (idx % PERIOD_BRANCHING) == 0:
> +                move_back = MOVE_BACK_MIN + (idx % MOVE_BACK_RANGE)
> +                hg('update', ".~%d" % move_back)
> +            if (idx % PERIOD_MERGING) == 0:
> +                hg('merge', 'min(head())')
> +            updatefile(FILENAME, idx)
> +            hg('commit', '--message', 'commit #%d' % idx)
>

I bet if we used the Mercurial API directly, this script would execute in a
few seconds. We wouldn't get that performance when using commit()
internally. But if you used addgroup() on the filelog, manifest, and
changelog to bulk insert revisions, it would be that fast. What I'm not
sure of is whether that would engage the proper deltas strategy. But you
could compute deltas beforehand and feed those deltas into addgroup(). This
would be much faster than invoking `hg` multiple times.

Anyway, I'm not going to hold up review because of that.


> +        hg('bundle', '--all', target)
> +        with open(target, 'rb') as bundle:
> +            data = bundle.read()
> +            digest = hashlib.md5(data).hexdigest()
> +        with open(target + '.md5', 'wb') as md5file:
> +            md5file.write(digest + '\n')
> +        if sys.stdout.isatty():
> +            print('bundle generated at "%s" md5: %s' % (target, digest))
> +
> +    finally:
> +        shutil.rmtree(tmpdir)
> +    return 0
> +
> +if __name__ == '__main__':
> +    orig = os.path.realpath(os.path.dirname(sys.argv[0]))
> +    target = os.path.join(orig, os.pardir, 'cache', BUNDLE_NAME)
> +    sys.exit(run(target))
> +
> diff --git a/tests/test-sparse-revlog.t b/tests/test-sparse-revlog.t
> new file mode 100644
> --- /dev/null
> +++ b/tests/test-sparse-revlog.t
> @@ -0,0 +1,121 @@
> +====================================
> +Test delta choice with sparse revlog
> +====================================
> +
> +Sparse-revlog usually shows the most gain on Manifest. However, it is
> simpler
> +to general an appropriate file, so we test with a single file instead. The
> +goal is to observe intermediate snapshot being created.
> +
> +We need a large enough file. Part of the content needs to be replaced
> +repeatedly while some of it changes rarely.
> +
> +  $ bundlepath="$TESTDIR/artifacts/cache/big-file-churn.hg"
> +
> +  $ expectedhash=`cat "$bundlepath".md5`
> +  $ if [ ! -f "$bundlepath" ]; then
> +  >     echo 'skipped: missing artifact, run
> "'"$TESTDIR"'/artifacts/scripts/generate-churning-bundle.py"'
> +  >     exit 80
> +  > fi
> +  $ currenthash=`f -M "$bundlepath" | cut -d = -f 2`
> +  $ if [ "$currenthash" != "$expectedhash" ]; then
> +  >     echo 'skipped: outdated artifact, md5 "'"$currenthash"'" expected
> "'"$expectedhash"'" run
> "'"$TESTDIR"'/artifacts/scripts/generate-churning-bundle.py"'
> +  >     exit 80
> +  > fi
> +
> +  $ cat >> $HGRCPATH << EOF
> +  > [format]
> +  > sparse-revlog = yes
> +  > [storage]
> +  > revlog.optimize-delta-parent-choice = yes
> +  > EOF
> +  $ hg init sparse-repo
> +  $ cd sparse-repo
> +  $ hg unbundle $bundlepath
> +  adding changesets
> +  adding manifests
> +  adding file changes
> +  added 5001 changesets with 5001 changes to 1 files (+89 heads)
> +  new changesets 9706f5af64f4:d9032adc8114
> +  (run 'hg heads' to see heads, 'hg merge' to merge)
> +  $ hg up
> +  1 files updated, 0 files merged, 0 files removed, 0 files unresolved
> +  updated to "d9032adc8114: commit #5000"
> +  89 other heads for branch "default"
> +
> +  $ hg log --stat -r 0:3
> +  changeset:   0:9706f5af64f4
> +  user:        test
> +  date:        Thu Jan 01 00:00:00 1970 +0000
> +  summary:     initial commit
> +
> +   SPARSE-REVLOG-TEST-FILE |  10500
> ++++++++++++++++++++++++++++++++++++++++++++++
> +   1 files changed, 10500 insertions(+), 0 deletions(-)
> +
> +  changeset:   1:724907deaa5e
> +  user:        test
> +  date:        Thu Jan 01 00:00:00 1970 +0000
> +  summary:     commit #1
> +
> +   SPARSE-REVLOG-TEST-FILE |  1068
> +++++++++++++++++++++++-----------------------
> +   1 files changed, 534 insertions(+), 534 deletions(-)
> +
> +  changeset:   2:62c41bce3e5d
> +  user:        test
> +  date:        Thu Jan 01 00:00:00 1970 +0000
> +  summary:     commit #2
> +
> +   SPARSE-REVLOG-TEST-FILE |  1068
> +++++++++++++++++++++++-----------------------
> +   1 files changed, 534 insertions(+), 534 deletions(-)
> +
> +  changeset:   3:348a9cbd6959
> +  user:        test
> +  date:        Thu Jan 01 00:00:00 1970 +0000
> +  summary:     commit #3
> +
> +   SPARSE-REVLOG-TEST-FILE |  1068
> +++++++++++++++++++++++-----------------------
> +   1 files changed, 534 insertions(+), 534 deletions(-)
> +
> +
> +  $ f -s .hg/store/data/*.d
> +  .hg/store/data/_s_p_a_r_s_e-_r_e_v_l_o_g-_t_e_s_t-_f_i_l_e.d:
> size=74365490
> +  $ hg debugrevlog *
> +  format : 1
> +  flags  : generaldelta
> +
> +  revisions     :     5001
> +      merges    :      625 (12.50%)
> +      normal    :     4376 (87.50%)
> +  revisions     :     5001
> +      empty     :        0 ( 0.00%)
> +                     text  :        0 (100.00%)
> +                     delta :        0 (100.00%)
> +      snapshot  :      101 ( 2.02%)
> +        lvl-0   :            101 ( 2.02%)
> +      deltas    :     4900 (97.98%)
> +  revision size : 74365490
> +      snapshot  : 20307865 (27.31%)
> +        lvl-0   :       20307865 (27.31%)
> +      deltas    : 54057625 (72.69%)
> +
> +  chunks        :     5001
> +      0x78 (x)  :     5001 (100.00%)
> +  chunks size   : 74365490
> +      0x78 (x)  : 74365490 (100.00%)
> +
> +  avg chain length  :       23
> +  max chain length  :       45
> +  max chain reach   : 11039464
> +  compression ratio :       23
> +
> +  uncompressed data size (min/max/avg) : 346468 / 346472 / 346471
> +  full revision size (min/max/avg)     : 200927 / 201202 / 201067
> +  inter-snapshot size (min/max/avg)    : 0 / 0 / 0
> +  delta size (min/max/avg)             : 10649 / 103898 / 11032
> +
> +  deltas against prev  : 4231 (86.35%)
> +      where prev = p1  : 4172     (98.61%)
> +      where prev = p2  :    0     ( 0.00%)
> +      other            :   59     ( 1.39%)
> +  deltas against p1    :  651 (13.29%)
> +  deltas against p2    :   18 ( 0.37%)
> +  deltas against other :    0 ( 0.00%)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20180910/0ba0eaeb/attachment.html>


More information about the Mercurial-devel mailing list