[c-hglib API] recap on hg_log() buffering

Giovanni Gherdovich g.gherdovich at gmail.com
Wed Aug 21 16:42:21 CDT 2013


Hello Iulian,
hello mercurial-dev,

this is another message in which I collect information
about a particular issue we encountered in
the command server GSoC project that starts to be scattered
across several places.

== TL;DR ==

how much data will the c-hglib hg_log() function buffer at a time?
As much as needed to represent a changeset, no matter how big.


== Long version ==

The c-hglib API function that will invoke `hg log`
cannot cache all data at once; for a software project
that spans over years, the amount of data in the history
can be simply too big to fit in memory.

So how big is the atomic quantity of data that c-hglib will cache?

We started thinking about a fixed number of bytes, say 4096.
But recently a better idea came out and gained consensus,
which is likely to be adopted in the definitive implementation:

the c-hglib equivalent of `hg log` will cache
enough data to represent a single changeset,
no matter how big it is.

The rationale behind that is: if you cannot fit
one of your commit objects into your memory,
you have worse problems to worry about than
the one we're addressing here.


== Digression: Level 0, Level 1 ==

With respect to the three "levels" of the API [1],
the hg_rawread() level 0 function will still read in chunks
of 4 KB; but the level 1 function hg_log() will cache
more flexibly depending on the size of the current changeset.


== References ==

Here I list all contributions that led to the above design decision.

mpm on June 27th[2]:
::::
:::: char buf[4096]; hg_handle *handle;
::::
:::: handle = hg_open("some/repo");
:::: hg_rawcommand(handle, "hg log -v");
:::: while(hg_rawread(handle, buf, 4096))
::::         printf("got: %s", buf);
:::: printf("exit code is: %d", hg_exitcode(handle));
:::: hg_close(handle);

mpm on July 2nd [IRC]:
::::
:::: You cannot return an exit code from hg_rawcommand. [...]
:::: Ok, let's say I do hg_rawcommand("this command will run
:::: for four days and output 500G of data")..
:::: when does this API call return?
:::: Can I convince you that the answer should not be
:::: "four days from now after trying to buffer 500G of data"?
:::: [...]
:::: The client library cannot/should not internally buffer
:::: unknown huge masses of data.

kevin on Aug 4th[3]:
::::
:::: Again, for things like `hg log`, you'll want to be able
:::: to read the output (and errors) long before
:::: you could possibly get the exit code.

ggherdov on Aug 4th[4]:
::::
:::: But when moving on to level 1,
:::: a function like hg_log() would be all-in-one,
:::: meaning it will be implemented like
::::
:::: int hg_log(...) {
::::         hg_rawcommand(...);
::::         while(hg_rawread(...)) { ... }
::::         return hg_exitcode(...);
:::: }
::::
:::: [...] which, at this point, confuses me.

kevin on Aug 5th[5]:
::::
:::: It shouldn't be that confusing;
:::: calling hg_log() shouldn't buffer data any more
:::: than calling hg_rawcommand("log", ...) should.
:::: The difference between the two calls is that hg_log()
:::: should return the results in a more structured form for
:::: programmatic manipulation [...]

martin schröder on Aug 6th[6]:
::::
:::: How can you create a list of structured data
:::: without buffering it? The only solution
:::: I can think of would be kind of like MySQL does:
::::
:::: hg_log_entry_t *le; int return_value;
::::
:::: hg_log(...);
::::
:::: while((le = hg_fetch_log_entry())) {
::::         printf("ID %s, rev %s, user %s, date %s, description %s\n",
::::                le->id, le->rev, le->user, le->date, le->description);
:::: }
::::
:::: return_value = hg_get_return();

mpm on Aug 7th [IRC]:
::::
:::: hg_log() can't cache all data.
:::: Probably wants to either
:::: do something that looks like an iterator
:::: or have callbacks.

iulians on Aug 7th [IRC]:
::::
:::: Let's make an example. Let's say we call hg_log
:::: that provides us an iterator
:::: we can call the iterator to get first rev
:::: and then we call again for the next rev
:::: like Martin Schröder suggests

mpm & durin42 on Aug 13th [IRC]:
::::
:::: <mpm>     I think it's fair to allocate one cset worth of data,
::::           however much that is. A cset description can get as large as
2G,
::::           which is unlikely to happen in practice.
:::: <iulians> so, I could say that it's ok to not worry about the size for
a cset?
::::           Probably the user will have enough RAM or space...
:::: <durin42> Yeah. If a user can't inflate the changeset object in RAM,
::::           they're probably screwed already.

mg on Aug 19th [7]:
::::
:::: I think it was established that it is okay
:::: to return the data for a single changeset as a single value
:::: -- the fields in a changeset are not that big.

[1] http://markmail.org/message/wdwlrwv6lcacwmrs
[2] http://markmail.org/message/tc6hsvl7fofdjqcl
[3] http://markmail.org/message/3alrorohanhiomqb
[4] http://markmail.org/message/sfz3qn6scrxwib36
[5] http://markmail.org/message/e3stgfheanp4ihxw
[6] http://markmail.org/message/jvk3jfpbsl3bhhs2
[7] http://markmail.org/message/ckrhparroxvpibym
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20130821/8da7157f/attachment.html>


More information about the Mercurial-devel mailing list