D2057: rust implementation of hg status

Thu Mar 8 12:34:00 EST 2018

kevincox requested changes to this revision.
kevincox added a comment.
This revision now requires changes to proceed.

  I will try to finish the review later, but it might be quicker if you incorporate some of the changes first since a lot of them are repeated many times. Overall it looks good, there are a couple of things that i would highlight to make the code easier to read.

  - Prefer more descriptive variable names.
  - If you can, avoid "pointer" arithmetic. Cursors and slices are nice and have great convenience methods.
  - Group your flow control and filtering more.
  - Try to keep your types straight. In rust using the right types helps catch errors. So be aware of `char` vs `u8` vs `String` vs `Vec<char>` vs `Vec<u8>`.

  On a higher level, all of these code appears to be treating file names as strings. This isn't really true and will disallow some valid file names. Maybe we should stick with bytes throughout. Of course this makes windows filepaths difficult because they are actually (utf16) strings.

INLINE COMMENTS

> indygreg wrote in build.rs:1
> I see this file was copied. There's nothing wrong with that. But does this mean we will need a custom build.rs for each Rust package doing Python? If that's the case, then I would prefer to isolate all our rust-cpython code to a single package, if possible. I'm guessing that could be challenging due to crossing create boundaries. I'm sure there are placed where we don't want to expose symbols outside the crate.
> 
> I'm curious how others feel about this.

If this is going to be reused I would move it into it's own crate. It seems like everything here could be boiled down to a single function call in main.

> base85.rs:21
> +    });
> +}
> +

I prefer something like this: https://play.rust-lang.org/?gist=5ca18a5b95335600e911b8f9310ea5c7&version=stable

I doubt lazy_static is too slow. Otherwise we can stay like this until const functions get implemented.

Either way note:

- I changed the type of B85CHARS to an array as opposed to an array ref.
- The loop condition is much nicer.

> base85.rs:23
> +
> +pub fn b85encode(py: Python, text: &str, pad: i32) -> PyResult<PyObject> {
> +    let text = text.as_bytes();

Would it be possible to separate the decode from the python objects. I'm thinking two helper functions.

  fn b85_required_len(text: &str) -> usize
  fn b85_encode(text: &str, pad: i32, out: &mut [u8]) -> Result<()>

> base85.rs:23
> +
> +pub fn b85encode(py: Python, text: &str, pad: i32) -> PyResult<PyObject> {
> +    let text = text.as_bytes();

`&str` can only hold valid utf8 data? Does it make more sense to have `&[u8]` here for a list of bytes?

> base85.rs:23
> +
> +pub fn b85encode(py: Python, text: &str, pad: i32) -> PyResult<PyObject> {
> +    let text = text.as_bytes();

IIUC pad is only ever checked `== 0`. Can it be made into a bool?

> base85.rs:45
> +
> +    let mut ptext = &text[..];
> +    let mut len = { ptext.len() };

`ptext` isn't very descriptive.

> base85.rs:46
> +    let mut ptext = &text[..];
> +    let mut len = { ptext.len() };
> +    let mut dst_off: usize = 0;

Why the braces here?

> base85.rs:47
> +    let mut len = { ptext.len() };
> +    let mut dst_off: usize = 0;
> +

I suspect this type annotation isn't required.

> base85.rs:47
> +    let mut len = { ptext.len() };
> +    let mut dst_off: usize = 0;
> +

It might be best to use a `std::io::Cursor` <https://doc.rust-lang.org/stable/std/io/struct.Cursor.html> and let it drack `dst_off` for your.

> base85.rs:52
> +            break;
> +        }
> +

while !ptext.is_empty()

> base85.rs:54
> +
> +        let mut acc: u32 = 0;
> +

I would prefer the name `chunk` or even `accum` is a lot mode obvious to me than `acc`.

> base85.rs:56
> +
> +        for i in [24, 16, 8, 0].iter() {
> +            let ch = ptext[0] as u32;

for i in &[24, 16, 8, 0]

> base85.rs:58
> +            let ch = ptext[0] as u32;
> +            acc |= ch << i;
> +

I would just combine these into one line as the name `ch` isn't adding much.

> base85.rs:63
> +
> +            if len == 0 {
> +                break;

Tracking len manually is a smell. Why not drop it and use `ptest.is_empty()`.

> base85.rs:91
> +pub fn b85decode(py: Python, text: &str) -> PyResult<PyObject> {
> +    let b85dec = unsafe { B85DEC };
> +

This is probably worth a comment that this is safe because D85DEC is required to be initialized before this function is called.

> base85.rs:152
> +
> +            acc *= 85;
> +

Let rust do the overflow checking.

  acc = acc.checked_mul(85)
      .ok_or_else(|| {
          PyErr::new::<exc::ValueError, _>(
              py,
              format!("bad base85 character at position {}", i))
      })?;

> changelog.rs:24
> +
> +        assert_eq!(content[NodeId::hex_len()], '\n' as u8);
> +

Passing a message as a third argument is really useful.

> changelog.rs:31
> +
> +        content.drain(..NodeId::hex_len());
> +

If you aren't using the value I would prefer `truncate(NodeId::hex_len())`

> changelog.rs:33
> +
> +        let msg = content;
> +

Just put `msg: content` in the struct construction.

> config.rs:49
> +    pub delta: DeltaPolicy,
> +}
> +

Is this used yet? It probably also needs some documentation because I don't really understand the fields (but I do have little domain knowledge).

> config.rs:78
> +
> +    pub fn register_conf(&mut self, key: &str, func: RegFn) {
> +        self.reg_conf.insert(key.to_string(), func);

If you are just going to convert to String I would recommend taking a String argument.

Also prefer `.to_owned()` over `.to_string()`.

> dirstate.rs:5
> +use std::fs::File;
> +use std::collections::HashMap as Map;
> +#[cfg(target_family = "unix")]

I recommend not renaming this. It is confusing.

> dirstate.rs:48
> +    pub path: PathBuf,
> +    pub dmap: Map<PathBuf, DirStateEntry>,
> +

This could have a better name.

> dirstate.rs:87
> +impl DirState {
> +    pub fn new(p: PathBuf) -> Self {
> +        if !p.exists() {

This should Probably return a `Result<Self>` and pass the error to the caller.

> dirstate.rs:90
> +            panic!("dirstate file is missing")
> +        }
> +

I would skip this check and rely on `p.metadata()`. Just switch `.unwrap()` to `.expect()` with a nicer message. This also handles race conditions more nicely.

> dirstate.rs:108
> +
> +    pub fn parse_dirstate(&mut self) {
> +        let mut dfile = File::open(&self.path).expect("Cannot open dirstate file");

1. Does this function need to be public? It seems internal to the constructor.
2. If it doesn't need to be I would prefer it return the Map so that you don't have a partial-constructed DirState.

> dirstate.rs:125
> +                .entry(PathBuf::from(str::from_utf8(fname.as_ref()).unwrap()))
> +                .or_insert(entry);
> +        }

Is ignoring duplicate entries desired? It might be worth a comment explaining why.

> dirstate.rs:130
> +    #[cfg(target_family = "unix")]
> +    fn _is_bad(entry: &DirEntry) -> bool {
> +        entry.file_type().is_block_device() || entry.file_type().is_fifo()

Don't use `_` prefix for privates. Rely on rust viability.

Also `is_bad` isn't very informative.

> dirstate.rs:141
> +
> +    pub fn walk_dir(&mut self, root: &Path, mtc: &matcher::Matcher) -> CurrentState {
> +        let mut grey = {

s/mtc/matcher/

> dirstate.rs:146
> +            grey
> +        };
> +

let mut grey = Set::new();
  grey.extend(self.dmap.keys().map(|s| s.as_path()));

Also I would pick a name like `undiscovered_paths` or something. `grey` is cryptic.

> dirstate.rs:152
> +
> +        for entry in walker.filter_entry(|ent| {
> +            if ent.file_type().is_dir() {

I would prefer doing the filter before the loop and storing it in a variable.

> dirstate.rs:155
> +                let mut p = ent.path().strip_prefix(root).unwrap().to_owned();
> +                p.push("");
> +                !mtc.check_path_ignored(p.to_str().unwrap())

This is probably worth a helper function.

> dirstate.rs:161
> +        }) {
> +            if let Ok(entry) = entry {
> +                let pbuf = entry.path();

Please explain why you are ignoring the error condition.

> dirstate.rs:162
> +            if let Ok(entry) = entry {
> +                let pbuf = entry.path();
> +                let relpath = pbuf.strip_prefix(root).unwrap();

I would just call this `path` or `pathbuf`.

> dirstate.rs:167
> +                    continue;
> +                }
> +

I would move this filter beside the filter in the loop.

> dirstate.rs:169
> +
> +                if !entry.file_type().is_dir() {
> +                    if self.dmap.contains_key(relpath) {

I would also put this filter above. But more importantly all `_is_bad()` does is check for file types. So it seems like the former filter is redundant with this one.

> dirstate.rs:170
> +                if !entry.file_type().is_dir() {
> +                    if self.dmap.contains_key(relpath) {
> +                        let stent = &self.dmap[relpath];

You could do the following for a slight performance win and save a line.

  if let Occupied(entry) = self.dmap.entry(relpath) {
     ...
  }

> dirstate.rs:175
> +                    } else {
> +                        if !mtc.check_path_ignored(relpath.to_str().unwrap()) {
> +                            res.unknown.insert(relpath.to_path_buf());

Use an `else if`.

> dirstate.rs:183
> +
> +        for rem in grey.drain() {
> +            if res.ignored.contains(rem) {

s/rem/path/ or remaining_path.

> dirstate.rs:184
> +        for rem in grey.drain() {
> +            if res.ignored.contains(rem) {
> +                res.ignored.remove(rem);

You can use the entry api here.

> dirstate.rs:199
> +
> +    fn check_status(res: &mut CurrentState, abspath: &Path, relpath: &Path, stent: &DirStateEntry) {
> +        let pb = relpath.to_path_buf();

Please use a better name for `sent`.

> dirstate.rs:206
> +        // other status all rely on file existence.
> +        if stent.status == ('r' as u8) {
> +            res.removed.insert(pb);

In rust we generally avoid brackets around `as` as it is very tightly binding.

> lib.rs:54
> +            "e8f05076085cd24d01ba1f5d6163fdee16e90103",
> +            dst.to_str().unwrap(),
> +        ])

You can add a later `.arg(dst)` to support non-utf8 paths instead of converting to a str here.

> local_repo.rs:50
> +                        p
> +                    ));
> +                }

I would replace the condition with.

  assert!(dot_hg_path.exists(), ".hg folder not found for the path given by -R argument: {:?}", p);

> local_repo.rs:66
> +                    }
> +                }
> +            }

while !root.join(".hg").exists() {
      root = root.parent().expect(".hg folder not found");
  }

> local_repo.rs:121
> +
> +    pub fn get_filelog(&self, fp: &Path) -> Arc<RwLock<Revlog>> {
> +        let relpath = encode_path(fp);

s/fp/path/

> local_repo.rs:127
> +            panic!(format!("path not exists: {:?}", abspath));
> +        }
> +

assert!(abspath.exists(), "path not exists: {:?}", abspath);

> local_repo.rs:129
> +
> +        let mut gd = self.revlog_db.write().unwrap();
> +

`gd` is cryptic.

> local_repo.rs:136
> +
> +        return gd.get_mut(fp).unwrap().clone();
> +    }

Why does it need to be mutable to clone?

> local_repo.rs:155
> +    #[test]
> +    fn test_hgstorage_dirstate() -> () {
> +        println!(

This test has no assetions. Consider calling it `test_create_...` or something to indicate that you are just checking for panics.

> manifest.rs:33
> +        for i in 0..(content.len()) {
> +            if content[i] == 0 {
> +                prev_i = i;

What are these magic numbers?

> manifest.rs:42
> +
> +                let ent = if i - prev_i - 1 == NodeId::hex_len() {
> +                    let id =

s/ent/entry/

> manifest.rs:49
> +                    let id = NodeId::new_from_bytes(&hex::decode(
> +                        &content[(prev_i + 1)..(prev_i + 41)],
> +                    ).unwrap());

What are these numbers?

> matcher.rs:9
> +
> +pub fn glob_to_re(pat: &str) -> String {
> +    let mut res = String::new();

s/pat/glob/

> matcher.rs:10
> +pub fn glob_to_re(pat: &str) -> String {
> +    let mut res = String::new();
> +

Might be worth calling `String::with_capacity(pat.len())` since it will be at least that long.

> matcher.rs:14
> +    let mut i = 0;
> +    let n = pat.len();
> +

Can you manage a `&[u8]` rather then pointer arithmetic for the whole string. It will make me feel better and will probably be easier to read.

> matcher.rs:108
> +
> +fn relglob(pat: String) -> String {
> +    let mut res = String::new();

If you are going to call `String.as_str()` just take a `&str`.

> matcher.rs:108
> +
> +fn relglob(pat: String) -> String {
> +    let mut res = String::new();

s/relglob/relative_glob_re/

> matcher.rs:111
> +    //res.push_str("(?:|.*/)");
> +    res.push_str(glob_to_re(pat.as_str()).as_str());
> +    //res.push_str("(?:/|$)");

If you are just doing one call just return the result.

> matcher.rs:111
> +    //res.push_str("(?:|.*/)");
> +    res.push_str(glob_to_re(pat.as_str()).as_str());
> +    //res.push_str("(?:/|$)");

You should be able to do `&string` rather then `string.as_str()` as it coerces.

> matcher.rs:131
> +
> +            let fhnd = BufReader::new(fs::File::open(fname).unwrap());
> +

Better name please.

> matcher.rs:143
> +            let mut cur_state = State::None;
> +            for ln in fhnd.lines() {
> +                let ln = match ln {

s/ln/line/

> matcher.rs:159
> +                        fname
> +                    );
> +                }

Is this a warning or error? You might want to switch to `panic!`.

> matcher.rs:160
> +                    );
> +                }
> +

I would move this into the following match because it dedupes the `starts_with` check and puts the logic closer together.

> matcher.rs:195
> +    /// rp: relative path, relative to the root of repository
> +    pub fn check_path_ignored(&self, rp: &str) -> bool {
> +        if let Some(ref m) = self.inner {

s/rp/path/

> matcher.rs:200
> +            false
> +        }
> +    }

I would do `self.inner.map(|m| m.is_match(rp)).unwrap_or(false)` but this is fine.

> mpatch.rs:13
> +    /// the line where previous chunk ends
> +    prev_cnk_ln: u32,
> +    /// the line where next chunk starts

Spell these out please.

> mpatch.rs:24
> +    frag_ofs: u32,
> +}
> +

struct Fragment {
      len: u32,
      offset: u32,
  }

> mpatch.rs:35
> +
> +fn pull(dst: &mut Vec<Fragment>, src: &mut Vec<Fragment>, l: u32) {
> +    let mut l = l;

Maybe it's just me but I think it is more common to put the source before the destination.

> mpatch.rs:35
> +
> +fn pull(dst: &mut Vec<Fragment>, src: &mut Vec<Fragment>, l: u32) {
> +    let mut l = l;

`pull` is very generic.

> mpatch.rs:39
> +    while l > 0 {
> +        assert_ne!(src.len(), 0);
> +        let f = src.pop().unwrap();

assert!(!src.is_empty())

> mpatch.rs:40
> +        assert_ne!(src.len(), 0);
> +        let f = src.pop().unwrap();
> +        if f.frag_len > l {

If you are unwrapping the `pop` there is no need for the prior check.

> mpatch.rs:40
> +        assert_ne!(src.len(), 0);
> +        let f = src.pop().unwrap();
> +        if f.frag_len > l {

s/f/fragment/

> mpatch.rs:51
> +
> +fn mov(m: &mut Cursor<Vec<u8>>, dest: u32, src: u32, count: u32) {
> +    m.seek(Start(src as u64)).unwrap();

`mov` is overly shortened and generic.

> mpatch.rs:51
> +
> +fn mov(m: &mut Cursor<Vec<u8>>, dest: u32, src: u32, count: u32) {
> +    m.seek(Start(src as u64)).unwrap();

It seems weird to take a cursor to a vec if you are just going to do an absolute seek. Can it work with `&mut [u8]`?

> mpatch.rs:54
> +    let mut buf: Vec<u8> = Vec::new();
> +    buf.resize(count as usize, 0);
> +

`vec![0; count]` works. (The arguments might be the other way around).

> mpatch.rs:68
> +        frag_ofs: p,
> +    } in list.iter().rev()
> +    {

for &Fragment{frag_len, frag_ofs} in list.iter().rev()

> mpatch.rs:86
> +        ref bins,
> +    } = ptc;
> +

Make this one line and don't bother renaming.

> mpatch.rs:120
> +    for plen in plens.iter() {
> +        if frags.len() > 128 {
> +            swap(&mut b1, &mut b2);

Please explain.

> mpatch.rs:137
> +            pull(&mut new, &mut frags, p1 - last);
> +            assert_ne!(frags.len(), 0);
> +            pull(&mut vec![], &mut frags, p2 - p1);

assert!(!frags.is_empty());

> path_encoding.rs:5
> +    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
> +];
> +

const HEX_DIGIT: [u8; 16] = *b"0123456789abcdef";

> path_encoding.rs:7
> +
> +fn hex_encode(c: usize) -> (char, char) {
> +    (HEX_DIGIT[c >> 4], HEX_DIGIT[c & 15])

c should be a `u8`.

> path_encoding.rs:11
> +
> +fn escape(buf: &mut Vec<char>, c: &char) {
> +    let c = *c as usize;

`Vec<char>` is odd. Is there any reason not to use a `String` or `Vec<u8>`

> path_encoding.rs:11
> +
> +fn escape(buf: &mut Vec<char>, c: &char) {
> +    let c = *c as usize;

Don't pass a `char` by reference. Also it seems your function wants a `u8`.

> path_encoding.rs:22
> +fn encode_dir(p: &String) -> String {
> +    let mut ps: Vec<char> = p.chars().collect();
> +    let len = ps.len();

I don't think you need this.

> path_encoding.rs:23
> +    let mut ps: Vec<char> = p.chars().collect();
> +    let len = ps.len();
> +

This isn't necessary.

> path_encoding.rs:25
> +
> +    if len >= 2 && ps[len - 2] == '.' && (ps[len - 1] == 'i' || ps[len - 1] == 'd') {
> +        ps.extend(".hg".chars());

p.ends_with(".i") || p.ends_with(".d")

> path_encoding.rs:34
> +
> +fn encode_fn(p: &String) -> String {
> +    let mut ps: Vec<char> = Vec::new();

Take a `&str`

> path_encoding.rs:34
> +
> +fn encode_fn(p: &String) -> String {
> +    let mut ps: Vec<char> = Vec::new();

`encode_file_name`?

> path_encoding.rs:35
> +fn encode_fn(p: &String) -> String {
> +    let mut ps: Vec<char> = Vec::new();
> +

Use a String.

> path_encoding.rs:57
> +        }
> +    }
> +

fn escape(out: &mut String, b: char) {
      unimplemented!()
  }

  pub fn encode_path(path: &str) -> String {
      let mut out = String::with_capacity(path.len());

      for c in path.bytes() {
          let c = c as char;
          match c {
              'A'...'Z' => {
                  out.push('_');
                  out.push(c.to_ascii_lowercase());
              }
              '\\' | ':' | '*' | '?' | '"' | '<' | '>' | '|' => {
                  escape(&mut out, c);
              }
              // The rest of the printable range.
              ' '...'~' => {
                  out.push(c);
              }
              _ => {
                  escape(&mut out, c);
              }
          }
      }

      out
  }

https://godbolt.org/g/3WCQs3

> path_encoding.rs:62
> +
> +pub fn aux_encode(p: &String) -> String {
> +    let mut ps: Vec<char> = Vec::new();

Take a `&str`.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2057

To: Ivzhh, #hg-reviewers, kevincox
Cc: glandium, krbullock, indygreg, durin42, kevincox, mercurial-devel