From 'Very Fast' to '~Fastest': Helping rust unleash compiler optimizations

diff-match-patch-rs

A few years back, while building HandyTrain, we decided to build a collaborative content creation feature, among other things we needed a text-synchronization library - a WASM version for the client and a high-performance library for our Go and Rust services. With some research we landed on the fantastic diff match patch algorithms, the diff part is an implementation of this paper (often called the Myer’s Diff algorithm) by Eugene Myer’s. We soon found the Go package for our services but none of the Rust implementations at the time were WASM ready. Thats when I decided to port the algorithm based on Google’s Diff Match Patch library.

A few months ago, I finally got the time and opportunity to clean it up a bit and release it as a rust crate diff-match-patch-rs.

Why optimize?

Real-time collaboration poses a unique challenge: synchronizing multiple users’ inputs instantly, regardless of network speed or user count. When building our collaborative editor, the text synchronization became a critical bottleneck. To our delight, our initial implementation of diff-match-patch-rs was already outperforming most alternatives.

Motivated by seeing others using our library, I embarked on a quest to transform it from “very fast” to “the fastest”. Here’s what worked, what didn’t, and what I learned along the way.

TL;DR

We achive ~40% improvement in the Diff performance of the crate by helping the compiler optimize better.

`Diff`: Performance Baselines

Before diving into optimizations (both essential and experimental), let’s see where we stand.

Code
Benchmark code here.

Lang.	Library	Diff Avg.	Bencher	Mode	Correct
`rust`	diff_match_patch v0.1.1	56.62 ms	Criterion	-	✅
`rust`	dmp v0.2.0	56.61 ms	Criterion	-	✅
`rust`	diff-match-patch-rs (our)	55.15 ms	Criterion	`Efficient`	✅
`rust`	diff-match-patch-rs (our)	55.27 ms	Criterion	`Compat`	✅
`go`	go-diff	30.31 ms	go test	-	✅
`node`	diff-match-patch	246.90 ms	tinybench	-	❌
`python`	diff-match-patch	1.01 s	timeit	-	✅

Note
The numbers above are from a Macbook M1 Pro machine for diff of a large text.

The verdict? While we’re ahead of other Rust, Node and Python implementations, the Go package is leaving us in the dust!

Our mission: squeeze more speed by enabling better compiler optimizations, keeping the core algorithm intact (we’ll tackle that another day). Ready? Let’s roll!

Bottlenecks:

Let’s look at some data some data! A quick flamegraph analysis reveals where our program is spending most of it’s time on:

Cargo Flamegraph
cargo flamegraph is pretty much the defacto standard. More details here.

flamegraph pre

The fn bisect(), responsible for finding the middle snake, dominates execution time. This is an expected behavior for the algorithm, attempting to optimize this part of the algorithm should yield maximum benefits.

Analysing the `fn bisect( .. )`

Here’s the fn bisect( .. ) as it stands now:

src/dmp.rs

648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
// Find the 'middle snake' of a diff, split the problem in two
// and return the recursively constructed diff.
// See Myers 1986 paper: An O(ND) Difference Algorithm and Its Variations.
pub fn bisect<'a, T: DType>(
    &self,
    old: &'a [T],
    new: &'a [T],
    deadline: Option<Time>,
) -> Result<Vec<Diff<T>>, crate::errors::Error> {
    let old_len = old.len() as isize;
    let new_len = new.len() as isize;

    let max_d = (old_len + new_len + 1) / 2;
    let v_offset = max_d;
    let v_len = 2 * max_d;

    let mut v1 = vec![-1_isize; v_len as usize];
    let mut v2 = vec![-1_isize; v_len as usize];

    v1[v_offset as usize + 1] = 0;
    v2[v_offset as usize + 1] = 0;

    let delta = old_len - new_len;
    // If the total number of characters is odd, then the front path will collide
    // with the reverse path.
    let front = delta % 2 != 0;

    // Offsets for start and end of k loop.
    // Prevents mapping of space beyond the grid.
    let mut k1start: isize = 0;
    let mut k1end: isize = 0;
    let mut k2start: isize = 0;
    let mut k2end: isize = 0;
    
    // Loop A -->
    for d in 0..max_d {
        // Bail out if deadline is reached.
        if let Some(tout) = deadline {
            #[cfg(target_arch = "wasm32")]
            if Utc::now().time() > tout {
                break;
            }
            #[cfg(not(target_arch = "wasm32"))]
            if Instant::now() > tout {
                break;
            }
        }

        // Walk the front path one step.
        let mut k1 = -d + k1start;
        // Loop A.1
        while k1 < d + 1 - k1end {
            let (x1, y1) = {
                let k1_offset = v_offset + k1;
                let mut x1 = if k1 == -d
                    || (k1 != d && v1[k1_offset as usize - 1] < v1[k1_offset as usize + 1])
                {
                    v1[k1_offset as usize + 1]
                } else {
                    v1[k1_offset as usize - 1] + 1
                };

                let mut y1 = x1 - k1;
                // Loop A.1.1
                while x1 < old_len && y1 < new_len {
                    let i1 = if x1 < 0 { old_len + x1 } else { x1 };
                    let i2 = if y1 < 0 { new_len + y1 } else { y1 };
                    if old[i1 as usize] != new[i2 as usize] {
                        break;
                    }
                    x1 += 1;
                    y1 += 1;
                }
                v1[k1_offset as usize] = x1;

                (x1, y1)
            };

            if x1 > old_len {
                // Ran off the right of the graph.
                k1end += 2;
            } else if y1 > new_len {
                // Ran off the bottom of the graph.
                k1start += 2;
            } else if front {
                let k2_offset = v_offset + delta - k1;
                if k2_offset >= 0 && k2_offset < v_len && v2[k2_offset as usize] != -1 {
                    // Mirror x2 onto top-left coordinate system.
                    let x2 = old_len - v2[k2_offset as usize];
                    if x1 >= x2 {
                        // Overlap detected.
                        return T::bisect_split(
                            self,
                            old,
                            new,
                            x1 as usize,
                            y1 as usize,
                            deadline,
                        );
                    }
                }
            }
            k1 += 2;
        }

        // Walk the reverse path one step.
        let mut k2 = -d + k2start;
        // Loop A.2
        while k2 < d + 1 - k2end {
            let (mut x2, y2) = {
                let k2_offset = v_offset + k2;

                let mut x2 = if k2 == -d
                    || (k2 != d && v2[k2_offset as usize - 1] < v2[k2_offset as usize + 1])
                {
                    v2[k2_offset as usize + 1]
                } else {
                    v2[k2_offset as usize - 1] + 1
                };

                let mut y2 = x2 - k2;
                // Loop A.2.1
                while x2 < old_len && y2 < new_len {
                    let i1 = if old_len - x2 > 0 {
                        old_len - x2 - 1
                    } else {
                        x2 + 1
                    };
                    let i2 = if new_len - y2 > 0 {
                        new_len - y2 - 1
                    } else {
                        y2 + 1
                    };
                    if old[i1 as usize] != new[i2 as usize] {
                        break;
                    }
                    x2 += 1;
                    y2 += 1;
                }
                v2[k2_offset as usize] = x2;

                (x2, y2)
            };

            if x2 > old_len {
                // Ran off the left of the graph.
                k2end += 2;
            } else if y2 > new_len {
                // Ran off the top of the graph.
                k2start += 2;
            } else if !front {
                let k1_offset = v_offset + delta - k2;
                if k1_offset >= 0 && k1_offset < v_len && v1[k1_offset as usize] != -1 {
                    let x1 = v1[k1_offset as usize];
                    let y1 = v_offset + x1 - k1_offset;
                    // Mirror x2 onto top-left coordinate system.
                    x2 = old_len - x2;
                    if x1 >= x2 {
                        // Overlap detected.
                        return T::bisect_split(
                            self,
                            old,
                            new,
                            x1 as usize,
                            y1 as usize,
                            deadline,
                        );
                    }
                }
            }
            k2 += 2;
        }
    }

    Ok(vec![Diff::delete(old), Diff::insert(new)])
}

I’ve highlighted the lines that are indexing into the slices and the comments demarketing the loop hierarchy!

That’s a lot of nested looping, moreover, there’s a lot of direct indexing happening in these hot loops, which is generally considered a performance bottleneck because of rust’s bound checks. Let’s look for some hints in the assembly generated by the compiler for this function.

Generating Assembly or IR
We are going to be using the crate cargo-show-asm.

For brevity I’m not pasting the ~700 lines of asm in this blog, but these are my observations:

The generated assembly does contain a few panic_bounds_check, index_start_len_fail, index_end_len_fail and index_order_fail - these are all various forms of bound checks with presumably heavy performance penalties in a hot loop
I’ve always assumed that x * 2^n or x / 2^n will always be optimized to bit-shift operations - that’s not reflecting in the generated assembly
In general, the loop(loop(loop)) has generated almost unreadably convoluted assembly (at-least for me)

Optimization 1: `bit-shift` replacements for `2^n` ops

Hypothesis:

Replacing mathematical operations with bit-shift alternatives wherever possible would yield performance gains.

Approach:

Replace the * 2^n, / 2^n and delta % 2 != 0 to bit shift operations, while at it, remove the redundant variable max_d.

src/dmp.rs

648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
// Find the 'middle snake' of a diff, split the problem in two
// and return the recursively constructed diff.
// See Myers 1986 paper: An O(ND) Difference Algorithm and Its Variations.
// Find the 'middle snake' of a diff, split the problem in two
// and return the recursively constructed diff.
// See Myers 1986 paper: An O(ND) Difference Algorithm and Its Variations.
pub fn bisect<'a, T: DType>(
    &self,
    old: &'a [T],
    new: &'a [T],
    deadline: Option<Time>,
) -> Result<Vec<Diff<T>>, crate::errors::Error> {
    let old_len = old.len() as isize;
    let new_len = new.len() as isize;

    let v_offset = (old_len + new_len + 1) >> 1; // same as `max_d`, (/ 2)
    let v_len = v_offset << 1; // (* 2)

    let mut v1 = vec![-1_isize; v_len as usize];
    let mut v2 = vec![-1_isize; v_len as usize];

    v1[v_offset as usize + 1] = 0;
    v2[v_offset as usize + 1] = 0;

    let delta = old_len - new_len;
    // If the total number of characters is odd, then the front path will collide
    // with the reverse path.
    let front = (delta & 1) != 0;
    
    // .. code omitted

    for edit_dist in 0..v_offset {
        // code omitted
    }

    // code omitted
}

Now let’s run the tests once again to verify that our modifications have not broken the algorithm!

cargo test --release -- --nocapture

And the tests pass, so time to re-run the benchmark!

// in `diff-match-patch-rs` repo
cargo bench --bench diff

Results:

Wallah🎉🎉, after multiple passes, I’m reasonably certain that our micro-optimization has yielded some benefits. The average time/iter has reduced to 54.78 ms (▼ 0.56%) for the Efficient mode and 54.83 ms (▼ 1.03%) for the Compat mode. Well, not very impressive but it’s a start.

`bit-shift` helps
Converting 2^n multiplication and division operations to bit-shift had positive (minor) performance impact!

Optimization 2: Reduce `Vec<T>` allocations!

Hypothesis:

Reducing known vector allocation calls (currently 2 during the setup of bisect( .. )) would result in performance gains!

Approach:

Instead of allocating v1 and v2 separately, we’ll allocate one vector v and then use method split_at_mut() to obtain the required data structures.

src/dmp.rs

654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
pub fn bisect<'a, T: DType>(
    &self,
    old: &'a [T],
    new: &'a [T],
    deadline: Option<Time>,
) -> Result<Vec<Diff<T>>, crate::errors::Error> {
    let old_len = old.len() as isize;
    let new_len = new.len() as isize;

    let v_offset = (old_len + new_len + 1) >> 1; // same as `max_d`, (/ 2)
    let v_len = v_offset << 1; // (* 2)

    let mut v = vec![-1_isize; (v_len << 1) as usize];
    let (v1, v2) = v.split_at_mut(v_len as usize);

    {
        let v_trg = v_offset as usize + 1;
        v1[v_trg] = 0;
        v2[v_trg] = 0;
    }

    let delta = old_len - new_len;
    // If the total number of characters is odd, then the front path will collide
    // with the reverse path.
    let front = (delta & 1) != 0;
    
    // .. code omitted

    for edit_dist in 0..v_offset {
        // code omitted
    }

    // code omitted
}

Results:

Meaningful improvements over the baseline - as it stands now, the average time/iter has reduced to 53.7 ms (▼ 2.6%) for the Efficient mode and 54.0 ms (▼ 2.42%) for the Compat mode.

Reducing allocation calls in hot loop
First meaningful improvement observed by reducing the 2 calls to allocate a vector to a single call and then method split_at_mut( .. ).

Optimization 3: Loop simplification

Hypothesis:

Our current method bisect( ... ) contains a multi-tiered nested loops which is preventing the compiler from fully optimizing it, isolating the loop logic would give the compiler an opportunity to reason with it.

Approach:

Target the inner most loop - the idea is to move upwards in the nesting - and see if we can squeeze out some performance gains! Let’s try and isolate the loop to its own function - as Alex Kladov puts it, bring the code closer to the compiler’s nose.

To begin with let’s just move the leaf (Loop: A.1.1) while x1 < old_len && y1 < new_len { ... } loop inside the forward graph traversal pass to a separate function!

src/dmp.rs

821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
fn bisect_fwd_path_i<'a, T: DType>(
    old: &'a [T],
    new: &'a [T],
    x1: usize,
    y1: usize,
) -> (usize, usize) {
    let mut x1 = x1;
    let mut y1 = y1;
    
    while x1 < old.len() && y1 < new.len() {
        if old[x1] != new[y1] {
            break;
        }

        x1 += 1;
        y1 += 1;
    }

    (x1, y1)
}

And we modify our fn bisect( .. ) ..

src/dmp.rs

654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
pub fn bisect<'a, T: DType>(
    &self,
    old: &'a [T],
    new: &'a [T],
    deadline: Option<Time>,
) -> Result<Vec<Diff<T>>, crate::errors::Error> {
    let old_len = old.len() as isize;
    let new_len = new.len() as isize;

    let v_offset = (old_len + new_len + 1) >> 1; // same as `max_d`, (/ 2)
    let v_len = v_offset << 1; // (* 2)

    let mut v = vec![-1_isize; (v_len << 1) as usize];
    let (v1, v2) = v.split_at_mut(v_len as usize);

    {
        let v_trg = v_offset as usize + 1;
        v1[v_trg] = 0;
        v2[v_trg] = 0;
    }

    let delta = old_len - new_len;
    // If the total number of characters is odd, then the front path will collide
    // with the reverse path.
    let front = (delta & 1) != 0;
    
    // .. code omitted

    for d in 0..v_offset {
        // code omitted

        let mut k1 = -d + k1start;
            while k1 < d + 1 - k1end {
                let (x1, y1) = {
                    let k1_offset = v_offset + k1;
                    let mut x1 = if k1 == -d
                        || (k1 != d && v1[k1_offset as usize - 1] < v1[k1_offset as usize + 1])
                    {
                        v1[k1_offset as usize + 1]
                    } else {
                        v1[k1_offset as usize - 1] + 1
                    };

                    let mut y1 = x1 - k1;
                    let (xi, yi) = Self::bisect_fwd_path_i(old, new, x1 as usize, y1 as usize);

                    y1 = yi as isize;
                    x1 = xi as isize;
                    
                    v1[k1_offset as usize] = x1;

                    (x1, y1)
                };

                // code omitted
            }

            // code omitted
    }

    // code oimitted
}

We have our first MAJOR improvement from the baseline, the average time/iter has reduced to 50.97 ms (▼ 7.55%) for the Efficient mode and 51.27 ms (▼ 2.46%) for the Compat mode.

So, our hypothesis really holds true, isolating the loop truly gave the rust compiler an opportunity to reason with the code better! Let’s apply the same technique to the reverse traversal leaf loop.

Results:

And the results are highly encouraging! We are looking at a cumulative gain of ~16% over our baseline - our average timings now stand at 45.97 ms (▼ 16.62%) in Efficient mode and 46.79 ms (▼ 15.55%) for Compat mode.

Loop simplification works!
We get a massive ~16% jump over the baseline by simplifying the inner-most loop!

Optimization 4: Vectorization

Let’s look at the fn bisect_fwd_path_i<'a, T: DType>( .. ) we just created. It’s a simple loop where we are comparing an element at a time till we hit a non-equality. Can we vectorize this?

Hypothesis:

Rust standard iterators are better suitable for compiler optimizations! Moreover, we should be able to hint the compiler towards auto vectorization of the code which would give us significant performance boost!

Approach:

Recently I had come across this blog post Can you trust a compiler to optimize your code? where Alex writes about hinting the compiler to auto-vectorize a loop by looping through a fixed chunk size (16 in their case). Let’s try and apply the same to our fn bisect_fwd_path_i< .. >( .. ).

src/dmp.rs

821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
const CHUNK_SIZE: usize = 16;

fn bisect_fwd_path_i<'a, T: DType>(
    old: &'a [T],
    new: &'a [T],
    x1: usize,
    y1: usize,
) -> (usize, usize) {
    let mut x1 = x1;
    let mut y1 = y1;

    if (0 .. old.len()).contains(&x1) && (0 .. new.len()).contains(&y1) {
        let off =
            zip(old[x1 ..].chunks_exact(CHUNK_SIZE), new[y1 ..].chunks_exact(CHUNK_SIZE))
            .take_while(|(o, n)| o == n)
            .count() * CHUNK_SIZE;

        x1 += off;
        y1 += off;
    }
    
    if (0 .. old.len()).contains(&x1) && (0 .. new.len()).contains(&y1) {
        let off = zip(&old[x1 ..], &new[y1 ..])
            .take_while(|(o, n)| o == n)
            .count();

        x1 += off;
        y1 += off;
    }
    

    (x1, y1)
}

Results:

We observe a massive performance regression of ~20%. Our hypothesis is invalidated.

Further analysis:

In fact, even if we remove the chunk_exact() based attempt at vectorization and just represent the while x1 < old.len() && y1 < new.len() { .. } loop with an iterator we still see significantly worse performance! Thats puzzling, in popular wisdom rust iterators are supposed to get optimized better!

Let’s look at the generated assembly!

First: with our simple while x1 < old.len() && y1 < new.len() { .. }

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
.section __TEXT,__text,regular,pure_instructions
	.p2align	2
diff_match_patch_rs::dmp::DiffMatchPatch::bisect_fwd_path_i:
Lfunc_begin80:
	.cfi_startproc
		// diff-match-patch-rs/src/dmp.rs : 845
		while x1 < old.len() && y1 < new.len() {
	cmp x4, x1
	b.hs LBB80_4
LBB80_1:
	cmp x5, x3
	b.hs LBB80_4
		// diff-match-patch-rs/src/dmp.rs : 846
		if old[x1] != new[y1] {
	ldrb w8, [x0, x4]
	ldrb w9, [x2, x5]
	cmp w8, w9
	b.ne LBB80_4
		// diff-match-patch-rs/src/dmp.rs : 850
		x1 += 1;
	add x4, x4, #1
		// diff-match-patch-rs/src/dmp.rs : 851
		y1 += 1;
	add x5, x5, #1
		// diff-match-patch-rs/src/dmp.rs : 845
		while x1 < old.len() && y1 < new.len() {
	cmp x4, x1
	b.lo LBB80_1
LBB80_4:
		// diff-match-patch-rs/src/dmp.rs : 856
		}
	mov x0, x4
	mov x1, x5
	ret

and the assembly generated when we replace the while .. {} with an iterator

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
.section __TEXT,__text,regular,pure_instructions
	.p2align	2
diff_match_patch_rs::dmp::DiffMatchPatch::bisect_fwd_path_i:
Lfunc_begin80:
		// diff-match-patch-rs/src/dmp.rs : 809
		fn bisect_fwd_path_i<'a, T: DType>(
	.cfi_startproc
	mov x8, #0
		// diff-match-patch-rs/src/dmp.rs : 833
		let off = if x1 < old.len() && y1 < new.len() {
	subs x9, x1, x4
	b.ls LBB80_7
	subs x10, x3, x5
	b.ls LBB80_7
	cmp x9, x10
	mov x8, #0
	csel x9, x9, x10, lo
		// .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/adapters/zip.rs : 309
		if self.index < self.len {
	cbz x9, LBB80_7
	add x10, x0, x4
	add x11, x2, x5
LBB80_4:
		// .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/cmp.rs : 1818
		PartialEq::eq(*self, *other)
	ldrb w12, [x10, x8]
	ldrb w13, [x11, x8]
		// .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/adapters/take_while.rs : 81
		if p(&x) {
	cmp w12, w13
	b.ne LBB80_7
		// .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs : 231
		|count, _| count + 1,
	add x8, x8, #1
		// .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/adapters/zip.rs : 309
		if self.index < self.len {
	cmp x9, x8
	b.ne LBB80_4
	mov x8, x9
LBB80_7:
		// diff-match-patch-rs/src/dmp.rs : 839
		x1 += off;
	add x0, x8, x4
		// diff-match-patch-rs/src/dmp.rs : 840
		y1 += off;
	add x1, x8, x5
		// diff-match-patch-rs/src/dmp.rs : 861
		}
	ret

Just eyeballing at the generated assembly, it looks like the overhead of the iterator setup in a hot loop introduces a significant performance penalty!

Hypothesis Invalidated
Iterators not always guranteed to yield the most optimized code
Couldn’t get vectorization to work! We’ll take a deeper look at this later!

Reverting back to the best state so far!

Optimization 5: Bound-checks

We are looking at a very hot loop and while doing so we have been indexing into slice[i] very frequently! This is known to be slow because of bound_checks introduced by the rust compiler! Bound checks are necessary for rust’s safety guarantees and this prevents a notorious class of bugs and vulnerabilities related to buffer overflows - we know all too well about some notorious vulnerabilities that exploit buffer overflows.

Anyway, how do we get around bound checks without resorting to unsafe {} rust?

Recommended reading
I found this blog post by Sergey Davidoff about bound checks very helpful and we are going to be using some of these techniques during our optimization attempt!

Hypothesis:

Alleviating bound checks would yield significant performance benefits!

Approach:

Whenever we index into a slice, we’ll try and give the compiler the guarantee that our slice[index] is within bounds and thereby reduce the need to introduce bound checks.

What does a bound check look like?

Let’s look at the generated assembly for our fn bisect<'a, T: DType>( .. ).

// code omitted ..
Lloh299:
	add x2, x2, l___unnamed_43@PAGEOFF
		// diff-match-patch-rs/src/dmp.rs : 758
		|| (k2 != d && v2[k2_offset as usize - 1] < v2[k2_offset as usize + 1])
	b LBB77_95
// code omitted ..
LBB77_95:
	mov x1, x25
	bl core::panicking::panic_bounds_check
	b LBB77_103
LBB77_96:
	mov x22, #0
LBB77_97:
		// .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs : 420
		Err(err) => handle_error(err),
	mov x0, x22
	mov x1, x28
	bl alloc::raw_vec::handle_error

// code omitted ..

In the excerpt of the generated assembly we can see that in one of our indexing points the assembly refers to the core::panicking::panic_bounds_check bit of code. Let’s try and see if we can tell the compiler that it’s not required!

src/dmp.rs

645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
pub fn bisect<'a, T: DType>(
    &self,
    old: &'a [T],
    new: &'a [T],
    deadline: Option<Time>,
) -> Result<Vec<Diff<T>>, crate::errors::Error> {
    let old_len = old.len() as isize;
    let new_len = new.len() as isize;

    let v_offset = (old_len + new_len + 1) >> 1; // same as `max_d`, (/ 2)
    let v_len = v_offset << 1; // (* 2)
    // code omitted ..
    {
        let v_trg = v_offset as usize + 1;
        if v_trg < v1.len() {
            v1[v_trg] = 0;
        }

        if v_trg < v2.len() {
            v2[v_trg] = 0;
        }
    }

    // code omitted ..

    for edit_dist in 0..v_offset {
        // code omitted ..

        // Walk the front path one step.
        let mut k1 = k1start - edit_dist;
        while k1 < edit_dist + 1 - k1end {
            let (x1, y1) = {
                let k1_offset = v_offset + k1;

                let v1_prev = {
                    let prev_idx = (k1_offset - 1) as usize;
                    if prev_idx < v1.len() {
                        v1[prev_idx]
                    } else {
                        -1
                    }
                };
                let v1_next = {
                    let next_idx = (k1_offset + 1) as usize;
                    if next_idx < v1.len() {
                        v1[next_idx]
                    } else {
                        -1
                    }
                };

                // code omitted

                if (0..v1.len() as isize).contains(&k1_offset) {
                    v1[k1_offset as usize] = x1;
                }

                (x1, y1)
            };

            if x1 > old_len {
                // Ran off the right of the graph.
                k1end += 2;
            } else if y1 > new_len {
                // Ran off the bottom of the graph.
                k1start += 2;
            } else if front {
                let k2_offset = (v_offset + delta - k1) as usize;
                if (0..v_len as usize).contains(&k2_offset)
                    && k2_offset < v2.len()
                    && v2[k2_offset] != -1
                {
                    // code omitted
                }
            }
            k1 += 2;
        }

        // Walk the reverse path one step.
        let mut k2 = k2start - edit_dist;
        while k2 < edit_dist + 1 - k2end {
            let (mut x2, y2) = {
                let k2_offset = (v_offset + k2) as usize;
                let prev_idx = k2_offset - 1;
                let next_idx = k2_offset + 1;

                let v2_prev = if prev_idx < v2.len() {
                    v2[prev_idx]
                } else {
                    -1
                };
                let v2_next = if next_idx < v2.len() {
                    v2[next_idx]
                } else {
                    -1
                };

                // code omitted ..

                if k2_offset < v2.len() {
                    v2[k2_offset] = x2;
                }

                (x2, y2)
            };

            if x2 > old_len {
                // Ran off the left of the graph.
                k2end += 2;
            } else if y2 > new_len {
                // Ran off the top of the graph.
                k2start += 2;
            } else if !front {
                let k1_offset = (v_offset + delta - k2) as usize;
                if k1_offset < v_len as usize && k1_offset < v1.len() && v1[k1_offset] != -1 {
                    // if (0..v_len).contains(&k1_offset) && k1_offset < v1.len() as isize && v1[k1_offset as usize] != -1 {
                    let x1 = v1[k1_offset];
                    // code omitted ..
                }
            }
            k2 += 2;
        }
    }

    Ok(vec![Diff::delete(old), Diff::insert(new)])
}

Wherever we have resorted to indexing into a slice we have added a condition to ensure our index is in bound!

Results:

And with that we have a whopping gain of ~40% from our baseline, our Efficient mode diff now stands at 32.89 ms (▼ 40.45%) and the Compat mode sees similar jump in performance with the average time/iter at 33.37 ms (▼ 39.77%).

Hypothesis Validated
Reducing bound checks in a hot loop gives us a massive jump in performance, we are now the faster than most implementations by a reasonable margin!

The Go implementation is still faster!

Well, it looks like the go-diff implementation is skipping a few optimization steps - at least from our benchmark test data here used for the benchmark the go-diff package generates far more diffs than other implementations - while every other implementation we have benchmarked against generates ~2300 diffs, the go implementation yields ~3000 diffs .. which would also explain why the Patch flow of the go package is slower by a large margin.

Benchmarks as it stands now:

Lang.	Library	Diff Avg.	Bencher	Mode	Correct
`rust`	diff_match_patch v0.1.1	56.49 ms	Criterion	-	✅
`rust`	dmp v0.2.0	56.68 ms	Criterion	-	✅
`rust`	diff-match-patch-rs (our)	32.88 ms	Criterion	`Efficient`	✅
`rust`	diff-match-patch-rs (our)	33.37 ms	Criterion	`Compat`	✅
`go`	go-diff	30.31 ms	go test	-	❗
`node`	diff-match-patch	246.90 ms	tinybench	-	❌
`python`	diff-match-patch	1.01 s	timeit	-	✅

Conclusion & Takeaways:

Well, we did go from very fast to nearly the fastest implementation of diff-match-patch. The journey doesn’t end here! I have plans for some algorithmic improvements - but that’s an experiment for some other time.

Success
Optimization 1: Converting numerical operations to bit-shift when possible.
Optimization 2: Reducing known Vec<T> allocations - this is nothing new, just goes on to validate how much of a gain can be achieved in a loop.
Optimization 3: Loop simplification and bringing the code closer to the compiler’s nose helped the compiler reason better and as a result optimize better.
Optimization 5: Most significant performance gain comes from removing bound checks in hot loops! We all know this - but didn’t expect the gain to be this significant!

Failure and Myth Debunked
Iterators are not always guaranteed to generate the most optimal assembly, iterator construction in a hot loop introduced overheads for us which was detrimental to the performance.
Couldn’t get auto-vectorization to kick in!

diff-match-patch-rs#

Why optimize?#

TL;DR#

Diff: Performance Baselines#

Bottlenecks:#

Analysing the fn bisect( .. )#

Optimization 1: bit-shift replacements for 2^n ops#

Hypothesis:#

Approach:#

Results:#

Optimization 2: Reduce Vec<T> allocations!#

Hypothesis:#

Approach:#

Results:#

Optimization 3: Loop simplification#

Hypothesis:#

Approach:#

Results:#

Optimization 4: Vectorization#

Hypothesis:#

Approach:#

Results:#

Further analysis:#

Optimization 5: Bound-checks#

Hypothesis:#

Approach:#

What does a bound check look like?#

Results:#

The Go implementation is still faster!#

Benchmarks as it stands now:#

Conclusion & Takeaways:#

diff-match-patch-rs

Why optimize?

TL;DR

`Diff`: Performance Baselines

Bottlenecks:

Analysing the `fn bisect( .. )`

Optimization 1: `bit-shift` replacements for `2^n` ops

Hypothesis:

Approach:

Results:

Optimization 2: Reduce `Vec<T>` allocations!

Hypothesis:

Approach:

Results:

Optimization 3: Loop simplification

Hypothesis:

Approach:

Results:

Optimization 4: Vectorization

Hypothesis:

Approach:

Results:

Further analysis:

Optimization 5: Bound-checks

Hypothesis:

Approach:

What does a bound check look like?

Results:

The Go implementation is still faster!

Benchmarks as it stands now:

Conclusion & Takeaways: