diff-match-patch-rs
A few years back, while building HandyTrain, we decided to build a collaborative content creation feature, among other things we needed a text-synchronization library - a WASM version for the client and a high-performance library for our Go and Rust services. With some research we landed on the fantastic diff match patch
algorithms, the diff part is an implementation of this paper (often called the Myer’s Diff algorithm) by Eugene Myer’s. We soon found the Go package for our services but none of the Rust implementations at the time were WASM ready. Thats when I decided to port the algorithm based on Google’s Diff Match Patch library.
A few months ago, I finally got the time and opportunity to clean it up a bit and release it as a rust crate diff-match-patch-rs.
Why optimize?
Real-time collaboration poses a unique challenge: synchronizing multiple users’ inputs instantly, regardless of network speed or user count. When building our collaborative editor, the text synchronization became a critical bottleneck. To our delight, our initial implementation of diff-match-patch-rs was already outperforming most alternatives.
Motivated by seeing others using our library, I embarked on a quest to transform it from “very fast” to “the fastest”. Here’s what worked, what didn’t, and what I learned along the way.
TL;DR
We achive ~40% improvement in the Diff
performance of the crate by helping the compiler optimize better.
Diff
: Performance Baselines
Before diving into optimizations (both essential and experimental), let’s see where we stand.
CodeBenchmark code here.
Lang. | Library | Diff Avg. | Bencher | Mode | Correct |
---|---|---|---|---|---|
rust | diff_match_patch v0.1.1 | 56.62 ms | Criterion | - | ✅ |
rust | dmp v0.2.0 | 56.61 ms | Criterion | - | ✅ |
rust | diff-match-patch-rs (our) | 55.15 ms | Criterion | Efficient | ✅ |
rust | diff-match-patch-rs (our) | 55.27 ms | Criterion | Compat | ✅ |
go | go-diff | 30.31 ms | go test | - | ✅ |
node | diff-match-patch | 246.90 ms | tinybench | - | ❌ |
python | diff-match-patch | 1.01 s | timeit | - | ✅ |
NoteThe numbers above are from a Macbook M1 Pro machine for diff of a large text.
The verdict? While we’re ahead of other Rust, Node and Python implementations, the Go package is leaving us in the dust!
Our mission: squeeze more speed by enabling better compiler optimizations, keeping the core algorithm intact (we’ll tackle that another day). Ready? Let’s roll!
Bottlenecks:
Let’s look at some data some data! A quick flamegraph analysis reveals where our program is spending most of it’s time on:
Cargo Flamegraphcargo flamegraph
is pretty much the defacto standard. More details here.
The fn bisect()
, responsible for finding the middle snake, dominates execution time. This is an expected behavior for the algorithm, attempting to optimize this part of the algorithm should yield maximum benefits.
Analysing the fn bisect( .. )
Here’s the fn bisect( .. )
as it stands now:
|
|
I’ve highlighted the lines that are indexing into the slices and the comments demarketing the loop hierarchy!
That’s a lot of nested looping, moreover, there’s a lot of direct indexing happening in these hot loops, which is generally considered a performance bottleneck because of rust’s bound checks. Let’s look for some hints in the assembly generated by the compiler for this function.
Generating Assembly or IRWe are going to be using the crate cargo-show-asm.
For brevity I’m not pasting the ~700 lines of asm
in this blog, but these are my observations:
- The generated assembly does contain a few
panic_bounds_check
,index_start_len_fail
,index_end_len_fail
andindex_order_fail
- these are all various forms of bound checks with presumably heavy performance penalties in a hot loop - I’ve always assumed that
x * 2^n
orx / 2^n
will always be optimized tobit-shift
operations - that’s not reflecting in the generated assembly - In general, the
loop(loop(loop))
has generated almost unreadably convoluted assembly (at-least for me)
Optimization 1: bit-shift
replacements for 2^n
ops
Hypothesis:
Replacing mathematical operations with bit-shift
alternatives wherever possible would yield performance gains.
Approach:
Replace the * 2^n
, / 2^n
and delta % 2 != 0
to bit shift operations, while at it, remove the redundant variable max_d
.
|
|
Now let’s run the tests
once again to verify that our modifications have not broken the algorithm!
cargo test --release -- --nocapture
And the tests pass, so time to re-run the benchmark!
// in `diff-match-patch-rs` repo
cargo bench --bench diff
Results:
Wallah🎉🎉, after multiple passes, I’m reasonably certain that our micro-optimization has yielded some benefits. The average time/iter
has reduced to 54.78 ms (▼ 0.56%) for the Efficient
mode and 54.83 ms (▼ 1.03%) for the Compat
mode. Well, not very impressive but it’s a start.
`bit-shift` helpsConverting2^n
multiplication and division operations tobit-shift
had positive (minor) performance impact!
Optimization 2: Reduce Vec<T>
allocations!
Hypothesis:
Reducing known vector allocation calls (currently 2 during the setup of bisect( .. )
) would result in performance gains!
Approach:
Instead of allocating v1
and v2
separately, we’ll allocate one vector v
and then use method split_at_mut()
to obtain the required data structures.
|
|
Results:
Meaningful improvements over the baseline - as it stands now, the average time/iter
has reduced to 53.7 ms (▼ 2.6%) for the Efficient
mode and 54.0 ms (▼ 2.42%) for the Compat
mode.
Reducing allocation calls in hot loopFirst meaningful improvement observed by reducing the 2 calls to allocate a vector to a single call and thenmethod split_at_mut( .. )
.
Optimization 3: Loop simplification
Hypothesis:
Our current method bisect( ... )
contains a multi-tiered nested loops which is preventing the compiler from fully optimizing it, isolating the loop logic would give the compiler an opportunity to reason with it.
Approach:
Target the inner most loop - the idea is to move upwards in the nesting - and see if we can squeeze out some performance gains! Let’s try and isolate the loop to its own function - as Alex Kladov puts it, bring the code closer to the compiler’s nose.
To begin with let’s just move the leaf (Loop: A.1.1) while x1 < old_len && y1 < new_len { ... }
loop inside the forward graph traversal pass to a separate function!
|
|
And we modify our fn bisect( .. )
..
|
|
We have our first MAJOR improvement from the baseline, the average time/iter
has reduced to 50.97 ms (▼ 7.55%) for the Efficient
mode and 51.27 ms (▼ 2.46%) for the Compat
mode.
So, our hypothesis really holds true, isolating the loop truly gave the rust compiler an opportunity to reason with the code better! Let’s apply the same technique to the reverse
traversal leaf loop
.
Results:
And the results are highly encouraging! We are looking at a cumulative gain of ~16% over our baseline - our average timings now stand at 45.97 ms (▼ 16.62%) in Efficient
mode and 46.79 ms (▼ 15.55%) for Compat
mode.
Loop simplification works!We get a massive ~16% jump over the baseline by simplifying the inner-most loop!
Optimization 4: Vectorization
Let’s look at the fn bisect_fwd_path_i<'a, T: DType>( .. )
we just created. It’s a simple loop where we are comparing an element at a time till we hit a non-equality
. Can we vectorize this?
Hypothesis:
Rust standard iterators are better suitable for compiler optimizations! Moreover, we should be able to hint the compiler towards auto vectorization
of the code which would give us significant performance boost!
Approach:
Recently I had come across this blog post Can you trust a compiler to optimize your code? where Alex writes about hinting the compiler to auto-vectorize a loop by looping through a fixed chunk size (16 in their case). Let’s try and apply the same to our fn bisect_fwd_path_i< .. >( .. )
.
|
|
Results:
We observe a massive performance regression of ~20%. Our hypothesis is invalidated.
Further analysis:
In fact, even if we remove the chunk_exact()
based attempt at vectorization and just represent the while x1 < old.len() && y1 < new.len() { .. }
loop with an iterator we still see significantly worse performance! Thats puzzling, in popular wisdom rust iterators are supposed to get optimized better!
Let’s look at the generated assembly!
First: with our simple while x1 < old.len() && y1 < new.len() { .. }
|
|
and the assembly generated when we replace the while .. {}
with an iterator
|
|
Just eyeballing at the generated assembly, it looks like the overhead of the iterator setup
in a hot loop introduces a significant performance penalty!
Hypothesis Invalidated
- Iterators not always guranteed to yield the most optimized code
- Couldn’t get
vectorization
to work! We’ll take a deeper look at this later!
Reverting back to the best state so far!
Optimization 5: Bound-checks
We are looking at a very hot loop and while doing so we have been indexing into slice[i]
very frequently! This is known to be slow because of bound_checks
introduced by the rust compiler! Bound checks are necessary for rust’s safety guarantees and this prevents a notorious class of bugs and vulnerabilities related to buffer overflows - we know all too well about some notorious vulnerabilities that exploit buffer overflows.
Anyway, how do we get around bound checks
without resorting to unsafe {}
rust?
Recommended readingI found this blog post by Sergey Davidoff about bound checks very helpful and we are going to be using some of these techniques during our optimization attempt!
Hypothesis:
Alleviating bound checks would yield significant performance benefits!
Approach:
Whenever we index into a slice, we’ll try and give the compiler the guarantee that our slice[index]
is within bounds and thereby reduce the need to introduce bound checks.
What does a bound check look like?
Let’s look at the generated assembly for our fn bisect<'a, T: DType>( .. )
.
// code omitted ..
Lloh299:
add x2, x2, l___unnamed_43@PAGEOFF
// diff-match-patch-rs/src/dmp.rs : 758
|| (k2 != d && v2[k2_offset as usize - 1] < v2[k2_offset as usize + 1])
b LBB77_95
// code omitted ..
LBB77_95:
mov x1, x25
bl core::panicking::panic_bounds_check
b LBB77_103
LBB77_96:
mov x22, #0
LBB77_97:
// .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs : 420
Err(err) => handle_error(err),
mov x0, x22
mov x1, x28
bl alloc::raw_vec::handle_error
// code omitted ..
In the excerpt of the generated assembly we can see that in one of our indexing points the assembly refers to the core::panicking::panic_bounds_check
bit of code.
Let’s try and see if we can tell the compiler that it’s not required!
|
|
Wherever we have resorted to indexing into a slice we have added a condition to ensure our index is in bound!
Results:
And with that we have a whopping gain of ~40% from our baseline, our Efficient
mode diff now stands at 32.89 ms (▼ 40.45%) and the Compat
mode sees similar jump in performance with the average time/iter
at 33.37 ms (▼ 39.77%).
Hypothesis ValidatedReducing bound checks in a hot loop gives us a massive jump in performance, we are now the faster than most implementations by a reasonable margin!
The Go implementation is still faster!
Well, it looks like the go-diff implementation is skipping a few optimization steps - at least from our benchmark test data here used for the benchmark the go-diff package generates far more diffs than other implementations - while every other implementation we have benchmarked against generates ~2300 diffs, the go implementation yields ~3000 diffs .. which would also explain why the Patch
flow of the go package is slower by a large margin.
Benchmarks as it stands now:
Lang. | Library | Diff Avg. | Bencher | Mode | Correct |
---|---|---|---|---|---|
rust | diff_match_patch v0.1.1 | 56.49 ms | Criterion | - | ✅ |
rust | dmp v0.2.0 | 56.68 ms | Criterion | - | ✅ |
rust | diff-match-patch-rs (our) | 32.88 ms | Criterion | Efficient | ✅ |
rust | diff-match-patch-rs (our) | 33.37 ms | Criterion | Compat | ✅ |
go | go-diff | 30.31 ms | go test | - | ❗ |
node | diff-match-patch | 246.90 ms | tinybench | - | ❌ |
python | diff-match-patch | 1.01 s | timeit | - | ✅ |
Conclusion & Takeaways:
Well, we did go from very fast to nearly the fastest implementation of diff-match-patch
. The journey doesn’t end here! I have plans for some algorithmic improvements - but that’s an experiment for some other time.
Success
- Optimization 1: Converting numerical operations to
bit-shift
when possible.- Optimization 2: Reducing known
Vec<T>
allocations - this is nothing new, just goes on to validate how much of a gain can be achieved in aloop
.- Optimization 3: Loop simplification and bringing the code closer to the compiler’s nose helped the compiler reason better and as a result optimize better.
- Optimization 5: Most significant performance gain comes from removing bound checks in hot loops! We all know this - but didn’t expect the gain to be this significant!
Failure and Myth Debunked
- Iterators are not always guaranteed to generate the most optimal assembly,
iterator
construction in a hot loop introduced overheads for us which was detrimental to the performance.- Couldn’t get
auto-vectorization
to kick in!