MAJOR=4
MINOR=53
DEVNAME=tty53
 9S    \documentclass[a4paper]{article}
\begin{document}


\title{The rsync algorithm}

\author{Andrew Tridgell \quad\quad Paul Mackerras\\
Department of Computer Science \\
Australian National University \\
Canberra, ACT 0200, Australia}

\maketitle

\begin{abstract}
  This report presents an algorithm for updating a file on one machine
  to be identical to a file on another machine.  We assume that the
  two machines are connected by a low-bandwidth high-latency
  bi-directional communications link.  The algorithm identifies parts
  of the source file which are identical to some part of the
  destination file, and only sends those parts which cannot be matched
  in this way.  Effectively, the algorithm computes a set of
  differences without having both files on the same machine.  The
  algorithm works best when the files are similar, but will also
  function correctly and reasonably efficiently when the files are
  quite different.
\end{abstract}

\section{The problem}

Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
the same as $A$. The obvious method is to copy $A$ onto $B$.

Now imagine that the two files are on machines connected by a slow
communications link, for example a dialup IP link.  If $A$ is large,
copying $A$ onto $B$ will be slow.  To make it faster you could
compress $A$ before sending it, but that will usually only gain a
factor of 2 to 4.

Now assume that $A$ and $B$ are quite similar, perhaps both derived
from the same original file. To really speed things up you would need
to take advantage of this similarity. A common method is to send just
the differences between $A$ and $B$ down the link and then use this
list of differences to reconstruct the file.

The problem is that the normal methods for creating a set of
differences between two files rely on being able to read both files.
Thus they require that both files are available beforehand at one end
of the link.  If they are not both available on the same machine,
these algorithms cannot be used (once you had copied the file over,
you wouldn't need the differences).  This is the problem that rsync
addresses.

The rsync algorithm efficiently computes which parts of a source file
match some part of an existing destination file.  These parts need not
be sent across the link; all that is needed is a reference to the part
of the destination file.  Only parts of the source file which are not
matched in this way need to be sent verbatim.  The receiver can then
construct a copy of the source file using the references to parts of
the existing destination file and the verbatim material.

Trivially, the data sent to the receiver can be compressed using any
of a range of common compression algorithms, for further speed
improvements.

\section{The rsync algorithm}

Suppose we have two general purpose computers $\alpha$ and $\beta$.
Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
file $B$, where $A$ and $B$ are ``similar''.  There is a slow
communications link between $\alpha$ and $\beta$.

The rsync algorithm consists of the following steps:

\begin{enumerate}
\item $\beta$ splits the file $B$ into a series of non-overlapping
  fixed-sized blocks of size S bytes\footnote{We have found that
  values of S between 500 and 1000 are quite good for most purposes}.
  The last block may be shorter than $S$ bytes.

\item For each of these blocks $\beta$ calculates two checksums:
  a weak ``rolling'' 32-bit checksum (described below) and a strong
  128-bit MD4 checksum.

\item $\beta$ sends these checksums to $\alpha$.
  
\item $\alpha$ searches through $A$ to find all blocks of length $S$
  bytes (at any offset, not just multiples of $S$) that have the same
  weak and strong checksum as one of the blocks of $B$. This can be
  done in a single pass very quickly using a special property of the
  rolling checksum described below.
  
\item $\alpha$ sends $\beta$ a sequence of instructions for
  constructing a copy of $A$.  Each instruction is either a reference
  to a block of $B$, or literal da