differ.txi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

@noindent
@code{diff:edit-length} implements the algorithm:

@ifinfo
@example
S. Wu, E. Myers, U. Manber, and W. Miller,
   "An O(NP) Sequence Comparison Algorithm,"
   Information Processing Letters 35, 6 (1990), 317-323.
   @url{http://www.cs.arizona.edu/people/gene/vita.html}
@end example
@end ifinfo
@ifset html
S. Wu, <A HREF="http://www.cs.arizona.edu/people/gene/vita.html">
E. Myers,</A> U. Manber, and W. Miller,
<A HREF="http://www.cs.arizona.edu/people/gene/PAPERS/np_diff.ps">
"An O(NP) Sequence Comparison Algorithm,"</A>
Information Processing Letters 35, 6 (1990), 317-323.
@end ifset

@noindent
The values returned by @code{diff:edit-length} can be used to gauge
the degree of match between two sequences.

@noindent
Surprisingly, "An O(NP) Sequence Comparison Algorithm" does not
derive the edit sequence; only the sequence length.  Developing this
linear-space sub-quadratic-time algorithm for computing the edit
sequence required hundreds of hours of work.  I have submitted a
paper describing the algorithm to the Journal of Computational
Biology.

@noindent
If the items being sequenced are text lines, then the computed
edit-list is equivalent to the output of the @dfn{diff} utility
@cindex diff
program.  If the items being sequenced are words, then it is like the
lesser known @dfn{spiff} program.
@cindex spiff


@defun diff:longest-common-subsequence array1 array2 =? p-lim


@defunx diff:longest-common-subsequence array1 array2 =?
@var{array1} and @var{array2} are one-dimensional arrays.  The procedure @var{=?} is used
to compare sequence tokens for equality.

The non-negative integer @var{p-lim}, if provided, is maximum number of
deletions of the shorter sequence to allow.  @code{diff:longest-common-subsequence} will return @code{#f}
if more deletions would be necessary.

@code{diff:longest-common-subsequence} returns a one-dimensional array of length @code{(quotient (- (+
len1 len2) (diff:edit-length @var{array1} @var{array2})) 2)} holding the longest sequence
common to both @var{array}s.
@end defun

@defun diff:edits array1 array2 =? p-lim


@defunx diff:edits array1 array2 =?
@var{array1} and @var{array2} are one-dimensional arrays.  The procedure @var{=?} is used
to compare sequence tokens for equality.

The non-negative integer @var{p-lim}, if provided, is maximum number of
deletions of the shorter sequence to allow.  @code{diff:edits} will return @code{#f}
if more deletions would be necessary.

@code{diff:edits} returns a vector of length @code{(diff:edit-length @var{array1} @var{array2})} composed
of a shortest sequence of edits transformaing @var{array1} to @var{array2}.

Each edit is an integer:
@table @asis
@item @var{k} > 0
Inserts @code{(array-ref @var{array1} (+ -1 @var{j}))} into the sequence.
@item @var{k} < 0
Deletes @code{(array-ref @var{array2} (- -1 @var{k}))} from the sequence.
@end table
@end defun

@defun diff:edit-length array1 array2 =? p-lim


@defunx diff:edit-length array1 array2 =?
@var{array1} and @var{array2} are one-dimensional arrays.  The procedure @var{=?} is used
to compare sequence tokens for equality.

The non-negative integer @var{p-lim}, if provided, is maximum number of
deletions of the shorter sequence to allow.  @code{diff:edit-length} will return @code{#f}
if more deletions would be necessary.

@code{diff:edit-length} returns the length of the shortest sequence of edits transformaing
@var{array1} to @var{array2}.
@end defun
@example
(diff:longest-common-subsequence "fghiejcklm" "fgehijkpqrlm" eqv?)
@result{} "fghijklm"

(diff:edit-length "fghiejcklm" "fgehijkpqrlm" eqv?)
@result{} 6

(diff:edits "fghiejcklm" "fgehijkpqrlm" eqv?)
@result{} #As32(3 -5 -7 8 9 10)
       ; e  c  h p q  r
@end example