TechnReport 7.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214
  1. CONCISE REPORT ON THE ALGORITHMS IN SIM 970623
  2. INTRODUCTION
  3. The general outline of the similarity checker is as follows:
  4. 1. the files are read in (pass 1)
  5. 2. a forward-reference table is prepared
  6. 3. the set of interesting runs is determined
  7. 4. the line numbers of the runs are determined (pass 2)
  8. 5. the contents of the runs are printed in order (pass 3)
  9. To keep the memory requirements (relatively) small, the exact positions
  10. of the tokens are not recorded. This necessitates pass 2. See, however,
  11. the pertinent chapter.
  12. READING THE FILES
  13. Each file is tokenized using an lex-generated scanner appropriate for
  14. the input. Each token fits in one byte, possibly using all 8 bits. The
  15. tokens are stored in the array TokenArray[], which is extended by
  16. reallocation if it overflows. See tokenarray.c.
  17. Also, to optimize away pass 2, an attempt is made to remember the token
  18. positions of all beginnings of lines. The token-positions at BOL are
  19. stored in the array nl_buff[], which is also extended by reallocation,
  20. if needed. If the attempt fails due to lack of memory, nl_buff[] is
  21. abandoned, and pass2 will read the files instead.
  22. PREPARING THE FORWARD-REFERENCE TABLE
  23. Text is compared by comparing every substring to all substrings
  24. to the right of it; this process is in essence quadratic. However,
  25. only substrings of length at least 'MinRunSize' are of interest,
  26. which gives us the possibility to speed up this process by using
  27. a hash table.
  28. Once the entire text has been read in, a forward-reference table
  29. forward_references[] is made (see hash.c).
  30. For every position in the text, we construct an index which gives
  31. the next position in the text where a run of MinRunSize tokens
  32. starts that has the same hash code. If there is no such run, the
  33. index is 0.
  34. To fill in this array, we use a hash table last_index[], such that
  35. last_index[i] is the index of the latest token with hash_code i, or 0 if
  36. there is none. If at a given position p, we find that the text ahead of
  37. us has hash code i, last_index[i] tells us which position in
  38. forward_references[] will have to be updated to p.
  39. See MakeForwardReferences().
  40. For long text sequences (say hundreds of thousands of tokens), the
  41. hashing is not really efficient any more since too many spurious matches
  42. occur. Therefore, the forward reference table is scanned a second time,
  43. eliminating from any chain all references to runs that do not start with
  44. and end in the same token (actually this is a second hash code).
  45. For the UNIX manuals this reduced the number of matches from 91.9% to 1.9%
  46. (of which 0.06% was genuine).
  47. DETERMINING THE SET OF INTERESTING RUNS
  48. The overall structure of the routine Compare() (see compare.c) is:
  49. for all new files
  50. for all texts it must be compared to
  51. for all positions in the new file
  52. for all positions in the text
  53. for ever increasing sizes
  54. try to match and keep the best
  55. If for a given position in the new file a good run (i.e. on of at least
  56. minimum length) has been found, the run is registered using a call of
  57. add_run(), the run is skipped in the new file and searching continues at
  58. the position after it. This prevents duplicate reports of runs.
  59. Add_run() allocates a struct run for the run (see sim.h)
  60. which contains two struct chunks and a quality description. It fills
  61. in the two chunks with the pertinent info, one for the first file and
  62. one for the second (which may be the same, if the run relates two chunks
  63. in the same file).
  64. The run is then entered into the arbitrary-in-sorted-out store AISO (see
  65. aiso.spc and aiso.bdy, a genuine generic abstract data type in C!), in
  66. which it is inserted according to its quality. Both positions
  67. (struct position) in both chunks in the run (so four in total) are each
  68. entered in a linked list starting at the tx_pos field in the struct text
  69. of the appropriate file.
  70. When this is finished, the forward reference table can be deleted.
  71. So the final results of this phase are visible both through the tx_pos
  72. fields and through the aiso interface.
  73. DETERMINING THE EXACT POSITION OF EACH RUN (PASS 2)
  74. The purpose of this pass is to find for each chunk, which up to now is
  75. known by token position only, its starting and ending line number (which
  76. cannot be easily derived from the token position).
  77. For each file that has a non-zero tx_pos field, ie. that has some
  78. interesting chunks, the positions in the tx_pos list are sorted on
  79. ascending line number (they have been found in essentially arbitrary
  80. order) by sort_pos() in pass2.c.
  81. Next we scan the pos list and the file in parallel, updating the info in
  82. a position when we meet it. A position carries an indication whether it
  83. is a starting or an ending position, since slightly differing
  84. calculations have to be done in each case.
  85. Actually, if the nl_buff[] data structure still exists, the file is not
  86. accessed at all and the data from nl_buff[] is used instead. This is
  87. done transparently in buff.c.
  88. PRINTING THE CONTENTS OF THE RUNS (PASS 3)
  89. Since each struct run has now been completely filled in, this is simple;
  90. the hard work is calculating the page layout.
  91. Pass3() accesses the aiso store and retrieves from it the runs in
  92. descending order of importance. Show_run() opens both files, positions
  93. them using the line numbers and prints the runs.
  94. ================================================================
  95. CODE EXCERPT OF THE SOFTWARE SIMILARITY TESTER SIM (980222)
  96. sim:
  97. get command line options
  98. check the options
  99. init language, to precompute tables
  100. pass1, read the files
  101. # there is an array TokenArray[] that holds all input tokens
  102. make forward reference table
  103. # there is an array forward_references[], with one entry for
  104. # each token in the input; forward_references[i] gives the
  105. # token number where a token sequence starts with the same
  106. # hash value as the one starting at i
  107. compare various files to find runs
  108. delete forward reference table
  109. pass2, find newline positions of found similarities
  110. pass3, print the similarities
  111. pass1, read the files:
  112. for each file
  113. divide the text into tokens
  114. store all tokens except newlines in TokenArray and try to
  115. keep a record of the newline positions
  116. make forward reference table:
  117. # there are two independent hash functions, hash1() and hash2().
  118. # hash1(i) gives the hash value of the token sequence starting at i
  119. # likewise for hash2(i)
  120. set up the forward references using the last_index table:
  121. # there is an array last_index[], with one entry for each
  122. # possible hash value; last_index[i] gives the position in
  123. # forward_references[] at which i was most recently
  124. # encountered as a hash value
  125. for each file
  126. for all positions in file except the last MinRunSize
  127. set forward_references[] and update last_index[]
  128. use hash2() to clean out matches:
  129. for all tokens
  130. find first token in chain with same hash2 code
  131. short-circuit forward reference to it
  132. compare:
  133. for all new files
  134. for all texts it must be compared to
  135. for all positions in the new file
  136. for all positions in the text
  137. for ever increasing sizes
  138. try to match and keep the best
  139. try to match and keep the best:
  140. # using forward_references[], we find a list of positions in
  141. # which a matching token sequence will start;
  142. # scanning this list, we measure the maximum length of the
  143. # match and add the longest match to the run collection
  144. pass2, find positions of found runs:
  145. for all files:
  146. sort the positions in the runs
  147. # we scan the pos list and the file in parallel
  148. for all positions inside this file
  149. if it matches a token position in a run
  150. record line number
  151. pass3, print the similarities:
  152. for all runs
  153. # a run consists of two chunks
  154. open the files that hold the chunks and position them
  155. at the beginning of the chunk
  156. display the chunks