sim.1 4.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176
  1. .\" This file is part of the software similarity tester SIM.
  2. .\" Written by Dick Grune, Vrije Universiteit, Amsterdam.
  3. .\" $Id: sim.1,v 2.6 2004/08/05 09:49:49 dick Exp $
  4. .\"
  5. .TH SIM 1 2001/11/13 "Vrije Universiteit"
  6. .SH NAME
  7. sim \- find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda or text files
  8. .SH SYNOPSIS
  9. .B sim_c
  10. [
  11. .B \-[defFnpsS]
  12. .B \-r
  13. .I N
  14. .B \-w
  15. .I N
  16. .B \-o
  17. .I F
  18. ]
  19. file ... [
  20. .B /
  21. [ file ... ] ]
  22. .br
  23. .B sim_c
  24. \&...
  25. .br
  26. .B sim_java
  27. \&...
  28. .br
  29. .B sim_pasc
  30. \&...
  31. .br
  32. .B sim_m2
  33. \&...
  34. .br
  35. .B sim_lisp
  36. \&...
  37. .br
  38. .B sim_mira
  39. \&...
  40. .br
  41. .B sim_text
  42. \&...
  43. .br
  44. .SH DESCRIPTION
  45. .I Sim_c
  46. reads the C files
  47. .I file ...
  48. and looks for pieces of text that are similar; two pieces of program text
  49. are similar if they only differ in layout, comment, identifiers and
  50. the contents of numbers, strings and characters.
  51. If any runs of sufficient length
  52. are found, they are reported on standard output; the number of significant
  53. tokens in the run is given between square brackets.
  54. .PP
  55. .I Sim_java
  56. does the same for Java,
  57. .I sim_pasc
  58. for Pascal,
  59. .I sim_m2
  60. for Modula-2,
  61. .I sim_lisp
  62. for Lisp, and
  63. .I sim_mira
  64. for Miranda.
  65. .I Sim_text
  66. works on arbitrary text; it is occasionally useful on shell scripts.
  67. .PP
  68. The program can be used for finding copied pieces of code in
  69. purportedly unrelated programs (with
  70. .B \-s
  71. or
  72. .BR \-S ),
  73. or for finding accidentally duplicated code in larger projects (with
  74. .BR \-f ).
  75. .PP
  76. If a
  77. .B /
  78. is present between the input files, the latter are divided into a group of
  79. "new" files (before the
  80. .BR / )
  81. and a group of "old" files; if there is no
  82. .BR / ,
  83. all files are "new".
  84. Old files are never compared to each other.
  85. Since the similarity tester
  86. reads the files several times, it cannot read from standard input.
  87. .PP
  88. There are the following options:
  89. .TP
  90. .B \-d
  91. The output is in a diff(1)-like format instead of the default
  92. 2-column format.
  93. .TP
  94. .B \-e
  95. Each file is compared to each file in isolation; this will find all
  96. similarities between all texts involved, regardless of duplicates.
  97. .TP
  98. .B \-f
  99. Runs are restricted to pieces with balancing parentheses, to isolate
  100. potential functions (C, Java, Pascal, Modula-2 and Lisp only).
  101. .TP
  102. .B \-F
  103. The names of functions in calls are required to match exactly
  104. (C, Java, Pascal, Modula-2 and Lisp only).
  105. .TP
  106. .B \-n
  107. Similarities found are only summarized, not displayed.
  108. .TP
  109. .B "\-o F"
  110. The output is written to the file named
  111. .I F.
  112. .TP
  113. .B \-p
  114. The output is given in similarity percentages; see below.
  115. .TP
  116. .B "\-r N"
  117. The minimum run length is set to
  118. .I N
  119. (default is
  120. .I N
  121. = 24).
  122. .TP
  123. .B \-s
  124. The contents of a file are not compared to itself (\-s = not self).
  125. .TP
  126. .B \-S
  127. The contents of the new files are compared to the old files only \- not
  128. between themselves.
  129. .TP
  130. .B "\-w N"
  131. The page width used is set to
  132. .I N
  133. columns (default is
  134. .I N
  135. = 80).
  136. .PP
  137. The
  138. .B \-p
  139. option results in lines of the form
  140. .DS
  141. .ft 5
  142. F consists for x % of G material
  143. .ft P
  144. .DE
  145. meaning that \f5x\fP % of \f5F\fP's text can also be found in \f5G\fP.
  146. Note that this relation is not symmetric; it is in fact quite possible for one
  147. file to consist for 100 % of text from another file, while the other file
  148. consists for only 1 % of text of the first file, if their lengths differ
  149. enough.
  150. Note also that the granularity of the recognized text is still governed by the
  151. .B \-r
  152. option or its default.
  153. .PP
  154. Care has been taken to keep all internal processes linear in the length of the
  155. input, with the exception of the matching process which is almost linear,
  156. using a hash table; various other tables are used for speed-up.
  157. If, however, there is not enough memory for the tables, they are discarded in
  158. order of unimportance, under which conditions the algorithms revert to their
  159. quadratic nature.
  160. .SH AUTHOR
  161. Dick Grune, Vrije Universiteit, Amsterdam.
  162. .SH BUGS
  163. Strong periodicity in the input text (like a table of
  164. .I N
  165. almost identical lines) causes problems.
  166. .I Sim
  167. tries to cope with this but cannot avoid giving appr.\&
  168. .I log N
  169. messages about it.
  170. The best advice is still to take the offending files out of the game.
  171. .PP
  172. Since it uses
  173. .I lex(1)
  174. on some systems, it may dump core on any weird construction that overflows
  175. .IR lex 's
  176. internal buffers.