123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580 |
- 2007-08-23 Dick Grune <[email protected]>
- LICENSE.txt added.
- 2006-11-27 Dick Grune <[email protected]>
- Removal of setbuff() for compatibility.
- 2005-01-17 Dick Grune <[email protected]>
- Corrections by Jerry James <[email protected]>; ANSIizing, etc.
- 2004-08-05 Dick Grune <[email protected]>
- Finished the 'percentage' option.
- 08-Nov-2001 Dick Grune
- Begun to add a 'percentage' option, which will express the
- similarity between two files in percents.
- 27-Sep-2001 Dick Grune
- Split add_run() off from compare.c into add_run.c, to accomodate
- different add_run()s, for different types of processing.
- 27-Nov-1998 Dick Grune
- Installed a Miranda version supplied by Emma Norling ([email protected])
- 23-Feb-1998 Dick Grune
- Renamed text.l to textlang.l for uniformity and to make room for
- a possible module text.[ch].
- Isolated a module for handling the token array from buff.[ch] to
- tokenarray.[ch], and renamed buff.[ch] to text.[ch].
- 23-Feb-1998 Dick Grune
- There is probably not much point in abandoning the nl_buff list
- when running out of memory for TokenArray[]: each token costs 1
- byte for the token and 4 bytes for the entry in
- forward_references[], a total of 5 bytes. There are about 3
- tokens to a line, together requiring 15 bytes, plus 1 byte in
- nl_buff yields 16 bytes. So releasing nl_buff frees only 1/16 =
- 6.7 % of memeory.
- Since the code is a bother, I removed it. Note that nl_buff is
- still abandoned when the number of tokens in a line does not fit
- in one unsigned char (but that is not very likely to happen).
-
- 21-Feb-1998 Dick Grune
- Printing got into an infinite loop when the last line of the
- input was not terminated by a newline AND contained tokens that
- were included in a matching run.
- This was due to a double bug: 1. the non-terminated line was not
- registered properly in NextTextTokenObtained() / CloseText(),
- and 2. the loop in pass 2 which sets the values of
- pos->ps_nl_cnt was terminated prematurely when the file turned
- out to be shorter than the list of pos-es indicated.
- Both bugs were corrected, the first by supplying an extra
- newline in CloseText() when one is found missing, and the second
- by rewriting the list-parallel loop in pass 2.
- 02-Feb-1998 Dick Grune
- Pascal does not differentiate between strings and characters
- (strings of one character); this difference has been removed
- from pascallang.l.
- 22-Jan-1998 Dick Grune
- Detection of non-ASCII characters added. Since the lexical
- analyser itself generates non-ASCII characters, the test must occur
- earlier. We could replace the input routine of lex by a
- checking routine, but with several lex-es going around, we want
- a more lex-independent solution. To allow each language its own
- restrictions about non-ASCII characters, the check is
- implemented in the *lang.l files.
- 28-Nov-1997 Dick Grune
- Changed the name of the C similarity tester 'sim' to 'sim_c', for
- uniformity with sim_java, etc.
- 23-Nov-1997 Dick Grune
- Java version finished; checked by Matty Huntjens and crew.
- 24-Jun-1997 Dick Grune
- Started on a Java version, by copying the C version.
- 22-Jun-1997 Dick Grune
- Modern lexical analysers, among which flex, read the entire input into
- a buffer before they issue the first token. As a result, ftell() no
- longer gives a usable indication of the position of a token in a file.
- This pulls the rug from under the nl_buff mechanism in buff.c, which
- is removed. We loose a valuable optimization this way, but there just
- seems to be no way to keep it.
- Note that this has nothing to do with the problem in MS-DOS of
- character count and fseek position not being synchronized. That
- problem has been solved on June 14, 1991 (which see) and the code has
- been running OK since.
- 18-Jun-1997 Dick Grune
- The thought has occurred to use McCreight's linear longest common
- substring algorithm rather than the existing algorithm, which has a
- small quadratic component. There are a couple of problems with this:
- 1. We need the longest >non-overlapping< common substring;
- McCreight provides just the longest. It is not at all clear
- how to modify the algorithm.
- 2. Once we have found our LCS, we want to find the
- one-but-longest; it is far from obvious how to do that in
- McCreight's algorithm.
- 3. Once we have found our LCS, we want to take one of its
- copies out of the game, to suppress duplicate messages.
- Again, it is difficult to see how to do that, without
- redoing all the calculations.
- 4. McCreight's algorithm seems to require about two binary
- tree nodes per token, say 8 bytes, which is double we
- use now.
- 17-Jun-1997 Dick Grune
- Did some experimenting with the hash function; it is still
- pretty bad: the simple-minded second sweep through
- forward_references easily removes another 80-99% of false hits.
- Next, a third sweep that does a full comparison will remove another
- large percentage.
-
- So I have left in the second sweep in all cases.
-
- There are a couple of questions here:
- 1. Can we find a better hash function, or will we forever need a
- second sweep?
- 2. Does it actually matter, or will we loose on more expensive
- hashing what we gain by having a better set of forward
- references in compare.c?
- 16-Jun-1997 Dick Grune
- Cleaned up sim.h and renamed aiso.[ch] to runs.[ch] since they
- are instantiations of the aiso module concerned with runs.
- Aiso.[spc|bdy] stays aiso.[spc|bdy], of course.
- 16-Jun-1997 Dick Grune
- Redid largest_function() in algollike.c.
- Corrected bug in CheckRun; it now always removes NonFinals from
- the end, even when it has first applied largest_function().
- 15-Jun-1997 Dick Grune
- Reorganized the layers around the input file. There were and
- still are three layers: lang, stream and buff.
- Since the lex_X variables are hoisted unchanged through the levels
- lang, stream, and buff, to be used by pass1, pass2, etc., they
- have to be placed in a module of their own.
- The token-providing module 'lang' has three interfaces:
- - lang.h, which provides access to the lowest-level token
- routines, to be used by the next level.
- - lex.h, which provides the lex variables, to be used by
- all and sundry.
- - language.h, which provides language-specific info about
- tokens, concerning their suitability as initial
- and final tokens, to be used by higher levels.
-
- This structure is not satisfactory, but it is also unreasonable
- to combine them in one interface.
- There is no single lang.c; rather it is represented by the
- various Xlang.c files generated from the Xlang.l files.
- 14-Jun-1997 Dick Grune
- Added a Makefile zip entry to parallel the shar entry.
- 13-Jun-1997 Dick Grune
- A number of simplifications, in view of better software and bigger
- machines:
- - Removed good_realloc from hash.c; I don't think there are
- any bad reallocs left.
- - Removed the option to run without forward_references.
- On a 16Mb machine this means you have at least 2M tokens;
- using a quadratic algorithm will take 4*10^6 sec. at an
- impossible rate of 1M actions/sec., which is some 50 days.
- Forget it.
- - Renamed lang() to print_stream(), and incorporated it in sim.c
- - Removed the MSDOS subdirectory mechanism in the Makefile.
- - Removed the funny and sneaky double parameter expansion in
- the call of idf_in_list().
- 12-Jun-1997 Dick Grune
- Converted to ANSI C. Removed cport.h.
- 09-Jan-1995 Dick Grune
- Decided not to do directories: they usually contain extraneous
- files and doing sim * is simple enough anyway.
- 09-Sep-1994 Dick Grune
- Added system.h to cater for the (few) differences between Unix and DOS.
- The #define int32 is also supplied there.
- 05-Sep-1994 Dick Grune
- Added many prototype declarations using cport.h.
- Added a depend entry to the Makefile.
- 31-Aug-1994 Dick Grune
- All these changes require a 32 bit integer; introduced a #define
- int32, set from the command line in the Makefile.
- 25-Aug-1994 Dick Grune
- It turned out that one of the most often called routines was .rem,
- from idf_hashed() in idf.c. Moving the % out of the loop chafed off
- another 6% and reduced the time to 18.4 sec.
- 19-Aug-1994 Dick Grune
- With very large files (e.g., concatenated /usr/man/man1/*) the fixed
- built-in hash table size of 10639 is no longer satisfactory. Hash.c
- now finds a prime about 8 times smaller than the text_size to use
- for hash table size; this achieves optimal speed-up without gobbling
- up too much memory. Reduced the time for the above file from 30.2
- sec. to 19.6 sec.
- For checking, the same test was run with all hashing off; it took
- 20h 27m 19s = 73639 sec. But it worked.
- 11-Aug-1994 Dick Grune
- For large values of MinRunSize (>1000) a large part of the time
- (>two-thirds) was spent in calculating the hash values for each
- position in the input, since the cost of this calculation was
- proportional to MinRunSize. We now sample a maximum of 24 tokens
- from the input string to calculate the hash value, and avoid
- overflow. On my workstation, this reduces the time for
- sim_text -r 1000 -n /usr/man/man1/*
- from 60 sec to 21 sec.
- 30-Jun-1992 Dick Grune,kamer R4.40,telef. 5778
- There was an amazing bug in buff.c where NextTextToken() for pass 2
- omitted to set lex_token to EOL when retrieving newline info from
- nl_buff. Worked until now!?!
- 23-Sep-1991 Dick Grune
- Cport.h introduced, CONST and *.spc only.
- 17-Sep-1991 Dick Grune
- The position-sorting routine in pass2.c has been made into a
- separate generic module.
- 14-Jun-1991 Dick Grune ([email protected]) at dick.cs.vu.nl
- Replaced the determination of the input position through counting
- input characters by calls of ftell(); this is cleaner and the other
- method will never work on MSDOS.
- 30-May-1989 Dick Grune (dick) at dick
- Replaced the old top-100 module (which had been extended to top-10000
- already anyway) by the new aiso (arbitrary-in sorted-out) module.
- This caused a considerable speed-up on the Mod2 test bed:
- %time cumsecs #call ms/call name
- 17.9 99.20 7209 13.76 _InsertTop
- 0.3 1.37 7209 0.19 _InsertAiso
- It turns out that malloc() is not a serious problem, so no special
- version for the aiso module is required.
- 23-May-1989 Dick Grune (dick) at dick
- No more uncommented comment at the end of preprocessor lines, to
- conform to ANSI C.
- 23-May-1989 Dick Grune (dick) at dick
- Added code in the X.l files to (silently) reject characters over 0200.
- This does not really help, since lex stops on null chars. Ah, well.
- 19-May-1989 Dick Grune (dick) at dick
- Made the token as handled by sim into an abstract data type, for
- aesthetic reasons. Sign extension is still a problem.
- 03-May-1989 Dick Grune (dick) at dick
- Optimized lcs() by first checking from the end if a sufficiently long
- run is present; if in fact only the first 12 tokens match, chances
- are good that you can reject the run right away by first testing
- the 20th token, then the 19th, and so on.
- 21-Apr-1989 Dick Grune (dick) at dick
- A run of sim_m2 finding 7209 similarities raised the question of
- the appropriateness of the linear sort in sort_pos(). Profiling
- showed that in this case sorting takes all of 7.5 % of the total
- time. Putting the word register in in the right places in
- sort_pos() lowered this number to 4.6%.
- 20-Apr-1989 Dick Grune (dick) at dick
- Moved the test for MayBeStartOfRun() from compare.c (where it is
- done again and again) to hash.c, where its effect is incorporated in
- the forward reference chain.
- 14-Apr-1989 Dick Grune (dick) at dick
- Replaced elem_of() by bit tables, headers[] and trailers[], to be
- prefilled from Headers[] and Trailers[] by a call of
- InitLanguage(). This saves a few percents.
- 13-Apr-1989 Dick Grune (dick) at dick
- Implemented the -e and the -S option, by putting yet another loop
- in compare.c
- 13-Apr-1989 Dick Grune (dick) at dick
- The -- option (displaying the tokens) will now handle more than one
- file.
- 20-Jan-1989 Dick Grune (dick) at dick
- After the modification of 19-Dec-88, 12% of the time went into
- updating the positions in the chunks, as they were produced by the
- matching process. This matching process identifies runs (matches)
- by token position, which has to be recalculated to lseek positions
- and line numbers. To this end the files are read again, and for
- each line all positions found were checked to see if they applied
- to this line; this was a awfully stupid algorithm, but since much
- more time was spent elsewhere, it did not really matter. With all
- the saving below, however, it had risen to second position, after
- yylook() with 35%.
- Th solution was, to sort the positions in the same order in which
- they would be met by the reading of the files. The process is then
- linear. This required some extensive hacking in pass2.c
- 06-Jan-1989 Dick Grune (dick) at dick
- The modification below did indeed save 25%. The newline information
- is now reduced to 2 shorts; 2 chars were not enough, since some
- lines are longer that 127 bytes, and a char and a short together
- take as much room as two shorts.
- 19-Dec-1988 Dick Grune (dick) at dick
- To avoid reading the files twice (which is still taking 25% of the
- time), the first pass will now collect newline information for the
- second pass in a buffer called nl_buff[]. This buffer, and the
- original token buffer now named TokenArray[], are managed by the file
- buff.c, which implements a layer between stream.h and pass?.c. This
- layer provides OpenText(), NextTextToken() and CloseText(), each
- with a parameter telling which pass it is.
- 06-Dec-1988 Dick Grune (dick) at dick
- As an introduction to removing the second pass altogether, the
- first and second scan were unified, i.e., their input is identical.
- This also means that the call sim -[12] has now been replaced by
- one call: sim --.
- 23-Sep-1988 Dick Grune (dick) at dick
- Dynamic allocation of line buffers in pass 3. This removes the
- restriction on the page width.
- 22-Sep-1988 Dick Grune (dick) at dick
- In order to give better messages on incorrect calls to sim, the
- whole option handling has been concentrated in a file option.c and
- separated from the options and their messages themselves. See sim.c
- 07-Sep-1988 Dick Grune (dick) at dick
- For long text sequences (say hundreds of thousands of tokens),
- the hashing is not really efficient any more since too many
- spurious matches occur. Therefore, the forward reference table is
- scanned a second time, eliminating from any chain all references to
- runs that do not end in the same token. For the UNIX manuals this
- reduced the number of matches from 91.9% to 1.9% (of which 0.06%
- were genuine).
- 30-Aug-1988 Dick Grune (dick) at dick
- For compatibility, NextTop has been rewritten to yield true or
- false and to accept a pointer to a run as a parameter.
- 30-Aug-1988 Dick Grune (dick) at dick
- When trying to find line-number and lseek position to beginnings
- and ends of runs found, the whole set of runs was scanned for each
- line in each file. Now only the runs belonging to that file are
- scanned; to this end another linked list has been braided through
- the data structures (tx_chunk).
- 30-Aug-1988 Dick Grune (dick) at dick
- The longest-common-substring algorithm was called much too often,
- mainly because the forward references made by hashing suffered from
- pollution. If you have say 1000 tokens and a hash range of say
- 10000, about 5 % of the hashings will be false matches, i.e. 50
- matches, which is quite a lot on a natural number of 2 to 3 matches.
- Improved by doing a second check in make_forw_ref().
- 12-Jun-1988 Dick Grune (dick) at dick
- Installed a Lisp version supplied by Gertjan Akkerman.
- 15-Jan-1988 Dick Grune (dick) at dick
- Added register declarations all over the place.
- 14-Jan-1988 Dick Grune (dick) at dick
- It is often useful to match a piece of code exactly, especially
- when function names (or, even more so, macro names) are involved.
- What one would want is having all the letters in the text array,
- but this is kind of hard, since each entry is one lexical item.
- This means that under the -F option each letter is a lex item, and
- normally each tag is a lex item; this requires two lex grammars in
- one program; no good. So, on the -F flag we hash the identifier
- into one lex item, which is hopefully characteristic enough. It
- works.
- 30-Sep-1987 Dick Grune (dick) at dick
- Some cosmetics.
- 31-Aug-1987 Dick Grune (dick) at dick
- Moved the whole thing to the SUN (while testing on a VAX and a
- MC68000)
- 16-Aug-1987 Dick Grune (dick) at dick
- The test program lang.c is no longer a main program, but rather a
- subroutine called in main() in sim.c, through the command line
- option -1 or -2.
- 23-Apr-1987 Dick Grune (dick) at tjalk
- Changed the name 'index' into 'elem_of', because of compatibility
- problems on different Unices. Added a declaration for it in
- the file algollike.c
- 10-Mar-1987 Dick Grune (dick) at tjalk
- Changed the printing of the header of a run so that:
- - long file names will no longer be truncated
- - the run length is displayed
- 27-Jan-1987 Dick Grune (dick) at tjalk
- Switched it right off again! Getting them in textual order is
- still more unpleasant, since now you cannot find the important
- ones if their are more than a few runs.
- 27-Jan-1987 Dick Grune (dick) at tjalk
- Going to experiment with leaving out the sorting; just all the
- runs, in the order we meet them. Should be as good or better.
- Comparisons of more than 100 runs are very rare anyway, so the
- fact that those over a 100 are rejected is probably no great
- help. Getting them in a funny order is a nuisance, however. Down
- with featurism. Just to be safe, present version saved as
- 870127.SV
- 26-Dec-1986 Dick Grune (dick) at tjalk
- Names of overall parameters in params.h changed to more uniformity.
- 26-Dec-1986 Dick Grune (dick) at tjalk
- Since the top package and the instantiation system have grown
- apart so much, I have integrated the old top package into sim,
- i.e., done the instantiation by hand. This removes top.g and
- top.p, and will save outsiders from wondering what is going on
- here.
- 23-Dec-1986 Dick Grune (dick) at tjalk
- Use setbuf to print unbuffered while reading the files (lex core
- dumps, other mishaps) and print buffered while printing the real
- output (for speed).
- 30-Nov-1986 Dick Grune (dick) at tjalk
- Various small changes in *lang.l:
- ; ignored conditionally (!options['f'])
- new format for tokens in struct idf
- cosmetics: macro Layout, macro UnsafeComChar, no \n
- in character denotations, more than one char
- in a char denotations in Pascal, etc.
- 30-Nov-1986 Dick Grune (dick) at tjalk
- Added a Modula-2 version.
- 29-Nov-1986 Dick Grune (dick) at tjalk
- Restricting tokens to the ASCII95 character set is really too
- severe: some languages have many more reserved words (COBOL!).
- Corrected this by adding a couple of '&0377' in strategic places.
- Added a routine for printing the 8-bit beasties: show_token().
- 15-Aug-1986 Dick Grune (dick) at tjalk
- Since the ; is superfluous in both C and Pascal, it is now ignored
- by clang.l and pascallang.l
- 15-Aug-1986 Dick Grune (dick) at tjalk
- The code in CheckRun in Xlang.l was incorrect in that it used the
- wrong criterion for throwing away trailing garbage. I've taken
- CheckRun etc. out of the Xlang.l-s and turned them into a module
- "algollike.c". Made a cleaner interface and avoided duplication of
- code.
- 02-Jul-1986 Dick Grune (dick) at tjalk
- Looking backwards in compare.c to see if we are in the middle of a
- run is an atavism. You can be and still be all right, e.g., if
- part of the run was rejected as not fitting for a function.
- Removed from compare.c.
- 10-Jun-1986 Dick Grune (dick) at tjalk
- The function hash_code() in hash.c could yield a negative value;
- corrected.
- 09-Jun-1986 Dick Grune (dick) at tjalk
- Changed the name of the file text.h to sim.h. Sim.h is more
- appropriate and text.h sounds as if it belongs to text.l, with
- which it has no connection.
- 04-Jun-1986 Dick Grune (dick) at tjalk
- After having looked at a couple of hash functions and having done
- some calculations on the number of duplicates normally encountered
- in hash functions, I conclude that our function in hash.c is quite
- good. Removed all the statistics-gathering stuff.
-
- Actually, hash_table[] is not the hash table at all; it is a
- forward reference table; likewise, the real hash table was called
- last[]. Renamed both.
-
- There is a way to keep the hash table local without putting it on
- the stack: use malloc().
- 02-Jun-1986 Dick Grune (dick) at tjalk
- Added a simple lex file for text: each word is condensed into a
- hash code which is mapped on the ASCII95 character set. This
- turns out to be quite effective.
- 01-Jun-1986 Dick Grune (dick) at tjalk
- The macros cput(tk) and c_eol() both have a return in them, so any
- code after them may not be executed -> they have to be last in an
- entry. But they weren't, in many places; I can't imagine why it
- all worked nevertheless. They have been renamed return_tk(tk) and
- return_eol() and the entries have been restructured.
- 30-May-1986 Dick Grune (dick) at tjalk
- Moved the string and character entries in clang.l and pascallang.l
- to a place behind the comment entries, to avoid strings (and
- characters) being recognized inside comments. I first thought
- this would not happen, but as Maarten pointed out, if both
- interpretations have the same length, lex will take the first
- entry. Now this will happen if the string occupies the whole line
- that would otherwise be taken as a comment. In short,
- /*
- "hallo"
- */
- would return ".
- 28-May-1986 Dick Grune (dick) at tjalk
- Added -d option, to display the output in diff(1) format (courtesy
- of Maarten van der Meulen).
- Rewrote the lexical parsing of comments (likewise courtesy Maarten
- van der Meulen).
- 20-May-1986 Dick Grune (dick) at tjalk
- Added a routine to convert identifiers to lower case in
- pascallang.l .
- 19-May-1986 Dick Grune (dick) at tjalk
- Added -a option, to quickly check antecedent of a file (courtesy
- of Maarten van der Meulen).
- 18-May-1986 Dick Grune (dick) at tjalk
- Brought everything under RCS/CVS.
- 18-Mar-1986 Dick Grune (dick) at tjalk
- Added modifications by Paul Bame (hp-lsd!paul@hp-labs) to have an
- option -w to set the page width.
- 21-Feb-1986 Dick Grune (dick) at tjalk
- Took array last[N_HASH] out of make_hash() in hash.c, due to stack
- overflow on the Gould (reported by George Walker
- [email protected])
- 16-Feb-1986 Dick Grune (dick) at tjalk
- Corrected some subtractions that caused unsigned ints to turn
- pseudo-negative. (Reported by jaap@mcvax)
- 11-Jan-1986 Dick Grune (dick) at tjalk
- Touched up for distribution.
- 10-Jan-1986 Dick Grune (dick) at tjalk
- Fill_line was not called for empty lines, which caused them to be
- printed as repetitions of the previous line.
- 24-Dec-1985 Dick Grune (dick) at tjalk
- Reduced hash table to a single array of indices; it is used only
- in one place, which makes it very easy to make it (the hash table)
- optional. General tune-up of everything. This seems to be
- another stable "final" version.
- 14-Dec-1985 Dick Grune (dick) at tjalk
- Some experiments with hash formulas:
- h = (h OP CST) + *p++ OP CST yields right wrong
- * 96 - 32 205 562
- * 96 - 2 205 560
- * 96 205 560
- * 97 205 559
- << 0 66 3128
- << 1 203 555
- << 2 205 536
- << 7 203 540
- Conclusion: it doesn't matter, unless you do it wrong.
- 01-Oct-1983 Dic8k Grune (dick) at vu44
- Oldest known files.
- # This file is part of the software similarity tester SIM.
- # Written by Dick Grune, Vrije Universiteit, Amsterdam.
- # $Id: ChangeLog,v 2.12 2007/08/27 09:57:30 dick Exp $
- #
|