|
@@ -0,0 +1,580 @@
|
|
|
+2007-08-23 Dick Grune <[email protected]>
|
|
|
+ LICENSE.txt added.
|
|
|
+
|
|
|
+2006-11-27 Dick Grune <[email protected]>
|
|
|
+ Removal of setbuff() for compatibility.
|
|
|
+
|
|
|
+2005-01-17 Dick Grune <[email protected]>
|
|
|
+ Corrections by Jerry James <[email protected]>; ANSIizing, etc.
|
|
|
+
|
|
|
+2004-08-05 Dick Grune <[email protected]>
|
|
|
+ Finished the 'percentage' option.
|
|
|
+
|
|
|
+08-Nov-2001 Dick Grune
|
|
|
+ Begun to add a 'percentage' option, which will express the
|
|
|
+ similarity between two files in percents.
|
|
|
+
|
|
|
+27-Sep-2001 Dick Grune
|
|
|
+ Split add_run() off from compare.c into add_run.c, to accomodate
|
|
|
+ different add_run()s, for different types of processing.
|
|
|
+
|
|
|
+27-Nov-1998 Dick Grune
|
|
|
+ Installed a Miranda version supplied by Emma Norling ([email protected])
|
|
|
+
|
|
|
+23-Feb-1998 Dick Grune
|
|
|
+ Renamed text.l to textlang.l for uniformity and to make room for
|
|
|
+ a possible module text.[ch].
|
|
|
+
|
|
|
+ Isolated a module for handling the token array from buff.[ch] to
|
|
|
+ tokenarray.[ch], and renamed buff.[ch] to text.[ch].
|
|
|
+
|
|
|
+23-Feb-1998 Dick Grune
|
|
|
+ There is probably not much point in abandoning the nl_buff list
|
|
|
+ when running out of memory for TokenArray[]: each token costs 1
|
|
|
+ byte for the token and 4 bytes for the entry in
|
|
|
+ forward_references[], a total of 5 bytes. There are about 3
|
|
|
+ tokens to a line, together requiring 15 bytes, plus 1 byte in
|
|
|
+ nl_buff yields 16 bytes. So releasing nl_buff frees only 1/16 =
|
|
|
+ 6.7 % of memeory.
|
|
|
+
|
|
|
+ Since the code is a bother, I removed it. Note that nl_buff is
|
|
|
+ still abandoned when the number of tokens in a line does not fit
|
|
|
+ in one unsigned char (but that is not very likely to happen).
|
|
|
+
|
|
|
+
|
|
|
+21-Feb-1998 Dick Grune
|
|
|
+ Printing got into an infinite loop when the last line of the
|
|
|
+ input was not terminated by a newline AND contained tokens that
|
|
|
+ were included in a matching run.
|
|
|
+ This was due to a double bug: 1. the non-terminated line was not
|
|
|
+ registered properly in NextTextTokenObtained() / CloseText(),
|
|
|
+ and 2. the loop in pass 2 which sets the values of
|
|
|
+ pos->ps_nl_cnt was terminated prematurely when the file turned
|
|
|
+ out to be shorter than the list of pos-es indicated.
|
|
|
+ Both bugs were corrected, the first by supplying an extra
|
|
|
+ newline in CloseText() when one is found missing, and the second
|
|
|
+ by rewriting the list-parallel loop in pass 2.
|
|
|
+
|
|
|
+02-Feb-1998 Dick Grune
|
|
|
+ Pascal does not differentiate between strings and characters
|
|
|
+ (strings of one character); this difference has been removed
|
|
|
+ from pascallang.l.
|
|
|
+
|
|
|
+22-Jan-1998 Dick Grune
|
|
|
+ Detection of non-ASCII characters added. Since the lexical
|
|
|
+ analyser itself generates non-ASCII characters, the test must occur
|
|
|
+ earlier. We could replace the input routine of lex by a
|
|
|
+ checking routine, but with several lex-es going around, we want
|
|
|
+ a more lex-independent solution. To allow each language its own
|
|
|
+ restrictions about non-ASCII characters, the check is
|
|
|
+ implemented in the *lang.l files.
|
|
|
+
|
|
|
+28-Nov-1997 Dick Grune
|
|
|
+ Changed the name of the C similarity tester 'sim' to 'sim_c', for
|
|
|
+ uniformity with sim_java, etc.
|
|
|
+
|
|
|
+23-Nov-1997 Dick Grune
|
|
|
+ Java version finished; checked by Matty Huntjens and crew.
|
|
|
+
|
|
|
+24-Jun-1997 Dick Grune
|
|
|
+ Started on a Java version, by copying the C version.
|
|
|
+
|
|
|
+22-Jun-1997 Dick Grune
|
|
|
+ Modern lexical analysers, among which flex, read the entire input into
|
|
|
+ a buffer before they issue the first token. As a result, ftell() no
|
|
|
+ longer gives a usable indication of the position of a token in a file.
|
|
|
+ This pulls the rug from under the nl_buff mechanism in buff.c, which
|
|
|
+ is removed. We loose a valuable optimization this way, but there just
|
|
|
+ seems to be no way to keep it.
|
|
|
+
|
|
|
+ Note that this has nothing to do with the problem in MS-DOS of
|
|
|
+ character count and fseek position not being synchronized. That
|
|
|
+ problem has been solved on June 14, 1991 (which see) and the code has
|
|
|
+ been running OK since.
|
|
|
+
|
|
|
+18-Jun-1997 Dick Grune
|
|
|
+ The thought has occurred to use McCreight's linear longest common
|
|
|
+ substring algorithm rather than the existing algorithm, which has a
|
|
|
+ small quadratic component. There are a couple of problems with this:
|
|
|
+ 1. We need the longest >non-overlapping< common substring;
|
|
|
+ McCreight provides just the longest. It is not at all clear
|
|
|
+ how to modify the algorithm.
|
|
|
+ 2. Once we have found our LCS, we want to find the
|
|
|
+ one-but-longest; it is far from obvious how to do that in
|
|
|
+ McCreight's algorithm.
|
|
|
+ 3. Once we have found our LCS, we want to take one of its
|
|
|
+ copies out of the game, to suppress duplicate messages.
|
|
|
+ Again, it is difficult to see how to do that, without
|
|
|
+ redoing all the calculations.
|
|
|
+ 4. McCreight's algorithm seems to require about two binary
|
|
|
+ tree nodes per token, say 8 bytes, which is double we
|
|
|
+ use now.
|
|
|
+
|
|
|
+17-Jun-1997 Dick Grune
|
|
|
+ Did some experimenting with the hash function; it is still
|
|
|
+ pretty bad: the simple-minded second sweep through
|
|
|
+ forward_references easily removes another 80-99% of false hits.
|
|
|
+ Next, a third sweep that does a full comparison will remove another
|
|
|
+ large percentage.
|
|
|
+
|
|
|
+ So I have left in the second sweep in all cases.
|
|
|
+
|
|
|
+ There are a couple of questions here:
|
|
|
+ 1. Can we find a better hash function, or will we forever need a
|
|
|
+ second sweep?
|
|
|
+ 2. Does it actually matter, or will we loose on more expensive
|
|
|
+ hashing what we gain by having a better set of forward
|
|
|
+ references in compare.c?
|
|
|
+
|
|
|
+
|
|
|
+16-Jun-1997 Dick Grune
|
|
|
+ Cleaned up sim.h and renamed aiso.[ch] to runs.[ch] since they
|
|
|
+ are instantiations of the aiso module concerned with runs.
|
|
|
+ Aiso.[spc|bdy] stays aiso.[spc|bdy], of course.
|
|
|
+
|
|
|
+16-Jun-1997 Dick Grune
|
|
|
+ Redid largest_function() in algollike.c.
|
|
|
+ Corrected bug in CheckRun; it now always removes NonFinals from
|
|
|
+ the end, even when it has first applied largest_function().
|
|
|
+
|
|
|
+15-Jun-1997 Dick Grune
|
|
|
+ Reorganized the layers around the input file. There were and
|
|
|
+ still are three layers: lang, stream and buff.
|
|
|
+
|
|
|
+ Since the lex_X variables are hoisted unchanged through the levels
|
|
|
+ lang, stream, and buff, to be used by pass1, pass2, etc., they
|
|
|
+ have to be placed in a module of their own.
|
|
|
+
|
|
|
+ The token-providing module 'lang' has three interfaces:
|
|
|
+ - lang.h, which provides access to the lowest-level token
|
|
|
+ routines, to be used by the next level.
|
|
|
+ - lex.h, which provides the lex variables, to be used by
|
|
|
+ all and sundry.
|
|
|
+ - language.h, which provides language-specific info about
|
|
|
+ tokens, concerning their suitability as initial
|
|
|
+ and final tokens, to be used by higher levels.
|
|
|
+
|
|
|
+ This structure is not satisfactory, but it is also unreasonable
|
|
|
+ to combine them in one interface.
|
|
|
+
|
|
|
+ There is no single lang.c; rather it is represented by the
|
|
|
+ various Xlang.c files generated from the Xlang.l files.
|
|
|
+
|
|
|
+14-Jun-1997 Dick Grune
|
|
|
+ Added a Makefile zip entry to parallel the shar entry.
|
|
|
+
|
|
|
+13-Jun-1997 Dick Grune
|
|
|
+ A number of simplifications, in view of better software and bigger
|
|
|
+ machines:
|
|
|
+ - Removed good_realloc from hash.c; I don't think there are
|
|
|
+ any bad reallocs left.
|
|
|
+ - Removed the option to run without forward_references.
|
|
|
+ On a 16Mb machine this means you have at least 2M tokens;
|
|
|
+ using a quadratic algorithm will take 4*10^6 sec. at an
|
|
|
+ impossible rate of 1M actions/sec., which is some 50 days.
|
|
|
+ Forget it.
|
|
|
+ - Renamed lang() to print_stream(), and incorporated it in sim.c
|
|
|
+ - Removed the MSDOS subdirectory mechanism in the Makefile.
|
|
|
+ - Removed the funny and sneaky double parameter expansion in
|
|
|
+ the call of idf_in_list().
|
|
|
+
|
|
|
+12-Jun-1997 Dick Grune
|
|
|
+ Converted to ANSI C. Removed cport.h.
|
|
|
+
|
|
|
+09-Jan-1995 Dick Grune
|
|
|
+ Decided not to do directories: they usually contain extraneous
|
|
|
+ files and doing sim * is simple enough anyway.
|
|
|
+
|
|
|
+09-Sep-1994 Dick Grune
|
|
|
+ Added system.h to cater for the (few) differences between Unix and DOS.
|
|
|
+ The #define int32 is also supplied there.
|
|
|
+
|
|
|
+05-Sep-1994 Dick Grune
|
|
|
+ Added many prototype declarations using cport.h.
|
|
|
+ Added a depend entry to the Makefile.
|
|
|
+
|
|
|
+31-Aug-1994 Dick Grune
|
|
|
+ All these changes require a 32 bit integer; introduced a #define
|
|
|
+ int32, set from the command line in the Makefile.
|
|
|
+
|
|
|
+25-Aug-1994 Dick Grune
|
|
|
+ It turned out that one of the most often called routines was .rem,
|
|
|
+ from idf_hashed() in idf.c. Moving the % out of the loop chafed off
|
|
|
+ another 6% and reduced the time to 18.4 sec.
|
|
|
+
|
|
|
+19-Aug-1994 Dick Grune
|
|
|
+ With very large files (e.g., concatenated /usr/man/man1/*) the fixed
|
|
|
+ built-in hash table size of 10639 is no longer satisfactory. Hash.c
|
|
|
+ now finds a prime about 8 times smaller than the text_size to use
|
|
|
+ for hash table size; this achieves optimal speed-up without gobbling
|
|
|
+ up too much memory. Reduced the time for the above file from 30.2
|
|
|
+ sec. to 19.6 sec.
|
|
|
+ For checking, the same test was run with all hashing off; it took
|
|
|
+ 20h 27m 19s = 73639 sec. But it worked.
|
|
|
+
|
|
|
+11-Aug-1994 Dick Grune
|
|
|
+ For large values of MinRunSize (>1000) a large part of the time
|
|
|
+ (>two-thirds) was spent in calculating the hash values for each
|
|
|
+ position in the input, since the cost of this calculation was
|
|
|
+ proportional to MinRunSize. We now sample a maximum of 24 tokens
|
|
|
+ from the input string to calculate the hash value, and avoid
|
|
|
+ overflow. On my workstation, this reduces the time for
|
|
|
+ sim_text -r 1000 -n /usr/man/man1/*
|
|
|
+ from 60 sec to 21 sec.
|
|
|
+
|
|
|
+30-Jun-1992 Dick Grune,kamer R4.40,telef. 5778
|
|
|
+ There was an amazing bug in buff.c where NextTextToken() for pass 2
|
|
|
+ omitted to set lex_token to EOL when retrieving newline info from
|
|
|
+ nl_buff. Worked until now!?!
|
|
|
+
|
|
|
+23-Sep-1991 Dick Grune
|
|
|
+ Cport.h introduced, CONST and *.spc only.
|
|
|
+
|
|
|
+17-Sep-1991 Dick Grune
|
|
|
+ The position-sorting routine in pass2.c has been made into a
|
|
|
+ separate generic module.
|
|
|
+
|
|
|
+14-Jun-1991 Dick Grune ([email protected]) at dick.cs.vu.nl
|
|
|
+ Replaced the determination of the input position through counting
|
|
|
+ input characters by calls of ftell(); this is cleaner and the other
|
|
|
+ method will never work on MSDOS.
|
|
|
+
|
|
|
+30-May-1989 Dick Grune (dick) at dick
|
|
|
+ Replaced the old top-100 module (which had been extended to top-10000
|
|
|
+ already anyway) by the new aiso (arbitrary-in sorted-out) module.
|
|
|
+ This caused a considerable speed-up on the Mod2 test bed:
|
|
|
+ %time cumsecs #call ms/call name
|
|
|
+ 17.9 99.20 7209 13.76 _InsertTop
|
|
|
+ 0.3 1.37 7209 0.19 _InsertAiso
|
|
|
+ It turns out that malloc() is not a serious problem, so no special
|
|
|
+ version for the aiso module is required.
|
|
|
+
|
|
|
+23-May-1989 Dick Grune (dick) at dick
|
|
|
+ No more uncommented comment at the end of preprocessor lines, to
|
|
|
+ conform to ANSI C.
|
|
|
+
|
|
|
+23-May-1989 Dick Grune (dick) at dick
|
|
|
+ Added code in the X.l files to (silently) reject characters over 0200.
|
|
|
+ This does not really help, since lex stops on null chars. Ah, well.
|
|
|
+
|
|
|
+19-May-1989 Dick Grune (dick) at dick
|
|
|
+ Made the token as handled by sim into an abstract data type, for
|
|
|
+ aesthetic reasons. Sign extension is still a problem.
|
|
|
+
|
|
|
+03-May-1989 Dick Grune (dick) at dick
|
|
|
+ Optimized lcs() by first checking from the end if a sufficiently long
|
|
|
+ run is present; if in fact only the first 12 tokens match, chances
|
|
|
+ are good that you can reject the run right away by first testing
|
|
|
+ the 20th token, then the 19th, and so on.
|
|
|
+
|
|
|
+21-Apr-1989 Dick Grune (dick) at dick
|
|
|
+ A run of sim_m2 finding 7209 similarities raised the question of
|
|
|
+ the appropriateness of the linear sort in sort_pos(). Profiling
|
|
|
+ showed that in this case sorting takes all of 7.5 % of the total
|
|
|
+ time. Putting the word register in in the right places in
|
|
|
+ sort_pos() lowered this number to 4.6%.
|
|
|
+
|
|
|
+20-Apr-1989 Dick Grune (dick) at dick
|
|
|
+ Moved the test for MayBeStartOfRun() from compare.c (where it is
|
|
|
+ done again and again) to hash.c, where its effect is incorporated in
|
|
|
+ the forward reference chain.
|
|
|
+
|
|
|
+14-Apr-1989 Dick Grune (dick) at dick
|
|
|
+ Replaced elem_of() by bit tables, headers[] and trailers[], to be
|
|
|
+ prefilled from Headers[] and Trailers[] by a call of
|
|
|
+ InitLanguage(). This saves a few percents.
|
|
|
+
|
|
|
+13-Apr-1989 Dick Grune (dick) at dick
|
|
|
+ Implemented the -e and the -S option, by putting yet another loop
|
|
|
+ in compare.c
|
|
|
+
|
|
|
+13-Apr-1989 Dick Grune (dick) at dick
|
|
|
+ The -- option (displaying the tokens) will now handle more than one
|
|
|
+ file.
|
|
|
+
|
|
|
+20-Jan-1989 Dick Grune (dick) at dick
|
|
|
+ After the modification of 19-Dec-88, 12% of the time went into
|
|
|
+ updating the positions in the chunks, as they were produced by the
|
|
|
+ matching process. This matching process identifies runs (matches)
|
|
|
+ by token position, which has to be recalculated to lseek positions
|
|
|
+ and line numbers. To this end the files are read again, and for
|
|
|
+ each line all positions found were checked to see if they applied
|
|
|
+ to this line; this was a awfully stupid algorithm, but since much
|
|
|
+ more time was spent elsewhere, it did not really matter. With all
|
|
|
+ the saving below, however, it had risen to second position, after
|
|
|
+ yylook() with 35%.
|
|
|
+
|
|
|
+ Th solution was, to sort the positions in the same order in which
|
|
|
+ they would be met by the reading of the files. The process is then
|
|
|
+ linear. This required some extensive hacking in pass2.c
|
|
|
+
|
|
|
+06-Jan-1989 Dick Grune (dick) at dick
|
|
|
+ The modification below did indeed save 25%. The newline information
|
|
|
+ is now reduced to 2 shorts; 2 chars were not enough, since some
|
|
|
+ lines are longer that 127 bytes, and a char and a short together
|
|
|
+ take as much room as two shorts.
|
|
|
+
|
|
|
+19-Dec-1988 Dick Grune (dick) at dick
|
|
|
+ To avoid reading the files twice (which is still taking 25% of the
|
|
|
+ time), the first pass will now collect newline information for the
|
|
|
+ second pass in a buffer called nl_buff[]. This buffer, and the
|
|
|
+ original token buffer now named TokenArray[], are managed by the file
|
|
|
+ buff.c, which implements a layer between stream.h and pass?.c. This
|
|
|
+ layer provides OpenText(), NextTextToken() and CloseText(), each
|
|
|
+ with a parameter telling which pass it is.
|
|
|
+
|
|
|
+06-Dec-1988 Dick Grune (dick) at dick
|
|
|
+ As an introduction to removing the second pass altogether, the
|
|
|
+ first and second scan were unified, i.e., their input is identical.
|
|
|
+ This also means that the call sim -[12] has now been replaced by
|
|
|
+ one call: sim --.
|
|
|
+
|
|
|
+23-Sep-1988 Dick Grune (dick) at dick
|
|
|
+ Dynamic allocation of line buffers in pass 3. This removes the
|
|
|
+ restriction on the page width.
|
|
|
+
|
|
|
+22-Sep-1988 Dick Grune (dick) at dick
|
|
|
+ In order to give better messages on incorrect calls to sim, the
|
|
|
+ whole option handling has been concentrated in a file option.c and
|
|
|
+ separated from the options and their messages themselves. See sim.c
|
|
|
+
|
|
|
+07-Sep-1988 Dick Grune (dick) at dick
|
|
|
+ For long text sequences (say hundreds of thousands of tokens),
|
|
|
+ the hashing is not really efficient any more since too many
|
|
|
+ spurious matches occur. Therefore, the forward reference table is
|
|
|
+ scanned a second time, eliminating from any chain all references to
|
|
|
+ runs that do not end in the same token. For the UNIX manuals this
|
|
|
+ reduced the number of matches from 91.9% to 1.9% (of which 0.06%
|
|
|
+ were genuine).
|
|
|
+
|
|
|
+30-Aug-1988 Dick Grune (dick) at dick
|
|
|
+ For compatibility, NextTop has been rewritten to yield true or
|
|
|
+ false and to accept a pointer to a run as a parameter.
|
|
|
+
|
|
|
+30-Aug-1988 Dick Grune (dick) at dick
|
|
|
+ When trying to find line-number and lseek position to beginnings
|
|
|
+ and ends of runs found, the whole set of runs was scanned for each
|
|
|
+ line in each file. Now only the runs belonging to that file are
|
|
|
+ scanned; to this end another linked list has been braided through
|
|
|
+ the data structures (tx_chunk).
|
|
|
+
|
|
|
+30-Aug-1988 Dick Grune (dick) at dick
|
|
|
+ The longest-common-substring algorithm was called much too often,
|
|
|
+ mainly because the forward references made by hashing suffered from
|
|
|
+ pollution. If you have say 1000 tokens and a hash range of say
|
|
|
+ 10000, about 5 % of the hashings will be false matches, i.e. 50
|
|
|
+ matches, which is quite a lot on a natural number of 2 to 3 matches.
|
|
|
+ Improved by doing a second check in make_forw_ref().
|
|
|
+
|
|
|
+12-Jun-1988 Dick Grune (dick) at dick
|
|
|
+ Installed a Lisp version supplied by Gertjan Akkerman.
|
|
|
+
|
|
|
+15-Jan-1988 Dick Grune (dick) at dick
|
|
|
+ Added register declarations all over the place.
|
|
|
+
|
|
|
+14-Jan-1988 Dick Grune (dick) at dick
|
|
|
+ It is often useful to match a piece of code exactly, especially
|
|
|
+ when function names (or, even more so, macro names) are involved.
|
|
|
+ What one would want is having all the letters in the text array,
|
|
|
+ but this is kind of hard, since each entry is one lexical item.
|
|
|
+ This means that under the -F option each letter is a lex item, and
|
|
|
+ normally each tag is a lex item; this requires two lex grammars in
|
|
|
+ one program; no good. So, on the -F flag we hash the identifier
|
|
|
+ into one lex item, which is hopefully characteristic enough. It
|
|
|
+ works.
|
|
|
+
|
|
|
+30-Sep-1987 Dick Grune (dick) at dick
|
|
|
+ Some cosmetics.
|
|
|
+
|
|
|
+31-Aug-1987 Dick Grune (dick) at dick
|
|
|
+ Moved the whole thing to the SUN (while testing on a VAX and a
|
|
|
+ MC68000)
|
|
|
+
|
|
|
+16-Aug-1987 Dick Grune (dick) at dick
|
|
|
+ The test program lang.c is no longer a main program, but rather a
|
|
|
+ subroutine called in main() in sim.c, through the command line
|
|
|
+ option -1 or -2.
|
|
|
+
|
|
|
+23-Apr-1987 Dick Grune (dick) at tjalk
|
|
|
+ Changed the name 'index' into 'elem_of', because of compatibility
|
|
|
+ problems on different Unices. Added a declaration for it in
|
|
|
+ the file algollike.c
|
|
|
+
|
|
|
+10-Mar-1987 Dick Grune (dick) at tjalk
|
|
|
+ Changed the printing of the header of a run so that:
|
|
|
+ - long file names will no longer be truncated
|
|
|
+ - the run length is displayed
|
|
|
+
|
|
|
+27-Jan-1987 Dick Grune (dick) at tjalk
|
|
|
+ Switched it right off again! Getting them in textual order is
|
|
|
+ still more unpleasant, since now you cannot find the important
|
|
|
+ ones if their are more than a few runs.
|
|
|
+
|
|
|
+27-Jan-1987 Dick Grune (dick) at tjalk
|
|
|
+ Going to experiment with leaving out the sorting; just all the
|
|
|
+ runs, in the order we meet them. Should be as good or better.
|
|
|
+ Comparisons of more than 100 runs are very rare anyway, so the
|
|
|
+ fact that those over a 100 are rejected is probably no great
|
|
|
+ help. Getting them in a funny order is a nuisance, however. Down
|
|
|
+ with featurism. Just to be safe, present version saved as
|
|
|
+ 870127.SV
|
|
|
+
|
|
|
+26-Dec-1986 Dick Grune (dick) at tjalk
|
|
|
+ Names of overall parameters in params.h changed to more uniformity.
|
|
|
+
|
|
|
+26-Dec-1986 Dick Grune (dick) at tjalk
|
|
|
+ Since the top package and the instantiation system have grown
|
|
|
+ apart so much, I have integrated the old top package into sim,
|
|
|
+ i.e., done the instantiation by hand. This removes top.g and
|
|
|
+ top.p, and will save outsiders from wondering what is going on
|
|
|
+ here.
|
|
|
+
|
|
|
+23-Dec-1986 Dick Grune (dick) at tjalk
|
|
|
+ Use setbuf to print unbuffered while reading the files (lex core
|
|
|
+ dumps, other mishaps) and print buffered while printing the real
|
|
|
+ output (for speed).
|
|
|
+
|
|
|
+30-Nov-1986 Dick Grune (dick) at tjalk
|
|
|
+ Various small changes in *lang.l:
|
|
|
+ ; ignored conditionally (!options['f'])
|
|
|
+ new format for tokens in struct idf
|
|
|
+ cosmetics: macro Layout, macro UnsafeComChar, no \n
|
|
|
+ in character denotations, more than one char
|
|
|
+ in a char denotations in Pascal, etc.
|
|
|
+
|
|
|
+30-Nov-1986 Dick Grune (dick) at tjalk
|
|
|
+ Added a Modula-2 version.
|
|
|
+
|
|
|
+29-Nov-1986 Dick Grune (dick) at tjalk
|
|
|
+ Restricting tokens to the ASCII95 character set is really too
|
|
|
+ severe: some languages have many more reserved words (COBOL!).
|
|
|
+ Corrected this by adding a couple of '&0377' in strategic places.
|
|
|
+ Added a routine for printing the 8-bit beasties: show_token().
|
|
|
+
|
|
|
+15-Aug-1986 Dick Grune (dick) at tjalk
|
|
|
+ Since the ; is superfluous in both C and Pascal, it is now ignored
|
|
|
+ by clang.l and pascallang.l
|
|
|
+
|
|
|
+15-Aug-1986 Dick Grune (dick) at tjalk
|
|
|
+ The code in CheckRun in Xlang.l was incorrect in that it used the
|
|
|
+ wrong criterion for throwing away trailing garbage. I've taken
|
|
|
+ CheckRun etc. out of the Xlang.l-s and turned them into a module
|
|
|
+ "algollike.c". Made a cleaner interface and avoided duplication of
|
|
|
+ code.
|
|
|
+
|
|
|
+02-Jul-1986 Dick Grune (dick) at tjalk
|
|
|
+ Looking backwards in compare.c to see if we are in the middle of a
|
|
|
+ run is an atavism. You can be and still be all right, e.g., if
|
|
|
+ part of the run was rejected as not fitting for a function.
|
|
|
+ Removed from compare.c.
|
|
|
+
|
|
|
+10-Jun-1986 Dick Grune (dick) at tjalk
|
|
|
+ The function hash_code() in hash.c could yield a negative value;
|
|
|
+ corrected.
|
|
|
+
|
|
|
+09-Jun-1986 Dick Grune (dick) at tjalk
|
|
|
+ Changed the name of the file text.h to sim.h. Sim.h is more
|
|
|
+ appropriate and text.h sounds as if it belongs to text.l, with
|
|
|
+ which it has no connection.
|
|
|
+
|
|
|
+04-Jun-1986 Dick Grune (dick) at tjalk
|
|
|
+ After having looked at a couple of hash functions and having done
|
|
|
+ some calculations on the number of duplicates normally encountered
|
|
|
+ in hash functions, I conclude that our function in hash.c is quite
|
|
|
+ good. Removed all the statistics-gathering stuff.
|
|
|
+
|
|
|
+ Actually, hash_table[] is not the hash table at all; it is a
|
|
|
+ forward reference table; likewise, the real hash table was called
|
|
|
+ last[]. Renamed both.
|
|
|
+
|
|
|
+ There is a way to keep the hash table local without putting it on
|
|
|
+ the stack: use malloc().
|
|
|
+
|
|
|
+02-Jun-1986 Dick Grune (dick) at tjalk
|
|
|
+ Added a simple lex file for text: each word is condensed into a
|
|
|
+ hash code which is mapped on the ASCII95 character set. This
|
|
|
+ turns out to be quite effective.
|
|
|
+
|
|
|
+01-Jun-1986 Dick Grune (dick) at tjalk
|
|
|
+ The macros cput(tk) and c_eol() both have a return in them, so any
|
|
|
+ code after them may not be executed -> they have to be last in an
|
|
|
+ entry. But they weren't, in many places; I can't imagine why it
|
|
|
+ all worked nevertheless. They have been renamed return_tk(tk) and
|
|
|
+ return_eol() and the entries have been restructured.
|
|
|
+
|
|
|
+30-May-1986 Dick Grune (dick) at tjalk
|
|
|
+ Moved the string and character entries in clang.l and pascallang.l
|
|
|
+ to a place behind the comment entries, to avoid strings (and
|
|
|
+ characters) being recognized inside comments. I first thought
|
|
|
+ this would not happen, but as Maarten pointed out, if both
|
|
|
+ interpretations have the same length, lex will take the first
|
|
|
+ entry. Now this will happen if the string occupies the whole line
|
|
|
+ that would otherwise be taken as a comment. In short,
|
|
|
+ /*
|
|
|
+ "hallo"
|
|
|
+ */
|
|
|
+ would return ".
|
|
|
+
|
|
|
+28-May-1986 Dick Grune (dick) at tjalk
|
|
|
+ Added -d option, to display the output in diff(1) format (courtesy
|
|
|
+ of Maarten van der Meulen).
|
|
|
+ Rewrote the lexical parsing of comments (likewise courtesy Maarten
|
|
|
+ van der Meulen).
|
|
|
+
|
|
|
+20-May-1986 Dick Grune (dick) at tjalk
|
|
|
+ Added a routine to convert identifiers to lower case in
|
|
|
+ pascallang.l .
|
|
|
+
|
|
|
+19-May-1986 Dick Grune (dick) at tjalk
|
|
|
+ Added -a option, to quickly check antecedent of a file (courtesy
|
|
|
+ of Maarten van der Meulen).
|
|
|
+
|
|
|
+18-May-1986 Dick Grune (dick) at tjalk
|
|
|
+ Brought everything under RCS/CVS.
|
|
|
+
|
|
|
+18-Mar-1986 Dick Grune (dick) at tjalk
|
|
|
+ Added modifications by Paul Bame (hp-lsd!paul@hp-labs) to have an
|
|
|
+ option -w to set the page width.
|
|
|
+
|
|
|
+21-Feb-1986 Dick Grune (dick) at tjalk
|
|
|
+ Took array last[N_HASH] out of make_hash() in hash.c, due to stack
|
|
|
+ overflow on the Gould (reported by George Walker
|
|
|
+ [email protected])
|
|
|
+
|
|
|
+16-Feb-1986 Dick Grune (dick) at tjalk
|
|
|
+ Corrected some subtractions that caused unsigned ints to turn
|
|
|
+ pseudo-negative. (Reported by jaap@mcvax)
|
|
|
+
|
|
|
+11-Jan-1986 Dick Grune (dick) at tjalk
|
|
|
+ Touched up for distribution.
|
|
|
+
|
|
|
+10-Jan-1986 Dick Grune (dick) at tjalk
|
|
|
+ Fill_line was not called for empty lines, which caused them to be
|
|
|
+ printed as repetitions of the previous line.
|
|
|
+
|
|
|
+24-Dec-1985 Dick Grune (dick) at tjalk
|
|
|
+ Reduced hash table to a single array of indices; it is used only
|
|
|
+ in one place, which makes it very easy to make it (the hash table)
|
|
|
+ optional. General tune-up of everything. This seems to be
|
|
|
+ another stable "final" version.
|
|
|
+
|
|
|
+14-Dec-1985 Dick Grune (dick) at tjalk
|
|
|
+ Some experiments with hash formulas:
|
|
|
+ h = (h OP CST) + *p++ OP CST yields right wrong
|
|
|
+ * 96 - 32 205 562
|
|
|
+ * 96 - 2 205 560
|
|
|
+ * 96 205 560
|
|
|
+ * 97 205 559
|
|
|
+ << 0 66 3128
|
|
|
+ << 1 203 555
|
|
|
+ << 2 205 536
|
|
|
+ << 7 203 540
|
|
|
+ Conclusion: it doesn't matter, unless you do it wrong.
|
|
|
+
|
|
|
+01-Oct-1983 Dic8k Grune (dick) at vu44
|
|
|
+ Oldest known files.
|
|
|
+
|
|
|
+# This file is part of the software similarity tester SIM.
|
|
|
+# Written by Dick Grune, Vrije Universiteit, Amsterdam.
|
|
|
+# $Id: ChangeLog,v 2.12 2007/08/27 09:57:30 dick Exp $
|
|
|
+#
|