ReadMe.txt 4.1 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
  1. -----------Summary
  2. Charset Detector - as the name says - is a stand alone component for automatic charset detection of a given text.
  3. It can be useful for internationalisation support in multilingual applications such as web-script editors or Unicode editors.
  4. Given input buffer will be analysed to guess used encoding. The result can be used as control parameter for charset conversation procedure.
  5. Based on Mozilla's i18n component - https://dxr.mozilla.org/mozilla/source/extensions/universalchardet/.
  6. -----------State
  7. Version 0.2.9 stable.
  8. Copyright (C) 2011-2019 Alexander Koblov
  9. The latest version can be found at https://github.com/doublecmd/doublecmd/tree/master/components/chsdet/.
  10. -----------Original
  11. Based on
  12. Charset Detector - http://chsdet.sourceforge.net
  13. Copyright (C) 2006-2013 Nikolaj Yakowlew
  14. -----------Requirements
  15. Charset Detector doesn't need any external components.
  16. -----------Output
  17. As result you will get guessed charset as MS Windows Code Page id and charset name.
  18. -----------Licence
  19. Charset Detector is open source project and distributed under GNU LGPL.
  20. See the GNU Lesser General Public License for more details - https://opensource.org/licenses/LGPL-2.1
  21. -----------Supported charsets
  22. +-----------+---------------------------+------------------------+
  23. | Code pade | Name | Note |
  24. +-----------+---------------------------+------------------------+
  25. | 0 | ASCII | Pseudo code page. |
  26. | 855 | IBM855 | |
  27. | 866 | IBM866 | |
  28. | 932 | Shift_JIS | |
  29. | 950 | Big5 | |
  30. | 1200 | UTF-16LE | |
  31. | 1201 | UTF-16BE | |
  32. | 1251 | windows-1251 | |
  33. | 1252 | windows-1252 | |
  34. | 1253 | windows-1253 | |
  35. | 1255 | windows-1255 | |
  36. | 10007 | x-mac-cyrillic | |
  37. | 12000 | X-ISO-10646-UCS-4-2143 | |
  38. | 12000 | UTF-32LE | |
  39. | 12001 | X-ISO-10646-UCS-4-3412 | |
  40. | 12001 | UTF-32BE | |
  41. | 20866 | KOI8-R | |
  42. | 28595 | ISO-8859-5 | |
  43. | 28595 | ISO-8859-5 | |
  44. | 28597 | ISO-8859-7 | |
  45. | 28598 | ISO-8859-8 | |
  46. | 50222 | ISO-2022-JP | |
  47. | 50225 | ISO-2022-KR | |
  48. | 50227 | ISO-2022-CN | |
  49. | 51932 | EUC-JP | |
  50. | 51936 | x-euc-tw | |
  51. | 51949 | EUC-KR | |
  52. | 52936 | HZ-GB-2312 | |
  53. | 54936 | GB18030 | |
  54. | 65001 | UTF-8 | |
  55. +-----------+---------------------------+------------------------+
  56. -----------Types
  57. Return values
  58. NS_OK = 0;
  59. NS_ERROR_OUT_OF_MEMORY = $8007000e;
  60. Returned types
  61. rCharsetInfo = record
  62. Name: PAnsiChar; // Charset GNU canonical name
  63. CodePage: Integer; // MS Windows CodePage ID
  64. Language: PAnsiChar;
  65. end;
  66. -----------Usage sample
  67. Below is a small usage sample in Free Pascal.
  68. function DetectEncoding(const S: String): rCharsetInfo;
  69. var
  70. Detector: TnsUniversalDetector;
  71. begin
  72. Detector:= TnsUniversalDetector.Create;
  73. try
  74. Detector.Reset;
  75. Detector.HandleData(PAnsiChar(S), Length(S));
  76. if not Detector.Done then Detector.DataEnd;
  77. Result:= Detector.GetDetectedCharsetInfo;
  78. finally
  79. FreeAndNil(Detector);
  80. end;
  81. end;