| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495 |
- -----------Summary
- Charset Detector - as the name says - is a stand alone component for automatic charset detection of a given text.
- It can be useful for internationalisation support in multilingual applications such as web-script editors or Unicode editors.
- Given input buffer will be analysed to guess used encoding. The result can be used as control parameter for charset conversation procedure.
- Based on Mozilla's i18n component - https://dxr.mozilla.org/mozilla/source/extensions/universalchardet/.
- -----------State
- Version 0.2.9 stable.
- Copyright (C) 2011-2019 Alexander Koblov
- The latest version can be found at https://github.com/doublecmd/doublecmd/tree/master/components/chsdet/.
- -----------Original
- Based on
- Charset Detector - http://chsdet.sourceforge.net
- Copyright (C) 2006-2013 Nikolaj Yakowlew
- -----------Requirements
- Charset Detector doesn't need any external components.
- -----------Output
- As result you will get guessed charset as MS Windows Code Page id and charset name.
- -----------Licence
- Charset Detector is open source project and distributed under GNU LGPL.
- See the GNU Lesser General Public License for more details - https://opensource.org/licenses/LGPL-2.1
- -----------Supported charsets
- +-----------+---------------------------+------------------------+
- | Code pade | Name | Note |
- +-----------+---------------------------+------------------------+
- | 0 | ASCII | Pseudo code page. |
- | 855 | IBM855 | |
- | 866 | IBM866 | |
- | 932 | Shift_JIS | |
- | 950 | Big5 | |
- | 1200 | UTF-16LE | |
- | 1201 | UTF-16BE | |
- | 1251 | windows-1251 | |
- | 1252 | windows-1252 | |
- | 1253 | windows-1253 | |
- | 1255 | windows-1255 | |
- | 10007 | x-mac-cyrillic | |
- | 12000 | X-ISO-10646-UCS-4-2143 | |
- | 12000 | UTF-32LE | |
- | 12001 | X-ISO-10646-UCS-4-3412 | |
- | 12001 | UTF-32BE | |
- | 20866 | KOI8-R | |
- | 28595 | ISO-8859-5 | |
- | 28595 | ISO-8859-5 | |
- | 28597 | ISO-8859-7 | |
- | 28598 | ISO-8859-8 | |
- | 50222 | ISO-2022-JP | |
- | 50225 | ISO-2022-KR | |
- | 50227 | ISO-2022-CN | |
- | 51932 | EUC-JP | |
- | 51936 | x-euc-tw | |
- | 51949 | EUC-KR | |
- | 52936 | HZ-GB-2312 | |
- | 54936 | GB18030 | |
- | 65001 | UTF-8 | |
- +-----------+---------------------------+------------------------+
- -----------Types
- Return values
- NS_OK = 0;
- NS_ERROR_OUT_OF_MEMORY = $8007000e;
- Returned types
- rCharsetInfo = record
- Name: PAnsiChar; // Charset GNU canonical name
- CodePage: Integer; // MS Windows CodePage ID
- Language: PAnsiChar;
- end;
- -----------Usage sample
- Below is a small usage sample in Free Pascal.
- function DetectEncoding(const S: String): rCharsetInfo;
- var
- Detector: TnsUniversalDetector;
- begin
- Detector:= TnsUniversalDetector.Create;
- try
- Detector.Reset;
- Detector.HandleData(PAnsiChar(S), Length(S));
- if not Detector.Done then Detector.DataEnd;
- Result:= Detector.GetDetectedCharsetInfo;
- finally
- FreeAndNil(Detector);
- end;
- end;
|