Alexander Koblov 4363a1eb9b UPD: Component address há 3 anos atrás
..
src 80d795cfaf FIX: Disable UCS2BELangModel and UCS2LELangModel, it is often mistaken há 6 anos atrás
Licence.txt aa6f55a44a UPD: Set svn:eol-style property to "native" há 13 anos atrás
ReadMe.txt 4363a1eb9b UPD: Component address há 3 anos atrás
chsdet.lpk 877ea0897a UPD: Charset Detector há 6 anos atrás
chsdet.pas 877ea0897a UPD: Charset Detector há 6 anos atrás

ReadMe.txt

-----------Summary
Charset Detector - as the name says - is a stand alone component for automatic charset detection of a given text.
It can be useful for internationalisation support in multilingual applications such as web-script editors or Unicode editors.
Given input buffer will be analysed to guess used encoding. The result can be used as control parameter for charset conversation procedure.
Based on Mozilla's i18n component - https://dxr.mozilla.org/mozilla/source/extensions/universalchardet/.

-----------State
Version 0.2.9 stable.
Copyright (C) 2011-2019 Alexander Koblov
The latest version can be found at https://github.com/doublecmd/doublecmd/tree/master/components/chsdet/.

-----------Original
Based on
Charset Detector - http://chsdet.sourceforge.net
Copyright (C) 2006-2013 Nikolaj Yakowlew

-----------Requirements
Charset Detector doesn't need any external components.

-----------Output
As result you will get guessed charset as MS Windows Code Page id and charset name.

-----------Licence
Charset Detector is open source project and distributed under GNU LGPL.
See the GNU Lesser General Public License for more details - https://opensource.org/licenses/LGPL-2.1

-----------Supported charsets

+-----------+---------------------------+------------------------+
| Code pade | Name | Note |
+-----------+---------------------------+------------------------+
| 0 | ASCII | Pseudo code page. |
| 855 | IBM855 | |
| 866 | IBM866 | |
| 932 | Shift_JIS | |
| 950 | Big5 | |
| 1200 | UTF-16LE | |
| 1201 | UTF-16BE | |
| 1251 | windows-1251 | |
| 1252 | windows-1252 | |
| 1253 | windows-1253 | |
| 1255 | windows-1255 | |
| 10007 | x-mac-cyrillic | |
| 12000 | X-ISO-10646-UCS-4-2143 | |
| 12000 | UTF-32LE | |
| 12001 | X-ISO-10646-UCS-4-3412 | |
| 12001 | UTF-32BE | |
| 20866 | KOI8-R | |
| 28595 | ISO-8859-5 | |
| 28595 | ISO-8859-5 | |
| 28597 | ISO-8859-7 | |
| 28598 | ISO-8859-8 | |
| 50222 | ISO-2022-JP | |
| 50225 | ISO-2022-KR | |
| 50227 | ISO-2022-CN | |
| 51932 | EUC-JP | |
| 51936 | x-euc-tw | |
| 51949 | EUC-KR | |
| 52936 | HZ-GB-2312 | |
| 54936 | GB18030 | |
| 65001 | UTF-8 | |
+-----------+---------------------------+------------------------+

-----------Types
Return values

NS_OK = 0;
NS_ERROR_OUT_OF_MEMORY = $8007000e;

Returned types

rCharsetInfo = record
Name: PAnsiChar; // Charset GNU canonical name
CodePage: Integer; // MS Windows CodePage ID
Language: PAnsiChar;
end;

-----------Usage sample

Below is a small usage sample in Free Pascal.

function DetectEncoding(const S: String): rCharsetInfo;
var
Detector: TnsUniversalDetector;
begin
Detector:= TnsUniversalDetector.Create;
try
Detector.Reset;
Detector.HandleData(PAnsiChar(S), Length(S));
if not Detector.Done then Detector.DataEnd;
Result:= Detector.GetDetectedCharsetInfo;
finally
FreeAndNil(Detector);
end;
end;