diff --git a/LICENCE b/LICENCE new file mode 100644 index 0000000..ac18870 --- /dev/null +++ b/LICENCE @@ -0,0 +1,9 @@ +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY +EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES +OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT +SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT +OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR +TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, +EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file diff --git a/LICENCE_icu b/LICENCE_icu new file mode 100644 index 0000000..dab93c1 --- /dev/null +++ b/LICENCE_icu @@ -0,0 +1,307 @@ + + + + +ICU License - ICU 1.8.1 and later + + + +

ICU License - ICU 1.8.1 and later

+ +

COPYRIGHT AND PERMISSION NOTICE

+ +

+Copyright (c) 1995-2012 International Business Machines Corporation and others +

+

+All rights reserved. +

+

+Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), +to deal in the Software without restriction, including without limitation +the rights to use, copy, modify, merge, publish, distribute, and/or sell +copies of the Software, and to permit persons +to whom the Software is furnished to do so, provided that the above +copyright notice(s) and this permission notice appear in all copies +of the Software and that both the above copyright notice(s) and this +permission notice appear in supporting documentation. +

+

+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, +INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A +PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL +THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, +OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER +RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, +NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE +USE OR PERFORMANCE OF THIS SOFTWARE. +

+

+Except as contained in this notice, the name of a copyright holder shall not be +used in advertising or otherwise to promote the sale, use or other dealings in +this Software without prior written authorization of the copyright holder. +

+ +
+

+All trademarks and registered trademarks mentioned herein are the property of their respective owners. +

+ +
+ +

Third-Party Software Licenses

+This section contains third-party software notices and/or additional terms for licensed +third-party software components included within ICU libraries. + +

1. Unicode Data Files and Software

+ +

EXHIBIT 1
+UNICODE, INC. LICENSE AGREEMENT - DATA FILES AND SOFTWARE

+
+

Unicode Data Files include all data files under the directories +http://www.unicode.org/Public/, +http://www.unicode.org/reports/, +and + +http://www.unicode.org/cldr/data/. Unicode Data Files do not include PDF online code charts under the directory http://www.unicode.org/Public/. Software includes any source code +published in the Unicode Standard or under the directories http://www.unicode.org/Public/, +http://www.unicode.org/reports/, +and + +http://www.unicode.org/cldr/data/.

+ +

NOTICE TO USER: Carefully read the following legal agreement. BY DOWNLOADING, INSTALLING, COPYING OR OTHERWISE USING UNICODE INC.'S DATA FILES ("DATA FILES"), AND/OR SOFTWARE ("SOFTWARE"), YOU UNEQUIVOCALLY ACCEPT, AND AGREE TO BE BOUND BY, ALL OF THE TERMS AND CONDITIONS OF THIS AGREEMENT. IF YOU DO NOT AGREE, DO NOT DOWNLOAD, INSTALL, COPY, DISTRIBUTE OR USE THE DATA FILES OR SOFTWARE.

+

COPYRIGHT AND PERMISSION NOTICE

+ +

Copyright © 1991-2012 Unicode, Inc. All rights reserved. Distributed under the Terms of Use in +http://www.unicode.org/copyright.html.

+ +

Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode data files and +any associated documentation (the "Data Files") or Unicode software and any associated documentation (the "Software") to deal in the Data Files or Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Data Files or Software, and to permit persons to whom the Data Files or Software are furnished to do so, provided that (a) the above copyright notice(s) and this permission notice appear +with all copies of the Data Files or Software, (b) both the above copyright notice(s) and this permission notice appear in associated documentation, and (c) there is clear notice in each modified Data File or in the Software as well as in the documentation associated with the Data File(s) or Software that the data or software has been modified.

+ +

THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR +PERFORMANCE OF THE DATA FILES OR SOFTWARE.

+ +

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in these Data Files or Software without prior written authorization of the copyright holder.

+ +
+ +

Unicode and the Unicode logo are trademarks of Unicode, Inc. in the United States and other countries. All third party trademarks referenced herein are the property of their respective owners.

+ + +
+ +

2. Chinese/Japanese Word Break Dictionary Data (cjdict.txt)

+
+ #    The Google Chrome software developed by Google is licensed under the BSD license. Other software included in this distribution is provided under other licenses, as set forth below.
+ #  
+ #  The BSD License
+ #  http://opensource.org/licenses/bsd-license.php 
+ #  Copyright (C) 2006-2008, Google Inc.
+ #  
+ #  All rights reserved.
+ #  
+ #  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+ #  
+ #  Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+ #  Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+ #  Neither the name of  Google Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+ #   
+ #  
+ #  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ #  
+ #                                               
+ #  The word list in cjdict.txt are generated by combining three word lists listed
+ #  below with further processing for compound word breaking. The frequency is generated
+ #  with an iterative training against Google web corpora. 
+ #  
+ #  * Libtabe (Chinese)
+ #    - https://sourceforge.net/project/?group_id=1519
+ #    - Its license terms and conditions are shown below.
+ #  
+ #  * IPADIC (Japanese)
+ #    - http://chasen.aist-nara.ac.jp/chasen/distribution.html
+ #    - Its license terms and conditions are shown below.
+ #  
+ #  ---------COPYING.libtabe ---- BEGIN--------------------
+ #  
+ #  /*
+ #   * Copyrighy (c) 1999 TaBE Project.
+ #   * Copyright (c) 1999 Pai-Hsiang Hsiao.
+ #   * All rights reserved.
+ #   *
+ #   * Redistribution and use in source and binary forms, with or without
+ #   * modification, are permitted provided that the following conditions
+ #   * are met:
+ #   *
+ #   * . Redistributions of source code must retain the above copyright
+ #   *   notice, this list of conditions and the following disclaimer.
+ #   * . Redistributions in binary form must reproduce the above copyright
+ #   *   notice, this list of conditions and the following disclaimer in
+ #   *   the documentation and/or other materials provided with the
+ #   *   distribution.
+ #   * . Neither the name of the TaBE Project nor the names of its
+ #   *   contributors may be used to endorse or promote products derived
+ #   *   from this software without specific prior written permission.
+ #   *
+ #   * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ #   * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ #   * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ #   * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ #   * REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ #   * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ #   * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ #   * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ #   * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ #   * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ #   * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ #   * OF THE POSSIBILITY OF SUCH DAMAGE.
+ #   */
+ #  
+ #  /*
+ #   * Copyright (c) 1999 Computer Systems and Communication Lab,
+ #   *                    Institute of Information Science, Academia Sinica.
+ #   * All rights reserved.
+ #   *
+ #   * Redistribution and use in source and binary forms, with or without
+ #   * modification, are permitted provided that the following conditions
+ #   * are met:
+ #   *
+ #   * . Redistributions of source code must retain the above copyright
+ #   *   notice, this list of conditions and the following disclaimer.
+ #   * . Redistributions in binary form must reproduce the above copyright
+ #   *   notice, this list of conditions and the following disclaimer in
+ #   *   the documentation and/or other materials provided with the
+ #   *   distribution.
+ #   * . Neither the name of the Computer Systems and Communication Lab
+ #   *   nor the names of its contributors may be used to endorse or
+ #   *   promote products derived from this software without specific
+ #   *   prior written permission.
+ #   *
+ #   * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ #   * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ #   * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ #   * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ #   * REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ #   * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ #   * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ #   * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ #   * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ #   * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ #   * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ #   * OF THE POSSIBILITY OF SUCH DAMAGE.
+ #   */
+ #  
+ #  Copyright 1996 Chih-Hao Tsai @ Beckman Institute, University of Illinois
+ #  c-tsai4@uiuc.edu  http://casper.beckman.uiuc.edu/~c-tsai4
+ #  
+ #  ---------------COPYING.libtabe-----END------------------------------------
+ #  
+ #  
+ #  ---------------COPYING.ipadic-----BEGIN------------------------------------
+ #  
+ #  Copyright 2000, 2001, 2002, 2003 Nara Institute of Science
+ #  and Technology.  All Rights Reserved.
+ #  
+ #  Use, reproduction, and distribution of this software is permitted.
+ #  Any copy of this software, whether in its original form or modified,
+ #  must include both the above copyright notice and the following
+ #  paragraphs.
+ #  
+ #  Nara Institute of Science and Technology (NAIST),
+ #  the copyright holders, disclaims all warranties with regard to this
+ #  software, including all implied warranties of merchantability and
+ #  fitness, in no event shall NAIST be liable for
+ #  any special, indirect or consequential damages or any damages
+ #  whatsoever resulting from loss of use, data or profits, whether in an
+ #  action of contract, negligence or other tortuous action, arising out
+ #  of or in connection with the use or performance of this software.
+ #  
+ #  A large portion of the dictionary entries
+ #  originate from ICOT Free Software.  The following conditions for ICOT
+ #  Free Software applies to the current dictionary as well.
+ #  
+ #  Each User may also freely distribute the Program, whether in its
+ #  original form or modified, to any third party or parties, PROVIDED
+ #  that the provisions of Section 3 ("NO WARRANTY") will ALWAYS appear
+ #  on, or be attached to, the Program, which is distributed substantially
+ #  in the same form as set out herein and that such intended
+ #  distribution, if actually made, will neither violate or otherwise
+ #  contravene any of the laws and regulations of the countries having
+ #  jurisdiction over the User or the intended distribution itself.
+ #  
+ #  NO WARRANTY
+ #  
+ #  The program was produced on an experimental basis in the course of the
+ #  research and development conducted during the project and is provided
+ #  to users as so produced on an experimental basis.  Accordingly, the
+ #  program is provided without any warranty whatsoever, whether express,
+ #  implied, statutory or otherwise.  The term "warranty" used herein
+ #  includes, but is not limited to, any warranty of the quality,
+ #  performance, merchantability and fitness for a particular purpose of
+ #  the program and the nonexistence of any infringement or violation of
+ #  any right of any third party.
+ #  
+ #  Each user of the program will agree and understand, and be deemed to
+ #  have agreed and understood, that there is no warranty whatsoever for
+ #  the program and, accordingly, the entire risk arising from or
+ #  otherwise connected with the program is assumed by the user.
+ #  
+ #  Therefore, neither ICOT, the copyright holder, or any other
+ #  organization that participated in or was otherwise related to the
+ #  development of the program and their respective officials, directors,
+ #  officers and other employees shall be held liable for any and all
+ #  damages, including, without limitation, general, special, incidental
+ #  and consequential damages, arising out of or otherwise in connection
+ #  with the use or inability to use the program or any product, material
+ #  or result produced or otherwise obtained by using the program,
+ #  regardless of whether they have been advised of, or otherwise had
+ #  knowledge of, the possibility of such damages at any time during the
+ #  project or thereafter.  Each user will be deemed to have agreed to the
+ #  foregoing by his or her commencement of use of the program.  The term
+ #  "use" as used herein includes, but is not limited to, the use,
+ #  modification, copying and distribution of the program and the
+ #  production of secondary products from the program.
+ #  
+ #  In the case where the program, whether in its original form or
+ #  modified, was distributed or delivered to or received by a user from
+ #  any person, organization or entity other than ICOT, unless it makes or
+ #  grants independently of ICOT any specific warranty to the user in
+ #  writing, such person, organization or entity, will also be exempted
+ #  from and not be held liable to the user for any such damages as noted
+ #  above as far as the program is concerned.
+ #  
+ #  ---------------COPYING.ipadic-----END------------------------------------
+
+ +

3. Time Zone Database

+

ICU uses the public domain data and code derived from +Time Zone Database for its time zone support. The ownership of the TZ database is explained +in BCP 175: Procedure for Maintaining the Time Zone +Database section 7.

+ +

+7.  Database Ownership
+
+   The TZ database itself is not an IETF Contribution or an IETF
+   document.  Rather it is a pre-existing and regularly updated work
+   that is in the public domain, and is intended to remain in the public
+   domain.  Therefore, BCPs 78 [RFC5378] and 79 [RFC3979] do not apply
+   to the TZ Database or contributions that individuals make to it.
+   Should any claims be made and substantiated against the TZ Database,
+   the organization that is providing the IANA Considerations defined in
+   this RFC, under the memorandum of understanding with the IETF,
+   currently ICANN, may act in accordance with all competent court
+   orders.  No ownership claims will be made by ICANN or the IETF Trust
+   on the database or the code.  Any person making a contribution to the
+   database or code waives all rights to future claims in that
+   contribution or in the TZ Database.
+
+
+ + + + \ No newline at end of file diff --git a/README.md b/README.md index f03ec1c..512a5fd 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,116 @@ -icu -=== +About +========== -Cgo binding for icu4c library \ No newline at end of file +Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1. + +Installation +========== + +Installation consists of several simple steps. They may be a bit different on your target system (e.g. require more permissions) so adapt them to the parameters of your system. + +### Get icu4c C library code + +* Download original icu4c archive from [icu download section](http://site.icu-project.org/download). +* Unarchive it. + +NOTE: If this link is not working or there are some problems with downloading, there is a stable version 50.1 snapshot saved in [Downloads](https://github.com/downloads/goodsign/icu/icu4c-50_1-src.tgz). + +### Build and install icu4c C library + +From the directory, where you unarchived icu4c, run: + +``` +cd source +./configure +make +sudo make install +``` + +### Install Go wrapper + +``` +go get github.com/goodsign/icu +go test github.com/goodsign/icu (must PASS) +``` + +Installation notes +========== + +* Make sure that you have your local library paths set correctly and that installation was successful. Otherwise, **go build** or **go test** may fail. + +* icu4c is installed in your local library directory (e.g. **/usr/local/lib**) and puts its libraries there. This path should be registered in your system (using ldconfig or exporting LD_LIBRARY_PATH, etc.) or the linker would fail. + +* icu4c installs its header files to local include folders (e.g. **/usr/local/include/unicode**) so there is no need to have additional .h files with this package, but the system must be properly set up to detect .h files in those directories. + +Usage +========== + +Note: check icu documentation for returned encoding identifiers. + +Detector +---------- + +```go +// Create detector +detector, err := NewCharsetDetector() + +if err != nil { + //... Handle error ... +} +defer detector.Close() + +// Guess encoding +encMatches, err := detector.GuessCharset(encodedText) + +if err != nil { + //... Handle error ... +} + +// Get charset with max confidence (goes first) +maxenc := encMatches[0].Charset + +// Use maxenc. +// ... +``` + +Converter +---------- + +```go +... + +// Create converter +converter := NewCharsetConverter(DefaultMaxTextSize) + +// Convert to utf-8 +converted, err := converter.ConvertToUtf8(encodedText, maxenc) + +if nil != err { + //... Handle error ... +} +``` + +Usage notes +========== + +* Check **NewCharsetConverter** func comments for details on max text size parameter. +* Often you would use detector and converter in pair. So, the 'converter' usage example actually continues the 'detector' example and uses the 'maxenc' result from it. + +More info +---------- + +For more information on libtextcat refer to the original [website](http://site.icu-project.org/), which contains links on theory and other details. + +icu4c Licence +========== + +ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software. + +[LICENCE file](https://github.com/goodsign/libtextcat/blob/master/LICENCE_icu) + +Licence +========== + +The goodsign/libtextcat binding is released under the [BSD Licence](http://opensource.org/licenses/bsd-license.php) + +[LICENCE file](https://github.com/goodsign/libtextcat/blob/master/LICENCE) \ No newline at end of file diff --git a/c_bridge.c b/c_bridge.c new file mode 100644 index 0000000..f4e2839 --- /dev/null +++ b/c_bridge.c @@ -0,0 +1,118 @@ +#include "c_bridge.h" +#include +#include +#include +#include +#include + +// See description in c_bridge.h +const int detectCharset(void *detector, + void *input, + int input_len, + int *status, + MatchData *matchBuffer, + int matchBufferSize) { + + // Put input bytes in the detector. + ucsdet_setText((UCharsetDetector*)detector, (char*)input, input_len, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + // Prepare vars for returned count and guesses. + int matchCount; + const UCharsetMatch **bestGuesses; + + // Perform analysis and return all guesses and their count. + bestGuesses = ucsdet_detectAll((UCharsetDetector*)detector, &matchCount, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + // Fill the matchBuffer. Its size is matchBufferSize, so it is filled with + // less or equal to matchBufferSize number of entries. + int i; + int retCount = matchCount > matchBufferSize ? matchBufferSize : matchCount; + + for (i = 0; i < retCount; i++) { + + const UCharsetMatch* bestGuess = bestGuesses[i]; + const char *bestGuessedCharset = NULL; + const char *bestGuessedLanguage = NULL; + + // Fill guessed encoding + bestGuessedCharset = ucsdet_getName(bestGuess, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + // Fill guessed language + bestGuessedLanguage = ucsdet_getLanguage(bestGuess, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + // Fill its confidence rating + int32_t conf = ucsdet_getConfidence(bestGuess, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + matchBuffer[i].confidence = conf; + matchBuffer[i].charset = bestGuessedCharset; + matchBuffer[i].language = bestGuessedLanguage; + } + + // Return the number of guesses put into matchBuffer. + return retCount; +} + +// See description in c_bridge.h +int convertToUtf16(const char *srcEncoding, + UChar *dest, + int32_t destCapacity, + const char *src, + int32_t srcLength, + int *status){ + UConverter *conv; + + conv = ucnv_open(srcEncoding, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + /* Convert from original encoding to UTF-16 */ + int len = ucnv_toUChars(conv, dest, destCapacity, src, srcLength, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + ucnv_close(conv); + + return len; +} + +// See description in c_bridge.h +int convertFromUtf16(const char *destEncoding, + char *dest, + int32_t destCapacity, + const UChar *src, + int32_t srcLength, + int *status){ + UConverter *conv; + + conv = ucnv_open(destEncoding, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + /* Convert from UTF-16 to destination encoding */ + int len = ucnv_fromUChars(conv, dest, destCapacity, src, srcLength, status); + if (*status != U_ZERO_ERROR) { + return 0; + } + + ucnv_close(conv); + + return len; +} \ No newline at end of file diff --git a/c_bridge.h b/c_bridge.h new file mode 100644 index 0000000..f00b008 --- /dev/null +++ b/c_bridge.h @@ -0,0 +1,61 @@ +#ifndef __C_BRIDGE_H__ +#define __C_BRIDGE_H__ + +// C_BRIDGE is a bridge between go and native pure c functions used to +// operate with ICU library code. + +#include +#include + +// MatchData contains information about one 'guess' of the +// encoding detector. It contains the guessed charset (ICU string identifiers, +// see ICU documentation for them) and a confidence coefficient, which is a +// number between 0 and 100 (100 is the best). +typedef struct MatchData { + const char* charset; + const char* language; + short int confidence; +} MatchData; + +// detectCharset performs the detection (guessing) operation using a given detector (ICU internals), +// input data (bytes), input length and error status pointer (Read ICU docs abour error codes). +// +// After the detection is performed, all possible matches are put into the matchBuffer. If there are +// more results than matchBufferSize, then only matchBufferSize entries are put (So no overflow can +// ever happen). +// +// The results of this function are put into the matchBuffer, so it MUST NOT be called asynchronously. +// Caller should guarantee thread safety and perform locks while working with it. +const int detectCharset(void *detector, + void *input, + int input_len, + int *status, + MatchData *matchBuffer, + int matchBufferSize); + +// convertToUtf16 performs conversion from any encoding to utf16. Utf16 is the ICU standard so +// it is easier to convert to/from it. +// +// The results of this function are put into the dest buffer, so it MUST NOT be called asynchronously. +// Caller should guarantee thread safety and perform locks while working with it. +int convertToUtf16(const char *srcEncoding, + UChar *dest, + int32_t destCapacity, + const char *src, + int32_t srcLength, + int *status); + +// convertFromUtf16 performs conversion from utf16 to any other encoding. Utf16 is the ICU standard so +// it is easier to convert to/from it. +// +// The results of this function are put into the dest buffer, so it MUST NOT be called asynchronously. +// Caller should guarantee thread safety and perform locks while working with it. +int convertFromUtf16(const char *destEncoding, + char *dest, + int32_t destCapacity, + const UChar *src, + int32_t srcLength, + int *status); + + +#endif //__C_BRIDGE_H__ \ No newline at end of file diff --git a/convert.go b/convert.go new file mode 100644 index 0000000..3577a53 --- /dev/null +++ b/convert.go @@ -0,0 +1,98 @@ +package icu + +// #cgo pkg-config: icu-i18n +// #include "c_bridge.h" +// #include "stdlib.h" +import "C" +import ( + "fmt" + "sync" + "unsafe" +) + +const ( + DefaultMaxTextSize = 1024 * 1024 // Default value for the max text length in conversion operations + utf8MaxCharSize = 4 + utf16MaxCharSize = 4 +) + +var ( + Utf8CString = C.CString("UTF-8") +) + +// CharsetConverter provides ICU charset conversion functionality. +type CharsetConverter struct { + utf16Buffer []byte + utf8Buffer []byte + maxTextSize int + cMutex sync.Mutex // Mutex used to guarantee thread safety for ICU calls +} + +// NewCharsetConverter creates a new charset converter. It doesn't need to be closed as +// it doesn't allocate any resources. +// +// For better performance, conversion buffers are not allocated on each operation. Instead they +// are created in memory once and then used. 'maxTextSize' sets the size of these buffers. +// ICU library would return error if any processed text is longer than this parameter. +// +// NOTE: +// +// UTF8 uses 1 to 4 bytes for each symbol. +// UTF16 uses 2 bytes to 4 bytes for each symbol. +// +// So, to guarantee successful conversion of text with size = 'maxTextSize' we need: +// maxTextSize * 8 bytes (utf8 buffer + utf16 buffer). +func NewCharsetConverter(maxTextSize int) (*CharsetConverter) { + conv := new(CharsetConverter) + + conv.utf16Buffer = make([]byte, utf16MaxCharSize * maxTextSize) + conv.utf8Buffer = make([]byte, utf8MaxCharSize * maxTextSize) + + return conv +} + +// ConvertToUtf8 converts input bytes encoded with srcEncoding to UTF-8. +func (conv *CharsetConverter) ConvertToUtf8(input []byte, srcEncoding string) ([]byte, error) { + // As described in c_bridge.h, conversion operations are not thread safe and + // should be called consequently. So a mutex is used here. + conv.cMutex.Lock() + defer conv.cMutex.Unlock() + + inputLen := len(input) + if inputLen == 0 { + return nil, fmt.Errorf("Nil length of input") + } + + var status int + + encCString := C.CString(srcEncoding) + inputCString := C.CString(string(input)) + + defer C.free(unsafe.Pointer(encCString)) + defer C.free(unsafe.Pointer(inputCString)) + + convLen := C.convertToUtf16( + encCString, + (*C.UChar)(unsafe.Pointer(&conv.utf16Buffer[0])), + C.int32_t(len(conv.utf16Buffer)), + inputCString, + C.int32_t(len(input)), + (*C.int)(unsafe.Pointer(&status))) + + if status == U_ZERO_ERROR { + nConvLen := C.convertFromUtf16( + Utf8CString, + (*C.char)(unsafe.Pointer(&conv.utf8Buffer[0])), + C.int32_t(len(conv.utf8Buffer)), + (*C.UChar)(unsafe.Pointer(&conv.utf16Buffer[0])), + C.int32_t(convLen), + (*C.int)(unsafe.Pointer(&status))) + + if status == U_ZERO_ERROR { + resStr := conv.utf8Buffer[:nConvLen] + return ([]byte)(resStr), nil + } + } + + return nil, fmt.Errorf("ICU Error code returned: %d", status) +} diff --git a/detect.go b/detect.go new file mode 100644 index 0000000..d966d04 --- /dev/null +++ b/detect.go @@ -0,0 +1,100 @@ +package icu + +// #cgo pkg-config: icu-i18n +// #include "c_bridge.h" +// #include "stdlib.h" +import "C" +import ( + "fmt" + "sync" + "unsafe" +) + +const ( + U_ZERO_ERROR = 0 // ICU common constant error code which means that no error occured + MatchDataBufferSize = 25 // Size of the buffer for detection results (Max count of returned guesses per detect call) +) + +// CharsetDetector provides ICU charset detection functionality. +type CharsetDetector struct { + ptr *C.UCharsetDetector // ICU struct needed for detection + resBuffer [MatchDataBufferSize]C.MatchData + gMutex sync.Mutex // Mutex used to guarantee thread safety for ICU calls +} + +// An equivalent of MatchData C structure (see c_bridge.h) +type Match struct { + Charset string + Language string + Confidence int +} + +// Creates new charset detector. If it is successfully created, it +// must be closed as it needs to free native ICU resources. +func NewCharsetDetector() (*CharsetDetector, error) { + det := new(CharsetDetector) + + var status int + statusPtr := unsafe.Pointer(&status) + + det.ptr = C.ucsdet_open((*C.UErrorCode)(statusPtr)) + + if status != U_ZERO_ERROR { + return nil, fmt.Errorf("ICU Error code returned: %d", status) + } + + return det, nil +} + +func (det *CharsetDetector) GuessCharset(input []byte) (matches []Match, err error) { + + // As described in c_bridge.h, detection operations are not thread safe and + // should be called consequently. So a mutex is used here. + det.gMutex.Lock() + defer det.gMutex.Unlock() + + inputLen := len(input) + if inputLen == 0 { + return nil, fmt.Errorf("Input data len is 0") + } + + var status int + + // Perform detection. Guess count is the number of matches returned. + // The matches themself are put in the result buffer + guessCount := C.detectCharset( + unsafe.Pointer(det.ptr), + unsafe.Pointer(&input[0]), + C.int(inputLen), + (*C.int)(unsafe.Pointer(&status)), + (*C.MatchData)(unsafe.Pointer(&det.resBuffer[0])), + C.int(MatchDataBufferSize)) + + if status == U_ZERO_ERROR { + // Convert the returned number of entries from result buffer to a slice + // that will be returned + count := int(guessCount) + mt := make([]Match, count, count) + + for i := 0; i < count; i++ { + mData := det.resBuffer[i] + charset := C.GoString(mData.charset) + language := C.GoString(mData.language) + mt[i] = Match{charset, language, int(mData.confidence)} + } + + return mt, nil + } + + return nil, fmt.Errorf("ICU Error code returned: %d", status) +} + +// Close frees native C resources +func (det *CharsetDetector) Close() { + det.gMutex.Lock() + defer det.gMutex.Unlock() + + if det.ptr != nil { + C.ucsdet_close(det.ptr) + } +} diff --git a/icu_test.go b/icu_test.go new file mode 100644 index 0000000..f943e1b --- /dev/null +++ b/icu_test.go @@ -0,0 +1,84 @@ +package icu + +import ( + "io/ioutil" + "testing" + "regexp" +) + +var ( + IcuTestLineRx *regexp.Regexp = regexp.MustCompile(`\A\[(?P.+)\]\s*\[(?P.+)\].*\z`) +) + +const ( + TestConfigPath = "defaultcfg/conf.txt" +) + +func testConversion(t *testing.T, encFileName string, expFileName string) { + // Create detector + detector, err := NewCharsetDetector() + + if nil != err { + t.Fatalf("Cannot create detector: %s", err) + } + defer detector.Close() + + // Create converter + converter := NewCharsetConverter(DefaultMaxTextSize) + + // Open files + + enc, err := ioutil.ReadFile(encFileName) + + if nil != err { + t.Error(err) + return + } + + exp, err := ioutil.ReadFile(expFileName) + + if nil != err { + t.Error(err) + return + } + + // Guess encoding + encMatches, err := detector.GuessCharset(enc) + + if nil != err { + t.Error(err) + return + } + + // Get charset with max confidence + maxenc := encMatches[0].Charset + + // Convert to utf-8 + converted, err := converter.ConvertToUtf8(enc, maxenc) + + if nil != err { + t.Error(err) + return + } + + t.Logf("Encoded file: '%s' Expected file: [%s] Detected charset: [%s]", + encFileName, + expFileName, + maxenc) + + // Compare converted result and expected result from file. + if string(converted) != string(exp) { + t.Errorf("Encoded file: '%s' Expected file: [%s] Detected charset: [%s] Expected utf8: [%s] Got utf8: [%s]", + encFileName, + expFileName, + maxenc, + exp, + string(converted)) + } +} + +func TestDefault(t *testing.T) { + testConversion(t, "test/koi8r.txt", "test/koi8r_to_utf.txt") + testConversion(t, "test/windows88591.txt","test/windows88591_to_utf.txt") + testConversion(t, "test/utf8.txt", "test/utf8_to_utf.txt") +} \ No newline at end of file diff --git a/test/koi8r.txt b/test/koi8r.txt new file mode 100644 index 0000000..8abef94 --- /dev/null +++ b/test/koi8r.txt @@ -0,0 +1 @@ +- . \ No newline at end of file diff --git a/test/koi8r_to_utf.txt b/test/koi8r_to_utf.txt new file mode 100644 index 0000000..ddca33d --- /dev/null +++ b/test/koi8r_to_utf.txt @@ -0,0 +1 @@ +Далеко-далеко за словесными горами в стране гласных и согласных живут рыбные тексты. \ No newline at end of file diff --git a/test/utf8.txt b/test/utf8.txt new file mode 100644 index 0000000..8035bc5 --- /dev/null +++ b/test/utf8.txt @@ -0,0 +1,5 @@ +Package template implements data-driven templates for generating textual output. + +To generate HTML output, see package html/template, which has the same interface as this package but automatically secures HTML output against certain attacks. + +Nous vous transmettrons les informations demandées dans les meilleurs délais. Ces recherches sont effectuées à titre gracieux. \ No newline at end of file diff --git a/test/utf8_to_utf.txt b/test/utf8_to_utf.txt new file mode 100644 index 0000000..8035bc5 --- /dev/null +++ b/test/utf8_to_utf.txt @@ -0,0 +1,5 @@ +Package template implements data-driven templates for generating textual output. + +To generate HTML output, see package html/template, which has the same interface as this package but automatically secures HTML output against certain attacks. + +Nous vous transmettrons les informations demandées dans les meilleurs délais. Ces recherches sont effectuées à titre gracieux. \ No newline at end of file diff --git a/test/windows88591.txt b/test/windows88591.txt new file mode 100644 index 0000000..cf767a6 --- /dev/null +++ b/test/windows88591.txt @@ -0,0 +1 @@ +Nous vous transmettrons les informations demandes dans les meilleurs dlais. Ces recherches sont effectues titre gracieux. \ No newline at end of file diff --git a/test/windows88591_to_utf.txt b/test/windows88591_to_utf.txt new file mode 100644 index 0000000..a22b366 --- /dev/null +++ b/test/windows88591_to_utf.txt @@ -0,0 +1 @@ +Nous vous transmettrons les informations demandées dans les meilleurs délais. Ces recherches sont effectuées à titre gracieux. \ No newline at end of file