这两款软件可以解决你大部分编码的烦恼
数据抽取(ETL)时,经常会碰到乱码的问题,乱码是件比较头疼的事情,尤其是在Windows平台下,要解决乱码问题,我们首先要了解产生乱码的可能原因,
编码引起的乱码
解码引起的乱码
缺少某种字体库引起的乱码(这种情况需要用户安装对应的字体库,ETL一般不会存在这种问题,除非你处理的文件格式比较特殊,比如PDF)
其中大部分乱码问题是由不合适的解码方式造成的,如图 所示的鱼骨图,
编码引起的乱码又分,
全部乱码(无解,重新获取正确数据)
部分乱码,可能是双字节或者多字节被截断(比如数据库设置的是ISO-8859-1,存的是汉字,如果字段长度设置过短,就有可能把一个双字节汉字截断,一半字节丢失。也可能是UTF-16的编码数据以UTF-8另存,导致部分字符在UTF8下无法识别变成乱码。
下面介绍两款和编码有关的软件,
file
iconv
这两个在Linux(或者MacOS)下都是很常见的软件,
>file --help
Usage: file [OPTION...] [FILE...]
Determine type of FILEs.
--help display this help and exit
-v, --version output version information and exit
-m, --magic-file LIST use LIST as a colon-separated list of magic
number files
-M LIST use LIST as a colon-separated list of magic
number files in place of default
LIST use LIST as a colon-separated list of magic
number files in place of default
-z, --uncompress try to look inside compressed files
-Z, --uncompress-noreport only print the contents of compressed files
-b, --brief do not prepend filenames to output lines
-c, --checking-printout print the parsed form of the magic file, use in
conjunction with -m to debug a new magic file
before installing it
-d use default magic file
use default magic file
-e, --exclude TEST exclude TEST from the list of test to be
performed for file. Valid tests are:
apptype, ascii, cdf, compress, elf, encoding,
soft, tar, text, tokens
-f, --files-from FILE read the filenames to be examined from FILE
-F, --separator STRING use string as separator instead of `:'
-i do not further classify regular files
do not further classify regular files
-I, --mime output MIME type strings (--mime-type and
--mime-encoding)
--apple output the Apple CREATOR/TYPE
--extension output a slash-separated list of extensions
--mime-type output the MIME type
--mime-encoding output the MIME encoding
-k, --keep-going don't stop at the first match
-l, --list list magic strength
-L, --dereference follow symlinks
-h, --no-dereference don't follow symlinks (default)
-n, --no-buffer do not buffer output
-N, --no-pad do not pad output
-0, --print0 terminate filenames with ASCII NUL
-p, --preserve-date preserve access times on files
-P, --parameter set file engine parameter limits
indir 15 recursion limit for indirection
name 30 use limit for name/use magic
elf_notes 256 max ELF notes processed
elf_phnum 128 max ELF prog sections processed
elf_shnum 32768 max ELF sections processed
-r, --raw don't translate unprintable chars to \ooo
-s, --special-files treat special (block/char devices) files as
ordinary ones
-C, --compile compile file specified by -m
-D, --debug print debugging messages
Report bugs to http://bugs.gw.com/
从帮助文档可以看出,判断编码仅仅是它最基本的一个功能,它主要是用于判断文件的类型,
>file abc.csv
abc.csv: UTF-8 Unicode (with BOM) text
获得正确的编码后,解码自然就容易多了,可以直接制定编码进行读取,必要的时候也可以使用iconv来转码。
>iconv --help
Usage: iconv [OPTION...] [-f ENCODING] [-t ENCODING] [INPUTFILE...]
or: iconv -l
Converts text from one encoding to another encoding.
Options controlling the input and output format:
-f ENCODING, --from-code=ENCODING
the encoding of the input
-t ENCODING, --to-code=ENCODING
the encoding of the output
Options controlling conversion problems:
-c discard unconvertible characters
--unicode-subst=FORMATSTRING
substitution for unconvertible Unicode characters
--byte-subst=FORMATSTRING substitution for unconvertible bytes
--widechar-subst=FORMATSTRING
substitution for unconvertible wide characters
Options controlling error output:
-s, --silent suppress error messages about conversion problems
Informative output:
-l, --list list the supported encodings
--help display this help and exit
--version output version information and exit
Report bugs to <bug-gnu-libiconv@gnu.org>.
iconv 就是把文件从一种编码往另外一种编码转,这里有一个非常实用的参数 -c,它的用途就是把忽略不可识别的字符(对部分乱码非常有效)
@rem 把UTF8编码文件转为GBK编码,并剔除掉无法识别的字符
iconv -f UTF8 -t GBK -c utf8.csv>gbk.csv
当然不管部分乱码,还是全部乱码,最好的办法就是重新导数据,如果碰到数据库存储本身的问题,也只能请iconv帮忙了。
参考及软件下载地址,
https://www.ibm.com/developerworks/cn/java/analysis-and-summary-of-common-random-code-problems/index.html
http://alitrack.com/2016/08/18/如何快速获得一个文件的类型和所使用的编码信息/
http://alitrack.com/2016/08/18/iconv批量转换字符集编码的利器/