这两款软件可以解决你大部分编码的烦恼

Original alitrack alitrack 2022-10-01

一个客户的数据存储在sybase里，当他把数据导出为CSV给我的时候，不管我怎么处理，都无法正常导入，细查并确认后才知道，他们数据库采用的编码是ISO-8859-1，由于字段长度的问题，导致某些汉字被截断，造成数据丢失，一番折腾后才把问题解决。

数据抽取（ETL）时，经常会碰到乱码的问题，乱码是件比较头疼的事情，尤其是在Windows平台下，要解决乱码问题，我们首先要了解产生乱码的可能原因，

编码引起的乱码

解码引起的乱码

缺少某种字体库引起的乱码（这种情况需要用户安装对应的字体库，ETL一般不会存在这种问题，除非你处理的文件格式比较特殊，比如PDF）

其中大部分乱码问题是由不合适的解码方式造成的，如图所示的鱼骨图，

编码引起的乱码又分，

全部乱码（无解，重新获取正确数据）

部分乱码，可能是双字节或者多字节被截断（比如数据库设置的是ISO-8859-1，存的是汉字，如果字段长度设置过短，就有可能把一个双字节汉字截断，一半字节丢失。也可能是UTF-16的编码数据以UTF-8另存，导致部分字符在UTF8下无法识别变成乱码。

下面介绍两款和编码有关的软件，

file

iconv

这两个在Linux（或者MacOS）下都是很常见的软件，

>file --help

Usage: file [OPTION...] [FILE...]Determine type of FILEs.

--help display this help and exit -v, --version output version information and exit -m, --magic-file LIST use LIST as a colon-separated list of magic number files -M LIST use LIST as a colon-separated list of magic number files in place of default LIST use LIST as a colon-separated list of magic number files in place of default -z, --uncompress try to look inside compressed files -Z, --uncompress-noreport only print the contents of compressed files -b, --brief do not prepend filenames to output lines -c, --checking-printout print the parsed form of the magic file, use in conjunction with -m to debug a new magic file before installing it -d use default magic file use default magic file -e, --exclude TEST exclude TEST from the list of test to be performed for file. Valid tests are: apptype, ascii, cdf, compress, elf, encoding, soft, tar, text, tokens -f, --files-from FILE read the filenames to be examined from FILE -F, --separator STRING use string as separator instead of `:' -i do not further classify regular files do not further classify regular files -I, --mime output MIME type strings (--mime-type and --mime-encoding) --apple output the Apple CREATOR/TYPE --extension output a slash-separated list of extensions --mime-type output the MIME type --mime-encoding output the MIME encoding -k, --keep-going don't stop at the first match -l, --list list magic strength -L, --dereference follow symlinks -h, --no-dereference don't follow symlinks (default) -n, --no-buffer do not buffer output -N, --no-pad do not pad output -0, --print0 terminate filenames with ASCII NUL -p, --preserve-date preserve access times on files -P, --parameter set file engine parameter limits indir 15 recursion limit for indirection name 30 use limit for name/use magic elf_notes 256 max ELF notes processed elf_phnum 128 max ELF prog sections processed elf_shnum 32768 max ELF sections processed -r, --raw don't translate unprintable chars to \ooo -s, --special-files treat special (block/char devices) files as ordinary ones -C, --compile compile file specified by -m -D, --debug print debugging messages

Report bugs to http://bugs.gw.com/

从帮助文档可以看出，判断编码仅仅是它最基本的一个功能，它主要是用于判断文件的类型，

>file abc.csv abc.csv: UTF-8 Unicode (with BOM) text

获得正确的编码后，解码自然就容易多了，可以直接制定编码进行读取，必要的时候也可以使用iconv来转码。

>iconv --help

Usage: iconv [OPTION...] [-f ENCODING] [-t ENCODING] [INPUTFILE...]or: iconv -l

Converts text from one encoding to another encoding.

Options controlling the input and output format: -f ENCODING, --from-code=ENCODING the encoding of the input -t ENCODING, --to-code=ENCODING the encoding of the output

Options controlling conversion problems: -c discard unconvertible characters --unicode-subst=FORMATSTRING substitution for unconvertible Unicode characters --byte-subst=FORMATSTRING substitution for unconvertible bytes --widechar-subst=FORMATSTRING substitution for unconvertible wide characters

Options controlling error output: -s, --silent suppress error messages about conversion problems

Informative output: -l, --list list the supported encodings --help display this help and exit --version output version information and exit

Report bugs to <bug-gnu-libiconv@gnu.org>.

iconv 就是把文件从一种编码往另外一种编码转，这里有一个非常实用的参数 -c，它的用途就是把忽略不可识别的字符（对部分乱码非常有效）

@rem 把UTF8编码文件转为GBK编码，并剔除掉无法识别的字符iconv -f UTF8 -t GBK -c utf8.csv>gbk.csv

当然不管部分乱码，还是全部乱码，最好的办法就是重新导数据，如果碰到数据库存储本身的问题，也只能请iconv帮忙了。

参考及软件下载地址，

https://www.ibm.com/developerworks/cn/java/analysis-and-summary-of-common-random-code-problems/index.html
http://alitrack.com/2016/08/18/如何快速获得一个文件的类型和所使用的编码信息/
http://alitrack.com/2016/08/18/iconv批量转换字符集编码的利器/

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

李宜雪的良知卖了2万元，真正需要声援的是罗灿宏啊

故意按摩让女生“产生欲望”后发生关系，算性侵吗？

洗牌电商圈！阿哲放话全网：挑战抖音所有机制！爆全品类大牌！

阿哲现身评论区，@一修！肉肉痛哭，无限期停播！回应舆论黑料，关闭私信评论区！

这两款软件可以解决你大部分编码的烦恼

您可能也对以下帖子感兴趣

二湘：朱令去世一周年，清华学子控诉清华在朱令案中的冷血和无耻

李宜雪的良知卖了2万元，真正需要声援的是罗灿宏啊

故意按摩让女生“产生欲望”后发生关系，算性侵吗？

洗牌电商圈！阿哲放话全网：挑战抖音所有机制！爆全品类大牌！

阿哲现身评论区，@一修！肉肉痛哭，无限期停播！回应舆论黑料，关闭私信评论区！

生成图片，分享到微信朋友圈

这两款软件可以解决你大部分编码的烦恼

您可能也对以下帖子感兴趣