chardet库：轻松识别文件的编码格式

原创 2017-06-01 大邓大邓带你玩python

昨天给一位Python爱好者调试代码，败在最容易忽视的编码问题。后来百度发现一个chardet库，可以自动甄别出文件的编码方式，真的很赞，以后读写文件编码就方便多了。

之前大邓存数据时候，经常写 open(path,'w',encoding='utf-8'),写数据时候经常写 open(path,'r',encoding='utf-8')。不论是保存还是读入，都有 encoding='utf-8'是为了将文件编码格式统一。如果不统一，不规范，以后使用数据时候会出现各种Bug，但今天又了这个chardet库，以后 encoding=编码中的编码可以是很容易的就传入值，不用都统一为utf-8.

chardet库文档

http://chardet.readthedocs.io/en/latest/usage.html

小文件的编码判断

detect函数只需要一个 非unicode字符串参数，返回一个字典。该字典包括判断到的编码格式及判断的置信度。

with open('test1.txt', 'rb') as f:
result = chardet.detect(f.read())
print(result)

返回结果

{'encoding': 'utf-8', 'confidence': 0.99}

百分之99可能为utf-8编码。

大文件的编码判断

考虑到有的文件非常大，如果使用上述方法，全部读入后再判断编码格式，效率会变得非常低下。因此这里对读入的数据进行分块迭代，每次迭代出的数据喂给detector，当喂给detector数据达到一定程度足以进行高准确性判断时， detector.done返回 True。此时我们就可以获取该文件的编码格式。

from chardet.universaldetector import UniversalDetector
bigdata = open('test2.txt','rb')
detector = UniversalDetector()
for line in bigdata.readlines():
detector.feed(line)
if detector.done:
break
detector.close()
bigdata.close()
print(detector.result)

返回结果

{'encoding': 'utf-8', 'confidence': 0.99}

多个大文件的编码判断

如果想判断多个文件的编码，我们可以重复使用单个的UniversalDetector对象。只需要在每次调用UniversalDetector对象时候，初始化 detector.reset()，其余同上。

import os
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
dirlist = os.dirlist('/Users/suosuo/Desktop/Test')
for name in dirlist:
"""
代码为mac上测试，如果为win
path = os.getcwd()+'\\%s'%name
"""
path = os.getcwd()+'/%s'%name
detector.reset()
for line in open(path, 'rb').readlines():
detector.feed(line)
if detector.done:
break
detector.close()
print(detector.result)

输出结果

{'encoding': 'utf-8', 'confidence': 0.99}
{'encoding': 'gb2312', 'confidence': 0.99}
......
{'encoding': 'utf-8', 'confidence': 0.99}

李宜雪的良知卖了2万元，真正需要声援的是罗灿宏啊

故意按摩让女生“产生欲望”后发生关系，算性侵吗？

洗牌电商圈！阿哲放话全网：挑战抖音所有机制！爆全品类大牌！

阿哲现身评论区，@一修！肉肉痛哭，无限期停播！回应舆论黑料，关闭私信评论区！

登热榜！某牙电母被S，榜一求爱遭拒！柚柚阿哲合体年度走红毯！