查看原文
其他

Python 源码阅读: String

(点击上方蓝字,快速关注我们)


来源:伯乐在线 - wklken

如有好文章投稿,请点击 → 这里了解详情


PyStringObject


源码位置 Include/stringobject.h |

Objects/stringobject.c


定义


typedef struct {

  PyObject_VAR_HEAD

  long ob_shash;

  int ob_sstate;

  char ob_sval[1];

 

  /* Invariants:

   *     ob_sval contains space for 'ob_size+1' elements.

   *     ob_sval[ob_size] == 0.

   *     ob_shash is the hash of the string or -1 if not computed yet.

   *     ob_sstate != 0 iff the string object is in stringobject.c's

   *       'interned' dictionary; in this case the two references

   *       from 'interned' to this object are *not counted* in ob_refcnt.

   */

} PyStringObject;


说明


1. PyObject_VAR_HEAD

PyStringObject是变长对象, 比定长对象多了一个ob_size字段

 

2. ob_shash

存储字符串的hash值, 如果还没计算等于-1

当string_hash被调用, 计算结果会被保存到这个字段一份, 后续不再进行计算

 

3. ob_sstate

如果是interned, !=0, 否则=0

interned后面说

 

4. char ob_sval[1]

字符指针指向一段内存, char数组指针, 指向一个ob_size+1大小数组(c中字符串最后要多一个字符``表字符串结束)


结构


构造方法


PyAPI_FUNC(PyObject *) PyString_FromString(const char *);

 

PyAPI_FUNC(PyObject *) PyString_FromStringAndSize(const char *, Py_ssize_t);


两个构造方法其实区别不大,


PyString_FromStringAndSize 参数可以为`NULL`, 无论是否为`NULL`, 都会分配`size+1`个字节空间.(不为NULL的话字符数组会进行拷贝)

 

PyString_FromString, 参数不能为`NULL`, 且必须是``结束的字符数组, 会调用c 语言string.h模块的strlen()函数计算字符串长度, 分配空间, 并将整个字符数组拷贝到ob_sval指向的内存


我们关注PyString_FromString就行


创建过程 PyString_FromString


定义


//默认未初始化, 均为NULL

static PyStringObject *characters[UCHAR_MAX + 1];

static PyStringObject *nullstring;

 

PyObject *

PyString_FromString(const char *str)

{

    register size_t size;

    register PyStringObject *op;

 

    assert(str != NULL);

 

    // 计算参数字符数组长度

    size = strlen(str);

 

    // 超长, 报错(PY_SSIZE_T_MAX平台相关,32位2GB)

    if (size > PY_SSIZE_T_MAX - PyStringObject_SIZE) {

        PyErr_SetString(PyExc_OverflowError,

            "string is too long for a Python string");

        return NULL;

    }

 

    // 长度=0, 且nullstring已定义, 返回nullstring

    if (size == 0 & (op = nullstring) != NULL) {

#ifdef COUNT_ALLOCS

        null_strings++;

#endif

        Py_INCREF(op);

        return (PyObject *)op;

    }

 

    // 字符缓冲池逻辑

    // 长度=1, 且characters[*str & UCHAR_MAX]字符已定义

    if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL) {

#ifdef COUNT_ALLOCS

        one_strings++;

#endif

        Py_INCREF(op);

        return (PyObject *)op;

    }

 

    // 申请内存空间

    /* Inline PyObject_NewVar */

    op = (PyStringObject *)PyObject_MALLOC(PyStringObject_SIZE + size);

    if (op == NULL)

        return PyErr_NoMemory();

 

    // 初始化

    PyObject_INIT_VAR(op, &PyString_Type, size);

    op->ob_shash = -1; //未计算hash, -1

    op->ob_sstate = SSTATE_NOT_INTERNED;  //未intern, 0

    // 将字符数组拷贝到PyStringObject

    Py_MEMCPY(op->ob_sval, str, size+1);

 

    // 在nullstring和字符缓冲池对应位置未初始化时, 会走到这个逻辑

    /* share short strings */

    if (size == 0) {

        PyObject *t = (PyObject *)op;

        // 走intern, 后面说

        PyString_InternInPlace(&t);

        op = (PyStringObject *)t;

 

        // 初始化nullstring

        nullstring = op;

        Py_INCREF(op);

    } else if (size == 1) {

        PyObject *t = (PyObject *)op;

        // 走intern, 后面说

        PyString_InternInPlace(&t);

        op = (PyStringObject *)t;

 

        // 初始化字符缓冲池对应位置

        characters[*str & UCHAR_MAX] = op;

        Py_INCREF(op);

    }

 

    // 返回

    return (PyObject *) op;

}


步骤简化


1. 计算长度

2. 长度0, 空字符串, 是返回已定义好的nullstring

3. 长度1, 字符, 返回字符缓冲池里面的

4. 都不是, 分配内存, 初始化


结论


长度0/长度1, 会用到intern机制

注意, intern机制对长度>1的字符串也适用


interned机制


interned


/* This dictionary holds all interned strings.  Note that references to

strings in this dictionary are *not* counted in the string's ob_refcnt.

When the interned string reaches a refcnt of 0 the string deallocation

function will delete the reference from this dictionary.

 

Another way to look at this is that to say that the actual reference

count of a string is:  s->ob_refcnt + (s->ob_sstate?2:0)

*/

static PyObject *interned; //指针, 指向PyDictObject


interned定义


void

PyString_InternInPlace(PyObject **p)

{

    register PyStringObject *s = (PyStringObject *)(*p);

 

    PyObject *t;

 

    // 检查值使用在PyStringObject上, 派生类不适用

    if (s == NULL || !PyString_Check(s))

        Py_FatalError("PyString_InternInPlace: strings only please!");

    /* If it's a string subclass, we don't really know what putting it in the interned dict might do. */

 

    // 不是字符串类型, 返回

    if (!PyString_CheckExact(s))

        return;

    // 本身已经intern了(标志位ob_sstate), 不重复进行, 返回

    if (PyString_CHECK_INTERNED(s))

        return;

 

    // 未初始化字典, 初始化之

    if (interned == NULL) {

        // 注意这里

        interned = PyDict_New();

        if (interned == NULL) {

            PyErr_Clear(); /* Don't leave an exception */

            return;

        }

    }

 

    // 在interned字典中已存在, 修改, 返回intern独享

    t = PyDict_GetItem(interned, (PyObject *)s);

    if (t) {

        Py_INCREF(t);

        Py_DECREF(*p);

        *p = t;

        return;

    }

 

    // 在interned字典中不存在, 放进去

    if (PyDict_SetItem(interned, (PyObject *)s, (PyObject *)s)  0) {

        PyErr_Clear();

        return;

    }

 

    // 加入interned字典(key-value), 会导致refcnt+2, 去掉, 保证当外部没有引用时, refcnt=0, 可以进行回收处理, (不-2, refcnt永远>=2)

    /* The two references in interned are not counted by refcnt.

       The string deallocator will take care of this */

    Py_REFCNT(s) -= 2;

 

    // 修改字符串对象的ob_sstate标志位, SSTATE_INTERNED_MORTAL

    PyString_CHECK_INTERNED(s) = SSTATE_INTERNED_MORTAL;

}


使用的地方


// 构造方法

PyAPI_FUNC(PyObject *) PyString_FromString(const char *);

 

PyAPI_FUNC(PyObject *) PyString_FromStringAndSize(const char *, Py_ssize_t);

 

// SSTATE_INTERNED_MORTAL, 计数0会被回收

PyObject *

PyString_InternFromString(const char *cp)

{

    PyObject *s = PyString_FromString(cp);

    if (s == NULL)

        return NULL;

    PyString_InternInPlace(&s);

    return s;

}

 

// SSTATE_INTERNED_IMMORTAL, 永远不会被销毁

void

PyString_InternImmortal(PyObject **p)


示例


>>> a = ''

>>> b = ''

>>> id(a) == id(b)

True

 

>>> a = 'x'

>>> b = 'x'

>>> id(a) == id(b)

True

 

>>> a = "abc"

>>> b = "abc"

>>> id(a) == id(b)

True


python源代码自己也大量使用


dict_str = PyString_InternFromString("__dict__")

lenstr = PyString_InternFromString("__len__")

s_true = PyString_InternFromString("true")

empty_array = PyString_InternFromString("[]")


好处


一旦字符串被intern, 会被python保存到字典中, 整个python运行期间, 系统中有且仅有一个对象. 下次相同字符串再进入, 不会重复创建对象.


字符缓冲池


定义


UCHAR_MAX 平台相关

 

static PyStringObject *characters[UCHAR_MAX + 1];


在上面PyString_FromString可以看到, 字符缓冲池在使用中初始化(存在直接返回, 不存在建一个, 放interned字典中, 初始化字符缓冲池对应位置)


PyObject *t = (PyObject *)op;

// 走intern, 后面说

PyString_InternInPlace(&t);

op = (PyStringObject *)t;

 

// 初始化字符缓冲池对应位置

characters[*str & UCHAR_MAX] = op;


字符串销毁过程


static void

string_dealloc(PyObject *op)

{

    // 是否intern

    switch (PyString_CHECK_INTERNED(op)) {

        // 非, 跳出 -> 回收内存

        case SSTATE_NOT_INTERNED:

            break;

 

        // 普通, 从interned字典中删除, 跳出 -> 回收内存

        case SSTATE_INTERNED_MORTAL:

            /* revive dead object temporarily for DelItem */

            Py_REFCNT(op) = 3;

            if (PyDict_DelItem(interned, op) != 0)

                Py_FatalError(

                    "deletion of interned string failed");

            break;

        // 永不回收的对象, 报错

        case SSTATE_INTERNED_IMMORTAL:

            Py_FatalError("Immortal interned string died.");

 

        default:

            Py_FatalError("Inconsistent interned string state.");

    }

 

    // 回收内存

    Py_TYPE(op)->tp_free(op);

}


性能相关


+与 join


'a' + 'b' + 'c'

 

or

 

''.join(['a', 'b', 'c'])


可以查看string_concat方法和string_join方法的源代码


string_concat, 一次加=分配一次内存空间, 拷贝两次. N次链接, 需要N-1次内存分配.

string_join, 计算序列所有元素字符串总长度, 用PyString_FromStringAndSize((char*)NULL, sz)分配内存空间, 然后逐一拷贝. 一次内存操作.


相关阅读


看完本文有收获?请转发分享给更多人

关注「Python开发者」,提升Python技能

    您可能也对以下帖子感兴趣

    文章有问题?点此查看未经处理的缓存