Elasticsearch自定义分词器

Original zhisheng zhisheng 2021-09-05

上一篇文章中文分词器没发完，这篇继续发完！

14、Mmseg 分词器

也支持 Elasticsearch

下载地址：https://github.com/medcl/elasticsearch-analysis-mmseg/releases 根据对应的版本进行下载

如何使用：

1、创建索引：

1curl -XPUT http://localhost:9200/index

2、创建 mapping

 1curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
 2{
 3        "properties": {
 4            "content": {
 5                "type": "text",
 6                "term_vector": "with_positions_offsets",
 7                "analyzer": "mmseg_maxword",
 8                "search_analyzer": "mmseg_maxword"
 9            }
10        }
11
12}'

3.Indexing some docs

 1curl -XPOST http://localhost:9200/index/fulltext/1 -d'
 2{"content":"美国留给伊拉克的是个烂摊子吗"}
 3'
 4
 5curl -XPOST http://localhost:9200/index/fulltext/2 -d'
 6{"content":"公安部：各地校车将享最高路权"}
 7'
 8
 9curl -XPOST http://localhost:9200/index/fulltext/3 -d'
10{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}
11'
12
13curl -XPOST http://localhost:9200/index/fulltext/4 -d'
14{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
15'

4.Query with highlighting(查询高亮)

 1curl -XPOST http://localhost:9200/index/fulltext/_search  -d'
 2{
 3    "query" : { "term" : { "content" : "中国" }},
 4    "highlight" : {
 5        "pre_tags" : ["<tag1>", "<tag2>"],
 6        "post_tags" : ["</tag1>", "</tag2>"],
 7        "fields" : {
 8            "content" : {}
 9        }
10    }
11}
12'

5、结果：

 1{
 2    "took": 14,
 3    "timed_out": false,
 4    "_shards": {
 5        "total": 5,
 6        "successful": 5,
 7        "failed": 0
 8    },
 9    "hits": {
10        "total": 2,
11        "max_score": 2,
12        "hits": [
13            {
14                "_index": "index",
15                "_type": "fulltext",
16                "_id": "4",
17                "_score": 2,
18                "_source": {
19                    "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
20                },
21                "highlight": {
22                    "content": [
23                        "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首 "
24                    ]
25                }
26            },
27            {
28                "_index": "index",
29                "_type": "fulltext",
30                "_id": "3",
31                "_score": 2,
32                "_source": {
33                    "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
34                },
35                "highlight": {
36                    "content": [
37                        "均每天扣1艘<tag1>中国</tag1>渔船 "
38                    ]
39                }
40            }
41        ]
42    }
43}

参考博客：

为elastic添加中文分词: http://blog.csdn.net/dingzfang/article/details/42776693

15、bosonnlp （玻森数据中文分析器）

下载地址：https://github.com/bosondata/elasticsearch-analysis-bosonnlp

如何使用：

运行 ElasticSearch 之前需要在 config 文件夹中修改 elasticsearch.yml 来定义使用玻森中文分析器，并填写玻森 API_TOKEN 以及玻森分词 API 的地址，即在该文件结尾处添加：

 1index:
 2  analysis:
 3    analyzer:
 4      bosonnlp:
 5          type: bosonnlp
 6          API_URL: http://api.bosonnlp.com/tag/analysis
 7          # You MUST give the API_TOKEN value, otherwise it doesn't work
 8          API_TOKEN: *PUT YOUR API TOKEN HERE*
 9          # Please uncomment if you want to specify ANY ONE of the following
10          # areguments, otherwise the DEFAULT value will be used, i.e.,
11          # space_mode is 0,
12          # oov_level is 3,
13          # t2s is 0,
14          # special_char_conv is 0.
15          # More detials can be found in bosonnlp docs:
16          # http://docs.bosonnlp.com/tag.html
17          #
18          #
19          # space_mode: put your value here(range from 0-3)
20          # oov_level: put your value here(range from 0-4)
21          # t2s: put your value here(range from 0-1)
22          # special_char_conv: put your value here(range from 0-1)

需要注意的是

必须在 API_URL 填写给定的分词地址以及在API_TOKEN：PUT YOUR API TOKEN HERE 中填写给定的玻森数据API_TOKEN，否则无法使用玻森中文分析器。该 API_TOKEN 是注册玻森数据账号所获得。

如果配置文件中已经有配置过其他的 analyzer，请直接在 analyzer 下如上添加 bosonnlp analyzer。

如果有多个 node 并且都需要 BosonNLP 的分词插件，则每个 node 下的 yaml 文件都需要如上安装和设置。

另外，玻森中文分词还提供了4个参数（space_mode，oov_level，t2s，special_char_conv）可满足不同的分词需求。如果取默认值，则无需任何修改；否则，可取消对应参数的注释并赋值。

测试：

建立 index

1curl -XPUT 'localhost:9200/test'

测试分析器是否配置成功

1curl -XGET 'localhost:9200/test/_analyze?analyzer=bosonnlp&pretty' -d '这是玻森数据分词的测试'

结果

 1{
 2  "tokens" : [ {
 3    "token" : "这",
 4    "start_offset" : 0,
 5    "end_offset" : 1,
 6    "type" : "word",
 7    "position" : 0
 8  }, {
 9    "token" : "是",
10    "start_offset" : 1,
11    "end_offset" : 2,
12    "type" : "word",
13    "position" : 1
14  }, {
15    "token" : "玻森",
16    "start_offset" : 2,
17    "end_offset" : 4,
18    "type" : "word",
19    "position" : 2
20  }, {
21    "token" : "数据",
22    "start_offset" : 4,
23    "end_offset" : 6,
24    "type" : "word",
25    "position" : 3
26  }, {
27    "token" : "分词",
28    "start_offset" : 6,
29    "end_offset" : 8,
30    "type" : "word",
31    "position" : 4
32  }, {
33    "token" : "的",
34    "start_offset" : 8,
35    "end_offset" : 9,
36    "type" : "word",
37    "position" : 5
38  }, {
39    "token" : "测试",
40    "start_offset" : 9,
41    "end_offset" : 11,
42    "type" : "word",
43    "position" : 6
44  } ]
45}

配置 Token Filter

现有的 BosonNLP 分析器没有内置 token filter，如果有过滤 Token 的需求，可以利用 BosonNLP Tokenizer 和 ES 提供的 token filter 搭建定制分析器。

步骤

配置定制的 analyzer 有以下三个步骤：

添加 BosonNLP tokenizer
在 elasticsearch.yml 文件中 analysis 下添加 tokenizer，并在 tokenizer 中添加 BosonNLP tokenizer 的配置：

 1index:
 2  analysis:
 3    analyzer:
 4      ...
 5    tokenizer:
 6      bosonnlp:
 7          type: bosonnlp
 8          API_URL: http://api.bosonnlp.com/tag/analysis
 9          # You MUST give the API_TOKEN value, otherwise it doesn't work
10          API_TOKEN: *PUT YOUR API TOKEN HERE*
11          # Please uncomment if you want to specify ANY ONE of the following
12          # areguments, otherwise the DEFAULT value will be used, i.e.,
13          # space_mode is 0,
14          # oov_level is 3,
15          # t2s is 0,
16          # special_char_conv is 0.
17          # More detials can be found in bosonnlp docs:
18          # http://docs.bosonnlp.com/tag.html
19          #
20          #
21          # space_mode: put your value here(range from 0-3)
22          # oov_level: put your value here(range from 0-4)
23          # t2s: put your value here(range from 0-1)
24          # special_char_conv: put your value here(range from 0-1)

添加 token filter

在 elasticsearch.yml 文件中 analysis 下添加 filter，并在 filter 中添加所需 filter 的配置（下面例子中，我们以 lowercase filter 为例）：

1index:
2  analysis:
3    analyzer:
4      ...
5    tokenizer:
6      ...
7    filter:
8      lowercase:
9          type: lowercase

添加定制的 analyzer

在 elasticsearch.yml 文件中 analysis 下添加 analyzer，并在 analyzer 中添加定制的 analyzer 的配置（下面例子中，我们把定制的 analyzer 命名为 filter_bosonnlp）：

1index:
2  analysis:
3    analyzer:
4      ...
5      filter_bosonnlp:
6          type: custom
7          tokenizer: bosonnlp
8          filter: [lowercase]

自定义分词器

虽然Elasticsearch带有一些现成的分析器，然而在分析器上Elasticsearch真正的强大之处在于，你可以通过在一个适合你的特定数据的设置之中组合字符过滤器、分词器、词汇单元过滤器来创建自定义的分析器。

字符过滤器：

字符过滤器用来整理一个尚未被分词的字符串。例如，如果我们的文本是HTML格式的，它会包含像 <p> 或者 <div> 这样的HTML标签，这些标签是我们不想索引的。我们可以使用 html清除字符过滤器来移除掉所有的HTML标签，并且像把 Á 转换为相对应的Unicode字符 Á 这样，转换HTML实体。

一个分析器可能有0个或者多个字符过滤器。

分词器:

一个分析器必须有一个唯一的分词器。分词器把字符串分解成单个词条或者词汇单元。标准分析器里使用的标准分词器把一个字符串根据单词边界分解成单个词条，并且移除掉大部分的标点符号，然而还有其他不同行为的分词器存在。

词单元过滤器:

经过分词，作为结果的词单元流会按照指定的顺序通过指定的词单元过滤器。

词单元过滤器可以修改、添加或者移除词单元。我们已经提到过 lowercase 和 stop 词过滤器，但是在 Elasticsearch 里面还有很多可供选择的词单元过滤器。词干过滤器把单词遏制为词干。 ascii_folding 过滤器移除变音符，把一个像 "très" 这样的词转换为 "tres" 。 ngram 和 edge_ngram 词单元过滤器可以产生适合用于部分匹配或者自动补全的词单元。

创建一个自定义分析器

我们可以在 analysis 下的相应位置设置字符过滤器、分词器和词单元过滤器:

 1PUT /my_index
 2{
 3    "settings": {
 4        "analysis": {
 5            "char_filter": { ... custom character filters ... },
 6            "tokenizer":   { ...    custom tokenizers     ... },
 7            "filter":      { ...   custom token filters   ... },
 8            "analyzer":    { ...    custom analyzers      ... }
 9        }
10    }
11}

这个分析器可以做到下面的这些事:

1、使用 html清除字符过滤器移除HTML部分。

2、使用一个自定义的映射字符过滤器把 & 替换为 "和" ：

1"char_filter": {
2    "&_to_and": {
3        "type":       "mapping",
4        "mappings": [ "&=> and "]
5    }
6}

3、使用标准分词器分词。

4、小写词条，使用小写词过滤器处理。

5、使用自定义停止词过滤器移除自定义的停止词列表中包含的词：

1"filter": {
2    "my_stopwords": {
3        "type":        "stop",
4        "stopwords": [ "the", "a" ]
5    }
6}

我们的分析器定义用我们之前已经设置好的自定义过滤器组合了已经定义好的分词器和过滤器：

1"analyzer": {
2    "my_analyzer": {
3        "type":           "custom",
4        "char_filter":  [ "html_strip", "&_to_and" ],
5        "tokenizer":      "standard",
6        "filter":       [ "lowercase", "my_stopwords" ]
7    }
8}

汇总起来，完整的创建索引请求看起来应该像这样：

 1PUT /my_index
 2{
 3    "settings": {
 4        "analysis": {
 5            "char_filter": {
 6                "&_to_and": {
 7                    "type":       "mapping",
 8                    "mappings": [ "&=> and "]
 9            }},
10            "filter": {
11                "my_stopwords": {
12                    "type":       "stop",
13                    "stopwords": [ "the", "a" ]
14            }},
15            "analyzer": {
16                "my_analyzer": {
17                    "type":         "custom",
18                    "char_filter":  [ "html_strip", "&_to_and" ],
19                    "tokenizer":    "standard",
20                    "filter":       [ "lowercase", "my_stopwords" ]
21            }}
22}}}

索引被创建以后，使用 analyze API 来测试这个新的分析器：

1GET /my_index/_analyze?analyzer=my_analyzer
2The quick & brown fox

下面的缩略结果展示出我们的分析器正在正确地运行：

1{
2  "tokens" : [
3      { "token" :   "quick",    "position" : 2 },
4      { "token" :   "and",      "position" : 3 },
5      { "token" :   "brown",    "position" : 4 },
6      { "token" :   "fox",      "position" : 5 }
7    ]
8}

这个分析器现在是没有多大用处的，除非我们告诉 Elasticsearch在哪里用上它。我们可以像下面这样把这个分析器应用在一个 string 字段上：

1PUT /my_index/_mapping/my_type
2{
3    "properties": {
4        "title": {
5            "type":      "string",
6            "analyzer":  "my_analyzer"
7        }
8    }
9}

最后

: ， . Video Mini Program Like ，轻点两下取消赞 Wow ，轻点两下取消在看

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

“家属和记者取得联系”：记者的退场意味深长

圈内疯传某谣言

不要放过这些人渣

“被指居者”之死：嫌犯身体遭长时间束缚，警方称指居使用械具是惯例

Elasticsearch自定义分词器

上一篇文章中文分词器没发完，这篇继续发完！

14、Mmseg 分词器

15、bosonnlp （玻森数据中文分析器）

自定义分词器

创建一个自定义分析器

最后

您可能也对以下帖子感兴趣

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

“家属和记者取得联系”：记者的退场意味深长

圈内疯传某谣言

不要放过这些人渣

“被指居者”之死：嫌犯身体遭长时间束缚，警方称指居使用械具是惯例

生成图片，分享到微信朋友圈

Elasticsearch自定义分词器

上一篇文章中文分词器没发完，这篇继续发完！

14、Mmseg 分词器

15、bosonnlp （玻森数据中文分析器）

自定义分词器

创建一个自定义分析器

最后

您可能也对以下帖子感兴趣