干货 | Elasticsearch Nested 数组大小求解,一网打尽!
1、实战线上 Nested 问题
如何查询所有 objectList (Nested 类型)里面的 lossStatus="ENABLE" 且 objectList 的数组大小大于2的数据?
——问题来源:死磕Elasticsearch 知识星球
2、数据模型
索引导入和样例数据批量写入如下所示。
PUT appweb
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"orderTime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"objectList": {
"type": "nested",
"properties": {
"addTime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"customerPersonId": {
"type": "long"
},
"lossStatus": {
"type": "text"
}
}
}
}
}
}
POST appweb/_bulk
{"index":{"_id":1}}
{"name":"111","orderTime":"2022-02-02 02:02:02","objectList":[{"addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE"},{"addTime":"2022-02-02 02:02:02","customerPersonId":102,"lossStatus":"ENABLE"}]}
{"index":{"_id":2}}
{"name":"222","orderTime":"2022-02-02 02:02:02","objectList":[{"addTime":"2022-02-02 02:02:02","customerPersonId":201,"lossStatus":"2222"},{"addTime":"2022-02-02 02:02:02","customerPersonId":202,"lossStatus":"2222"},{"addTime":"2022-02-02 02:02:02","customerPersonId":203,"lossStatus":"3333"}]}
{"index":{"_id":3}}
{"name":"111","orderTime":"2022-02-02 02:02:02","objectList":[{"addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE"}]}
{"index":{"_id":4}}
{"name":"111","orderTime":"2022-02-02 02:02:02","objectList":[{"addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE"},{"addTime":"2022-02-02 02:02:02","customerPersonId":102,"lossStatus":"ENABLE"},{"addTime":"2022-02-02 02:02:02","customerPersonId":103,"lossStatus":"ENABLE"}]}
开搞,方案逐步展开讨论。
3、问题拆解
涉及三个核心知识点:
其一:检索数据涉及 Nested 类型。 其二:检索条件1:objectList (Nested 类型)下的 lossStatus="ENABLE"。
这个在检索的时候要注意指定 path,否则会报错。
其三:检索条件2:获取 objectList 的数组大小大于 2 的数据?
问题转化为:检索条件1、检索条件2的组合实现。
3.1 检索条件 1 实现
POST appweb/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "objectList",
"query": {
"match_phrase": {
"objectList.lossStatus": "ENABLE"
}
}
}
}
]
}
}
}
中规中矩的 Nested 语法,无需过多解释。唯一强调的是:path
的用法。
如果 Nested 语法不熟悉,可以参考官方文档:
https://www.elastic.co/guide/en/elasticsearch/reference/8.0/query-dsl-nested-query.html
3.2 检索条件 2 实现
本质是获取 objectList 的数组大小大于 2 的数据。再进一步缩小范围是:获取 objectList 数组的大小。
问题转化为如何获取 Nested 嵌套类型数组大小?
这里的确没有非常现成的实现,我总结了如下几种方案。
方案1:function_score 检索实现
该方案包含了:3.1 小节 检索条件 1 的实现,完整实现如下。
POST appweb/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "objectList",
"query": {
"match_phrase": {
"objectList.lossStatus": "ENABLE"
}
}
}
},
{
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": {
"source": "params._source.containsKey('objectList') && params._source['objectList'] != null && params._source.objectList.size() > 2 ? 2 : 0"
}
}
}
],
"min_score": 1
}
}
]
}
}
}
注意在 script_score 下做了多条件判断:
params._source.containsKey('objectList')
params._source['objectList'] != null
params._source.objectList.size() > 2
官方语法参考:
https://www.elastic.co/guide/en/elasticsearch/reference/8.0/query-dsl-function-score-query.html
https://www.elastic.co/guide/en/elasticsearch/painless/8.0/painless-score-context.html
方案2:funciton_score 检索实现2
POST appweb/_search
{
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"nested": {
"path": "objectList",
"query": {
"exists": {
"field": "objectList.customerPersonId"
}
},
"score_mode": "sum"
}
},
{
"nested": {
"path": "objectList",
"query": {
"match_phrase": {
"objectList.lossStatus": "ENABLE"
}
}
}
}
]
}
},
"functions": [
{
"script_score": {
"script": {
"source": "_score >= 3 ? 1 : 0"
}
}
}
],
"boost_mode": "replace"
}
},
"min_score": 1
}
该方式本质是曲线救国
,借助:sum 求和累加评分实现。
实现条件是:存在字段“objectList.customerPersonId”,评分就高。该方式不太容易想到,“可遇而不可求”
。
方案3:runtime_field 运行时字段实现
POST appweb/_search
{
"runtime_mappings": {
"objectList_tmp": {
"type": "keyword",
"script": """
int genre = params['_source']['objectList'].size();
emit(genre.toString());
"""
}
},
"query": {
"bool": {
"must": [
{
"nested": {
"path": "objectList",
"query": {
"match_phrase": {
"objectList.lossStatus": "ENABLE"
}
}
}
},
{
"range": {
"objectList_tmp": {
"gte": 3
}
}
}
]
}
}
}
这是我整合了聚合 + runtime_field 实现的结果,召回结果达到预期且令人满意。
最后发现聚合部分是多余的,删除之。
解读如下:
第一:新加了运行时字段——objectList_tmp,目的:获取 Nested 数组大小。 第二:结合已有 nested 检索组合 bool 检索实现即可。
综合对比看,它比下面的方案4更简洁,如果线上环境想不修改数据的前提下使用,推荐此方案。
方案4:聚合实现
GET appweb/_search
{
"size": 0,
"query": {
"nested": {
"path": "objectList",
"query": {
"match_phrase": {
"objectList.lossStatus": "ENABLE"
}
}
}
},
"aggs": {
"counts_aggs": {
"terms": {
"script": "params['_source']['objectList'].size()"
},
"aggs": {
"top_hits_aggs": {
"top_hits": {
"size": 10
}
}
}
}
}
}
对比方案 3,方案 4相对鸡肋和繁冗、复杂。
也更进一步体会:runtime_field 的妙处。
4、换个思路?轻装上阵!
什么思路?之前文章有过解读——空间换时间。
具体实现如下:
4.1 步骤1:预处理新增字段 nested_size。
PUT _ingest/pipeline/add_nested_size_pipeline
{
"processors": [
{
"script": {
"lang": "painless",
"source": "ctx.nested_size = ctx.objectList.size();"
}
}
]
}
4.2 步骤2:创建索引且导入数据。
创建索引同时指定步骤 1 的 pipeline 预处理管道。
PUT appweb_ext
{
"settings": {
"index": {
"default_pipeline": "add_nested_size_pipeline"
}
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"orderTime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"objectList": {
"type": "nested",
"properties": {
"addTime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"customerPersonId": {
"type": "long"
},
"lossStatus": {
"type": "text"
}
}
}
}
}
}
POST appweb_ext/_bulk
{"index":{"_id":1}}
{"name":"111","orderTime":"2022-02-02 02:02:02","objectList":[{"addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE"},{"addTime":"2022-02-02 02:02:02","customerPersonId":102,"lossStatus":"ENABLE"}]}
{"index":{"_id":2}}
{"name":"222","orderTime":"2022-02-02 02:02:02","objectList":[{"addTime":"2022-02-02 02:02:02","customerPersonId":201,"lossStatus":"2222"},{"addTime":"2022-02-02 02:02:02","customerPersonId":202,"lossStatus":"2222"},{"addTime":"2022-02-02 02:02:02","customerPersonId":203,"lossStatus":"3333"}]}
{"index":{"_id":3}}
{"name":"111","orderTime":"2022-02-02 02:02:02","objectList":[{"addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE"}]}
{"index":{"_id":4}}
{"name":"111","orderTime":"2022-02-02 02:02:02","objectList":[{"addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE"},{"addTime":"2022-02-02 02:02:02","customerPersonId":102,"lossStatus":"ENABLE"},{"addTime":"2022-02-02 02:02:02","customerPersonId":103,"lossStatus":"ENABLE"}]}
4.3 步骤3:复杂脚本检索变成简单检索实现。
bool 组合条件,一个 nested 检索 + 一个 range query,轻松搞定!
POST appweb_ext/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "objectList",
"query": {
"match_phrase": {
"objectList.lossStatus": "ENABLE"
}
}
}
},
{
"range": {
"nested_size": {
"gt": 2
}
}
}
]
}
}
}
此方案是我极力推广的方案,需要我们多结合业务实际,多在数据写入前的设计阶段、数据建模阶段做“文章”。而不是快速导入数据,后面丢给复杂的检索脚本实现。
一般项目实战阶段,很多人会说,“工期要紧,我管不了那么多”。项目后期复盘会发现,“看似快了,实则慢了”,最终感叹:“预处理的工作不要省也不能省”!
5、小结
看似简单的几个方案,我从入手到梳理完毕耗时大于 6 个小时+。主要是painless 脚本没有固定的章法可循,需要摸索和反复验证。
意外收获是方案3,基于方案 4 的创新方案,比较灵活好用。
但,我更推荐空间换时间的方案。能预处理
搞定的事情,就不要留到检索阶段实现。
欢迎留言说下您的方案和思考!
6、参考
https://stackoverflow.com/questions/64447956
https://stackoverflow.com/questions/54022283
https://stackoverflow.com/questions/57144172
https://t.zsxq.com/FAQ7mUN
https://www.ru-rocker.com/2020/11/03/filtering-nested-array-objects-in-elasticsearch-document-with-painless-scripting/
https://medium.com/@felipegirotti/elasticsearch-filter-field-array-more-than-zero-8d52d067d3a0
推荐
更短时间更快习得更多干货!
和全球近 1600+ Elastic 爱好者一起精进!