触类旁通Elasticsearch之吊打同行系列：搜索篇

大数据技术与架构 2021-10-21

点击上方蓝色字体，选择“设为星标”

回复”资源“获取更多资源

大数据技术与架构点击右侧关注，大数据开发领域最强公众号！

大数据真好玩点击右侧关注，大数据真好玩！

一、搜索请求的结构

1. 确定搜索范围

2. 搜索请求的基本模块

3. 基于请求主体的搜索请求

4. 回复的结构

二、查询和过滤器

1. match

2. term

3. query_string

三、复合查询

1. bool查询

2. bool过滤器

四、其它查询和过滤器

1. range查询和过滤器

2. prefix查询和过滤器

3. wildcard查询

4. exists过滤器

5. missing过滤器

6. 将任何查询转变为过滤器

五、为任务选择最好的查询

ES的搜索请求执行流程如图1所示。图中索引包含两个分片，每个分片有一个副本分片。在给文档定位和评分后，缺省只会获取排名前10的文档。REST API搜索请求被发送到所连接的节点，该节点根据要查询的索引，将这个请求依次发送到所有的相关分片（主分片或者副本分片）。从所有分片收集到足够的排序和排名信息后，只有包含所需文档的分片被要求返回相关内容。这种搜索路由的行为是可配置的，图1展示的默认行为，称为查询后获取（query_then_fetch）。

图1 搜索请求是如何路由的

一、搜索请求的结构

ES的搜索是基于JSON文档或者是基于URL的请求。

1. 确定搜索范围

所有的REST搜索请求使用_search的REST端点，既可以是GET请求，也可以是POST请求。既可以搜索整个集群，也可以通过在搜索URL中指定索引或类型的名称来限制范围：

# 无条件搜索整个集群curl '172.16.1.127:9200/_search?pretty'curl '172.16.1.127:9200/_all/_search?pretty'curl '172.16.1.127:9200/*/_search?pretty' # 无条件搜索get-together索引，类似于SQL中的select * from get-together;curl '172.16.1.127:9200/get-together/_search?pretty' # 在ES6中已经废弃了type的概念，所以功能同上curl '172.16.1.127:9200/get-together/_doc/_search?pretty' # 无条件搜索get-together、dbinfo两个索引curl '172.16.1.127:9200/get-together,dbinfo/_doc/_search?pretty' # 模糊匹配索引名称，包含get-toge开头的索引，但不包括get-togethercurl '172.16.1.127:9200/+get-toge*,-get-together/_search?pretty'

和DB类似，为了获得更好的性能，尽可能地将查询限制在最小数量索引。每个搜索请求必须发送到所有索引分片（类似于DB中的全索引扫描），发送到越多的索引，就会涉及越多的分片。

2. 搜索请求的基本模块

类比SQL查询语句：

select ... from ... where ... order by ... limit ... where <-> query select ... <-> _source size + from <-> limit order by <-> sort

搜索请求的基本模块如下：

query：配置查询和过滤器DSL，限制搜索的条件，类似于SQL查询中的where子句。
size：返回文档的数量，类似于SQL查询中的limit子句中的数量。
from：和size一起使用，from用于分页操作，类似于SQL查询中的limit子句中的偏移量。如果结果集合不断增加，获取某些靠后的翻页将会成为代价高昂的操作。（SQL中延迟关联的思想应该也可用于ES，先搜索出某一页的ID，再通过ID查询字段。）
_source：指定_source字段如何返回，默认返回完整的_source字段，类似于SQL中的select *。通过配置_source，将过滤返回的字段。
sort：类似于SQL中的order by子句，用于排序，默认的排序是基于文档的得分。

下面看一些简单的例子。
（1）返回第2页的10个结果

# ES的from从0开始curl '172.16.1.127:9200/get-together/_search?from=10&size=10&pretty'

（2）按日期升序排列，返回前10项结果

curl '172.16.1.127:9200/get-together/_search?sort=date:asc&pretty'

（3）按日期升序排列，返回前10项结果中title、date的两个字段

curl '172.16.1.127:9200/get-together/_search?sort=date:asc&_source=title,date&pretty'

（4）请求匹配了所有标题中含有“elasticsearch”的文档（按小写比较），按日期升序返回

curl '172.16.1.127:9200/get-together/_search?sort=date:asc&q=title:elasticsearch&pretty'

3. 基于请求主体的搜索请求

前面的搜索请求都是基于URL的。当执行更多高级搜索的时候，采用基于请求主体的搜索会拥有更多的灵活性和选择性。ES推荐使用基于请求主体的搜索请求。

（1）返回第2页的10个结果

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match_all": {} }, "from": 10, "size": 10}'

（2）返回指定字段

# 只返回name和date字段curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match_all": {} }, "_source": [ "name", "date" ]}'

（3）_source中使用通配符返回字段

# 返回location开头的字段和日期字段，但不返回location.geolocation字段curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match_all": {} }, "_source": { "include": [ "location.*", "date" ], "exclude": [ "location.geolocation" ] }}'

（4）结果排序

# 类似于SQL中的order by created_on asc, name desc, _scorecurl -XPOST "172.16.1.127:9200/get-together/_mapping/_doc?pretty" -H 'Content-Type: application/json' -d'{ "properties": { "name": { "type": "text", "fielddata": "true" } }}' curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match_all": {} }, "sort": [ { "created_on": "asc" }, { "name": "desc" }, "_score" ]}'

（5）综合搜索基础模块

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match_all": {} }, "from": 0, "size": 10, "_source": [ "name", "organizer", "description" ], "sort": [ { "created_on": "desc" } ]}'

类似于如下SQL查询：

select name, organizer, description from get-together order by created_on desc limit 0, 10;

注意，如果在返回结果中某些字段的值为null，缺省在ES返回的_source中根本就不会出现该字段名称，这点与SQL是不同的。

4. 回复的结构

下面看一下ES搜索返回的数据结构。

curl '172.16.1.127:9200/_search?q=title:elasticsearch&_source=title,date&pretty'

结果返回：

{ "took" : 13, # 查询执行所用的毫秒数 "timed_out" : false, # 是否超时 "_shards" : { "total" : 28, # 搜索的分片数 "successful" : 28, # 成功的分片数 "skipped" : 0, # 跳过的分片数 "failed" : 0 # 失败的分片数 }, "hits" : { "total" : 7, # 匹配的文档数 "max_score" : 1.0128567, # 最高文档得分 "hits" : [ # 命中文档的数组 { "_index" : "get-together", # 文档所属索引 "_type" : "_doc", # 文档所属类型 "_id" : "103", # 文档ID "_score" : 1.0128567, # 相关性得分 "_routing" : "2", # 文档所属的分片号 "_source" : { # 请求的_source字段 "date" : "2013-04-17T19:00", "title" : "Introduction to Elasticsearch" } }, { "_index" : "get-together", "_type" : "_doc", "_id" : "105", "_score" : 1.0128567, "_routing" : "2", "_source" : { "date" : "2013-07-17T18:30", "title" : "Elasticsearch and Logstash" } }, ... ] }}

如果没有存储文档的_source或者是fields，那么将无法从ES中获取数值！

二、查询和过滤器

查询和过滤器功能上类似于SQL查询中的where子句，都是起到按查询条件筛选文档的作用，但它们在评分就机制和搜索行为的性能上有所不同。不像查询会为特定的词条计算得分，搜索的过滤器只是为“文档是否匹配这个查询”，返回简的“是”或“否”的答案。图2展示了查询和过滤器之间的主要差别。

图2 由于不计算得分，过滤器所需的处理更少，并且可以被缓存

由于这个差异，过滤器可以比普通的查询更快，而且还可以被缓存。

1. match

（1）match_all
匹配所有文档，类似于SQL中的无where条件查询。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match_all": {} }}'

在ES6中，match_all查询返回文档的_score都为1.0。

（2）match
匹配字段条件，类似于SQL中的where column='xxx'。下面的查询搜索标题中有“hadoop”字样的文档：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match": { "title": "hadoop" } }}'

match查询不区分大小写。在进行匹配时，词条和输入的文本都被转换成小写进行比较。match查询返回文档的_score相关性得分。

默认情况下，match查询使用OR操作符。例如，如果搜索文本“Elasticsearch Denver”，ES会搜索“Elasticsearch OR Denver”，同时匹配“Elasticsearch Amsterdam”和“Denver Clojure”。下面的查询搜索同时包含“Elasticsearch”和“Denver”关键词的结果：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match": { "name": { "query": "Elasticsearch Denver", "operator": "and" } } }}'

（3）match_phrase
下面的查询搜索name字段中包含“enterprise london”短语，并且“enterprise”和“london”之间允许包含一个单词的文档：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match_phrase": { "name": { "query": "enterprise london", "slop": 1 } } }, "_source": [ "name", "description" ]}'

（4）phrase_prefix
下面的例子中，phrase_prefix使用的是“Elasticsearch den”，ES使用“den”文本进行前缀匹配，查找所有name字段，发现那些以“den”开始的取值。max_expansions设置最大前缀扩展数量。由于产生的结果可能是个很大的集合，需要限制扩展的数量。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "match_phrase_prefix": { "name": { "query": "Elasticsearch den", "max_expansions": 1 } } }, "_source": [ "name" ]}'

（5）multi_match
可以在多个字段中匹配多个词条，类似于SQL中的where name like '%elasticsearch%' or name like '%hadoop%' or 'description' like '%elasticsearch%' or 'description' like '%hadoop%'：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "multi_match": { "query": "elasticsearch hadoop", "fields": [ "name", "description" ] } }}'

就像match查询可以转化为phrase查询或者phrase_prefix查询，multi_match查询可以转化为phrase查询或者phrase_prefix查询，方法是指定type键。除了可以指定搜索字段是多个而不是单独一个之外，可以将multi_match查询当做match查询使用。

2. term

term查询和过滤器可以指定需要搜索的文档字段和词条。注意，term搜索的词条是没有经过分析的，文档中的词条必须要精确匹配才能作为结果返回。

（1）term查询

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "term": { "tags": "elasticsearch" } }, "_source": [ "name", "tags" ]}'

（2）term过滤器
和term查询相似，可以使用term过滤器来限制结果文档，使其包含特定的词条，不过无须计算得分。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "bool": { "filter": { "term": { "tags": "elasticsearch" } } } }}'

（3）terms查询
和term查询类似，terms查询可以搜索某个文档字段中的多个词条。例如下面的查询搜索标签含有“jvm”或“hadoop”的文档。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "terms": { "tags": [ "jvm", "hadoop" ] } }, "_source": [ "name", "tags" ]}'

对于和查询匹配的文档，可以强制规定每篇文档中匹配词条的最小数量，为了实现这一点需要指定minimum_should_match参数。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "bool": { "minimum_should_match": 2, "must": { "terms": { "tags": [ "jvm", "hadoop", "lucene" ] } } } }}'

3. query_string

下面的查询搜索包含“nosql”的文档。两个查询等价，前者使用URL执行，后者使用请求主体发送：

curl -XGET '172.16.1.127:9200/get-together/_search?q=nosql&pretty'curl -XPOST '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "query_string": { "query": "nosql" } }}'

默认情况下，query_string查询将会搜索_all字段，该字段是由所有字段组合而成。可以通过default_field设置字段：

curl -XPOST '172.16.1.127:9200/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "query_string": { "default_field": "description", "query": "nosql" } }}'

也可以在多个字段上执行查询，此时应使用fields：

curl -XPOST '172.16.1.127:9200/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "query_string": { "fields": ["description", "tags"], "query": "nosql" } }}'

下面的查询搜索所有名称中含有“nosql”的文档，但是排除了那些描述中有“mongodb”的结果：

curl -XPOST '172.16.1.127:9200/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "query_string": { "query": "name:nosql AND -description:mongodb" } }}'

可以使用如下命令查询所有于1999年到2001年期间创建的标签为搜索或lucene的文档：

curl -XPOST '172.16.1.127:9200/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "query_string": { "query": "(tags:search OR tags:lucene) AND (created_on:[1999-01-01 TO 2001-01-01])" } }}'

针对query_string查询，建议的替换方案包括term、terms、match或multi_match查询。

三、复合查询

1. bool查询

bool查询允许在单独的查询中组合任意数量的查询，指定的查询子句表明哪些部分是必须（must）匹配、应该（should）匹配或者是不能（must_not）匹配上ES索引里的数据。

下面的例子查询attendees字段中必须包含“david”，也应该包含“clint”和“andy”，并且date必须大于等于'2013-06-30'。minimum_should_match表示最小的should子句匹配数，满足这个数量的文档才能作为结果返回。minimum_should_match的默认值有一些隐藏的特性。如果指定了must子句，minimum_should_match的默认值为0。如果没有指定must子句，默认值为1。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "bool": { "must": [ { "term": { "attendees": "david" } } ], "should": [ { "term": { "attendees": "clint" } }, { "term": { "attendees": "andy" } } ], "must_not": [ { "range": { "date": { "lt": "2013-06-30T00:00" } } } ], "minimum_should_match": 1 } }}'

可以使用下面的语句改写这个查询，它在逻辑上与上个查询等价，但只包含must一个bool查询选项，更短小。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "bool": { "must": [ { "term": { "attendees": "david" } }, { "range": { "date": { "gte": "2013-06-30T00:00" } } }, { "terms": { "attendees": [ "clint", "andy" ] } } ] } }}'

2. bool过滤器

bool过滤器和bool查询的表现基本一致。只是它组合的是过滤器。bool过滤器不支持minimum_should_match属性，而是使用了默认值1。

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "bool": { "filter": { "bool": { "must": [ { "term": { "attendees": "david" } } ], "should": [ { "term": { "attendees": "clint" } }, { "term": { "attendees": "andy" } } ], "must_not": [ { "range": { "date": { "lt": "2013-06-30T00:00" } } } ] } } } }}'

四、其它查询和过滤器

1. range查询和过滤器

（1）查询

# where created_on > 2012-06-01 and created_on < 2012-09-01curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "range": { "created_on": { "gt": "2012-06-01", "lt": "2012-09-01" } } }}'

（2）过滤器

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "bool": { "filter": { "range": { "created_on": { "gt": "2012-06-01", "lt": "2012-09-01" } } } } }}'

range查询支持字符串范围，如果想搜索name在“c”和“e”之间的文档，可以使用下面的搜索：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "range": { "name": { "gt": "c", "lt": "e" } } }}'

使用range查询时，应仔细考虑一下过滤器是否为更好的选择。由于在查询范围之中的文档是二元匹配（“是的，文档在范围之中”或者“不是，文档不在范围之中”），range查询不必是查询。为了获得更好的性能，它应该是过滤器。如果不确定是查询还是过滤器，请使用过滤器。在99%的用例中，使用range过滤器是正确的选择。

2. prefix查询和过滤器

prefix查询和过滤器允许根据给定的前缀来搜索词条。这里前缀在搜索之前是没有经过分析的。例如，为了在索引中搜索title为“liber”开头的全部文档，使用下面的查询：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "prefix": { "title": "liber" } }}'

类似地也可以使用过滤器：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d '{ "query": { "bool": { "filter": { "prefix": { "title": "liber" } } } }}'

由于前缀搜索没有经过分析，前缀查询或过滤器是大小写敏感的。

3. wildcard查询

# 创建索引，添加两个文档curl -XPOST '172.16.1.127:9200/wildcard-test/_doc/1?pretty' -H 'Content-Type: application/json' -d '{ "title":"The Best Bacon Ever"}' curl -XPOST '172.16.1.127:9200/wildcard-test/_doc/2?pretty' -H 'Content-Type: application/json' -d '{ "title":"How to raise a barn"}' # “ba*n”会匹配bacon和barncurl '172.16.1.127:9200/wildcard-test/_search?pretty' -H 'Content-Type: application/json' -d'{ "query": { "wildcard": { "title": { "wildcard": "ba*n" } } }}' # “ba?n”只会匹配barn，不会匹配baconcurl '172.16.1.127:9200/wildcard-test/_search?pretty' -H 'Content-Type: application/json' -d'{ "query": { "wildcard": { "title": { "wildcard": "ba?n" } } }}'

使用这种查询时，需要注意的是wildcard查询不像match等其它查询那样轻量级。查询词条中越早出现通配符（*或?），ES就需要做更多的工作来进行匹配。

4. exists过滤器

exists过滤器允许过滤文档，只查找那些在特定字段有值的文档：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d'{ "query": { "bool": { "filter": { "exists": { "field": "location_event.geolocation" } } } }}'

5. missing过滤器

missing过滤器可以搜索字段里没有值，或者是映射时指定了默认值的文档（也叫做null值，即映射里null_value）。为了搜索缺失reviews字段的文档，可以使用下面的过滤器：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d'{ "query": { "bool": { "must_not": { "exists": { "field": "reviews" } } } }}'

6. 将任何查询转变为过滤器

ES允许通过query过滤器将任何查询转化为过滤器。例如，有个query_string查询搜索匹配“Elasticsearch”的名字，可以使用如下搜索将其转变为过滤器：

curl '172.16.1.127:9200/get-together/_search?pretty' -H 'Content-Type: application/json' -d'{ "query": { "bool": { "filter": { "query_string": { "query": "name:\"Elasticsearch\"" } } } }}'

五、为任务选择最好的查询

表1为ES的常用案例中使用哪些查询的指南。

用例	使用的查询类型
想从类似Google的界面接受用户的输入，然后根据这些输入搜索文档	如果想支持+/-或者在特定字段中搜索，就是用simple_query_string查询
想将输入作为词组并搜索包含这个词组的文档，词组中的单词也许包含一些间隔（slop）	要查找和用户搜索相似的词组，使用match_phrase查询，并设置一定量的slop
想在not_analyzed字段中搜索单个关键字，并完全清楚这个词应该是如何出现的	使用term查询，因为查询的词条不会被分析
希望组合许多不同的搜索请求或者不同类型的搜索，创建一个单独的搜索来处理它们	使用bool查询，将任意数量的子查询组合到一个单独的查询
希望在某个文档中的多个字段搜索特定的单词	使用multi_match查询，它和match查询的表现类似，不过是在多个字段上搜索
希望通过一次搜索返回所有的文档	使用match_all查询，在一次搜索中返回全部文档
希望在字段中搜索一定取值范围内的值	使用range查询，搜索取值在一定范围内的文档
希望在字段中搜索特定字符串开头的取值	使用prefix查询，搜索以给定字符串开头的词条
希望根据用户已经输入的内容，提供单个关键词的自动完成功能	使用prefix查询，发送用户已经输入的内容，然后获取以此文本开头的匹配项
希望搜索特定字段没有取值的所有文档	使用missing过滤器过滤出缺失某些字段的文档

表1 常用案例中使用哪些类型的查询

本文为大数据技术与架构整理，原作者独家授权。未经原作者允许转载追究侵权责任。编辑｜冷眼丶微信公众号｜import_bigdata

欢迎点赞+收藏+转发朋友圈素质三连

文章不错？点个【在看】吧！ 👇

：，。视频小程序赞，轻点两下取消赞在看，轻点两下取消在看

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

跟着南通住建局学“朝令夕改”

触类旁通Elasticsearch之吊打同行系列：搜索篇

一、搜索请求的结构

1. 确定搜索范围

2. 搜索请求的基本模块

3. 基于请求主体的搜索请求

4. 回复的结构

二、查询和过滤器

1. match

2. term

3. query_string

三、复合查询

1. bool查询

2. bool过滤器

四、其它查询和过滤器

1. range查询和过滤器

2. prefix查询和过滤器

3. wildcard查询

4. exists过滤器

5. missing过滤器

6. 将任何查询转变为过滤器

五、为任务选择最好的查询

您可能也对以下帖子感兴趣

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

紧急通告！三高的“克星”终于被找到了！！不是吃素和控糖,而是多喝它....

话费充值活动来了：95元充值100元电话费！

跟着南通住建局学“朝令夕改”

生成图片，分享到微信朋友圈

触类旁通Elasticsearch之吊打同行系列：搜索篇

一、搜索请求的结构

1. 确定搜索范围

2. 搜索请求的基本模块

3. 基于请求主体的搜索请求

4. 回复的结构

二、查询和过滤器

1. match

2. term

3. query_string

三、复合查询

1. bool查询

2. bool过滤器

四、其它查询和过滤器

1. range查询和过滤器

2. prefix查询和过滤器

3. wildcard查询

4. exists过滤器

5. missing过滤器

6. 将任何查询转变为过滤器

五、为任务选择最好的查询

您可能也对以下帖子感兴趣