Elastic Search 基础介绍

ElasticSearch 是一个分布式、可扩展、实时的搜索与数据分析引擎。

功能介绍

全文搜索
结构化数据的实时统计
数据分析
复杂的语言处理
地理位置和对象间关联关系

应用场景

Wikipedia 使用 Elasticsearch 提供带有高亮片段的全文搜索
Stack Overflow 将地理位置查询融入全文检索中去，并且使用 more-like-this 接口去查找相关的问题与答案
GitHub 使用 Elasticsearch 对1300亿行代码进行查询

特点

一个分布式的实时文档存储，每个字段可以被索引与搜索
一个分布式实时分析搜索引擎
能胜任上百个服务节点的扩展，并支持 PB 级别的结构化或者非结构化数据

操作

索引结构分析

路径 /megacorp/employee/1 包含了三部分的信息：

megacorp：索引名称，类似数据库
employee：类型名称，类似数据表
1：特定雇员的ID，类似表里的ID

写入

curl -X PUT "localhost:9200/megacorp/employee/1" -H 'Content-Type: application/json' -d'
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
'

查询

curl  'localhost:9200/megacorp/employee/1?pretty'
{"_index":"megacorp","_type":"employee","_id":"2","_version":1,"found":true,"_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
}

搜索

curl -X GET "10.96.83.188:9200/megacorp/employee/_search"
{"took":71,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":3,"max_score":1.0,"hits":[{"_index":"megacorp","_type":"employee","_id":"2","_score":1.0,"_source":
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
},{"_index":"megacorp","_type":"employee","_id":"1","_score":1.0,"_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
},{"_index":"megacorp","_type":"employee","_id":"3","_score":1.0,"_source":
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}
}]}}

简单查询 query_string

curl -X GET "10.96.83.188:9200/megacorp/employee/_search?q=last_name:Smith"
{"took":31,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":0.2876821,"hits":[{"_index":"megacorp","_type":"employee","_id":"2","_score":0.2876821,"_source":
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
},{"_index":"megacorp","_type":"employee","_id":"1","_score":0.2876821,"_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
}]}}

查询表达式搜索 query

curl -X GET "10.96.83.188:9200/megacorp/employee/_search" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}
'

{"took":9,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":0.2876821,"hits":[{"_index":"megacorp","_type":"employee","_id":"2","_score":0.2876821,"_source":
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
},{"_index":"megacorp","_type":"employee","_id":"1","_score":0.2876821,"_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
}]}}

更复杂的搜索过滤器 filter

curl -X GET "10.96.83.188:9200/megacorp/employee/_search" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "bool": {
            "must": {
                "match" : {
                    "last_name" : "smith"
                }
            },
            "filter": {
                "range" : {
                    "age" : { "gt" : 30 }
                }
            }
        }
    }
}
'
{"took":15,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.2876821,"hits":[{"_index":"megacorp","_type":"employee","_id":"2","_score":0.2876821,"_source":
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
}]}}

全文搜索

curl -X GET "10.96.83.188:9200/megacorp/employee/_search" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}
'
{"took":8,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":0.53484553,"hits":[{"_index":"megacorp","_type":"employee","_id":"1","_score":0.53484553,"_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
},{"_index":"megacorp","_type":"employee","_id":"2","_score":0.26742277,"_source":
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
}]}}

Elasticsearch 默认按照相关性得分排序，即每个文档跟查询的匹配程度。
第一个最高得分的结果很明显：John Smith 的 about 属性清楚地写着 “rock climbing”
Jane Smith也作为结果返回，原因是她的 about 属性里提到了 “rock” 。因为只有 “rock” 而没有 “climbing” ，所以她的相关性得分低于 John 的。

这是完全区别于传统关系型数据库的一个概念，数据库中的一条记录要么匹配要么不匹配。

短语搜索

想要精确匹配一系列单词或者短语。
匹配同时包含 “rock” 和 “climbing” ，并且二者以短语 “rock climbing” 的形式紧挨着的雇员记录。
为此对 match 查询稍作调整，使用一个叫做 match_phrase 的查询

 curl -X GET "10.96.83.188:9200/megacorp/employee/_search" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}
'
{"took":13,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.53484553,"hits":[{"_index":"megacorp","_type":"employee","_id":"1","_score":0.53484553,"_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
}]}}

高亮搜索

返回结果与之前一样，与此同时结果中还多了一个叫做 highlight 的部分。这个部分包含了 about 属性匹配的文本片段，并以 HTML 标签封装

curl -X GET "10.96.83.188:9200/megacorp/employee/_search" -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}
'
{"took":36,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.53484553,"hits":[{"_index":"megacorp","_type":"employee","_id":"1","_score":0.53484553,"_source":
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
,"highlight":{"about":["I love to go <em>rock</em> <em>climbing</em>"]}}]}}

分析

curl -X GET "10.96.83.188:9200/megacorp/employee/_search" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}
'


{
   ...
   "hits": { ... },
   "aggregations": {
      "all_interests": {
         "buckets": [
            {
               "key":       "music",
               "doc_count": 2
            },
            {
               "key":       "forestry",
               "doc_count": 1
            },
            {
               "key":       "sports",
               "doc_count": 1
            }
         ]
      }
   }
}