2016年10月24日 星期一

Elasticsearch,为了搜索

Elasticsearch是一个基于Apache Lucene(TM)的开源搜索引擎。无论在开源还是专有领域,Lucene可以被认为是迄今为止最先进、性能最好的、功能最全的搜索引擎库。
但是,Lucene只是一个库。想要使用它,你必须使用Java来作为开发语言并将其直接集成到你的应用中,更糟糕的是,Lucene非常复杂,你需要深入了解检索的相关知识来理解它是如何工作的。
Elasticsearch也使用Java开发并使用Lucene作为其核心来实现所有索引和搜索的功能,但是它的目的是通过简单的RESTful API来隐藏Lucene的复杂性,从而让全文搜索变得简单。
如果没有搜索引擎,单单凭借Mysql提供的简单搜索功能,无论在性能还是效果上都不尽如人意,继承程序猿的折腾属性,决定将自己的博客插上Elasticsearch的翅膀。

安装 Oracle JDK#

sudo apt-get update
sudo apt-get install openjdk-8-jdk

安装 Elasticsearch#

  • 下载
    wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-2.3.4.deb    
    sudo dpkg -i elasticsearch-2.3.4.deb
    目前ElasticSearch的中文分词插件IK最高版本为1.9.4,兼容Elasticsearch的2.3.4版本。
  • 安装
    sudo dpkg -i elasticsearch-2.3.4.deb
  • 开机自启动
    sudo update-rc.d elasticsearch defaults 95 10
    sudo service elasticsearch start
  • 测试
    curl http://localhost:9200
你如果你看到以下信息,说明你的ElasticSearch已安装成功。
{
    "name" : "Peter Petruski",
    "cluster_name" : "elasticsearch",
    "version" : {
        "number" : "2.3.4",
        "build_hash" : "...(隐藏)",
        "build_timestamp" : "2016-06-30T11:24:31Z",
        "build_snapshot" : false,
        "lucene_version" : "5.5.0"
    },
      "tagline" : "You Know, for Search"
}
默认情况下 Elasticsearch 的 RESTful 服务只有本机才能访问,也就是说无法从主机访问虚拟机中的服务。为了方便调试,可以修改 /etc/elasticsearch/config/elasticsearch.yml文件,加入以下两行:
network.bind_host:0.0.0.0"
network.publish_host: \_non_loopback:ipv4_

安装中文分词插件 IK#

Elasticsearch原装分词器会简单地拆分每个汉字,没有根据词库来分词,这样的后果就是搜索结果很可能不是你想要的。这里推荐使用elasticsearch-analysis-ik,支持自定义词库。
  • 下载
    wget https://github.com/medcl/elasticsearch-analysis-ik/archive/v1.9.4.tar.gz
  • 解压
    tar -xvf elasticsearch-analysis-ik.tar.gz
  • 使用maven打包该java项目
    cd elasticsearch-analysis-ik-1.9.4
    mvn package
  • 在plugins目录下创建ik目录,并将打包好的IK插件解压到其中
    mkdir /usr/share/elasticsearch/plugins/ik
    unzip target/releases/elasticsearch-analysis-ik-1.9.4.zip -d /usr/share/elasticsearch/plugins/ik/
    elasticsearch-analysis-ik 的配置文件在 ~/{es_root}/plugins/ik/config/ik/ 目录,很多都是词表,直接用文本编辑器打开就可以修改,改完记得保存为 utf-8 格式。
现在再启动 Elasticsearch 服务,如果看到类似下面这样的信息,说明 IK Analysis 插件已经装好了
plugins [analysis-ik]

使用 Elasticsearch#

在使用之前,先大概了解下ES的特点:
网上通常会将Elasticsearch和传统关系型数据库Mysql做一下类比:
MySQLElasticsearch
Database(数据库)Index (索引)
Table(表)Type (类型)
Row (行)Document (文档)
Column (列)Field (字段)
Schema (方案)Mapping (映射)
Index (索引)Everything Indexed by default (所有字段都被索引)
SQL (结构化查询语言)Query DSL (查询专用语言)
Elasticsearch不仅仅是全文搜索:
  • 分布式的实时文件存储,每个字段都被索引并可被搜索
  • 分布式的实时分析搜索引擎
  • 可以扩展到上百台服务器,处理PB级结构化或非结构化数据
分布式实时每个字段PB级,有点不明觉厉~ 不要慌,刚认识不熟悉很正常,慢慢接触,自然就熟络了,stop getting off track(回归正题),想要详细认识ES,请移步Elasticsearch权威指南,接下来就是一步步将ES集成进项目当中:
1. 使用package
可以直接使用官方提供的package,由于不想花时间重复造轮子,我直接使用进一步封装的第三方package。在github上有好几个可用的,我选了Elasticquent,部分是因为名字和laravel的Eloquent比较搭(笑...)。Elasticquent提供了简洁好用的trait,直接集成进你的Model里,例如Article:
...
use Elasticquent\ElasticquentTrait;

class Article extends Model
{
    use ElasticquentTrait;
        ...
}
然后就可以优雅的使用Elasticsearch了,具体如何安装使用,请参考Elasticquent的说明文档。
2. 配置Mapping
关于Mapping(映射),我找到了一篇专门介绍它的文章(传送们),通俗易懂。
文章中提到,mapping不仅告诉ES一个field中是什么类型的值, 它还告诉ES如何索引数据以及数据是否能被搜索到。
Got it! 也就是说,如果我们不配置mapping,那ES就不会知道我们是想让它按照IK的分词方式来进行索引咯~
到这里,不得不说这是一个坑,目前网上的很多资料因为使用的是老版本的ES和IK,所以index和mapping的配置一般都放在es的配置文件yml当中。但我按照那种配置方法,并没有达到预期的分词效果,ES还是简单粗暴的将汉字一个个的切开,屡败屡战折腾两天之后,终于想到试着使用Elasticquent说明文档里在Model中配置mapping的方式,果然,豁然开朗:
curl -XGET 'http://localhost:9200/_analyze?analyzer=ik&pretty=true&text=%e4%bd%a0%e5%a5%bd%e9%ba%a6%e8%82%af%e5%85%88%e7%94%9f'
{   
    "tokens" : [ {
        "token" : "你好",
           "start_offset" : 0,
        "end_offset" : 2,
        "type" : "CN_WORD",
        "position" : 0
    },
    {
        "token" : "麦",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "CN_WORD",
        "position" : 1
    }, {
        "token" : "肯",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "CN_WORD",
    "position" : 2
    }, {
        "token" : "先生",
        "start_offset" : 4,
        "end_offset" : 6,
        "type" : "CN_WORD",
        "position" : 3
    }]
}
附上我的mapping配置代码
protected $mappingProperties = array(
   'title' => array(
        'type' => 'string',
        'analyzer' => 'ik_max_word'
    ),
   'content' => array(
        'type' => 'string',
        'analyzer' => 'ik_max_word'
    )
);
可以看出,我告诉ES,我的title和content字段是string类型而且请按照ik的分词方式帮我检索。
3. 创建索引
直接使用Elasticquent提供的createIndex方法创建,如果想把现有文档全部索引,可以使用addAllToIndex方法,简单愉快。
Article::createIndex($shards = null, $replicas = null);
Article::addAllToIndex();
4. 增删改查
在你的控制器里的增删改查方法中,将Elasticquent提供的相应操作索引的方法依次加上即可,完成之后,那么你对文档的操作就会同步ES的索引了。具体代码请直接移步Elasticquent开源项目中trait里的代码就好,这里不再贴出。

写在最后#

在此之前,了解过sphinx,使用过配置好的xunsearch,但真正自己从零开始研究全文搜索引擎还是头一次,中间遇到了许多坑,虽然被坑郁闷,但也感谢这些坑,毕竟越过去就会有快感。写这篇文章一来作为纪念和起点,二来希望能多少对别人有点帮助,因为我也是看过好多相关的文章才一点点将ES搭建完成,在这里感谢那些乐于分享的前辈。
当然,Elasticsearch功能很强大,各种插件各种配置,这篇文章需要完善的地方还有很多,后期会不断更新,如果文中有错误或者不严谨的地方,欢迎留言交流。
PS. 最后贴出我项目中的Dockerfile,方便感兴趣的同学使用。

参考资料#


from : https://laravel-china.org/topics/2765

Elasticquent

Elasticsearch for Eloquent Laravel Models
Elasticquent makes working with Elasticsearch and Eloquent models easier by mapping them to Elasticsearch types. You can use the default settings or define how Elasticsearch should index and search your Eloquent models right in the model.
Elasticquent uses the official Elasticsearch PHP API. To get started, you should have a basic knowledge of how Elasticsearch works (indexes, types, mappings, etc).

Elasticsearch Requirements

You must be running at least Elasticsearch 1.0. Elasticsearch 0.9 and below will not work and are not supported.

Contents

Reporting Issues

If you do find an issue, please feel free to report it with GitHub's bug tracker for this project.
Alternatively, fork the project and make a pull request :)

Overview

Elasticquent allows you take an Eloquent model and easily index and search its contents in Elasticsearch.
    $books = Book::where('id', '<', 200)->get();
    $books->addToIndex();
When you search, instead of getting a plain array of search results, you instead get an Eloquent collection with some special Elasticsearch functionality.
    $books = Book::search('Moby Dick');
    echo $books->totalHits();
Plus, you can still use all the Eloquent collection functionality:
    $books = $books->filter(function ($book) {
        return $book->hasISBN();
    });
Check out the rest of the documentation for how to get started using Elasticsearch and Elasticquent!

How Elasticquent Works

When using a database, Eloquent models are populated from data read from a database table. With Elasticquent, models are populated by data indexed in Elasticsearch. The whole idea behind using Elasticsearch for search is that its fast and light, so you model functionality will be dictated by what data has been indexed for your document.

Setup

Before you start using Elasticquent, make sure you've installed Elasticsearch.
To get started, add Elasticquent to you composer.json file:
"elasticquent/elasticquent": "dev-master"
Once you've run a composer update, you need to register Laravel service provider, in your config/app.php:
'providers' => [
    ...
    Elasticquent\ElasticquentServiceProvider::class,
],
We also provide a facade for elasticsearch-php client (which has connected using our settings), add following to your config/app.php if you need so.
'aliases' => [
    ...
    'Es' => Elasticquent\ElasticquentElasticsearchFacade::class,
],
Then add the Elasticquent trait to any Eloquent model that you want to be able to index in Elasticsearch:
use Elasticquent\ElasticquentTrait;

class Book extends Eloquent
{
    use ElasticquentTrait;
}
Now your Eloquent model has some extra methods that make it easier to index your model's data using Elasticsearch.

Elasticsearch Configuration

By default, Elasticquent will connect to localhost:9200 and use default as index name, you can change this and the other settings in the configuration file. You can add the elasticquent.php config file at /app/config/elasticquent.php for Laravel 4, or use the following Artisan command to publish the configuration file into your config directory for Laravel 5:
$ php artisan vendor:publish --provider="Elasticquent\ElasticquentServiceProvider"


return array(

    /*
    |--------------------------------------------------------------------------
    | Custom Elasticsearch Client Configuration
    |--------------------------------------------------------------------------
    |
    | This array will be passed to the Elasticsearch client.
    | See configuration options here:
    |
    | http://www.elasticsearch.org/guide/en/elasticsearch/client/php-api/current/_configuration.html
    */

    'config' => [
        'hosts'     => ['localhost:9200'],
        'retries'   => 1,
    ],

    /*
    |--------------------------------------------------------------------------
    | Default Index Name
    |--------------------------------------------------------------------------
    |
    | This is the index name that Elastiquent will use for all
    | Elastiquent models.
    */

    'default_index' => 'my_custom_index_name',

);

Indexes and Mapping

While you can definitely build your indexes and mapping through the Elasticsearch API, you can also use some helper methods to build indexes and types right from your models.
If you want a simple way to create indexes, Elasticquent models have a function for that:
Book::createIndex($shards = null, $replicas = null);
For custom analyzer, you can set an indexSettings property in your model and define the analyzers from there:
    /**
     * The elasticsearch settings.
     *
     * @var array
     */
    protected $indexSettings = [
        'analysis' => [
            'char_filter' => [
                'replace' => [
                    'type' => 'mapping',
                    'mappings' => [
                        '&=> and '
                    ],
                ],
            ],
            'filter' => [
                'word_delimiter' => [
                    'type' => 'word_delimiter',
                    'split_on_numerics' => false,
                    'split_on_case_change' => true,
                    'generate_word_parts' => true,
                    'generate_number_parts' => true,
                    'catenate_all' => true,
                    'preserve_original' => true,
                    'catenate_numbers' => true,
                ]
            ],
            'analyzer' => [
                'default' => [
                    'type' => 'custom',
                    'char_filter' => [
                        'html_strip',
                        'replace',
                    ],
                    'tokenizer' => 'whitespace',
                    'filter' => [
                        'lowercase',
                        'word_delimiter',
                    ],
                ],
            ],
        ],
    ];
For mapping, you can set a mappingProperties property in your model and use some mapping functions from there:
protected $mappingProperties = array(
   'title' => array(
        'type' => 'string',
        'analyzer' => 'standard'
    )
);
If you'd like to setup a model's type mapping based on your mapping properties, you can use:
    Book::putMapping($ignoreConflicts = true);
To delete a mapping:
    Book::deleteMapping();
To rebuild (delete and re-add, useful when you make important changes to your mapping) a mapping:
    Book::rebuildMapping();
You can also get the type mapping and check if it exists.
    Book::mappingExists();
    Book::getMapping();

Setting a Custom Index Name

By default, Elasticquent will look for the default_index key within your configuration file(config/elasticquent.php). To set the default value for an index being used, you can edit this file and set the default_index key:
return array(

   // Other configuration keys ...

   /*
    |--------------------------------------------------------------------------
    | Default Index Name
    |--------------------------------------------------------------------------
    |
    | This is the index name that Elastiquent will use for all
    | Elastiquent models.
    */

   'default_index' => 'my_custom_index_name',
);
If you'd like to have a more dynamic index, you can also override the default configuration with a getIndexName method inside your Eloquent model:
function getIndexName()
{
    return 'custom_index_name';
}
Note: If no index was specified, Elasticquent will use a hardcoded string with the value of default.

Setting a Custom Type Name

By default, Elasticquent will use the table name of your models as the type name for indexing. If you'd like to override it, you can with the getTypeName function.
function getTypeName()
{
    return 'custom_type_name';
}
To check if the type for the Elasticquent model exists yet, use typeExists:
    $typeExists = Book::typeExists();

Indexing Documents

To index all the entries in an Eloquent model, use addAllToIndex:
    Book::addAllToIndex();
You can also index a collection of models:
    $books = Book::where('id', '<', 200)->get();
    $books->addToIndex();
You can index individual entries as well:
    $book = Book::find($id);
    $book->addToIndex();
You can also reindex an entire model:
    Book::reindex();

Searching

There are three ways to search in Elasticquent. All three methods return a search collection.

Simple term search

The first method is a simple term search that searches all fields.
    $books = Book::search('Moby Dick');

Query Based Search

The second is a query based search for more complex searching needs:
    public static function searchByQuery($query = null, $aggregations = null, $sourceFields = null, $limit = null, $offset = null, $sort = null)
Example:
    $books = Book::searchByQuery(array('match' => array('title' => 'Moby Dick')));
Here's the list of available parameters:
  • query - Your ElasticSearch Query
  • aggregations - The Aggregations you wish to return. See Aggregations for details.
  • sourceFields - Limits returned set to the selected fields only
  • limit - Number of records to return
  • offset - Sets the record offset (use for paging results)
  • sort - Your sort query

Raw queries

The final method is a raw query that will be sent to Elasticsearch. This method will provide you with the most flexibility when searching for records inside Elasticsearch:
    $books = Book::complexSearch(array(
        'body' => array(
            'query' => array(
                'match' => array(
                    'title' => 'Moby Dick'
                )
            )
        )
    ));
This is the equivalent to:
    $books = Book::searchByQuery(array('match' => array('title' => 'Moby Dick')));

Search Collections

When you search on an Elasticquent model, you get a search collection with some special functions.
You can get total hits:
    $books->totalHits();
Access the shards array:
    $books->shards();
Access the max score:
    $books->maxScore();
Access the timed out boolean property:
    $books->timedOut();
And access the took property:
    $books->took();
And access search aggregations - See Aggregations for details:
    $books->getAggregations();

Search Collection Documents

Items in a search result collection will have some extra data that comes from Elasticsearch. You can always check and see if a model is a document or not by using the isDocument function:
    $book->isDocument();
You can check the document score that Elasticsearch assigned to this document with:
    $book->documentScore();

Chunking results from Elastiquent

Similar to Illuminate\Support\Collection, the chunk method breaks the Elasticquent collection into multiple, smaller collections of a given size:
    $all_books = Book::searchByQuery(array('match' => array('title' => 'Moby Dick')));
    $books = $all_books->chunk(10);

Using the Search Collection Outside of Elasticquent

If you're dealing with raw search data from outside of Elasticquent, you can use the Elasticquent search results collection to turn that data into a collection.
$client = new \Elasticsearch\Client();

$params = array(
    'index' => 'default',
    'type'  => 'books'
);

$params['body']['query']['match']['title'] = 'Moby Dick';

$collection = Book::hydrateElasticsearchResult($client->search($params));

More Options

Document IDs

Elasticquent will use whatever is set as the primaryKey for your Eloquent models as the id for your Elasticsearch documents.

Document Data

By default, Elasticquent will use the entire attribute array for your Elasticsearch documents. However, if you want to customize how your search documents are structured, you can set a getIndexDocumentData function that returns you own custom document array.
function getIndexDocumentData()
{
    return array(
        'id'      => $this->id,
        'title'   => $this->title,
        'custom'  => 'variable'
    );
}
Be careful with this, as Elasticquent reads the document source into the Eloquent model attributes when creating a search result collection, so make sure you are indexing enough data for your the model functionality you want to use.

Using Elasticquent With Custom Collections

If you are using a custom collection with your Eloquent models, you just need to add the ElasticquentCollectionTrait to your collection so you can use addToIndex.
class MyCollection extends \Illuminate\Database\Eloquent\Collection
{
    use ElasticquentCollectionTrait;
}

Roadmap

Elasticquent currently needs:
  • Tests that mock ES API calls.
  • Support for routes

from : https://github.com/elasticquent/Elasticquent

沒有留言:

wibiya widget