Error message here!

Hide Error message here!

忘记密码?

Error message here!

请输入正确邮箱

Hide Error message here!

密码丢失?请输入您的电子邮件地址。您将收到一个重设密码链接。

Error message here!

返回登录

Close

Elasticsearch series notes of daydreaming (Part one) -- Kwai ES

Give me a daydream 2021-01-14 13:16:07 阅读数:8 评论数:0 点赞数:0 收藏数:0


One 、 Reading guide

Hi All! Let's learn something interesting together !NoSQL! Welcome to subscribe to daydream Elasticsearch Feature series . There are four articles planned for this project . The official account of all articles is the first .

The official account of all articles is the first !

The official account of all articles is the first !

Click to read the original text, you can pay attention to me ! Catch up with the update in the first place

The official account of all articles is the first !

The official account of all articles is the first !

Notice!!! Daydreaming doesn't guarantee that you'll learn through these four articles ES, however ! I'll talk in the vernacular ES Some of the concepts of 、 And fancy play . At least I can tell you to Elasticsearch To minimize the strangeness of , One day you need to use it in your own business ES when , Because I read the daydream ahead of time ES Take notes and start quickly .

In order to write this article, I also purchased one from Huawei cloud 2C4G Server for , Welcome to daydream , Let's learn something practical ! Interesting technology !


1.1、 know ES

Relational database :

image MySQL This kind of database is the traditional relational database . It has a very intuitive feature : The columns of each data table need to be determined when creating the table . For example, you create a user surface , Defined 3 Column id、username、password. If you have one more entity class age Field of , Then this entity can't be saved into user Tabular .( Of course, you can go through DDL Modify, add columns or reduce Columns . Let the attributes of the entity class correspond to the columns in the table one by one ).

Non relational database :

Non relational databases are what we often hear about NoSQL. Common are :MongoDB、Redis、Elasticsearch.

Not to mention performance , Let's just talk about the use NoSQL This kind of non relational database supports you to store a json object , This json How many fields are not related to it , Take the example above , As long as you give him an object , With or without age、 It can help you store it .

About ES For more knowledge, we'll start below , Say it again ES Common usage scenarios and features :

On-site search :

If your company wants to do its own website search , that ES perfect . As a non relational database ES It allows you to store in it all kinds of uncertain formats Json object , It also provides you with a full-text search and analysis engine . It allows you to quickly , In near real time (1 s) Storage , Search and analyze large amounts of data . One word : fast !

Log collection system :

Elasticsearch yes Elastic The company's nuclear technology , also Elastic The company has other things like :Logstash、Filebeat、Kibana Etc. technology stack . The common log management system used in the company can be used ELK+Filebeat Set it up ,Filebeat Collect logs and push them to Logstash Do the processing , then Logstash Store data in ES, Finally through Kibana Show logs .

Extensibility :

Elasticsearch Distributed by nature , It can run on a server with poor performance in the form of a single machine . It can also form a cluster of hundreds of nodes . And it manages the nodes in the cluster itself , stay ES We can freely add 、 Remove the node , The cluster itself will spread the data evenly among the nodes .


1.2、 install 、 start-up ES、Kibana、IK Word segmentation is

  1. Easy to install , So the detailed process will not be written in the article .
  2. Install the startup tutorial 、ES、Kibana、IK The word segmentation installation package is shared with you in the form of Baidu network disk , The background to reply :es Can claim

Two 、 The core concept

Because this is the first basic article , Be friendly to Xiaobai , So we need to understand some basic concepts first , You can read it in a folding way , It's not hard to understand .

2.1、Near Realtime (NRT)

ES It is claimed that it provides near real-time search service to the outside world , It means that data is written from ES To be able to be Searchable Just need to 1 Second , So it's based on ES Search and analysis can be performed in seconds .


2.2、Cluster

colony : A cluster is one or more node Set , Together they hold the data you put in , Users can node Search between , In general, each cluster will have a unique name identifier , The default name ID is elasticsearch , This name is very important , because node Want to join cluster when , This name information is required .

Make sure you don't use the same cluster name in different environments , To avoid node Wrong cluster , Consider the following cluster naming style logging-stage and logging-dev and logging-pro.


2.3、Node

A single server It's just one. node, It and cluster equally , There is also a default name . But its name is through UUID Generated random string , Of course, users can also customize different names , But it's better not to repeat the name . This name is very important for management , Because it needs to be determined , Which server in the current network , Corresponding to which node in this cluster .

node There is a default setting , default , When every node At startup, it will automatically add a call elasticsearch The node of , That means , If the user starts more than one in the network node, They will find each other , Then form a cluster .

In a single cluster in , You can have as many node. If there are no other running nodes on your network , Then you start a new node , The new node will form a cluster of its own .


2.4、Index

Index It's a class with similar properties document Set , For example, you can create a index, Create a index, Create a index.

index name ( Must be a lowercase character ), When need is right index Document execution index in 、 Search for 、 to update 、 Delete 、 When waiting for operation , All need to use this index.

Theoretically : You can create any number of... In a cluster index.


2.5、Type

Type It can be used as index Logical categories in . For a more detailed division , For example, user data type、 Review data type、 The blog data type

Try your best to have more of the same in design field Of document To divide into the same type Next .


2.6、Document

document Namely ES A piece of data stored in , It's like mysql One row of records in . It can be a user's record 、 A commodity record, etc

2.7、 A loose summary :

Why is this a loose summary ? That is to say, the following three corresponding relationships can only be said to look similar from the surface . however ES Medium type In fact, it is a logical division . When data is stored, it is still stored together ( Read on. It's written below ), and mysql There is absolutely no relationship between the two columns of the different tables in .

Elasticsearch Relational database
Document That's ok
type surface
index database

2.8、Shards & Replicas

2.8.1、 Problem introduction :

If you let one Index Self storage 1TB The data of , The speed of the response will decrease . To solve this problem ,ES Provides a way to Index Conduct subdivide Operation , Will be index Fragmentation , One for each piece Shards, And then the whole huge data is distributed on different servers for storage .


2.8.2、 What is? shard?

shard Divide into replica shard and primary shard. As the name suggests, one is the Lord shard、 One is backup shard, Responsible for fault tolerance and partial read requests .

shard It can be understood as ES The smallest unit of work in . all shard Sum of data in , It's the whole thing ES Data stored in . You can put shard Understood as a luncene The implementation of the , Have complete index creation , Ability to process requests .

Here are two node,6 individual shard The composition of the cluster division :

 The distribution of the two nodes

You can look at the picture above , No matter java The application accesses node1 still node2, In fact, data can be obtained .

2.8.3、shard Default number of

The newly created node will exist 5 individual primary shard, Be careful ! Or we can change it later primary shard Value , If every one primary shard They all correspond to one replica shard, A single set es Startup will exist 10 A shard , But the reality is , Of the same node replica shard and primary shard Cannot exist in one server in , So single es The default number of slices after startup is 5 individual .

2.8.4、 How to expand capacity Cluster

First of all, make it clear : once index Creation complete ,primary shard It's impossible to change the number of .

therefore Horizontal expansion You have to add replica The number of , because replica shard The quantity of can be changed later . in other words , If we change the number to 2, It means that every primary shard Both have two replica shard, Calculate : 5+5*2=15 Cluster will expand into 15 Nodes .

If you want every one shard Increase the number of servers if you have the most system resources , Let every one shard Exclusive one server .


2.8.5、 for instance :

shard and replica Entry map

There are two up and down node, Every node One of them Their own primary shard and Of other nodes replica shard, Why Emphasize yourself and others Well ? because ES Specified in the , Of the same node replica shard and primary shard Cannot exist in one server in , And different nodes primary shard Can exist in the same server On .

When primary shard outage , Because it corresponds to replicas shard In other server Not affected , therefore ES Can continue to respond to user's read request . Through this fragmentation mechanism , And the status of segmentation is similar , Suppose a single shard Can handle 2000/s Request , Through horizontal expansion, the throughput of the system can be doubled , Naturally distributed , High availability .

Besides : every last document There must be a primary shard And this primary shard Corresponding replica shard in , There will never be the same document Exist in multiple primary shard Situation in .


3、 ... and 、 Introductory exploration :

You'll see in the next section that I use a lot of GET / POST And so on, what does it include query. In fact, you don't have to wonder why the whole pile of these things without writing some code .

In fact, these orders are very important for ES Come on , like SQL and MySQL The relationship between . let me put it another way , In fact, the bottom layer of the code you write helps you execute the commands I'm going to talk about below . therefore , Don't be afraid of trouble , The following points of knowledge, however, you can't directly cross the past .

3.1、 The health of the cluster

GET /_cat/health?v

The results are as follows :

epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1572595632 16:07:12 elasticsearch yellow 1 1 5 5 0 0 5 0 - 50.0%

Read the above information , The default cluster name is elasticsearch, The current cluster of status yes yellow, The next list is the partition information of the cluster , the last one active_shards_percent Indicates that only half of the current clusters shard Is available .

state :

There are three states :red、green、yellow

  • green : Indicates that all nodes of the current cluster are available .
  • yellow: Express ES All data in is accessible , But not all replica shard All available ( I'm now starting one by default node, and ES It's not allowed to be the same node Of primary shard and replica shard coexistence , So my current node Only exists in 5 individual primary shard, by status It's yellow ).
  • red: Cluster down , Data is not accessible .

3.2、 Index information of cluster

GET /_cat/indices?v

result :

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open ai_answer_question cl_oJNRPRV-bdBBBLLL05g 5 1 203459 0 172.3mb 172.3mb

The display status is yellow, Indicates presence replica shard Unavailable , There is 5 individual primary shard, And each of these primary shard There is one. replica shard , altogether 20 More than 10000 documents , Document not deleted , The space occupied by the document is 172.3 mega .


3.3、 establish index

PUT /customer?pretty

ES The use of RestfulAPI, New use put, It's a very people friendly move .


3.4、 add to or modify

If it is ES If there is no data below, add it , If it exists id=1 The elements of ( Full replacement ).

  • Format :PUT /index/type/id

Full replacement , The original document It's not deleted ! It's marked as deleted, Marked as deleted It's not going to be retrieved , When ES When there are more and more data in , Will delete it .

PUT /customer/_doc/1?pretty
{
"name": "John Doe"
}

Respond to :

{
"_index": "customer",
"_type": "_doc",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}

Force creation , Add _create perhaps ?op_type=create.

PUT /customer/_doc/1?op_type=create
PUT /customer/_doc/1/_create
  • Partial update (Partial Update)

Don't specify id Then add document.

POST /customer/_doc?pretty
{
"name": "Jane Doe"
}

Appoint id Is to doc Local update operation for .

POST /customer/_doc/1?pretty
{
"name": "Jane Doe"
}

also POST Relative to the above PUT for , Whether or not there is the same thing doc, As long as you don't specify id, Will use a random string as id, complete doc Insertion .

Partial Update First get document, I'll pass it on field Update into document Of json in , Will the old doc Marked as deleted, Then create document, Compared with full replacement, two network requests will be saved


3.5、 retrieval

Format : GET /index/type/

GET /customer/_doc/1?pretty

Respond to :

{
"_index": "customer",
"_type": "_doc",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
"name": "John Doe"
}
}

3.6、 Delete

Delete one document.

In most cases , The original document Will not be deleted immediately , It's marked as deleted, Marked as deleted It's not going to be retrieved , When ES When there are more and more data in , Will delete it .

DELETE /customer/_doc/1

Respond to :

{
"_index": "customer",
"_type": "_doc",
"_id": "1",
"_version": 2,
"result": "deleted",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1
}

Delete index

DELETE /index1
DELETE /index1,index2
DELETE /index*
DELETE /_all
Can be in elasticsearch.yml Set the following setting to ture, Indicates no use DELETE /_all
action.destructive_required_name:true

Respond to

{
"acknowledged": true
}

3.6、 Update the document

It says POST keyword , Can be implemented without specifying id It's done document Insertion , POST + _update Keyword can realize update operation .

POST /customer/_doc/1/_update?pretty
{
"doc": { "name": "changwu" }
}

POST+_update The update action still needs to be specified id, But relative to PUT Come on , When using POST When updating ,id If it doesn't exist, it will report an error , and PUT And think it's adding .

Besides : For this update operation ,ES The original will be deleted first doc, Then insert the new doc.


Four 、document api

4.1、search

  • Retrieve all data below all indexes
/_search
  • Search all data under the specified index
/index/_search
  • More modes
/index1/index2/_search
/*1/*2/_search
/index1/index2/type1/type2/_search
/_all/type1/type2/_search

4.2、_mget api Batch query

mget yes ES For our batch query API, We just need to work out index、type、id.ES The hit records will be returned to us in batches .

  • stay docs It is specified in _index,_type,_id
GET /_mget
{
"docs" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1"
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2"
}
]
}
  • stay URL It is specified in index
GET /test/_mget
{
"docs" : [
{
"_type" : "_doc",
"_id" : "1"
},
{
"_type" : "_doc",
"_id" : "2"
}
]
}
  • stay URL It is specified in index and type
GET /test/type/_mget
{
"docs" : [
{
"_id" : "1"
},
{
"_id" : "2"
}
  • stay URL It is specified in index and type, And use ids Appoint id Range
GET /test/type/_mget
{
"ids" : ["1", "2"]
}
  • For different doc Specify different filter rules
GET /_mget
{
"docs" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_source" : false
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "2",
"_source" : ["field3", "field4"]
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "3",
"_source" : {
"include": ["user"],
"exclude": ["user.location"]
}
}
]
}

4.3、_bulk api Batch addition, deletion and modification


4.3.1、 Basic grammar

{"action":{"metadata"}}\n
{"data"}\n

What types of operations can be performed ?

  • delete: Delete the document .

  • create: _create Force creation .

  • index: It means ordinary put operation , It can be a creation document or a full replacement document .

  • update: Local substitution .

In the above grammar, people are not used to reading json Format , But this one-way form of json More efficient advantages .

ES How to deal with ordinary json as follows :

  • take json Array to JSONArray object , This means that as like as two peas in the memory, a copy will appear , One is json Text , One is JSONArray object .

But if the single line above JSON,ES Cut directly , It will not copy the whole data in memory .


4.3.2、delete

delete It only needs one line to be more beautiful json Just ok

{ "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }

4.3.3、create

Two lines json, The first line indicates the json Of index,type as well as id

The second line indicates the doc The data of

{ "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }
{ "field1" : "value3" }

4.3.4、index

Equivalent to PUT, New or full replacement can be realized , It's the same in two lines json.

The first line indicates a new or full replacement json Of index type as well as id.

The second line is specific data .

{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }

4.3.5、update

Express parcial update, Local substitution .

It can specify a retry_on_conflict Characteristics of , Indicates that you can retry 3 Time .

POST _bulk
{ "update" : {"_id" : "1", "_type" : "_doc", "_index" : "index1", "retry_on_conflict" : 3} }
{ "doc" : {"field" : "value"} }
{ "update" : { "_id" : "0", "_type" : "_doc", "_index" : "index1", "retry_on_conflict" : 3} }
{ "script" : { "source": "ctx._source.counter += params.param1", "lang" : "painless", "params" : {"param1" : 1}}, "upsert" : {"counter" : 1}}
{ "update" : {"_id" : "2", "_type" : "_doc", "_index" : "index1", "retry_on_conflict" : 3} }
{ "doc" : {"field" : "value"}, "doc_as_upsert" : true }
{ "update" : {"_id" : "3", "_type" : "_doc", "_index" : "index1", "_source" : true} }
{ "doc" : {"field" : "value"} }
{ "update" : {"_id" : "4", "_type" : "_doc", "_index" : "index1"} }
{ "doc" : {"field" : "value"}, "_source": true}

4.4、 Rolling query technology

If you want to query tens of thousands of data at a time , Such a huge amount of data ,ES Performance will definitely be affected . At this time, you can choose to use scrolling query (scroll). Batch by batch query , Until all the data is queried . That is to say, it will first search for a batch of data, and then search for a batch of data .

Examples are as follows : One at a time scroll request , We also need to specify a scroll Required parameters : A time window , Each search only needs to be completed in this time window ok.

GET /index/type/_search?scroll=1m
{
"query":{
"match_all":{}
},
"sort":["_doc"],
"size":3
}

Respond to

{
"_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAACNFlJmWHZLTkFhU0plbzlHX01LU2VzUXcAAAAAAAAAkRZSZlh2S05BYVNKZW85R19NS1Nlc1F3AAAAAAAAAI8WUmZYdktOQWFTSmVvOUdfTUtTZXNRdwAAAAAAAACQFlJmWHZLTkFhU0plbzlHX01LU2VzUXcAAAAAAAAAjhZSZlh2S05BYVNKZW85R19NS1Nlc1F3",
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"title": "This is another document",
"body": "This document has a body"
},
"sort": [
0
]
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": null,
"_source": {
"title": "This is a document"
},
"sort": [
0
]
}
· ]
}
}

When querying the next batch of data , Need to carry the last scroll Back to us _scroll_id Scroll query again

GET /_search/scroll
{
"scroll":"1m",
"_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAACNFlJmWHZLTkFhU0plbzlHX01LU2VzUXcAAAAAAAAAkRZSZlh2S05BYVNKZW85R19NS1Nlc1F3AAAAAAAAAI8WUmZYdktOQWFTSmVvOUdfTUtTZXNRdwAAAAAAAACQFlJmWHZLTkFhU0plbzlHX01LU2VzUXcAAAAAAAAAjhZSZlh2S05BYVNKZW85R19NS1Nlc1F3"
}

When scrolling through the query , If based on _doc The sorting method of will get higher performance .


5、 ... and 、 Next table of contents :

One 、_search api Search for api
1.1、query string search
1.2、query dsl 20 A query case
1.3、 Other AIDS API
1.4、 Aggregate analysis
1.4.1、filter aggregate
1.4.2、 nested aggregates - breadth-first
1.4.3、global aggregation
1.4.4、Cardinality Aggregate Cardinality aggregation
1.4.5、 Control the ascending and descending order of polymerization
1.4.6、Percentiles Aggregation
Two 、 Optimize relevance score and query skills
2.1、 Optimization techniques 1
2.2、 Optimization techniques 2
2.3、 Optimization techniques 3
2.4、 Optimization techniques 4
2.5、 Optimization techniques 5
2.6、 Optimization techniques 6
2.7、 Optimization techniques 7
3、 ... and 、 Next table of contents


  1. MySQL The way to cultivate immortals , Let's talk about how to learn MySQL、 How to advance !( The published )
  2. In front of you !33 High frequency interview questions , You deserve it !( The published )
  3. What is the cardinal number ?( The published )
  4. What is slow search ! How to monitor ? How to check ?( The published )
  5. Yes NotNull Field insert Null What is the phenomenon of value ?( The published )
  6. Can we talk date、datetime、time、timestamp、year The difference between ?( The published )
  7. Learn about database query caching and BufferPool Do you ? Talk about it !( The published )
  8. You know... In the database buffer pool LRU-List Do you ?( The published )
  9. On the database buffer pool Free-List?( The published )
  10. On the database buffer pool Flush-List?( The published )
  11. Do you know when to swipe dirty pages back to disk ?( The published )
  12. Make it clear with eleven pictures , When you CRUD when BufferPool What happened in ! as well as BufferPool The optimization of the !( The published )
  13. Have you heard of a table space ? What is tablespace ? What is a data table ?( The published )
  14. Talk about MySQL Of : Data area 、 Data segment 、 Data pages 、 What exactly does a data page look like ? Understand data page splitting ? Talk about it !( The published )
  15. Talk about MySQL What is the line record of ? What does it look like ?( The published )
  16. understand MySQL Is the row overflow mechanism of ?( The published )
  17. say something fsync This system calls ! ( The published )
  18. sketch undo log、truncate、 as well as undo log How to help you roll things back ! ( The published )
  19. I advise ! This young man doesn't talk about MVCC, Mouse tail juice ! ( The published )
  20. MySQL What's going on with the crash recovery of ? ( The published )
  21. MySQL Of binlog What's the use ? Who wrote it ? Where is the ? How to configure ( The published )
  22. MySQL Of bin log Write mechanism of ( The published )
  23. After deleting the library ! What else can I do besides running ?( The published )
  24. The best two-stage transaction commit and distributed transaction crosstalk in the whole network ! ( The published )

Reference resources :

https://www.elastic.co/guide/en/elasticsearch/reference/6.0

Copyright statement
In this paper,the author:[Give me a daydream],Reprint please bring the original link, thank you