ããã¯ããªã«ãããããŠæžãããã®ïŒ
Elasticsearchã¯ã©ã¹ã¿ã®ã·ã£ãŒãæ°ãããŒãæ°ãç®åºããæã®èãæ¹ã«ã€ããŠãããã調ã¹ãã®ã§ãã¡ã¢ãããŠãããããªãšã
Elasticsearchã¯ã©ã¹ã¿ã§ã®ã·ã£ãŒãæ°ãšããŒãæ°ãèšç®ãã
åºæ¬ãšãªãã®ã¯ã以äžã®ããã°ãšã³ããªã§ãããã
How many shards should I have in my Elasticsearch cluster? | Elastic Blog
ãã¡ãã«ãã·ã£ãŒãã²ãšã€ãããã®ãµã€ãºã«ã€ããŠã®èãæ¹ãæžãããŠããŸãã
ãã³ãïŒ å°ããªã·ã£ãŒãã¯å°ããªã»ã°ã¡ã³ããšãªããçµæãšããŠãªãŒããŒããããå¢ããŸãããã®ãããå¹³åã·ã£ãŒããµã€ãºã¯æå°ã§æ°GBãæ倧ã§æ°åGBã«ä¿ã€ããã«ããŸããããæéããŒã¹ã®ããŒã¿ã䜿çšããã±ãŒã¹ã§ã¯ãã·ã£ãŒããµã€ãºã20GBãã40GBã«ããã®ãäžè¬çã§ãã
20ã40GBã«ããã®ããäžè¬çã ãšæžãããŠããŸãããAWSã®ããã°ãšã³ããªãèŠããšã30GBãç®å®ãšãæžãããŠããŸãã
Amazon Elasticsearch Service をはじめよう: シャード数の算出方法 | Amazon Web Services ブログ
ããã§ã¯ã30GBãç®å®ã«èããŸãããã
ãŸããã²ãšã€ã®ElasticsearchããŒãããã©ã®ãããã·ã£ãŒããæãŠããã«ã€ããŠã¯ã以äžã®ããã«ããŒããµã€ãºã§ç®åºããããã§ããã
ãã³ãïŒ ããŒãã«ä¿æã§ããã·ã£ãŒãæ°ã¯ãå©çšã§ããããŒãéã«æ¯äŸããŸãããElasticsearchã«ãã£ãŠåŒ·å¶ãããåºå®ã®äžéã¯ãããŸãããçµéšåã§ã¯ãããŒãããšã®ã·ã£ãŒãæ°ã¯æ§æããããŒãã®GBããã20æªæºã«ç¶æããããšãè¯ããšèšããŸãããããã£ãŠ30GBã®ããŒãã§ã¯æ倧600ã·ã£ãŒããšãªããŸããããã®äžéãããå€§å¹ ã«äžåãæ°ã«ããã»ããããé©åã§ãã
ããŒã1Gããã20ã·ã£ãŒãããšããããšã«ãªããŸãã
ãšãããšãã€ã³ããã¯ã¹ã®ãµã€ãºãšã¬ããªã«æ°ãElasticsearchã®ããŒããµã€ãºã決ãŸãã°ãã·ã£ãŒãæ°ãšElasticsearchã®ããŒãæ°ãç®åºã§ããããšã«
ãªããŸãã
ããšãã°ã以äžã®æ¡ä»¶ã§èããŸãã
- Elasticsearchã«ä¿æããã€ã³ããã¯ã¹ã¯1çš®é¡
- ã²ãšã€ã®ã€ã³ããã¯ã¹ã®ãµã€ãºã150GB
- ã¬ããªã«æ°ã1
- Elasticsearchã®ããŒããµã€ãºã15GB
- 1æ¥åäœã«åãçš®é¡ã®ã€ã³ããã¯ã¹ãäœæããŠãæ倧3ã¶æïŒ90æ¥ïŒåä¿æãã
- LogstashãBeatsã§æ¥åäœã®ã€ã³ããã¯ã¹ãäœæããã€ã¡ãŒãž
- 1æ¥ãããã150GBã®ã€ã³ããã¯ã¹ãã§ãããã®ãšãã
èšç®ãããš
- ã€ã³ããã¯ã¹ãããã®ã·ã£ãŒãæ°ïŒ 150GB / 30GB = 5ã·ã£ãŒãïŒäœããåºãå Žåã¯ã1ã·ã£ãŒãååãäžãïŒ
- ã¯ã©ã¹ã¿å ã®ã·ã£ãŒãæ°ïŒ 5ã·ã£ãŒã à 2ïŒãã©ã€ããªãŒïŒã¬ããªã«ïŒ à 90ïŒä¿ææ¥æ°ãã€ã³ããã¯ã¹æ°ã«ãªãïŒ ïŒ 900ã·ã£ãŒã
- å¿ èŠãªElasticsearchããŒãæ°ïŒ 900ã·ã£ãŒã / 15 à 20ïŒããŒããµã€ãº à 1GããŒãããã20ã·ã£ãŒãïŒ ïŒ 3ããŒãïŒäœããåºãå Žåã¯ã1ããŒãååãäžãïŒ
ãšããæãã§ãããããããã«ãè€æ°ã®çš®é¡ã®ã€ã³ããã¯ã¹ãæã€ãªããã·ã£ãŒãæ°ã®éšåã«è¿œå èšç®ããŠããæãã§ããã
ã¬ããªã«ãèšç®ããèœã¡ãããæ°ãããªãã§ããªãã§ãããã¬ããªã«ã·ã£ãŒããå«ããŸããã¬ããªã«ã·ã£ãŒãã¯ãæ€çŽ¢ã«ã䜿ãããããã§ããã
ãã©ã€ããªãŒã·ã£ãŒããæŽæ°ãããåŸã«åãããŠã¬ããªã«ã·ã£ãŒããæŽæ°ãããŸããããããµã€ãã«äœ¿ãããŸãããšã
Elasticsearchã§ã¯ãåã¯ãšãªã¯ã·ã£ãŒãããšã«åäžã®ã¹ã¬ããã§å®è¡ãããŸãããã ããåäžã®ã·ã£ãŒãã«è€æ°ã®ã¯ãšãªããã³éçŽãå®è¡ã§ããã®ãšåæ§ã«ãè€æ°ã®ã·ã£ãŒãã䞊è¡ããŠåŠçããããšãå¯èœã§ãã
ã€ã³ããã¯ã¹ã®å®ããŒã¿éãèŠã
ããã¯ãããšãã€ã³ããã¯ã¹ã®ãµã€ãºã¯ã©ããã£ãŠæž¬ãïŒãšããããšã«ãªããšæããŸãããããŒã¿ãå
¥ããŠcat APIã䜿ãããšã«ãªãã®ã§ã¯
ãªãã§ããããã
cat indices API | Elasticsearch Reference [7.5] | Elastic
cat shards API | Elasticsearch Reference [7.5] | Elastic
å®äŸããªããšãªããšããªã®ã§ãé©åœãªãµã€ãºïŒç°¡åã«çµãããããã®ã§GBã¯ããããã§ãMBã¯æ¬²ããïŒãããã®âŠïŒã®ãã¿ããªããã©ãã
èããçµæããã®ããã°ã®ããŒã¿ã䜿ãããšã«ããŸããã
â»ããã§ã¯ããããŸã§ã«æžããŠãã1ã·ã£ãŒãããã30GBãªã©ã®ç®å®ã¯ç¡èŠããŠãã·ã£ãŒããåããããšã«ããå€åãã¬ããªã«æ°ã®åœ±é¿ãèŠãŸã
ãã®ããã°ã®ããŒã¿ããšã¯ã¹ããŒãããŠååŸã
ãã®ããã°ã察象ã«ããŠããã®ã§ããkazuhira-r.hatenablog.com.export.txtããšãããã¡ã€ã«ãååŸã§ããŸãããµã€ãºã¯32MBã»ã©ã§ããããŸãããã§ãããã
$ ll -h kazuhira-r.hatenablog.com.export.txt -rw-rw-r-- 1 xxxxx xxxxx 32M 1æ 2 15:48 kazuhira-r.hatenablog.com.export.txt
äžã«ã¯ããã©ããç¶æ ã®ãã®ãå ¥ã£ãŠããŸããâŠã
$ head -n 20 kazuhira-r.hatenablog.com.export.txt AUTHOR: Kazuhira TITLE: Elasticsearchã¯ã©ã¹ã¿ã®ã·ã£ãŒãæ°ãèšç®ãã BASENAME: 2020/01/02/011901 STATUS: Draft ALLOW COMMENTS: 1 CONVERT BREAKS: 0 DATE: 01/02/2020 00:58:10 CATEGORY: Elasticsearch ----- BODY: <p>Elasticsearchã®ã·ã£ãŒãæ°ãç®åºããæã®èãæ¹ããã¡ã¢ãããŠãããããªãšã</p> <p>åºæ¬ãšãªãã®ã¯ã以äžã®ããã°ãšã³ããªã§ãããã</p> <p><a href="https://www.elastic.co/jp/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster">How many shards should I have in my Elasticsearch cluster? | Elastic Blog</a></p> <p>ãã¡ãã«ãã·ã£ãŒãã²ãšã€ãããã®ãµã€ãºã«ã€ããŠã®èãæ¹ãæžãããŠããŸãã</p> <blockquote><p>ãã³ãïŒ å°ããªã·ã£ãŒãã¯å°ããªã»ã°ã¡ã³ããšãªããçµæãšããŠãªãŒããŒããããå¢ããŸãããã®ãããå¹³åã·ã£ãŒããµã€ãºã¯æå°ã§æ°GBãæ倧ã§æ°åGBã«ä¿ã€ããã«ããŸããããæéããŒã¹ã®ããŒã¿ã䜿çšããã±ãŒã¹ã§ã¯ãã·ã£ãŒããµã€ãºã20GBãã40GBã«ããã®ãäžè¬çã§ãã</p></blockquote>
圢åŒã¯ãMovable Typeã
ãããããã€ã³ããã¯ã¹ã«
- ã¿ã€ãã«ïŒTITLEïŒ
- ã«ããŽãªãŒïŒCATEGORYïŒ
- æçš¿æ¥æïŒDATEïŒ
- ã³ã³ãã³ãïŒBODYããHTMLã®ããã¹ãã ããæãåºãããã®ïŒ
- ã¹ããŒã¿ã¹ïŒSTATUSïŒ
ãç»é²ããidãšããŠã¯BASENAMEãã/ãæãã§äœ¿ããã®ãšããŸããããšãã³ã¡ã³ãã¯å šéšèªã¿é£ã°ããŸãã
ããããã¡ãã£ãšããããã°ã©ã ãPythonã§æžããŠã¿ãŸããããHTMLããããã¹ããååŸããã®ã«BeautifulSoup4ãElasticsearchãžã®
ããŒã¿ç»é²ã«ãPythonã®Elasticsearchã¯ã©ã€ã¢ã³ãã䜿çšããŸãã
$ pip3 install beautifulsoup4 elasticsearch
ããŒãžã§ã³ã
$ pip3 freeze beautifulsoup4==4.8.2 elasticsearch==7.1.0 pkg-resources==0.0.0 soupsieve==1.9.5 urllib3==1.25.7 $ python3 -V Python 3.6.9
ã§ãé©åœã«ã¹ã¯ãªãããäœããŸãããmv_export_to_es.pyããšãããã¡ã€ã«åã§ãåŒæ°ã«ãšã¯ã¹ããŒãããMovable Typeã®ãã¡ã€ã«ãã¹ãæž¡ãããã«
ããŸããäœæããã¹ã¯ãªããã¯ãã®ãšã³ããªã®æ¬çã§ã¯ãªãã®ã§ãæåŸã«èŒããŸããã
ãã®ã¹ã¯ãªããã§ã¯ãããŒã¿ãç»é²ããã€ã³ããã¯ã¹åããblogããšããŠããã«ã¯åŠçã§100件ãã€ç»é²ããŸãã
æ¥ç¶å ã®Elasticsearchã¯ã192.168.33.11ã13ã§ã¯ã©ã¹ã¿æ§æãããŠãããã®ãšããŸãã
æ å ±ã¯ãã¡ãã
$ curl localhost:9200/_cat/nodes?v ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 192.168.33.12 9 96 0 0.00 0.01 0.00 dilm - node-2 192.168.33.13 7 96 0 0.02 0.02 0.02 dilm - node-3 192.168.33.11 8 95 0 0.00 0.00 0.02 dilm * node-1 $ curl localhost:9200 { "name" : "node-1", "cluster_name" : "elasticsearch", "cluster_uuid" : "0PtgLGF_Q2-IWYMVUtcV4Q", "version" : { "number" : "7.5.1", "build_flavor" : "default", "build_type" : "deb", "build_hash" : "3ae9ac9a93c95bd0cdc054951cf95d88e1e18d96", "build_date" : "2019-12-16T22:57:37.835892Z", "build_snapshot" : false, "lucene_version" : "8.3.0", "minimum_wire_compatibility_version" : "6.8.0", "minimum_index_compatibility_version" : "6.0.0-beta1" }, "tagline" : "You Know, for Search" } $ java --version openjdk 11.0.5 2019-10-15 OpenJDK Runtime Environment (build 11.0.5+10-post-Ubuntu-0ubuntu1.118.04) OpenJDK 64-Bit Server VM (build 11.0.5+10-post-Ubuntu-0ubuntu1.118.04, mixed mode, sharing)
ãšããããããªã«ãèããã«ããŒã¿ãç»é²ããŸãã
$ python3 mv_export_to_es.py kazuhira-r.hatenablog.com.export.txt
1243件ã®ããŒã¿ãå ¥ããŸããã
$ curl localhost:9200/blog/_count?pretty { "count" : 1243, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } }
ã€ã³ããã¯ã¹ã®æ å ±ãèŠãŠã¿ãŸãã
$ curl localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open blog jTUukj5PQ-eZoMo3GSVO8w 1 1 1243 0 23.6mb 11.8mb
ããŒã¿ãµã€ãºã¯23.6MBã§ããã§ããããèŠããšãpri.store.sizeãã¯11.8MBãšè¡šç€ºãããŠããŸããã
äžæ¹ã§ã·ã£ãŒãã®ãµã€ãºãèŠãŠã¿ãŸãã
$ curl localhost:9200/_cat/shards?v index shard prirep state docs store ip node blog 0 p STARTED 1243 11.8mb 192.168.33.13 node-3 blog 0 r STARTED 1243 11.7mb 192.168.33.11 node-1
2ã€ã·ã£ãŒããããã11.7ã8MBã§ããã
ã€ã³ããã¯ã¹ã¯ããã©ã«ãã®èšå®ã§äœæããã®ã§ãã·ã£ãŒãæ°1ãã¬ããªã«æ°1ã«ãªã£ãŠããŸãã
$ curl localhost:9200/blog/_settings?pretty { "blog" : { "settings" : { "index" : { "creation_date" : "1577954953969", "number_of_shards" : "1", "number_of_replicas" : "1", "uuid" : "jTUukj5PQ-eZoMo3GSVO8w", "version" : { "created" : "7050199" }, "provided_name" : "blog" } } } }
ãšããããã§ãã€ã³ããã¯ã¹ã®æ¹ã«åºãŠããããŒã¿ãµã€ãºïŒstore.sizeïŒã¯ããã©ã€ããªãŒã·ã£ãŒããšã¬ããªã«ã·ã£ãŒãã®å€ã
足ãããã®ã§ããã
èšç®ã®ããŒã¹ãšããŠã¯ããã©ã€ããªãŒã·ã£ãŒãã®åèšå€ãèŠãŠå¿ èŠãšãªãããŒã¿ãµã€ãºãèŠãŠã¿ãã°ããã§ãããã
å¥ãã¿ãŒã³ãšããŠãã·ã£ãŒãæ°ãšã¬ããªã«æ°ãå€æŽãããã®ãè©ŠããŠã¿ãŸãããã
ä»ã®ã€ã³ããã¯ã¹ã1床åé€ã
$ curl -XDELETE localhost:9200/blog {"acknowledged":true}
ã·ã£ãŒãæ°9ãã¬ããªã«æ°2ã§ã€ã³ããã¯ã¹ãäœæã
$ curl -XPUT "localhost:9200/blog" -H 'Content-Type: application/json' -d '{ "settings" : { "index" : { "number_of_shards" : 9, "number_of_replicas" : 2 } } }' {"acknowledged":true,"shards_acknowledged":true,"index":"blog"}
å床ããŒã¿ç»é²ã
$ python3 mv_export_to_es.py kazuhira-r.hatenablog.com.export.txt
確èªã
$ curl localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open blog 4iAlQWUURxypKkNqmilbsg 9 2 1243 0 41.1mb 14mb
ãã©ã€ããªãŒã·ã£ãŒãã®å®¹éãã1ã·ã£ãŒãã®æããã¡ãã£ãšå¢ããŠããŸãããããŒã¿ã«ã®ãµã€ãºãšããŠããã¬ããªã«æ°ã2ã€ã«ãªã£ãã®ã§ã
ãã©ã€ããªãŒã·ã£ãŒãã®çŽ3åã«ãªã£ãŠããŸãã
ã·ã£ãŒãã®ç¶æ ã¯ããããªæãã§ãã
$ curl localhost:9200/_cat/shards?v index shard prirep state docs store ip node blog 3 p STARTED 143 1.6mb 192.168.33.13 node-3 blog 3 r STARTED 143 1.5mb 192.168.33.11 node-1 blog 3 r STARTED 143 1.6mb 192.168.33.12 node-2 blog 4 r STARTED 124 1.4mb 192.168.33.13 node-3 blog 4 p STARTED 124 1.4mb 192.168.33.11 node-1 blog 4 r STARTED 124 1.4mb 192.168.33.12 node-2 blog 6 p STARTED 137 1.4mb 192.168.33.13 node-3 blog 6 r STARTED 137 1.3mb 192.168.33.11 node-1 blog 6 r STARTED 137 1.3mb 192.168.33.12 node-2 blog 7 r STARTED 163 1.7mb 192.168.33.13 node-3 blog 7 p STARTED 163 1.9mb 192.168.33.11 node-1 blog 7 r STARTED 163 1.7mb 192.168.33.12 node-2 blog 8 r STARTED 126 1.3mb 192.168.33.13 node-3 blog 8 r STARTED 126 1.3mb 192.168.33.11 node-1 blog 8 p STARTED 126 1.5mb 192.168.33.12 node-2 blog 2 r STARTED 129 1.3mb 192.168.33.13 node-3 blog 2 r STARTED 129 1.2mb 192.168.33.11 node-1 blog 2 p STARTED 129 1.3mb 192.168.33.12 node-2 blog 1 r STARTED 141 1.6mb 192.168.33.13 node-3 blog 1 p STARTED 141 1.6mb 192.168.33.11 node-1 blog 1 r STARTED 141 1.4mb 192.168.33.12 node-2 blog 5 r STARTED 129 1.5mb 192.168.33.13 node-3 blog 5 r STARTED 129 1.5mb 192.168.33.11 node-1 blog 5 p STARTED 129 1.5mb 192.168.33.12 node-2 blog 0 p STARTED 151 1.6mb 192.168.33.13 node-3 blog 0 r STARTED 151 1.6mb 192.168.33.11 node-1 blog 0 r STARTED 151 1.5mb 192.168.33.12 node-2
å®éã«èšç®ããæã¯1ã·ã£ãŒãã«ããŒã¿ãå
¥ããŠåºæºãšãªãããŒã¿ãµã€ãºãèŠã€ã€ãå®éã«ã·ã£ãŒãåå²ããŠã©ããããäœå°ã«
å¢ãããã確èªããŠã¿ããšãã£ãæãã«ãªããã§ããããã
ãªããªãcat APIã«çµæãåæ ãããªãæã¯ãRefresh APIã䜿ã£ãŠãã€ã³ããã¯ã¹ããªãã¬ãã·ã¥ãããšããã§ãããã
$ curl -XPOST localhost:9200/[ã€ã³ããã¯ã¹å]/_refresh $ curl -XPOST localhost:9200/_refresh
ãªãã±
æåŸã«ãä»åäœæããã¹ã¯ãªãããèŒããŠçµããã«ããŸãã mv_export_to_es.py
from bs4 import BeautifulSoup from elasticsearch import Elasticsearch from elasticsearch import helpers import re import sys elasticsearch_hosts = ["192.168.33.11", "192.168.33.12", "192.168.33.13"] elasticsearch_urls = [ f"http://{host}:9200" for host in elasticsearch_hosts ] es = Elasticsearch(elasticsearch_urls) index_name = "blog" bulk_size = 100 mv_export_file = sys.argv[1:][0] body_separator = "-----" entry_separator = "--------" with open(mv_export_file, "rt", encoding = "utf-8") as file: docs = [] while True: line = file.readline() # AUTHOR if line == "": break title = re.match("TITLE:\s+(.+)", file.readline().strip()).group(1) # TITLE basename = re.match("BASENAME:\s+(.+)", file.readline()).group(1).replace("/", "") # BASENAME status = re.match("STATUS:\s(.+)", file.readline().strip()).group(1) # STATUS file.readline() # ALLOW COMMENT file.readline() # CONVERT BREAKS m = re.match("DATE:\s+(\d\d)/(\d\d)/(\d\d\d\d) (\d\d:\d\d:\d\d)", file.readline()) # DATE datetime = f"{m.group(3)}-{m.group(1)}-{m.group(2)} {m.group(4)}" categories = [] while True: line = file.readline() if line.startswith("CATEGORY:"): category = re.match("CATEGORY:\s+(.+)", line).group(1) categories.append(category) else: # BODY separator(------) break file.readline() # BODY body = "" while True: line = file.readline() if line.strip() == body_separator: maybe_contents = line line = file.readline() if line.strip() == entry_separator: break elif line.strip() == "COMMENT:": # skip comment while True: line = file.readline() if line.strip() == entry_separator: break break else: body += maybe_contents body += line else: body += line soup = BeautifulSoup(body, "html.parser") body_text = soup.get_text() doc = { "_index": index_name, "_id": basename, "_source": { "title": title, "datetime": datetime, "categories": categories, "body": body_text, "status": status } } docs.append(doc) if len(docs) >= bulk_size: helpers.bulk(es, docs) docs = [] helpers.bulk(es, docs) # last