CLOVERšŸ€

That was when it all began.

äæ®ę­£ć•ć‚ŒćŸmecab-ipadic-neologdć®č¾žę›ø悒态Lucene Kuromoji恫適ē”Ø恗恦ćæ悋

å…ˆę—„ć€ć“ć®ć‚ˆć†ćŖć‚Øćƒ³ćƒˆćƒŖ悒ę›øćć¾ć—ćŸć€‚

mecab-ipadic-neologdć®č¾žę›ø悒态Lucene Kuromoji恫適ē”Ø恗恦ćæ悋
http://d.hatena.ne.jp/Kazuhira/20150315/1426391366

mecab-ipadic-neologdč‡Ŗä½“ć«ć¤ć„ć¦ćÆ态恓恔悉怂

MeCab ē”Ø恮ꖰčŖžč¾žę›ø mecab-ipadic-neologd ć‚’å…¬é–‹ć—ć¾ć—ćŸ
http://diary.overlasting.net/2015-03-13-1.html

恓恮ć‚Øćƒ³ćƒˆćƒŖ恧ćÆ态Lucene恮Kuromoji恫mecab-ipadic-neologd悒適ē”Ø恗恦ćæćŸć®ć§ć™ćŒć€2ć¤ć®å•é”ŒćŒå‡ŗć¾ć—ćŸć€‚

ć²ćØ恤ćÆ态Kuromoji恌mecab-ipadic-neologdć®ć‚·ćƒ¼ćƒ‰č¾žę›øć«å«ć¾ć‚Œć‚‹åŽŸå½¢ćŒ15ę–‡å­—ć‚’č¶…ćˆć‚‹å˜čŖžć‚’å–ć‚Šč¾¼ć‚ćŖ恄恓ćØć€‚ć‚‚ć†ć²ćØ恤ćÆć€åŒć˜ćmecab-ipadic-neologdć®ć‚·ćƒ¼ćƒ‰č¾žę›øć«å«ć¾ć‚Œć‚‹ę–‡č„ˆIDćØå“č©žć®ēµ„ćæåˆć‚ć›ćŒć€å…ƒć®IPAč¾žę›øćØć‚ŗćƒ¬ć¦ć—ć¾ć£ć¦ć„ć¦ć‚„ć£ć±ć‚Šå–ć‚Šč¾¼ć‚ćŖ恄恓ćØ怂

ęœ€åˆć®ć‚‚ć®ćÆ态Kuromoji恮äŗ‹ęƒ…恠ćØę€ć„ć¾ć™ćŒļ¼ˆć‚ŖćƒŖć‚øćƒŠćƒ«ć®MeCabćÆåŽŸå½¢ćŒ15ę–‡å­—ć‚’č¶…ćˆć¦ć„ć‚ˆć†ćŒå‹•ä½œć™ć‚‹ć®ć§ļ¼‰ć€å¾Œč€…ćÆmecab-ipadic-neologdć®å•é”Œć‹ćŖ恁ćØć”ć‚‡ć£ćØę€ć£ć¦ć„ć¾ć—ćŸć€‚

恙悋ćØć€ä½œč€…ć®@overlast恕悓恫恓恮ć‚Øćƒ³ćƒˆćƒŖć‚’č¦‹ć¦ć„ćŸć ć‘ćŸć‚ˆć†ć§ć€äæ®ę­£ć—ć¦ć‚‚ć‚‰ćˆć¾ć—ćŸļ¼

恂悊恌ćØć†ć”ć–ć„ć¾ć™ļ¼

3/17čæ½čؘļ¼‰
ćć®å¾Œć€č¾žę›øć«åę˜ ć™ć‚‹åŽŸå½¢ć®ęœ€å¤§å€¤ć‚’ęŒ‡å®šć§ćć‚‹ć‚Ŗćƒ—ć‚·ćƒ§ćƒ³ć¾ć§ć¤ć‘ć¦ć‚‚ć‚‰ćˆć¾ć—ćŸļ¼

ć‚¹ćƒŸćƒžć‚»ćƒ³ā€¦ć‚ć‚ŠćŒćØć†ć”ć–ć„ć¾ć™ļ¼

ćØć„ć†ć‚ć‘ć§ć€ę°—ć‚’å–ć‚Šē›“ć—ć¦å†åŗ¦é©ē”Øę–¹ę³•ć‚’ć¾ćØ悁恟恄ćØę€ć„ć¾ć™ć€‚ä»Šåŗ¦ćÆć€ć‚‚ć†å°‘ć—ćƒœćƒŖćƒ„ćƒ¼ćƒ ć‚’ęŠ‘ćˆć¦ā€¦ć€‚
ā€»å‰å›žć®ć‚Øćƒ³ćƒˆćƒŖ恮čؘčæ°ćŒę··ć˜ć£ć¦ć„ć‚‹ć‚‚ć®ć‚‚ć‚ć‚Šć¾ć™ćŒć€ć”äŗ†ę‰æ恏恠恕恄

čæ½čؘļ¼‰
꜀ēµ‚ēš„恫态bashć‚¹ć‚ÆćƒŖćƒ—ćƒˆć«ć—ć¾ć—ćŸć€‚ć„ććŖ悊ēµęžœćŒę¬²ć—ć„ę–¹ćÆ态恓恔悉ćø怂

Lucene Kuromoji恫åÆ¾ć—ć¦ć€mecab-ipadic-neologdć®č¾žę›ø悒適ē”Øć—ć¦ćƒ“ćƒ«ćƒ‰ć™ć‚‹bashć‚¹ć‚ÆćƒŖ惗惈悒ę›øćć¾ć—ćŸ
http://d.hatena.ne.jp/Kazuhira/20150317/1426606053

MeCabćØIPAč¾žę›øć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«

README悒čŖ­ćæ恤恤态åæ…要ćŖć‚½ćƒ•ćƒˆć‚¦ć‚§ć‚¢ć‚’ęƒćˆć¾ć™ć€‚

mecab-ipadic-NEologd : Neologism dictionary for MeCab
https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md

åæ…要ćŖ悂恮ćÆ态C++ć‚³ćƒ³ćƒ‘ć‚¤ćƒ©ć€iconv态MeCab态mecab-ipadic态xz恠恝恆恧恙怂

å‰å›žć¾ć§ć§ć‚ć‚‹ē؋åŗ¦ęƒćˆć¦ć—ć¾ć£ćŸć®ć§ć€ć“ć“ć§ćÆMeCabćØIPAč¾žę›øć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć‹ć‚‰å§‹ć‚ć¾ć™ć€‚

MeCabć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć€‚ä»Šå›žć‚‚MeCabć‚’ć‚·ć‚¹ćƒ†ćƒ ć‚°ćƒ­ćƒ¼ćƒćƒ«ć«ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ćŖć„ć®ć§ć€ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«å…ˆć‚’ęŒ‡å®šć—ć¾ć™ć€‚ć“ć“ć§ćÆ态怌$MECAB_HOME怍ćØčØ˜č¼‰ć—ć¾ć™ć€‚

$ wget https://mecab.googlecode.com/files/mecab-0.996.tar.gz
$ tar -zxvf mecab-0.996.tar.gz
$ cd mecab-0.996
$ ./configure --prefix=$MECAB_HOME
$ make
$ sudo make install

ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ćŸMeCabć«ćƒ‘ć‚¹ć‚’é€šć—ć¾ć™ć€‚

$ export PATH=$MECAB_HOME/bin:$PATH

IPAč¾žę›øć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć€‚

$ wget https://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz
$ tar -zxvf mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
$ ./configure --with-charset=utf-8
$ make
$ sudo make install

ć“ć“ć¾ć§ć§ć€mecab-ipadic-neologdć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«č¦ä»¶ćŒęƒć„ć¾ć™ć€‚

mecab-ipadic-neologdć®ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«

仄äø‹ć«čØ˜č¼‰ć®ę‰‹é †ć«ę²æć£ć¦ć€mecab-ipadic-neologdć‚’ć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ć¾ć™ć€‚

mecab-ipadic-NEologd : Neologism dictionary for MeCab
https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md

$ git clone https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
$ git pull
$ ./bin/install-mecab-ipadic-neologd -n --max_baseform_length 15

ä»Šå›žć€@overlast恕悓恫čæ½åŠ ć—ć¦ć„ćŸć ć„ćŸć‚Ŗćƒ—ć‚·ćƒ§ćƒ³ć€ć€Œ-n --max_baseform_length [åŽŸå½¢ć®ęœ€å¤§é•·]ć€ć‚’ęŒ‡å®šć—ć¦ć€ä»Šå›žćÆåŽŸå½¢ć®é•·ć•ćŒ15ę–‡å­—ä»„å†…ć«ćŖć‚‹ć‚ˆć†ć«ć—ć¦ć„ć¾ć™ć€‚

å®Ÿč”Œć™ć‚‹ćØ态途äø­ć§ć“ć‚“ćŖč”Øē¤ŗ恌å‡ŗć¦åŽŸå½¢ćŒ15ę–‡å­—ć‚’č¶…ćˆćŸć‚‚ć®ć‚’ć€ę¶ˆć—ć¦ćć‚Œć¦ć‚‹ć£ć½ć„ć§ć™ć€‚

[make-mecab-ipadic-neologd] : Delete the entries whose length of base form is longer than 15 from seed file

ćć—ć¦ć€ćƒ“ćƒ«ćƒ‰ć—ćŸę™‚ć«ć€ć‚«ćƒ¬ćƒ³ćƒˆćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ恫ē”Ÿęˆć•ć‚Œć‚‹ć€Œbuildć€ćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ恮äø­čŗ«ć‚’ć€ä»Šå¾ŒćÆä½æ恆恓ćØ恫ćŖć‚Šć¾ć™ć€‚

$ ls -l build
合č؈ 11928
drwxrwxr-x 2 xxxxx xxxxx     4096  3꜈ 17 21:53 mecab-ipadic-2.7.0-20070801-neologd-20150317
-rw-rw-r-- 1 xxxxx xxxxx 12208105  3꜈ 17 21:52 mecab-ipadic-2.7.0-20070801.tar.gz

怌neologdć€ć®ę¬”ć®ę—„ä»˜ć®éƒØ分ļ¼ˆä»Šå›žćÆ怌20150317怍ļ¼‰ćÆć€č¾žę›øćŒę›“ę–°ć•ć‚Œć‚‹ć«ć—ćŸćŒć£ć¦äøŠćŒć£ć¦ć„ćć‚ˆć†ćŖć®ć§ć€é©å®œčŖ­ćæę›æćˆć¦ćć ć•ć„ć€‚

恔ćŖćæć«ć€ä»Šå›žć®ć‚ˆć†ćŖä½æć„ę–¹ć ćØć€ęœ€å¾Œć«å®Ÿč”Œć™ć‚‹ć‚¹ć‚ÆćƒŖ惗惈ćÆ态怌bin/install-mecab-ipadic-neologd怍恧ćÆćŖćć¦ć€Œlibexec/make-mecab-ipadic-neologd.shć€ć§ć‚‚ć„ć„ć‹ć‚‚ć—ć‚Œć¾ć›ć‚“ć€‚

ä»„é™ć€MeCabćÆä½æ悏ćŖ恏ćŖć‚Šć¾ć™ć€‚

Luceneć®ćƒ“ćƒ«ćƒ‰

ē¶šć„恦态Luceneć‚’ćƒ“ćƒ«ćƒ‰ć—ć¾ć™ć€‚

Subversion恋悉ć‚Øć‚Æć‚¹ćƒćƒ¼ćƒˆć—ć¦ć€Luceneęœ¬ä½“ć‚’ćƒ“ćƒ«ćƒ‰ć€‚ä»Šå›žć®Luceneć®ćƒćƒ¼ć‚øćƒ§ćƒ³ćÆ5.0.0悒ä½æ恆恮恧态ć‚æ悰ćÆ怌lucene_solr_5_0_0怍恧恙怂

$ svn export http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_5_0_0
$ cd lucene_solr_5_0_0/lucene
$ ant ivy-bootstrap
$ ant compile

怌ant ivy-bootstrap怍ćÆ态恙恧恫Ant恫IvyćŒå°Žå…„ęøˆćæ恧恂悌恰äøč¦ć§ć™ć€‚

ćŖ恊态恓恓恧Lucene悒ć‚Øć‚Æć‚¹ćƒćƒ¼ćƒˆć—ćŸćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖļ¼ˆ/path/to/lucene_solr_5_0_0ļ¼‰ć‚’态$LUCENE_SRC_HOMEćØčØ˜č¼‰ć—ć¾ć™ć€‚

Kuromojić®é…ē½®ć•ć‚Œć¦ć„ć‚‹ćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖćøē§»å‹•ć€‚

$ cd analysis/kuromoji

å…ˆć»ć©ć€mecab-ipadic-neologdć‚’ćƒ“ćƒ«ćƒ‰ć—ćŸę™‚ć®äø­é–“ē”Ÿęˆē‰©ļ¼ˆbuild/mecab-ipadic-2.7.0-20070801-neologd-20150317ļ¼‰ć‚’态Kuromojić®ćƒ“ćƒ«ćƒ‰ęˆęžœē‰©ć®å‡ŗåŠ›å…ˆć«ć‚³ćƒ”ćƒ¼ć—ć¾ć™ć€‚

$ cp -Rp [mecab-ipadic-neologdć‚’ćƒ“ćƒ«ćƒ‰ć—ćŸćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ]/build/mecab-ipadic-2.7.0-20070801-neologd-20150317 $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji

ćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®IPAč¾žę›ø恧ćÆćŖćć€ć‚³ćƒ”ćƒ¼ć—ćŸmecab-ipadic-neologd恮äø­é–“ē”Ÿęˆē‰©ć‚’ä½æć†ć‚ˆć†ć«ć€build.xml恮ipadic.version悒äæ®ę­£ć—ć¾ć™ļ¼ˆć“ć“ćŒć€ćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖåć‚‚ęŒ‡ć™ć‚ˆć†ć«ćŖć£ć¦ć„ć‚‹ć®ć§ļ¼‰ć€‚

  <!-- <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" /> -->
  <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801-neologd-20150317" />

å…ˆć«ć‚‚ę›øćć¾ć—ćŸćŒć€neologdć®å¾Œć‚ć«ē¶šćę—„付ćÆć€é©å®œć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ć—ćŸćƒćƒ¼ć‚øćƒ§ćƒ³ć«åˆć‚ć›ć¦äæ®ę­£ć—ć¦ćć ć•ć„ć€‚

今回ä½æć†č¾žę›øļ¼ˆćØ恄恆恋CSVćƒ•ć‚”ć‚¤ćƒ«ļ¼‰ćÆUTF-8恧ę›øć‹ć‚Œć¦ć„ć‚‹ć®ć§ć€ćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®EUC-JPć‹ć‚‰å¤‰ę›“ć—ć¾ć™ć€‚

  <!-- <property name="dict.encoding" value="euc-jp"/> -->
  <property name="dict.encoding" value="utf-8"/>

build-dictć‚æć‚¹ć‚Æ恧ćÆć€č¾žę›øć®ćƒ€ć‚¦ćƒ³ćƒ­ćƒ¼ćƒ‰ćÆäøč¦ć«ćŖ悋恮恧态depends恋悉download-dictć‚æć‚¹ć‚Æć‚’åˆ‡ć‚Šé›¢ć—ć¾ć™ć€‚

  <!-- <target name="build-dict" depends="compile-tools, download-dict"> -->
  <target name="build-dict" depends="compile-tools">

č¾žę›øä½œęˆćƒ„ćƒ¼ćƒ«ćÆć€ä»Šå›žć®CSVćƒ•ć‚”ć‚¤ćƒ«ć‚’čŖ­ć¾ć›ć‚‹ćØćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®ćƒ’ćƒ¼ćƒ—ć‚µć‚¤ć‚ŗļ¼ˆ1Gļ¼‰ć§ćÆč¶³ć‚ŠćŖ恏ćŖć‚‹ć®ć§ć€ę‹”å¼µć—ć¾ć™ć€‚2Gć«ć—ć¾ć—ćŸćŒć€ä»Šå›žćÆć“ć‚Œć§ååˆ†ć«ä½™č£•ćŒć‚ć‚Šć¾ć—ćŸć€‚

      <!-- TODO: optimize the dictionary construction a bit so that you don't need 1G -->
      <!-- <java fork="true" failonerror="true" maxmemory="1g" classname="org.apache.lucene.analysis.ja.util.DictionaryBuilder"> -->
      <java fork="true" failonerror="true" maxmemory="2g" classname="org.apache.lucene.analysis.ja.util.DictionaryBuilder">

ā€»Ant恮čح定ćÆć‚·ć‚¹ćƒ†ćƒ ćƒ—ćƒ­ćƒ‘ćƒ†ć‚£ć§ć‚‚ć§ćć‚‹ć“ćØć‚’å¾Œć§ę€ć„å‡ŗć—ć¾ć—ćŸćŒć€maxmemoryćŖ恩ćÆå¤‰ćˆć‚‰ć‚ŒćŖ恕恝恆ćŖć®ć§ć“ć®ć¾ć¾build.xml悒äæ®ę­£ć™ć‚‹ę–¹å‘ć®ć¾ć¾ć«ć—ć¾ć—ćŸ

恧ćÆć€č¾žę›øć‚’ćƒ“ćƒ«ćƒ‰ć—ć¾ć™ć€‚

$ ant regenerate

恓恓恧态mecab-ipadic-neologdć‚¤ćƒ³ć‚¹ćƒˆćƒ¼ćƒ«ę™‚ć«ć€Œ-n --max_baseform_length 15怍ć‚Ŗćƒ—ć‚·ćƒ§ćƒ³ć‚’ä»˜ć‘ć¦ć„ćŖć‹ć£ćŸå “åˆć€ć—ć°ć‚‰ćå¾…ć£ć¦ć„ć‚‹ćØč¾žę›øć®ćƒ“ćƒ«ćƒ‰ć«å¤±ę•—ć—ć¾ć™ć€‚

     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java]   encode...
     [java] Exception in thread "main" java.lang.AssertionError
     [java] 	at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:129)
     [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:143)
     [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:78)
     [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
     [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

å†’é ­ć«ć‚‚ę›øćć¾ć—ćŸćŒć€KuromojićŒåŽŸå½¢ćŒ15ę–‡å­—ć‚’č¶…ćˆć‚‹ć“ćØ悒čØ±ć•ćŖ恄ćæ恟恄ćŖ恮恧恙怂

      assert baseForm.length() < 16;

https://github.com/apache/lucene-solr/blob/lucene_solr_5_0_0/lucene/analysis/kuromoji/src/tools/java/org/apache/lucene/analysis/ja/util/BinaryDictionaryWriter.java#L129

å‚č€ƒļ¼šMeCabć®č¾žę›øć®ćƒ•ć‚©ćƒ¼ćƒžćƒƒćƒˆć«ć¤ć„ć¦ļ¼‰
単čŖžć®čæ½åŠ ę–¹ę³•
http://mecab.googlecode.com/svn/trunk/mecab/doc/dic.html

ć†ć¾ćć„ć‹ćŖć‹ć£ćŸå “åˆćÆć€ć“ć®ć‚ćŸć‚Šć‚’č¦‹ē›“恗恦ćæć¦ćć ć•ć„ć€‚

ęˆåŠŸć™ć‚‹ćØ态仄äø‹ć®ē”Ø恫č”Øē¤ŗć•ć‚Œć¾ć™ć€‚

regenerate:

BUILD SUCCESSFUL
Total time: 1 minute 31 seconds

ć¾ćŸć€å‰å›žćÆ恓恓恧ꖇ脈IDćØå“č©žć®ēµ„ćæåˆć‚ć›ćŒć‚ŗćƒ¬ć¦ć„ć¦ć‚³ć‚±ć¦ć„ćŸć®ć§ć—ćŸć€‚ćƒćƒƒćƒćƒŖäæ®ę­£ć•ć‚Œć¦ć„ć¾ć™ć­ļ¼

ćć—ć¦ć€Kuromojięœ¬ä½“ć‚’ćƒ“ćƒ«ćƒ‰ć€‚

$ ant jar-core
jar-core:
      [jar] Building jar: $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/lucene-analyzers-kuromoji-5.0.0-SNAPSHOT.jar

BUILD SUCCESSFUL
Total time: 5 seconds

Lucene Kuromoji悒ä½æć£ć¦å‹•ä½œē¢ŗčŖ

恧ćÆ态Kuromoji悒ä½æć£ć¦å‹•ä½œē¢ŗčŖć—ć¾ć™ć€‚ćƒ—ćƒ­ć‚°ćƒ©ćƒ ćÆ态Scala恧čؘčæ°ć€‚

build.sbt

build.sbt 
name := "lucene-kuromoji-mecab-neologd"

version := "0.0.1-SNAPSHOT"

scalaVersion := "2.11.5"

organization := "org.littlewings"

updateOptions := updateOptions.value.withCachedResolution(true)

scalacOptions ++= Seq("-Xlint", "-unchecked", "-deprecation", "-feature")

libraryDependencies ++= Seq(
  "org.apache.lucene" % "lucene-core" % "5.0.0",
  "org.apache.lucene" % "lucene-analyzers-common" % "5.0.0",
  "org.apache.lucene" % "lucene-analyzers-kuromoji" % "5.0.0"
)

ē¢ŗčŖē”Øć®ć‚³ćƒ¼ćƒ‰ć€‚
src/main/scala/org/littlewings/lucene/kuromoji/KuromojiWithNeologd.scala

package org.littlewings.lucene.kuromoji

import org.apache.lucene.analysis.ja.JapaneseAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute

object KuromojiWithNeologd {
  def main(args: Array[String]): Unit = {
    val texts = Array(
      "恙悂悂悂悂悂悂悂悂恮恆恔",
      "ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅ćÆ态2012å¹“ć«ć€Œć¤ć‘ć¾ć¤ć‘ć‚‹ć€ć§ćƒ‡ćƒ“ćƒ„ćƒ¼ļ¼",
      "ę—„ęœ¬ēµŒęøˆę–°čžć§ćƒ¢ćƒć‚²ćƒ¼ć®čؘäŗ‹ć‚’čŖ­ć‚“恠",
      "ćć‚Šćƒć‚€ć—ć”ć‚…ćƒ¼ćÆ态äøŠē”°ę™‹ä¹ŸćØ꜉ē”°å“²å¹³ć®2äŗŗ恋悉ćŖć‚‹ę—„ęœ¬ć®ćŠē¬‘ć„ć‚³ćƒ³ćƒ“",
      "č‰¦éšŠć“ć‚Œćć—ć‚‡ć‚“ćÆć€č§’å·ć‚²ćƒ¼ćƒ ć‚¹ćŒé–‹ē™ŗ恗态DMM.com恌配äæ”ć—ć¦ć„ć‚‹ćƒ–ćƒ©ć‚¦ć‚¶ć‚²ćƒ¼ćƒ "
    )

    val analyzer = new JapaneseAnalyzer

    for (text <- texts) {
      val tokenStream = analyzer.tokenStream("", text)

      val charTermAttr = tokenStream.addAttribute(classOf[CharTermAttribute])

      tokenStream.reset()

      val tokens =
        Iterator
          .continually(tokenStream.incrementToken())
          .takeWhile(identity)
          .map(_ => charTermAttr.toString)

      println(s"InputText = $text")
      println(s"  Tokenized = ${tokens.mkString("[", ", ", "]")}")

      tokenStream.close()
    }
  }
}

å½¢ę…‹ē“ č§£ęžć™ć‚‹ę–‡ē« ćÆć€é©å½“ć«WikipediaćŖć©ć‹ć‚‰č²¼ć£ć¦ć„ć¾ć™ć€‚ć¾ćŸć€KuromojićÆćƒ‡ćƒ•ć‚©ćƒ«ćƒˆć®SEARCHćƒ¢ćƒ¼ćƒ‰ć§ć™ć€‚

ć“ć®ćƒ—ćƒ­ć‚°ćƒ©ćƒ ć‚’å®Ÿč”Œć—ć¦ćæć¾ć™ć€‚

> run
[info] Running org.littlewings.lucene.kuromoji.KuromojiWithNeologd 
InputText = 恙悂悂悂悂悂悂悂悂恮恆恔
  Tokenized = [恙悂悂, 悂悂, 悂悂]
InputText = ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅ćÆ态2012å¹“ć«ć€Œć¤ć‘ć¾ć¤ć‘ć‚‹ć€ć§ćƒ‡ćƒ“ćƒ„ćƒ¼ļ¼
  Tokenized = [恏, ćƒ¼, 恱ćæ悅恱ćæ悅ćÆ, 2012, 幓, 恤恑, 恤恑悋, ćƒ‡ćƒ“ćƒ„]
InputText = ę—„ęœ¬ēµŒęøˆę–°čžć§ćƒ¢ćƒć‚²ćƒ¼ć®čؘäŗ‹ć‚’čŖ­ć‚“恠
  Tokenized = [ę—„ęœ¬, ę—„ęœ¬ēµŒęøˆę–°čž, ēµŒęøˆ, ꖰ聞, ćƒ¢ćƒć‚², čؘäŗ‹, čŖ­ć‚€]
InputText = ćć‚Šćƒć‚€ć—ć”ć‚…ćƒ¼ćÆ态äøŠē”°ę™‹ä¹ŸćØ꜉ē”°å“²å¹³ć®2äŗŗ恋悉ćŖć‚‹ę—„ęœ¬ć®ćŠē¬‘ć„ć‚³ćƒ³ćƒ“
  Tokenized = [恏悊, ćƒć‚€ć—ć”ć‚…, ćƒ¼, äøŠē”°, ꙋ, 也, ꜉ē”°, 哲, å¹³, 2, äŗŗ, ę—„ęœ¬, 恊ē¬‘恄, ć‚³ćƒ³ćƒ“]
InputText = č‰¦éšŠć“ć‚Œćć—ć‚‡ć‚“ćÆć€č§’å·ć‚²ćƒ¼ćƒ ć‚¹ćŒé–‹ē™ŗ恗态DMM.com恌配äæ”ć—ć¦ć„ć‚‹ćƒ–ćƒ©ć‚¦ć‚¶ć‚²ćƒ¼ćƒ 
  Tokenized = [艦隊, 恏恗, 悇悓ćÆ, č§’å·, ć‚²ćƒ¼ćƒ ć‚¹, 開ē™ŗ, dmm, com, 配äæ”, ćƒ–ćƒ©ć‚¦ć‚¶ć‚²ćƒ¼ćƒ ]
[success] Total time: 0 s, completed 2015/03/17 0:14:21

Tokenized = 恮éƒØåˆ†ćŒå½¢ę…‹ē“ č§£ęžć—ćŸēµęžœć§ć™ćŒć€ć‘ć£ć“ć†ć™ć”ć„ć“ćØ恫ćŖć‚Šć¾ć—ćŸć€‚ć‚ć‹ć‚‰ćŖć„å˜čŖžćŒå¤šć„ćØ恄恆恓ćØ恧恙恭怂

恧ćÆć€ć“ć“ć§å…ˆć»ć©č¾žę›ø悒ä½æć£ć¦ćƒ“ćƒ«ćƒ‰ć—ćŸć€Kuromoji恮JARćƒ•ć‚”ć‚¤ćƒ«ć‚’ä½æć£ć¦ćæć¾ć™ć€‚
1åŗ¦sbt悒ēµ‚äŗ†ć€‚

> exit

libćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖć‚’ä½œęˆć—ć¾ć™ć€‚

$ mkdir lib

恓恮äø­ć«ć€ćƒ“ćƒ«ćƒ‰ć—ćŸJARćƒ•ć‚”ć‚¤ćƒ«ć‚’ę”¾ć‚Šč¾¼ćæć¾ć™ć€‚

$ cp $LUCENE_SRC_HOME/lucene/build/analysis/kuromoji/lucene-analyzers-kuromoji-5.0.0-SNAPSHOT.jar lib/

sbtć®ä¾å­˜é–¢äæ‚定ē¾©ć‹ć‚‰ć€Kuromojić‚’å¤–ć—ć¾ć™ć€‚

libraryDependencies ++= Seq(
  "org.apache.lucene" % "lucene-core" % "5.0.0",
  "org.apache.lucene" % "lucene-analyzers-common" % "5.0.0"
  // "org.apache.lucene" % "lucene-analyzers-kuromoji" % "5.0.0"
)

恧ćÆć€å†åŗ¦sbtć‚’čµ·å‹•ć—ć¦ć€å®Ÿč”Œć€‚

> run
[info] Running org.littlewings.lucene.kuromoji.KuromojiWithNeologd 
InputText = 恙悂悂悂悂悂悂悂悂恮恆恔
  Tokenized = [恙悂悂悂悂悂悂, 恙悂悂悂悂悂悂悂悂恮恆恔, 悂悂]
InputText = ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅ćÆ态2012å¹“ć«ć€Œć¤ć‘ć¾ć¤ć‘ć‚‹ć€ć§ćƒ‡ćƒ“ćƒ„ćƒ¼ļ¼
  Tokenized = [ćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅, 2012幓, ć¤ć‘ć¾ć¤ć‘ć‚‹, ćƒ‡ćƒ“ćƒ„]
InputText = ę—„ęœ¬ēµŒęøˆę–°čžć§ćƒ¢ćƒć‚²ćƒ¼ć®čؘäŗ‹ć‚’čŖ­ć‚“恠
  Tokenized = [ę—„ęœ¬, ę—„ęœ¬ēµŒęøˆę–°čž, ēµŒęøˆ, ꖰ聞, mobage, čؘäŗ‹, čŖ­ć‚€]
InputText = ćć‚Šćƒć‚€ć—ć”ć‚…ćƒ¼ćÆ态äøŠē”°ę™‹ä¹ŸćØ꜉ē”°å“²å¹³ć®2äŗŗ恋悉ćŖć‚‹ę—„ęœ¬ć®ćŠē¬‘ć„ć‚³ćƒ³ćƒ“
  Tokenized = [ćć‚Šćƒć‚€ć—ć”ć‚…ćƒ¼, äøŠē”°, ę™‹ä¹Ÿ, ꜉ē”°, ꜉ē”°å“²å¹³, 哲平, 2, äŗŗ, ę—„ęœ¬, 恊ē¬‘ć„ć‚³ćƒ³ćƒ“]
InputText = č‰¦éšŠć“ć‚Œćć—ć‚‡ć‚“ćÆć€č§’å·ć‚²ćƒ¼ćƒ ć‚¹ćŒé–‹ē™ŗ恗态DMM.com恌配äæ”ć—ć¦ć„ć‚‹ćƒ–ćƒ©ć‚¦ć‚¶ć‚²ćƒ¼ćƒ 
  Tokenized = [č‰¦éšŠć“ć‚Œćć—ć‚‡ć‚“, č§’å·ć‚²ćƒ¼ćƒ ć‚¹, 開ē™ŗ, dmm.com, 配äæ”, ćƒ–ćƒ©ć‚¦ć‚¶ć‚²ćƒ¼ćƒ ]
[success] Total time: 1 s, completed 2015/03/17 0:15:09

ēµęžœćŒć ć„ć¶å¤‰ć‚ć‚Šć¾ć—ćŸć­ć€‚ć€Œćć‚ƒć‚Šćƒ¼ć±ćæ悅恱ćæ悅怍ćØć„ć£ćŸäŗŗ名ćŖć©ćŒć‹ćŖ悊čŖč­˜ć§ćć¦ć„ć¾ć™ć€‚

ćØć“ć‚ć§ć€ć€Œć™ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć®ć†ć”ć€ć®ēµęžœćŒć€å¦™ćŖ恓ćØ恫ćŖć‚Šć¾ć—ćŸć€‚

InputText = 恙悂悂悂悂悂悂悂悂恮恆恔
  Tokenized = [恙悂悂悂悂悂悂, 恙悂悂悂悂悂悂悂悂恮恆恔, 悂悂]

äø€åæœć€ć‚·ćƒ¼ćƒ‰ć®CSV悒ē¢ŗčŖā€¦ć€‚

$ view [mecab-ipadic-neologdć‚’ćƒ“ćƒ«ćƒ‰ć—ćŸćƒ‡ć‚£ćƒ¬ć‚Æ惈ćƒŖ]/build/mecab-ipadic-2.7.0-20070801-neologd-20150317/mecab-user-dict-seed.20150317.csv

ć“ć®ć‚ćŸć‚Šć«å¼•ć£ć‹ć‹ć‚Šć¾ć—ćŸć­ā€¦ć€‚

恙悂悂悂悂悂,1288,1288,5072,åč©ž,å›ŗęœ‰åč©ž,äø€čˆ¬,*,*,*,恙悂悂悂悂悂,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢
恙悂悂悂悂悂悂,1288,1288,4587,åč©ž,å›ŗęœ‰åč©ž,äø€čˆ¬,*,*,*,恙悂悂悂悂悂悂,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢

ćć—ć¦ć€ć€Œć™ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć‚‚ć®ć†ć”ć€ćÆå˜ä½“ć§åč©žćØć—ć¦å­˜åœØć—ć¦ć„ćŸć‚Šā€¦ć€‚

恙悂悂悂悂悂悂悂悂恮恆恔,1288,1288,4143,åč©ž,å›ŗęœ‰åč©ž,äø€čˆ¬,*,*,*,恙悂悂悂悂悂悂悂悂恮恆恔,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒŽć‚¦ćƒ,ć‚¹ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒ¢ćƒŽć‚¦ćƒ

ćØ悊恂恈恚态mecab-ipadic-neologdć®č¾žę›ø悒Kuromojić§å–ć‚Šč¾¼ć‚“ć§å‹•ć‹ć™ćØ恄恆ē›®ęؙćÆé”ęˆć—ćŸć®ć§ć€ć‚ˆć—ćØć—ć¾ć—ć‚‡ć†ļ¼

ēµ‚ć‚ć‚Šć«

mecab-ipadic-neologdć«ć¤ć„ć¦ć§ć™ćŒć€å…¬é–‹ć•ć‚Œć¦ć‹ć‚‰ę³Øē›®ć‚’é›†ć‚ć¦ć„ć‚‹ć‚ˆć†ć§ć€ć„ć‚ć‚“ćŖćØć“ć‚ć§åå‰ć‚’č¦‹ćŸć‚Šć€ā—Æā—Æ恧ä½æć£ć¦ćæ恟ćæ恟恄ćŖć‚Øćƒ³ćƒˆćƒŖ恌å‡ŗć¦ć„ć‚‹ć‚ˆć†ć§ć™ć€‚

ćć®äø­ć§ć€å…ˆę—„č‡Ŗ分ćÆLucene Kuromoji恧čøć‚“ć ć‚ć‘ć§ć™ćŒā€¦ć€‚ćØćÆ恄恈态@overlastć•ć‚“ć«ć‚³ćƒ”ćƒ³ćƒˆćŠć‚ˆć³äæ®ę­£ć„ćŸć ć‘ćŸć‚Šć—ć¦ć€ćØć¦ć‚‚é©šćć¾ć—ćŸć€‚å‰å›žćÆ恋ćŖ悊ē„”ē†ēŸ¢ē†ćƒ“ćƒ«ćƒ‰ć‚’é€šć—ć¾ć—ćŸćŒć€äæ®ę­£ć—ć¦ć‚‚ć‚‰ćˆćŸć“ćØ恧Lucene Kuromoji恧悂恋ćŖć‚Šå–ć‚Šč¾¼ćæ悄恙恏ćŖć£ćŸćØę€ć„ć¾ć™ć€‚

恂悊恌ćØć†ć”ć–ć„ć¾ć—ćŸļ¼

ä»Šå›žä½œęˆć—ćŸć‚½ćƒ¼ć‚¹ć‚³ćƒ¼ćƒ‰ćØć‚¹ć‚ÆćƒŖ惗惈ćÆ态恓恔悉恫ē½®ć„ć¦ć„ć¾ć™ć€‚
https://github.com/kazuhira-r/lucene-examples/tree/master/lucene-kuromoji-mecab-neologd

å‰å›žć‚ˆć‚Šć€å½¢ę…‹ē“ č§£ęžć™ć‚‹ę–‡ē« ć‚’ć”ć‚‡ć£ćØć„ć˜ć£ćŸć‚Šć—ć¦ć„ć¾ć™ć€‚