LuceneのKuromoji（JapaneseAnalyzer）に、ユーザ定義辞書を適用してみる

ちょっと前に、Kuromojiを使った形態素解析の挙動の確認をしてみましたけど、今度はユーザ定義辞書を使ってみたいと思います。

今回は、TokenizerではなくAnalyzerを使う方向で考えます。

なお、こちらのサイトを少し参考にさせていただきました。
http://www.mwsoft.jp/programming/lucene/kuromoji.html

build.sbt

name := "lucene-kuromoji-userdict"

version := "0.0.1-SNAPSHOT"

scalaVersion := "2.10.2"

organization := "littlewings"

libraryDependencies ++= Seq(
  "org.apache.lucene" % "lucene-core" % "4.3.0",
  "org.apache.lucene" % "lucene-analyzers-common" % "4.3.0",
  "org.apache.lucene" % "lucene-analyzers-kuromoji" % "4.3.0"
)

ユーザ定義辞書ですが、作り方は割と簡単です。まずは、ユーザ辞書の定義から。例えば、こんな感じで作成します。
src/main/resources/my-userdict.txt

# #以降は、コメントとして無視されます

# 単語,形態素解析後の単語（単語を分ける場合は、スペースで区切る）,読み,品詞
かずひら,かずひら,カズヒラ,カスタム名詞
はてなダイアリー,はてなダイアリー,ハテナダイアリー,カスタム名詞

# Kuromojiで、テスト用に置かれていた辞書の一部
日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞

このファイルは、クラスパスの直下に置いています。

で、このファイルを読んでユーザ定義辞書を表す、UserDictionaryクラスのインスタンスの作成方法は、こんな感じです。

val reader =
  new InputStreamReader(this.getClass.getResourceAsStream("my-userdict.txt"))
try {
  // UserDirectoryの内部で、BufferedReaderに包んでいる
  new UserDictionary(reader)
} finally {
  reader.close()
}

まあ、外側でBufferedReaderに包んでいても別にいい気もしますが…。

これを使用して、JapaneseAnalyzerのインスタンスを生成します。

なんですけど、UserDictionaryを引数に与えることができるコンストラクタを使う場合、Luceneのバージョンに加えて

ユーザ定義辞書
Kuromojiのモード
ストップワード
ストップタグ

を指定する必要があります。

ストップワードと、ストップタグについては、まあ変えないだろうということで、デフォルトの値を使用します。デフォルトのストップワード、ストップタグは、JapaneseAnalyzerのstaticメソッドで取得することができます。

つまり、こういうコードになります。

val analyzer = new JapaneseAnalyzer(Version.LUCENE_43,
                                    /* ここにUserDictionaryのインスタンスを指定 */,
                                    /* KuromojiのModeを指定 */,
                                    JapaneseAnalyzer.getDefaultStopSet,
                                    JapaneseAnalyzer.getDefaultStopTags)

ストップワードとストップタグもカスタマイズしたい場合は、JapaneseAnalyzerのコードを読むといいと思います…。

ちなみに、デフォルトのストップワードは

の
に
は
を
た
が
で
て
と
し
れ
さ
ある
いる
も
する
から
な
こと
として
い
や
れる
など
なっ
ない
この
ため
その
あっ
よう
また
もの
という
あり
まで
られ
なる
へ
か
だ
これ
によって
により
おり
より
による
ず
なり
られる
において
ば
なかっ
なく
しかし
について
せ
だっ
その後
できる
それ
う
ので
なお
のみ
でき
き
つ
における
および
いう
さらに
でも
ら
たり
その他
に関する
たち
ます
ん
なら
に対して
特に
せる
及び
これら
とき
では
にて
ほか
ながら
うち
そして
とともに
ただし
かつて
それぞれ
または
お
ほど
ものの
に対する
ほとんど
と共に
といった
です
とも
ところ
ここ

で、デフォルトのストップタグは

接続詞
助詞
助詞-格助詞
助詞-格助詞-一般
助詞-格助詞-引用
助詞-格助詞-連語
助詞-接続助詞
助詞-係助詞
助詞-副助詞
助詞-間投助詞
助詞-並立助詞
助詞-終助詞
助詞-副助詞／並立助詞／終助詞
助詞-連体化
助詞-副詞化
助詞-特殊
助動詞
記号
記号-一般
記号-読点
記号-句点
記号-空白
記号-括弧開
記号-括弧閉
その他-間投
フィラー
非言語音

ですね。

では、早速使ってみましょう。

Scalaコードとして、こんなのを用意しました。
src/main/scala/KuromojiUserDict.scala

import java.io.{InputStreamReader, StringReader}

import org.apache.lucene.analysis.{Analyzer, Tokenizer, TokenStream}
import org.apache.lucene.analysis.ja.{JapaneseAnalyzer, JapaneseTokenizer}
import org.apache.lucene.analysis.ja.dict.UserDictionary
import org.apache.lucene.analysis.ja.tokenattributes.{BaseFormAttribute, PartOfSpeechAttribute, ReadingAttribute, InflectionAttribute}
import org.apache.lucene.analysis.tokenattributes.{CharTermAttribute, OffsetAttribute, PositionIncrementAttribute, TypeAttribute}
import org.apache.lucene.util.Version

object KuromojiUserDict {
  def main(args: Array[String]): Unit = {
    val texts = List(
      /* ここに形態素解析したい、テキストを指定 */
    )

    for (text <- texts) {
      withJapaneseAnalyzer(text, JapaneseTokenizer.Mode.SEARCH)(displayTokens)
    }
  }

  def createUserDictionary(): UserDictionary = {
    val reader =
      new InputStreamReader(this.getClass.getResourceAsStream("my-userdict.txt"))
    try {
      // UserDirectoryの内部で、BufferedReaderに包んでいる
      new UserDictionary(reader)
    } finally {
      reader.close()
    }
  }

  def withJapaneseAnalyzer(text: String, mode: JapaneseTokenizer.Mode)(body: (String, TokenStream) => Unit): Unit = {
    val analyzer = new JapaneseAnalyzer(Version.LUCENE_43,
                                        createUserDictionary(),
                                        mode,
                                        JapaneseAnalyzer.getDefaultStopSet,
                                        JapaneseAnalyzer.getDefaultStopTags)
    println(s"Mode => $mode Start")

    val reader = new StringReader(text)
    val tokenStream = analyzer.tokenStream("", reader)

    try {
      body(text, tokenStream)
    } finally {
      tokenStream.close()
    }

    println(s"Mode => $mode End")
    println()
  }

  def displayTokens(text: String, tokenStream: TokenStream): Unit = {
    val charTermAttr = tokenStream.addAttribute(classOf[CharTermAttribute])
    val offsetAttr = tokenStream.addAttribute(classOf[OffsetAttribute])
    val positionIncrementAttr = tokenStream.addAttribute(classOf[PositionIncrementAttribute])
    val typeAttr = tokenStream.addAttribute(classOf[TypeAttribute])  // JapaneseAnalyzerは、これを入れないと取得できない

    // Kuromoji Additional Attributes
    val baseFormAttr = tokenStream.addAttribute(classOf[BaseFormAttribute])
    val partOfSpeechAttr = tokenStream.addAttribute(classOf[PartOfSpeechAttribute])
    val readingAttr = tokenStream.addAttribute(classOf[ReadingAttribute])
    val inflectionAttr = tokenStream.addAttribute(classOf[InflectionAttribute])

    println("<<==========================================")
    println(s"input text => $text")
    println("============================================")

    tokenStream.reset()

    while (tokenStream.incrementToken()) {
      val startOffset = offsetAttr.startOffset
      val endOffset = offsetAttr.endOffset
      val token = charTermAttr.toString
      val posInc = positionIncrementAttr.getPositionIncrement
      val tpe = typeAttr.`type`

      // Kuromoji Additional Attributes
      val baseForm = baseFormAttr.getBaseForm
      val partOfSpeech = partOfSpeechAttr.getPartOfSpeech
      val reading = readingAttr.getReading
      val pronunciation = readingAttr.getPronunciation
      val inflectionForm = inflectionAttr.getInflectionForm
      val inflectionType = inflectionAttr.getInflectionType

      println(s"token: $token, startOffset: $startOffset, endOffset: $endOffset, posInc: $posInc, type: $tpe")

      if (partOfSpeech != null) {
        println(s"baseForm: $baseForm, partOfSpeech: $partOfSpeech, reading: $reading, pronunciation: $pronunciation, inflectionForm: $inflectionForm, inflectionType: $inflectionType")
      }
    }

    tokenStream.end()

    println("==========================================>>")
  }
}

で、解析対象のテキストに

    val texts = List(
      "かずひらは、はてなダイアリーを使用しています。",
      "東京メトロ丸ノ内線は、今日も混んでいます。",
      "関西国際空港は、日本の空港です。"
    )

と与え、ユーザ定義辞書を

# #以降は、コメントとして無視されます

# 単語,形態素解析後の単語（単語を分ける場合は、スペースで区切る）,読み,品詞
かずひら,かずひら,カズヒラ,カスタム名詞
はてなダイアリー,はてなダイアリー,ハテナダイアリー,カスタム名詞
東京メトロ丸ノ内線,東京メトロ 丸ノ内線,トウキョウメトロ マルノウチセン,カスタム名詞

# Kuromojiで、テスト用に置かれていた辞書の一部
日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞

と定義してみます。

では、これを元に実行してみます。辞書の効果がわかりやすいように、ユーザ定義辞書に登録なしの時と、結果を比較しながら書いてみますね。

> run
[info] Running KuromojiUserDict

結果、それぞれこんな感じです。

## ユーザ辞書定義あり
Mode => SEARCH Start
<<==========================================
input text => かずひらは、はてなダイアリーを使用しています。
============================================
token: かずひら, startOffset: 0, endOffset: 4, posInc: 1, type: word
baseForm: null, partOfSpeech: カスタム名詞, reading: カズヒラ, pronunciation: null, inflectionForm: null, inflectionType: null
token: はてなダイアリー, startOffset: 6, endOffset: 14, posInc: 2, type: word
baseForm: null, partOfSpeech: カスタム名詞, reading: ハテナダイアリー, pronunciation: null, inflectionForm: null, inflectionType: null
token: 使用, startOffset: 15, endOffset: 17, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-サ変接続, reading: シヨウ, pronunciation: シヨー, inflectionForm: null, inflectionType: null
==========================================>>
Mode => SEARCH End

## ユーザ辞書定義なし
<<==========================================
input text => かずひらは、はてなダイアリーを使用しています。
============================================
token: く, startOffset: 0, endOffset: 1, posInc: 1, type: word
baseForm: く, partOfSpeech: 動詞-非自立, reading: カ, pronunciation: カ, inflectionForm: 未然形, inflectionType: 五段・カ行促音便
token: ひる, startOffset: 2, endOffset: 4, posInc: 2, type: word
baseForm: ひる, partOfSpeech: 動詞-自立, reading: ヒラ, pronunciation: ヒラ, inflectionForm: 未然形, inflectionType: 五段・ラ行
token: はてな, startOffset: 6, endOffset: 9, posInc: 2, type: word
baseForm: null, partOfSpeech: 感動詞, reading: ハテナ, pronunciation: ハテナ, inflectionForm: null, inflectionType: null
token: ダイアリ, startOffset: 9, endOffset: 14, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: ダイアリー, pronunciation: ダイアリー, inflectionForm: null, inflectionType: null
token: 使用, startOffset: 15, endOffset: 17, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-サ変接続, reading: シヨウ, pronunciation: シヨー, inflectionForm: null, inflectionType: null
==========================================>>
Mode => SEARCH End

辞書に定義しなかったら、「かずひら」がひどいことになった…。

## ユーザ辞書定義あり
Mode => SEARCH Start
<<==========================================
input text => 東京メトロ丸ノ内線は、今日も混んでいます。
============================================
token: 東京メトロ, startOffset: 0, endOffset: 5, posInc: 1, type: word
baseForm: null, partOfSpeech: カスタム名詞, reading: トウキョウメトロ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 丸ノ内線, startOffset: 5, endOffset: 9, posInc: 1, type: word
baseForm: null, partOfSpeech: カスタム名詞, reading: マルノウチセン, pronunciation: null, inflectionForm: null, inflectionType: null
token: 今日, startOffset: 11, endOffset: 13, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-副詞可能, reading: キョウ, pronunciation: キョー, inflectionForm: null, inflectionType: null
token: 混む, startOffset: 14, endOffset: 16, posInc: 2, type: word
baseForm: 混む, partOfSpeech: 動詞-自立, reading: コン, pronunciation: コン, inflectionForm: 連用タ接続, inflectionType: 五段・マ行
==========================================>>
Mode => SEARCH End

## ユーザ辞書定義なし
Mode => SEARCH Start
<<==========================================
input text => 東京メトロ丸ノ内線は、今日も混んでいます。
============================================
token: 東京, startOffset: 0, endOffset: 2, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-地域-一般, reading: トウキョウ, pronunciation: トーキョー, inflectionForm: null, inflectionType: null
token: メトロ, startOffset: 2, endOffset: 5, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: メトロ, pronunciation: メトロ, inflectionForm: null, inflectionType: null
token: 丸ノ内線, startOffset: 5, endOffset: 9, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-一般, reading: マルノウチセン, pronunciation: マルノウチセン, inflectionForm: null, inflectionType: null
token: 今日, startOffset: 11, endOffset: 13, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-副詞可能, reading: キョウ, pronunciation: キョー, inflectionForm: null, inflectionType: null
token: 混む, startOffset: 14, endOffset: 16, posInc: 2, type: word
baseForm: 混む, partOfSpeech: 動詞-自立, reading: コン, pronunciation: コン, inflectionForm: 連用タ接続, inflectionType: 五段・マ行
==========================================>>
Mode => SEARCH End

東京メトロは、辞書に書かないと分割されちゃいますね。

## ユーザ辞書定義あり
Mode => SEARCH Start
<<==========================================
input text => 関西国際空港は、日本の空港です。
============================================
token: 関西, startOffset: 0, endOffset: 2, posInc: 1, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: カンサイ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 国際, startOffset: 2, endOffset: 4, posInc: 1, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: コクサイ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 空港, startOffset: 4, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: クウコウ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 日本, startOffset: 8, endOffset: 10, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-地域-国, reading: ニッポン, pronunciation: ニッポン, inflectionForm: null, inflectionType: null
token: 空港, startOffset: 11, endOffset: 13, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: クウコウ, pronunciation: クーコー, inflectionForm: null, inflectionType: null
==========================================>>
Mode => SEARCH End

## ユーザ辞書定義なし
Mode => SEARCH Start
<<==========================================
input text => 関西国際空港は、日本の空港です。
============================================
token: 関西, startOffset: 0, endOffset: 2, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-地域-一般, reading: カンサイ, pronunciation: カンサイ, inflectionForm: null, inflectionType: null
token: 関西国際空港, startOffset: 0, endOffset: 6, posInc: 0, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: カンサイコクサイクウコウ, pronunciation: カンサイコクサイクーコー, inflectionForm: null, inflectionType: null
token: 国際, startOffset: 2, endOffset: 4, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: コクサイ, pronunciation: コクサイ, inflectionForm: null, inflectionType: null
token: 空港, startOffset: 4, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: クウコウ, pronunciation: クーコー, inflectionForm: null, inflectionType: null
token: 日本, startOffset: 8, endOffset: 10, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-地域-国, reading: ニッポン, pronunciation: ニッポン, inflectionForm: null, inflectionType: null
token: 空港, startOffset: 11, endOffset: 13, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: クウコウ, pronunciation: クーコー, inflectionForm: null, inflectionType: null
==========================================>>
Mode => SEARCH End

このケースの場合、読みがサンプルの辞書だと「クウコウ」になっていて、「空港」の「クーコー」で登録されているためか、パッと見た感じ同じ単語のように見えるのに、

token: 空港, startOffset: 4, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: クウコウ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 空港, startOffset: 11, endOffset: 13, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: クウコウ, pronunciation: クーコー, inflectionForm: null, inflectionType: null

のように異なる扱いで検出されることになります。

あと、ユーザ定義辞書はデフォルトの辞書よりも優先度が高いらしくて、

空港,空 港,ソラ ミナト,テスト名詞

みたいな無茶な単語を登録すると、先ほどの「関西国際空港は、日本の空港です。」は、以下の用に分割されます。

Mode => SEARCH Start
<<==========================================
input text => 関西国際空港は、日本の空港です。
============================================
token: 関西, startOffset: 0, endOffset: 2, posInc: 1, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: カンサイ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 国際, startOffset: 2, endOffset: 4, posInc: 1, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: コクサイ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 空港, startOffset: 4, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: クウコウ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 日本, startOffset: 8, endOffset: 10, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-地域-国, reading: ニッポン, pronunciation: ニッポン, inflectionForm: null, inflectionType: null
token: 空, startOffset: 11, endOffset: 12, posInc: 2, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: ソラ, pronunciation: null, inflectionForm: null, inflectionType: null
token: 港, startOffset: 12, endOffset: 13, posInc: 1, type: word
baseForm: null, partOfSpeech: テスト名詞, reading: ミナト, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
Mode => SEARCH End

ここでも、先の「関西国際空港」で分割された「空港」は、「クウコウ」のままですが。

他、ユーザ定義辞書に書かれた単語を形態素解析の結果から無視したければ、品詞をストップタグとして登録すればいいのでは？と思いますが、違うでしょうか。

たとえば、「かずひら」を記号にすると

# #以降は、コメントとして無視されます

# 単語,形態素解析後の単語（単語を分ける場合は、スペースで区切る）,読み,品詞
かずひら,かずひら,カズヒラ,記号
はてなダイアリー,はてなダイアリー,ハテナダイアリー,カスタム名詞

形態素解析の結果から、いなくなります。

Mode => SEARCH Start
<<==========================================
input text => かずひらは、はてなダイアリーを使用しています。
============================================
token: はてなダイアリー, startOffset: 6, endOffset: 14, posInc: 3, type: word
baseForm: null, partOfSpeech: カスタム名詞, reading: ハテナダイアリー, pronunciation: null, inflectionForm: null, inflectionType: null
token: 使用, startOffset: 15, endOffset: 17, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-サ変接続, reading: シヨウ, pronunciation: シヨー, inflectionForm: null, inflectionType: null
==========================================>>
Mode => SEARCH End

こんなところで。

CLOVER🍀

That was when it all began.

LuceneのKuromoji（JapaneseAnalyzer）に、ユーザ定義辞書を適用してみる