LuceneのAnalyzerで遊んでみる

前に、LuceneのDirectoryの実装としてのInfinispanの機能を使ってみましたが、そもそも自分はLuceneにあまり詳しくないので、これを機にちょっと勉強してみることにしました。

仕事でも、直接的でないにしろ、Solrを使っていますので。

Apache Lucene
http://lucene.apache.org/core/index.html

というわけで、Analyzerから触っていこうと思います。プログラムは、Scalaで書きました。

では、build.sbtから。

name := "lucene-analyzers"

version := "0.0.1-SNAPSHOT"

scalaVersion := "2.10.1"

organization := "littlewings"

scalacOptions += "-deprecation"

libraryDependencies ++= Seq(
  "org.apache.lucene" % "lucene-core" % "4.3.0",
  "org.apache.lucene" % "lucene-analyzers-common" % "4.3.0",
  "org.apache.lucene" % "lucene-analyzers-kuromoji" % "4.3.0"
)

使うLuceneのバージョンは4.3で、AnalyzerにKuromojiを含めています。

では、サンプルコード。
src/main/scala/LuceneAnalyzers.scala

import java.io.StringReader

import org.apache.lucene.analysis.{Analyzer, TokenStream}
import org.apache.lucene.analysis.cjk.CJKAnalyzer
import org.apache.lucene.analysis.core.WhitespaceAnalyzer
import org.apache.lucene.analysis.core.KeywordAnalyzer
import org.apache.lucene.analysis.ja.{JapaneseAnalyzer, JapaneseTokenizer}
import org.apache.lucene.analysis.ja.tokenattributes.{BaseFormAttribute, PartOfSpeechAttribute, ReadingAttribute, InflectionAttribute}
import org.apache.lucene.analysis.ja.dict.UserDictionary
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.analysis.tokenattributes.{CharTermAttribute, OffsetAttribute, PositionIncrementAttribute, TypeAttribute}
import org.apache.lucene.util.Version

object LuceneAnalyzers {
  def main(args: Array[String]): Unit = {
    val luceneVersion = Version.LUCENE_43

    val texts =
      List(
        "すもももももももものうち。",
        "メガネは顔の一部です。",
        "日本経済新聞でモバゲーの記事を読んだ。",
        "Java, Scala, Groovy, Clojure",
        "ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr",
        "ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６",
        "Lucene is a full-featured text search engine library written in Java."
      )

    usingTokenStream(/* ここでAnalyzerをnewして渡す */, texts: _*)(displayTokens)
  }

  def usingTokenStream(analyzer: Analyzer, texts: String*)(body: (String, TokenStream) => Unit): Unit = {
    println(s"Analyzer => ${analyzer.getClass.getName} Start")
    for (text <- texts) {
      val reader = new StringReader(text)
      val tokenStream = analyzer.tokenStream("", reader)

      try {
        body(text, tokenStream)
      } finally {
        tokenStream.close()
      }
    }
    println(s"Analyzer => ${analyzer.getClass.getName} End")
    println()
  }

  def displayTokens(text: String, tokenStream: TokenStream): Unit = {
    val charTermAttr = tokenStream.addAttribute(classOf[CharTermAttribute])
    val offsetAttr = tokenStream.addAttribute(classOf[OffsetAttribute])
    val positionIncrementAttr = tokenStream.addAttribute(classOf[PositionIncrementAttribute])
    val typeAttr = tokenStream.addAttribute(classOf[TypeAttribute])  // JapaneseAnalyzerは、これを入れないと取得できない

    // Kuromoji Additional Attributes
    val baseFormAttr = tokenStream.addAttribute(classOf[BaseFormAttribute])
    val partOfSpeechAttr = tokenStream.addAttribute(classOf[PartOfSpeechAttribute])
    val readingAttr = tokenStream.addAttribute(classOf[ReadingAttribute])
    val inflectionAttr = tokenStream.addAttribute(classOf[InflectionAttribute])

    println("<<==========================================")
    println(s"input text => $text")
    println("============================================")

    tokenStream.reset()

    while (tokenStream.incrementToken()) {
      val startOffset = offsetAttr.startOffset
      val endOffset = offsetAttr.endOffset
      val token = charTermAttr.toString
      val posInc = positionIncrementAttr.getPositionIncrement
      val tpe = typeAttr.`type`

      // Kuromoji Additional Attributes
      val baseForm = baseFormAttr.getBaseForm
      val partOfSpeech = partOfSpeechAttr.getPartOfSpeech
      val reading = readingAttr.getReading
      val pronunciation = readingAttr.getPronunciation
      val inflectionForm = inflectionAttr.getInflectionForm
      val inflectionType = inflectionAttr.getInflectionType

      println(s"token: $token, startOffset: $startOffset, endOffset: $endOffset, posInc: $posInc, type: $tpe")

      if (partOfSpeech != null) {
        println(s"baseForm: $baseForm, partOfSpeech: $partOfSpeech, reading: $reading, pronunciation: $pronunciation, inflectionForm: $inflectionForm, inflectionType: $inflectionType")
      }
    }

    tokenStream.end()

    println("==========================================>>")
  }
}

Analyzerは、Readerを渡してtokenStreamメソッドを呼び出すことで、TokenStreamを取得できます、と。

      val tokenStream = analyzer.tokenStream("", reader)

      try {
        body(text, tokenStream)
      } finally {
        tokenStream.close()
      }

使い終わったTokenStreamは、closeするのがお約束？

TokenStreamは、resetメソッドを呼んだ後に、incrementTokenメソッドでTokenを読み進めていく感じみたいですね。

    tokenStream.reset()

    while (tokenStream.incrementToken()) {
       // Tokenごとの処理
    }

    tokenStream.end()

終了したら、TokenStream#end。今回のサンプルでは、各種Attributeから取得できる情報をコンソールに出力するようにしてあります。

では、以下の文字列を対象にして

    val texts =
      List(
        "すもももももももものうち。",
        "メガネは顔の一部です。",
        "日本経済新聞でモバゲーの記事を読んだ。",
        "Java, Scala, Groovy, Clojure",
        "ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr",
        "ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６",
        "Lucene is a full-featured text search engine library written in Java."
      )

コメントにも書いてあるように、試したいAnalyzerをnewして試してみましょう。

    usingTokenStream(/* ここでAnalyzerをnewして渡す */, texts: _*)(displayTokens)

StandardAnalyzer

文字通り、標準的なAnalyzerです。StandardFilterとLowerCaseFilter、StopFilter付き。

StandardAnalyzer
http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html

    usingTokenStream(new StandardAnalyzer(luceneVersion), texts: _*)(displayTokens)

このサンプルで動かすと、こういう結果になります。

Analyzer => org.apache.lucene.analysis.standard.StandardAnalyzer Start
<<==========================================
input text => すもももももももものうち。
============================================
token: す, startOffset: 0, endOffset: 1, posInc: 1, type: <HIRAGANA>
token: も, startOffset: 1, endOffset: 2, posInc: 1, type: <HIRAGANA>
token: も, startOffset: 2, endOffset: 3, posInc: 1, type: <HIRAGANA>
token: も, startOffset: 3, endOffset: 4, posInc: 1, type: <HIRAGANA>
token: も, startOffset: 4, endOffset: 5, posInc: 1, type: <HIRAGANA>
token: も, startOffset: 5, endOffset: 6, posInc: 1, type: <HIRAGANA>
token: も, startOffset: 6, endOffset: 7, posInc: 1, type: <HIRAGANA>
token: も, startOffset: 7, endOffset: 8, posInc: 1, type: <HIRAGANA>
token: も, startOffset: 8, endOffset: 9, posInc: 1, type: <HIRAGANA>
token: の, startOffset: 9, endOffset: 10, posInc: 1, type: <HIRAGANA>
token: う, startOffset: 10, endOffset: 11, posInc: 1, type: <HIRAGANA>
token: ち, startOffset: 11, endOffset: 12, posInc: 1, type: <HIRAGANA>
==========================================>>
<<==========================================
input text => メガネは顔の一部です。
============================================
token: メガネ, startOffset: 0, endOffset: 3, posInc: 1, type: <KATAKANA>
token: は, startOffset: 3, endOffset: 4, posInc: 1, type: <HIRAGANA>
token: 顔, startOffset: 4, endOffset: 5, posInc: 1, type: <IDEOGRAPHIC>
token: の, startOffset: 5, endOffset: 6, posInc: 1, type: <HIRAGANA>
token: 一, startOffset: 6, endOffset: 7, posInc: 1, type: <IDEOGRAPHIC>
token: 部, startOffset: 7, endOffset: 8, posInc: 1, type: <IDEOGRAPHIC>
token: で, startOffset: 8, endOffset: 9, posInc: 1, type: <HIRAGANA>
token: す, startOffset: 9, endOffset: 10, posInc: 1, type: <HIRAGANA>
==========================================>>
<<==========================================
input text => 日本経済新聞でモバゲーの記事を読んだ。
============================================
token: 日, startOffset: 0, endOffset: 1, posInc: 1, type: <IDEOGRAPHIC>
token: 本, startOffset: 1, endOffset: 2, posInc: 1, type: <IDEOGRAPHIC>
token: 経, startOffset: 2, endOffset: 3, posInc: 1, type: <IDEOGRAPHIC>
token: 済, startOffset: 3, endOffset: 4, posInc: 1, type: <IDEOGRAPHIC>
token: 新, startOffset: 4, endOffset: 5, posInc: 1, type: <IDEOGRAPHIC>
token: 聞, startOffset: 5, endOffset: 6, posInc: 1, type: <IDEOGRAPHIC>
token: で, startOffset: 6, endOffset: 7, posInc: 1, type: <HIRAGANA>
token: モバゲー, startOffset: 7, endOffset: 11, posInc: 1, type: <KATAKANA>
token: の, startOffset: 11, endOffset: 12, posInc: 1, type: <HIRAGANA>
token: 記, startOffset: 12, endOffset: 13, posInc: 1, type: <IDEOGRAPHIC>
token: 事, startOffset: 13, endOffset: 14, posInc: 1, type: <IDEOGRAPHIC>
token: を, startOffset: 14, endOffset: 15, posInc: 1, type: <HIRAGANA>
token: 読, startOffset: 15, endOffset: 16, posInc: 1, type: <IDEOGRAPHIC>
token: ん, startOffset: 16, endOffset: 17, posInc: 1, type: <HIRAGANA>
token: だ, startOffset: 17, endOffset: 18, posInc: 1, type: <HIRAGANA>
==========================================>>
<<==========================================
input text => Java, Scala, Groovy, Clojure
============================================
token: java, startOffset: 0, endOffset: 4, posInc: 1, type: <ALPHANUM>
token: scala, startOffset: 6, endOffset: 11, posInc: 1, type: <ALPHANUM>
token: groovy, startOffset: 13, endOffset: 19, posInc: 1, type: <ALPHANUM>
token: clojure, startOffset: 21, endOffset: 28, posInc: 1, type: <ALPHANUM>
==========================================>>
<<==========================================
input text => ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr
============================================
token: ｌｕｃｅｎｅ, startOffset: 0, endOffset: 6, posInc: 1, type: <ALPHANUM>
token: ｓｏｌｒ, startOffset: 7, endOffset: 11, posInc: 1, type: <ALPHANUM>
token: lucene, startOffset: 12, endOffset: 18, posInc: 1, type: <ALPHANUM>
token: solr, startOffset: 20, endOffset: 24, posInc: 1, type: <ALPHANUM>
==========================================>>
<<==========================================
input text => ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６
============================================
token: ｱｲｳｴｵカキクケコ, startOffset: 0, endOffset: 10, posInc: 1, type: <KATAKANA>
token: さ, startOffset: 10, endOffset: 11, posInc: 1, type: <HIRAGANA>
token: し, startOffset: 11, endOffset: 12, posInc: 1, type: <HIRAGANA>
token: す, startOffset: 12, endOffset: 13, posInc: 1, type: <HIRAGANA>
token: せ, startOffset: 13, endOffset: 14, posInc: 1, type: <HIRAGANA>
token: そ, startOffset: 14, endOffset: 15, posInc: 1, type: <HIRAGANA>
token: abcｘｙｚ123４５６, startOffset: 15, endOffset: 27, posInc: 1, type: <ALPHANUM>
==========================================>>
<<==========================================
input text => Lucene is a full-featured text search engine library written in Java.
============================================
token: lucene, startOffset: 0, endOffset: 6, posInc: 1, type: <ALPHANUM>
token: full, startOffset: 12, endOffset: 16, posInc: 3, type: <ALPHANUM>
token: featured, startOffset: 17, endOffset: 25, posInc: 1, type: <ALPHANUM>
token: text, startOffset: 26, endOffset: 30, posInc: 1, type: <ALPHANUM>
token: search, startOffset: 31, endOffset: 37, posInc: 1, type: <ALPHANUM>
token: engine, startOffset: 38, endOffset: 44, posInc: 1, type: <ALPHANUM>
token: library, startOffset: 45, endOffset: 52, posInc: 1, type: <ALPHANUM>
token: written, startOffset: 53, endOffset: 60, posInc: 1, type: <ALPHANUM>
token: java, startOffset: 64, endOffset: 68, posInc: 2, type: <ALPHANUM>
==========================================>>
Analyzer => org.apache.lucene.analysis.standard.StandardAnalyzer End

CJK文字に対しては、uni-gramとして動作しています。あと、英単語は全て小文字に変換。なんか、平仮名がわかっていますね。Lucene 3.4からっぽいです。

WhitespaceAnalyzer

スペースやタブなどで、単語分割を行うAnalyzer。

WhitespaceAnalyzer
http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html

    usingTokenStream(new WhitespaceAnalyzer(luceneVersion), texts: _*)(displayTokens)

実行結果。

Analyzer => org.apache.lucene.analysis.core.WhitespaceAnalyzer Start
<<==========================================
input text => すもももももももものうち。
============================================
token: すもももももももものうち。, startOffset: 0, endOffset: 13, posInc: 1, type: word
==========================================>>
<<==========================================
input text => メガネは顔の一部です。
============================================
token: メガネは顔の一部です。, startOffset: 0, endOffset: 11, posInc: 1, type: word
==========================================>>
<<==========================================
input text => 日本経済新聞でモバゲーの記事を読んだ。
============================================
token: 日本経済新聞でモバゲーの記事を読んだ。, startOffset: 0, endOffset: 19, posInc: 1, type: word
==========================================>>
<<==========================================
input text => Java, Scala, Groovy, Clojure
============================================
token: Java,, startOffset: 0, endOffset: 5, posInc: 1, type: word
token: Scala,, startOffset: 6, endOffset: 12, posInc: 1, type: word
token: Groovy,, startOffset: 13, endOffset: 20, posInc: 1, type: word
token: Clojure, startOffset: 21, endOffset: 28, posInc: 1, type: word
==========================================>>
<<==========================================
input text => ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr
============================================
token: ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene,, startOffset: 0, endOffset: 19, posInc: 1, type: word
token: Solr, startOffset: 20, endOffset: 24, posInc: 1, type: word
==========================================>>
<<==========================================
input text => ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６
============================================
token: ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６, startOffset: 0, endOffset: 27, posInc: 1, type: word
==========================================>>
<<==========================================
input text => Lucene is a full-featured text search engine library written in Java.
============================================
token: Lucene, startOffset: 0, endOffset: 6, posInc: 1, type: word
token: is, startOffset: 7, endOffset: 9, posInc: 1, type: word
token: a, startOffset: 10, endOffset: 11, posInc: 1, type: word
token: full-featured, startOffset: 12, endOffset: 25, posInc: 1, type: word
token: text, startOffset: 26, endOffset: 30, posInc: 1, type: word
token: search, startOffset: 31, endOffset: 37, posInc: 1, type: word
token: engine, startOffset: 38, endOffset: 44, posInc: 1, type: word
token: library, startOffset: 45, endOffset: 52, posInc: 1, type: word
token: written, startOffset: 53, endOffset: 60, posInc: 1, type: word
token: in, startOffset: 61, endOffset: 63, posInc: 1, type: word
token: Java., startOffset: 64, endOffset: 69, posInc: 1, type: word
==========================================>>
Analyzer => org.apache.lucene.analysis.core.WhitespaceAnalyzer End

英語に対してはそれっぽく動きますが、あまり明示的に使うことはない？？

KeywordAnalyzer

入力単語のすべてを単一のトークンとして扱うAnalyzer。IDなど、むしろ単語分割して欲しくないもののは、こちらを使用するのでしょうね。

KeywordAnalyzer
http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html

    usingTokenStream(new KeywordAnalyzer, texts: _*)(displayTokens)

実行結果。

Analyzer => org.apache.lucene.analysis.core.KeywordAnalyzer Start
<<==========================================
input text => すもももももももものうち。
============================================
token: すもももももももものうち。, startOffset: 0, endOffset: 13, posInc: 1, type: word
==========================================>>
<<==========================================
input text => メガネは顔の一部です。
============================================
token: メガネは顔の一部です。, startOffset: 0, endOffset: 11, posInc: 1, type: word
==========================================>>
<<==========================================
input text => 日本経済新聞でモバゲーの記事を読んだ。
============================================
token: 日本経済新聞でモバゲーの記事を読んだ。, startOffset: 0, endOffset: 19, posInc: 1, type: word
==========================================>>
<<==========================================
input text => Java, Scala, Groovy, Clojure
============================================
token: Java, Scala, Groovy, Clojure, startOffset: 0, endOffset: 28, posInc: 1, type: word
==========================================>>
<<==========================================
input text => ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr
============================================
token: ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr, startOffset: 0, endOffset: 24, posInc: 1, type: word
==========================================>>
<<==========================================
input text => ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６
============================================
token: ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６, startOffset: 0, endOffset: 27, posInc: 1, type: word
==========================================>>
<<==========================================
input text => Lucene is a full-featured text search engine library written in Java.
============================================
token: Lucene is a full-featured text search engine library written in Java., startOffset: 0, endOffset: 69, posInc: 1, type: word
==========================================>>
Analyzer => org.apache.lucene.analysis.core.KeywordAnalyzer End

CJKAnalyzer

bi-gramのAnalyzer。CJK文字を読み込んだ場合は2文字ごとに分割し、英単語を読ませた場合はそれを認識してトークン化するようです。

CJKAnalyzer
http://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html

今は、StandardAnalyzerにCJKWithFilter、LowerCaseFilter、CJKBigramFilter、StopFilterを組み合わせたものでできているっぽい？

    usingTokenStream(new CJKAnalyzer(luceneVersion), texts: _*)(displayTokens)

実行結果。

Analyzer => org.apache.lucene.analysis.cjk.CJKAnalyzer Start
<<==========================================
input text => すもももももももものうち。
============================================
token: すも, startOffset: 0, endOffset: 2, posInc: 1, type: <DOUBLE>
token: もも, startOffset: 1, endOffset: 3, posInc: 1, type: <DOUBLE>
token: もも, startOffset: 2, endOffset: 4, posInc: 1, type: <DOUBLE>
token: もも, startOffset: 3, endOffset: 5, posInc: 1, type: <DOUBLE>
token: もも, startOffset: 4, endOffset: 6, posInc: 1, type: <DOUBLE>
token: もも, startOffset: 5, endOffset: 7, posInc: 1, type: <DOUBLE>
token: もも, startOffset: 6, endOffset: 8, posInc: 1, type: <DOUBLE>
token: もも, startOffset: 7, endOffset: 9, posInc: 1, type: <DOUBLE>
token: もの, startOffset: 8, endOffset: 10, posInc: 1, type: <DOUBLE>
token: のう, startOffset: 9, endOffset: 11, posInc: 1, type: <DOUBLE>
token: うち, startOffset: 10, endOffset: 12, posInc: 1, type: <DOUBLE>
==========================================>>
<<==========================================
input text => メガネは顔の一部です。
============================================
token: メガ, startOffset: 0, endOffset: 2, posInc: 1, type: <DOUBLE>
token: ガネ, startOffset: 1, endOffset: 3, posInc: 1, type: <DOUBLE>
token: ネは, startOffset: 2, endOffset: 4, posInc: 1, type: <DOUBLE>
token: は顔, startOffset: 3, endOffset: 5, posInc: 1, type: <DOUBLE>
token: 顔の, startOffset: 4, endOffset: 6, posInc: 1, type: <DOUBLE>
token: の一, startOffset: 5, endOffset: 7, posInc: 1, type: <DOUBLE>
token: 一部, startOffset: 6, endOffset: 8, posInc: 1, type: <DOUBLE>
token: 部で, startOffset: 7, endOffset: 9, posInc: 1, type: <DOUBLE>
token: です, startOffset: 8, endOffset: 10, posInc: 1, type: <DOUBLE>
==========================================>>
<<==========================================
input text => 日本経済新聞でモバゲーの記事を読んだ。
============================================
token: 日本, startOffset: 0, endOffset: 2, posInc: 1, type: <DOUBLE>
token: 本経, startOffset: 1, endOffset: 3, posInc: 1, type: <DOUBLE>
token: 経済, startOffset: 2, endOffset: 4, posInc: 1, type: <DOUBLE>
token: 済新, startOffset: 3, endOffset: 5, posInc: 1, type: <DOUBLE>
token: 新聞, startOffset: 4, endOffset: 6, posInc: 1, type: <DOUBLE>
token: 聞で, startOffset: 5, endOffset: 7, posInc: 1, type: <DOUBLE>
token: でモ, startOffset: 6, endOffset: 8, posInc: 1, type: <DOUBLE>
token: モバ, startOffset: 7, endOffset: 9, posInc: 1, type: <DOUBLE>
token: バゲ, startOffset: 8, endOffset: 10, posInc: 1, type: <DOUBLE>
token: ゲー, startOffset: 9, endOffset: 11, posInc: 1, type: <DOUBLE>
token: ーの, startOffset: 10, endOffset: 12, posInc: 1, type: <DOUBLE>
token: の記, startOffset: 11, endOffset: 13, posInc: 1, type: <DOUBLE>
token: 記事, startOffset: 12, endOffset: 14, posInc: 1, type: <DOUBLE>
token: 事を, startOffset: 13, endOffset: 15, posInc: 1, type: <DOUBLE>
token: を読, startOffset: 14, endOffset: 16, posInc: 1, type: <DOUBLE>
token: 読ん, startOffset: 15, endOffset: 17, posInc: 1, type: <DOUBLE>
token: んだ, startOffset: 16, endOffset: 18, posInc: 1, type: <DOUBLE>
==========================================>>
<<==========================================
input text => Java, Scala, Groovy, Clojure
============================================
token: java, startOffset: 0, endOffset: 4, posInc: 1, type: <ALPHANUM>
token: scala, startOffset: 6, endOffset: 11, posInc: 1, type: <ALPHANUM>
token: groovy, startOffset: 13, endOffset: 19, posInc: 1, type: <ALPHANUM>
token: clojure, startOffset: 21, endOffset: 28, posInc: 1, type: <ALPHANUM>
==========================================>>
<<==========================================
input text => ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr
============================================
token: lucene, startOffset: 0, endOffset: 6, posInc: 1, type: <ALPHANUM>
token: solr, startOffset: 7, endOffset: 11, posInc: 1, type: <ALPHANUM>
token: lucene, startOffset: 12, endOffset: 18, posInc: 1, type: <ALPHANUM>
token: solr, startOffset: 20, endOffset: 24, posInc: 1, type: <ALPHANUM>
==========================================>>
<<==========================================
input text => ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６
============================================
token: アイ, startOffset: 0, endOffset: 2, posInc: 1, type: <DOUBLE>
token: イウ, startOffset: 1, endOffset: 3, posInc: 1, type: <DOUBLE>
token: ウエ, startOffset: 2, endOffset: 4, posInc: 1, type: <DOUBLE>
token: エオ, startOffset: 3, endOffset: 5, posInc: 1, type: <DOUBLE>
token: オカ, startOffset: 4, endOffset: 6, posInc: 1, type: <DOUBLE>
token: カキ, startOffset: 5, endOffset: 7, posInc: 1, type: <DOUBLE>
token: キク, startOffset: 6, endOffset: 8, posInc: 1, type: <DOUBLE>
token: クケ, startOffset: 7, endOffset: 9, posInc: 1, type: <DOUBLE>
token: ケコ, startOffset: 8, endOffset: 10, posInc: 1, type: <DOUBLE>
token: コさ, startOffset: 9, endOffset: 11, posInc: 1, type: <DOUBLE>
token: さし, startOffset: 10, endOffset: 12, posInc: 1, type: <DOUBLE>
token: しす, startOffset: 11, endOffset: 13, posInc: 1, type: <DOUBLE>
token: すせ, startOffset: 12, endOffset: 14, posInc: 1, type: <DOUBLE>
token: せそ, startOffset: 13, endOffset: 15, posInc: 1, type: <DOUBLE>
token: abcxyz123456, startOffset: 15, endOffset: 27, posInc: 1, type: <ALPHANUM>
==========================================>>
<<==========================================
input text => Lucene is a full-featured text search engine library written in Java.
============================================
token: lucene, startOffset: 0, endOffset: 6, posInc: 1, type: <ALPHANUM>
token: full, startOffset: 12, endOffset: 16, posInc: 3, type: <ALPHANUM>
token: featured, startOffset: 17, endOffset: 25, posInc: 1, type: <ALPHANUM>
token: text, startOffset: 26, endOffset: 30, posInc: 1, type: <ALPHANUM>
token: search, startOffset: 31, endOffset: 37, posInc: 1, type: <ALPHANUM>
token: engine, startOffset: 38, endOffset: 44, posInc: 1, type: <ALPHANUM>
token: library, startOffset: 45, endOffset: 52, posInc: 1, type: <ALPHANUM>
token: written, startOffset: 53, endOffset: 60, posInc: 1, type: <ALPHANUM>
token: java, startOffset: 64, endOffset: 68, posInc: 2, type: <ALPHANUM>
==========================================>>
Analyzer => org.apache.lucene.analysis.cjk.CJKAnalyzer End

JapaneseAnalyzer

Kuromojiという、オープンソースの形態素解析をLuceneに取り込んだものらしいです。Luceneの3.6および4.0からだとか。

Kuromoji
http://www.atilika.org/

JapaneseAnalyzer
http://lucene.apache.org/core/4_3_0/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html

参考：
http://www.mwsoft.jp/programming/lucene/kuromoji.html
http://www.rondhuit.com/solr%E3%81%AE%E6%97%A5%E6%9C%AC%E8%AA%9E%E5%AF%BE%E5%BF%9C.html

    usingTokenStream(new JapaneseAnalyzer(luceneVersion), texts: _*)(displayTokens)

実行結果。

Analyzer => org.apache.lucene.analysis.ja.JapaneseAnalyzer Start
<<==========================================
input text => すもももももももものうち。
============================================
token: すもも, startOffset: 0, endOffset: 3, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: スモモ, pronunciation: スモモ, inflectionForm: null, inflectionType: null
token: もも, startOffset: 4, endOffset: 6, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: モモ, pronunciation: モモ, inflectionForm: null, inflectionType: null
token: もも, startOffset: 7, endOffset: 9, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: モモ, pronunciation: モモ, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => メガネは顔の一部です。
============================================
token: メガネ, startOffset: 0, endOffset: 3, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: メガネ, pronunciation: メガネ, inflectionForm: null, inflectionType: null
token: 顔, startOffset: 4, endOffset: 5, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: カオ, pronunciation: カオ, inflectionForm: null, inflectionType: null
token: 一部, startOffset: 6, endOffset: 8, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-副詞可能, reading: イチブ, pronunciation: イチブ, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => 日本経済新聞でモバゲーの記事を読んだ。
============================================
token: 日本, startOffset: 0, endOffset: 2, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-地域-国, reading: ニッポン, pronunciation: ニッポン, inflectionForm: null, inflectionType: null
token: 日本経済新聞, startOffset: 0, endOffset: 6, posInc: 0, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: ニホンケイザイシンブン, pronunciation: ニホンケイザイシンブン, inflectionForm: null, inflectionType: null
token: 経済, startOffset: 2, endOffset: 4, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: ケイザイ, pronunciation: ケイザイ, inflectionForm: null, inflectionType: null
token: 新聞, startOffset: 4, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: シンブン, pronunciation: シンブン, inflectionForm: null, inflectionType: null
token: モバゲ, startOffset: 7, endOffset: 11, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: 記事, startOffset: 12, endOffset: 14, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: キジ, pronunciation: キジ, inflectionForm: null, inflectionType: null
token: 読む, startOffset: 15, endOffset: 17, posInc: 2, type: word
baseForm: 読む, partOfSpeech: 動詞-自立, reading: ヨン, pronunciation: ヨン, inflectionForm: 連用タ接続, inflectionType: 五段・マ行
==========================================>>
<<==========================================
input text => Java, Scala, Groovy, Clojure
============================================
token: java, startOffset: 0, endOffset: 4, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: scala, startOffset: 6, endOffset: 11, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: groovy, startOffset: 13, endOffset: 19, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: clojure, startOffset: 21, endOffset: 28, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr
============================================
token: lucene, startOffset: 0, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: solr, startOffset: 7, endOffset: 11, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: lucene, startOffset: 12, endOffset: 18, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: solr, startOffset: 20, endOffset: 24, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６
============================================
token: アイウエオカキクケコ, startOffset: 0, endOffset: 10, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-一般, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: しす, startOffset: 11, endOffset: 13, posInc: 2, type: word
baseForm: null, partOfSpeech: 動詞-自立, reading: シス, pronunciation: シス, inflectionForm: 基本形, inflectionType: 五段・サ行
token: そ, startOffset: 14, endOffset: 15, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-接尾-助動詞語幹, reading: ソ, pronunciation: ソ, inflectionForm: null, inflectionType: null
token: abcxyz, startOffset: 15, endOffset: 21, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: 123456, startOffset: 21, endOffset: 27, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-数, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => Lucene is a full-featured text search engine library written in Java.
============================================
token: lucene, startOffset: 0, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: is, startOffset: 7, endOffset: 9, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: a, startOffset: 10, endOffset: 11, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: full, startOffset: 12, endOffset: 16, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: featured, startOffset: 17, endOffset: 25, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: text, startOffset: 26, endOffset: 30, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: search, startOffset: 31, endOffset: 37, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: engine, startOffset: 38, endOffset: 44, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: library, startOffset: 45, endOffset: 52, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: written, startOffset: 53, endOffset: 60, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: in, startOffset: 61, endOffset: 63, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: java, startOffset: 64, endOffset: 68, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
Analyzer => org.apache.lucene.analysis.ja.JapaneseAnalyzer End

JapaneseAnalyzerを使用した場合は、もう少しいろいろとAttributeが付けられるようなので、試してみました。

また、KuromojiにはModeというものがあり、

SEARCH（デフォルト）
NORMAL
EXTENDED

の3つがあるようです。

JapaneseTokenizer.Mode
http://lucene.apache.org/core/4_3_0/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseTokenizer.Mode.html

下記のコードで、LuceneのVersionのみを指定するコンストラクタと、同じ引数になります。あとは、Modeを変えるだけですね。

    val userDictionary: UserDictionary = null
    val mode = JapaneseTokenizer.Mode.SEARCH
    //val mode = JapaneseTokenizer.Mode.NORMAL
    //val mode = JapaneseTokenizer.Mode.EXTENDED
    val stopwords = JapaneseAnalyzer.getDefaultStopSet
    val stoptags = JapaneseAnalyzer.getDefaultStopTags

    usingTokenStream(new JapaneseAnalyzer(luceneVersion,
                                          userDictionary,
                                          mode,
                                          stopwords,
                                          stoptags),
                     texts: _*)(displayTokens)

では、NORMALに変えてみます。

Analyzer => org.apache.lucene.analysis.ja.JapaneseAnalyzer Start
<<==========================================
input text => すもももももももものうち。
============================================
token: すもも, startOffset: 0, endOffset: 3, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: スモモ, pronunciation: スモモ, inflectionForm: null, inflectionType: null
token: もも, startOffset: 4, endOffset: 6, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: モモ, pronunciation: モモ, inflectionForm: null, inflectionType: null
token: もも, startOffset: 7, endOffset: 9, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: モモ, pronunciation: モモ, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => メガネは顔の一部です。
============================================
token: メガネ, startOffset: 0, endOffset: 3, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: メガネ, pronunciation: メガネ, inflectionForm: null, inflectionType: null
token: 顔, startOffset: 4, endOffset: 5, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: カオ, pronunciation: カオ, inflectionForm: null, inflectionType: null
token: 一部, startOffset: 6, endOffset: 8, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-副詞可能, reading: イチブ, pronunciation: イチブ, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => 日本経済新聞でモバゲーの記事を読んだ。
============================================
token: 日本経済新聞, startOffset: 0, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: ニホンケイザイシンブン, pronunciation: ニホンケイザイシンブン, inflectionForm: null, inflectionType: null
token: モバゲ, startOffset: 7, endOffset: 11, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: 記事, startOffset: 12, endOffset: 14, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: キジ, pronunciation: キジ, inflectionForm: null, inflectionType: null
token: 読む, startOffset: 15, endOffset: 17, posInc: 2, type: word
baseForm: 読む, partOfSpeech: 動詞-自立, reading: ヨン, pronunciation: ヨン, inflectionForm: 連用タ接続, inflectionType: 五段・マ行
==========================================>>
<<==========================================
input text => Java, Scala, Groovy, Clojure
============================================
token: java, startOffset: 0, endOffset: 4, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: scala, startOffset: 6, endOffset: 11, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: groovy, startOffset: 13, endOffset: 19, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: clojure, startOffset: 21, endOffset: 28, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr
============================================
token: lucene, startOffset: 0, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: solr, startOffset: 7, endOffset: 11, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: lucene, startOffset: 12, endOffset: 18, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: solr, startOffset: 20, endOffset: 24, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６
============================================
token: アイウエオカキクケコ, startOffset: 0, endOffset: 10, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-一般, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: しす, startOffset: 11, endOffset: 13, posInc: 2, type: word
baseForm: null, partOfSpeech: 動詞-自立, reading: シス, pronunciation: シス, inflectionForm: 基本形, inflectionType: 五段・サ行
token: そ, startOffset: 14, endOffset: 15, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-接尾-助動詞語幹, reading: ソ, pronunciation: ソ, inflectionForm: null, inflectionType: null
token: abcxyz, startOffset: 15, endOffset: 21, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: 123456, startOffset: 21, endOffset: 27, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-数, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => Lucene is a full-featured text search engine library written in Java.
============================================
token: lucene, startOffset: 0, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: is, startOffset: 7, endOffset: 9, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: a, startOffset: 10, endOffset: 11, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: full, startOffset: 12, endOffset: 16, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: featured, startOffset: 17, endOffset: 25, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: text, startOffset: 26, endOffset: 30, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: search, startOffset: 31, endOffset: 37, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: engine, startOffset: 38, endOffset: 44, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: library, startOffset: 45, endOffset: 52, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: written, startOffset: 53, endOffset: 60, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: in, startOffset: 61, endOffset: 63, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
token: java, startOffset: 64, endOffset: 68, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-組織, reading: null, pronunciation: null, inflectionForm: null, inflectionType: null
==========================================>>
Analyzer => org.apache.lucene.analysis.ja.JapaneseAnalyzer End

変わったのは、ここくらいかな…？

input text => 日本経済新聞でモバゲーの記事を読んだ。

最後、EXTENDED。

Analyzer => org.apache.lucene.analysis.ja.JapaneseAnalyzer Start
<<==========================================
input text => すもももももももものうち。
============================================
token: すもも, startOffset: 0, endOffset: 3, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: スモモ, pronunciation: スモモ, inflectionForm: null, inflectionType: null
token: もも, startOffset: 4, endOffset: 6, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: モモ, pronunciation: モモ, inflectionForm: null, inflectionType: null
token: もも, startOffset: 7, endOffset: 9, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: モモ, pronunciation: モモ, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => メガネは顔の一部です。
============================================
token: メガネ, startOffset: 0, endOffset: 3, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: メガネ, pronunciation: メガネ, inflectionForm: null, inflectionType: null
token: 顔, startOffset: 4, endOffset: 5, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: カオ, pronunciation: カオ, inflectionForm: null, inflectionType: null
token: 一部, startOffset: 6, endOffset: 8, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-副詞可能, reading: イチブ, pronunciation: イチブ, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => 日本経済新聞でモバゲーの記事を読んだ。
============================================
token: 日本, startOffset: 0, endOffset: 2, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-固有名詞-地域-国, reading: ニッポン, pronunciation: ニッポン, inflectionForm: null, inflectionType: null
token: 経済, startOffset: 2, endOffset: 4, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: ケイザイ, pronunciation: ケイザイ, inflectionForm: null, inflectionType: null
token: 新聞, startOffset: 4, endOffset: 6, posInc: 1, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: シンブン, pronunciation: シンブン, inflectionForm: null, inflectionType: null
token: 記事, startOffset: 12, endOffset: 14, posInc: 7, type: word
baseForm: null, partOfSpeech: 名詞-一般, reading: キジ, pronunciation: キジ, inflectionForm: null, inflectionType: null
token: 読む, startOffset: 15, endOffset: 17, posInc: 2, type: word
baseForm: 読む, partOfSpeech: 動詞-自立, reading: ヨン, pronunciation: ヨン, inflectionForm: 連用タ接続, inflectionType: 五段・マ行
==========================================>>
<<==========================================
input text => Java, Scala, Groovy, Clojure
============================================
==========================================>>
<<==========================================
input text => ＬＵＣＥＮＥ、ＳＯＬＲ、Lucene, Solr
============================================
==========================================>>
<<==========================================
input text => ｱｲｳｴｵカキクケコさしすせそABCＸＹＺ123４５６
============================================
token: しす, startOffset: 11, endOffset: 13, posInc: 12, type: word
baseForm: null, partOfSpeech: 動詞-自立, reading: シス, pronunciation: シス, inflectionForm: 基本形, inflectionType: 五段・サ行
token: そ, startOffset: 14, endOffset: 15, posInc: 2, type: word
baseForm: null, partOfSpeech: 名詞-接尾-助動詞語幹, reading: ソ, pronunciation: ソ, inflectionForm: null, inflectionType: null
==========================================>>
<<==========================================
input text => Lucene is a full-featured text search engine library written in Java.
============================================
==========================================>>
Analyzer => org.apache.lucene.analysis.ja.JapaneseAnalyzer End

あれ…？なんか、思ってたのと違う結果が…。というか、なんか妙に得られる単語の数が減ってません？？

なんか、使い方を間違ってるかなぁ…。

CLOVER🍀

That was when it all began.

LuceneのAnalyzerで遊んでみる

StandardAnalyzer

WhitespaceAnalyzer

KeywordAnalyzer

CJKAnalyzer

JapaneseAnalyzer