チュートリアルを参考に、Luceneのインデックスの登録、検索、削除を書いてみる

少し前にも似たようなものを書きましたが、Luceneの基礎ということで。

Introduction to Lucene's APIs
http://lucene.apache.org/core/4_3_0/core/overview-summary.html#overview_description

こちらを参考に、Luceneのインデックスの

登録
検索
削除

を書いてみようかと。

追記）
ドキュメントの更新は、別エントリに書きました。

Luceneでドキュメントの更新
http://d.hatena.ne.jp/Kazuhira/20140411/1397240194

用意。
build.sbt

name := "lucene-first-indexing"

version := "0.0.1-SNAPSHOT"

scalaVersion := "2.10.1"

organization := "littlewings"

scalacOptions += "-deprecation"

libraryDependencies ++= Seq(
  "org.apache.lucene" % "lucene-core" % "4.3.0",
  "org.apache.lucene" % "lucene-analyzers-common" % "4.3.0",
  "org.apache.lucene" % "lucene-analyzers-kuromoji" % "4.3.0",
  "org.apache.lucene" % "lucene-queryparser" % "4.3.0"
)

使用したimport文。

import scala.util.{Failure, Success, Try}

import java.io.File
import java.text.SimpleDateFormat

import org.apache.lucene.analysis.ja.JapaneseAnalyzer
import org.apache.lucene.document.CompressionTools
import org.apache.lucene.document.Document
import org.apache.lucene.document.DateTools
import org.apache.lucene.document.Field
import org.apache.lucene.document.IntField
import org.apache.lucene.document.StringField
import org.apache.lucene.document.TextField
import org.apache.lucene.index.{DirectoryReader, IndexWriter, IndexWriterConfig}
import org.apache.lucene.queryparser.classic.QueryParser
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.store.FSDirectory
import org.apache.lucene.util.Version

チュートリアルを見ながらそれに手を加えた、ホントに簡単なサンプルです。

登録

ほぼ、チュートリアル通り。ただ、DirectoryはFSDirectoryにしています。

object LuceneIndexCreater {
  def main(args: Array[String]): Unit = {
    val luceneVersion = Version.LUCENE_43
    val analyzer = new JapaneseAnalyzer(luceneVersion)

    val directory = FSDirectory.open(new File("index-dir"))
    try {
      val config = new IndexWriterConfig(luceneVersion, analyzer)
      val writer = new IndexWriter(directory, config)

      try {
        writer.deleteAll()

        createDocs.foreach(writer.addDocument)
      } finally {
        writer.close()
      }
    } finally {
      directory.close()
    }
  }

  def createDocs: List[Document] =
    List(
      createBookDoc("Apache Lucene 入門 Java・オープンソース・全文検索システムの構築",
                    3200,
                    "2006/05/17",
                    "Luceneは全文検索システムを構築するためのJavaのライブラリです。Luceneを使えば,一味違う高機能なWebアプリケーションを作ることができます。"),
      createBookDoc("Apache Solr入門 オープンソース全文検索エンジン",
                    3780,
                    "2010/02/20",
                    "Apache Solrとは,オープンソースの検索エンジンです.Apache LuceneというJavaの全文検索システムをベースに豊富な拡張性をもたせ,多くの開発者が利用できるように作られました.検索というと,Googleのシステムを使っている企業Webページが多いですが,自前の検索エンジンがあると顧客にとって本当に必要な情報を提供できます"),
      createBookDoc("JBoss徹底活用ガイド",
                    2919,
                    "2008/02/19",
                    "企業向けJava+Web開発の決定版」企業向けのJavaの定番オープンソース製品になったJBoss。本書は JBoss ユーザ会の有志による共同執筆。現場で培われたJBossの活用ノウハウを余すことなく本書に反映。"),
      createBookDoc("Seasar 2 徹底入門 SAStruts/S2JDBC 対応",
                    3990,
                    "2010/04/20",
                    "Seasar2を使いこなすためのバイブルが登場!!")
    )

  def createBookDoc(title: String, price: Int, publishDate: String, description: String): Document = {
    val doc = new Document
    doc.add(new TextField("title",
                          title,
                          Field.Store.YES))
    doc.add(new StringField("price",
                            price.toString,
                            Field.Store.YES))
    doc.add(new StringField("publish-date",
                            DateTools.dateToString(new SimpleDateFormat("yyyy/MM/dd").parse(publishDate),
                                                   DateTools.Resolution.DAY),
                            Field.Store.YES))
    doc.add(new TextField("description",
                          description,
                          Field.Store.YES))
    doc
  }
}

相変わらず、本がお題です。
最初、priceフィールドをIntFieldにしたら、検索できなくてハマりました…。

起動時に、

        writer.deleteAll()

を呼び出しているので、インデックスに登録してあるドキュメントを全て削除して再登録します。

実行。

> run-main LuceneIndexCreater
[info] Running LuceneIndexCreater 
[success] Total time: 1 s, completed 2013/06/09 1:30:02

検索

こちらも、ほぼチュートリアルと同じ。Queryは、コンソールから入力するようにしました。

object LuceneIndexSearcher {
  def main(args: Array[String]): Unit = {
    val luceneVersion = Version.LUCENE_43
    val analyzer = new JapaneseAnalyzer(luceneVersion)

    val directory = FSDirectory.open(new File("index-dir"))
    try {
      val reader = DirectoryReader.open(directory)
      val searcher = new IndexSearcher(reader)

      Iterator
        .continually(readLine("""Enter Search Query (if "exit" type, exit)> """))
        .takeWhile(_ != "exit")
        .filter(_ != "")
        .foreach { line =>
          val parser = new QueryParser(luceneVersion, "title", analyzer)
          Try {
            val query = parser.parse(line)

            println(s"Query => $query")

            val hits = searcher.search(query, null, 1000).scoreDocs

            println(s"hits => ${hits.length}")

            for (h <- hits) {
              val hitDoc = searcher.doc(h.doc)
              println(s"Hit Document => $hitDoc")
            }
          } match {
            case Success(_) =>
            case Failure(e) => println(e)
          }
        }

      reader.close()
    } finally {
      directory.close()
    }
  }
}

実行。

> run-main LuceneIndexSearcher
[info] Running LuceneIndexSearcher

Enter Search Query (if "exit" type, exit)> *:*
Query => *:*
hits => 4
Hit Document => Document<stored,indexed,tokenized<title:Apache Lucene 入門 Java・オープンソース・全文検索システムの構築> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3200> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20060516> stored,indexed,tokenized<description:Luceneは全文検索システムを構築するためのJavaのライブラリです。Luceneを使えば,一味違う高機能なWebアプリケーションを作ることができます。>>
Hit Document => Document<stored,indexed,tokenized<title:Apache Solr入門 オープンソース全文検索エンジン> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3780> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20100219> stored,indexed,tokenized<description:Apache Solrとは,オープンソースの検索エンジンです.Apache LuceneというJavaの全文検索システムをベースに豊富な拡張性をもたせ,多くの開発者が利用できるように作られました.検索というと,Googleのシステムを使っている企業Webページが多いですが,自前の検索エンジンがあると顧客にとって本当に必要な情報を提供できます>>
Hit Document => Document<stored,indexed,tokenized<title:JBoss徹底活用ガイド> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:2919> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20080218> stored,indexed,tokenized<description:企業向けJava+Web開発の決定版」企業向けのJavaの定番オープンソース製品になったJBoss。本書は JBoss ユーザ会の有志による共同執筆。現場で培われたJBossの活用ノウハウを余すことなく本書に反映。>>
Hit Document => Document<stored,indexed,tokenized<title:Seasar 2 徹底入門 SAStruts/S2JDBC 対応> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3990> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20100419> stored,indexed,tokenized<description:Seasar2を使いこなすためのバイブルが登場!!>>

Enter Search Query (if "exit" type, exit)> Java
Query => title:java
hits => 1
Hit Document => Document<stored,indexed,tokenized<title:Apache Lucene 入門 Java・オープンソース・全文検索システムの構築> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3200> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20060516> stored,indexed,tokenized<description:Luceneは全文検索システムを構築するためのJavaのライブラリです。Luceneを使えば,一味違う高機能なWebアプリケーションを作ることができます。>>

Enter Search Query (if "exit" type, exit)> description:オープンソース
Query => description:オープン description:ソース
hits => 2
Hit Document => Document<stored,indexed,tokenized<title:Apache Solr入門 オープンソース全文検索エンジン> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3780> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20100219> stored,indexed,tokenized<description:Apache Solrとは,オープンソースの検索エンジンです.Apache LuceneというJavaの全文検索システムをベースに豊富な拡張性をもたせ,多くの開発者が利用できるように作られました.検索というと,Googleのシステムを使っている企業Webページが多いですが,自前の検索エンジンがあると顧客にとって本当に必要な情報を提供できます>>
Hit Document => Document<stored,indexed,tokenized<title:JBoss徹底活用ガイド> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:2919> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20080218> stored,indexed,tokenized<description:企業向けJava+Web開発の決定版」企業向けのJavaの定番オープンソース製品になったJBoss。本書は JBoss ユーザ会の有志による共同執筆。現場で培われたJBossの活用ノウハウを余すことなく本書に反映。>>

Enter Search Query (if "exit" type, exit)> price:[3500 TO 4000]
Query => price:[3500 TO 4000]
hits => 2
Hit Document => Document<stored,indexed,tokenized<title:Apache Solr入門 オープンソース全文検索エンジン> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3780> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20100219> stored,indexed,tokenized<description:Apache Solrとは,オープンソースの検索エンジンです.Apache LuceneというJavaの全文検索システムをベースに豊富な拡張性をもたせ,多くの開発者が利用できるように作られました.検索というと,Googleのシステムを使っている企業Webページが多いですが,自前の検索エンジンがあると顧客にとって本当に必要な情報を提供できます>>
Hit Document => Document<stored,indexed,tokenized<title:Seasar 2 徹底入門 SAStruts/S2JDBC 対応> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3990> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20100419> stored,indexed,tokenized<description:Seasar2を使いこなすためのバイブルが登場!!>>

Enter Search Query (if "exit" type, exit)> publish-date:[20090101 TO 20111231]
Query => publish-date:[20090101 TO 20111231]
hits => 2
Hit Document => Document<stored,indexed,tokenized<title:Apache Solr入門 オープンソース全文検索エンジン> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3780> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20100219> stored,indexed,tokenized<description:Apache Solrとは,オープンソースの検索エンジンです.Apache LuceneというJavaの全文検索システムをベースに豊富な拡張性をもたせ,多くの開発者が利用できるように作られました.検索というと,Googleのシステムを使っている企業Webページが多いですが,自前の検索エンジンがあると顧客にとって本当に必要な情報を提供できます>>
Hit Document => Document<stored,indexed,tokenized<title:Seasar 2 徹底入門 SAStruts/S2JDBC 対応> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3990> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20100419> stored,indexed,tokenized<description:Seasar2を使いこなすためのバイブルが登場!!>>

「exit」で終了。

Enter Search Query (if "exit" type, exit)> exit
[success] Total time: 161 s, completed 2013/06/09 1:33:07

削除

こちらは、チュートリアルにありません。IndexWriter#deleteDocumentsを使います。

object LuceneIndexDeleter {
  def main(args: Array[String]): Unit = {
    val luceneVersion = Version.LUCENE_43
    val analyzer = new JapaneseAnalyzer(luceneVersion)

    val directory = FSDirectory.open(new File("index-dir"))
    try {
      val config = new IndexWriterConfig(luceneVersion, analyzer)
      val writer = new IndexWriter(directory, config)

      try {
        Iterator
          .continually(readLine("""Enter Delete Query (if "exit" type, exit)> """))
          .takeWhile(_ != "exit")
          .filter(_ != "")
          .foreach { line =>
            val parser = new QueryParser(luceneVersion, "title", analyzer)

            Try {
              val query = parser.parse(line)

              println(s"Query => $query")

              writer.deleteDocuments(query)

              println("Document Deleted")
            } match {
              case Success(_) =>
              case Failure(e) => println(e)
            }
        }
      } finally {
        writer.close()
      }
    } finally {
      directory.close()
    }
  }
}

IndexWriter#deleteDocumentsには、引数にQueryを取るものとTermを取るものがありますが、今回はQueryを取るものを使用しています。

実行。

> run-main LuceneIndexDeleter
[info] Running LuceneIndexDeleter

Enter Delete Query (if "exit" type, exit)> publish-date:[20090101 TO 20111231]
Query => publish-date:[20090101 TO 20111231]
Document Deleted

1度抜けて…

Enter Delete Query (if "exit" type, exit)> exit
[success] Total time: 34 s, completed 2013/06/09 1:34:09

Searcherで検索してみると、

> run-main LuceneIndexSearcher
[info] Running LuceneIndexSearcher 
Enter Search Query (if "exit" type, exit)> *:*
Query => *:*
hits => 2
Hit Document => Document<stored,indexed,tokenized<title:Apache Lucene 入門 Java・オープンソース・全文検索システムの構築> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:3200> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20060516> stored,indexed,tokenized<description:Luceneは全文検索システムを構築するためのJavaのライブラリです。Luceneを使えば,一味違う高機能なWebアプリケーションを作ることができます。>>
Hit Document => Document<stored,indexed,tokenized<title:JBoss徹底活用ガイド> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<price:2919> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<publish-date:20080218> stored,indexed,tokenized<description:企業向けJava+Web開発の決定版」企業向けのJavaの定番オープンソース製品になったJBoss。本書は JBoss ユーザ会の有志による共同執筆。現場で培われたJBossの活用ノウハウを余すことなく本書に反映。>>

ドキュメントが減っています。

だいぶメモ書きですが…。

CLOVER🍀

That was when it all began.

チュートリアルを参考に、Luceneのインデックスの登録、検索、削除を書いてみる

登録

検索

削除