Luceneでドキュメントの更新

最近、Elasticsearchの本を読みまして。

高速スケーラブル検索エンジン　ElasticSearch Server (アスキー書籍)

作者: Ｒａｆａｌ　Ｋｕｃ　（ｌにストローク符号、ｃにアクサン・テギュ付く）,Ｍａｒｅｋ　Ｒｏｇｏｚｉｎｓｋｉ　（ｎにアクサン・テギュ付く）
出版社/メーカー: 角川アスキー総合研究所
発売日: 2014/03/25
メディア: Kindle版
この商品を含むブログ (3件) を見る

こちらを読んでいて、ふと気になったのがドキュメントの更新。登録する時のソース（JSON）を持っていれば、ユニークキーを指定してのドキュメントの更新が可能だとか。

そういえば、Solrも近い話でしたねぇ。更新そのものはできないので、更新する／しないに関わらず全部のフィールドを送ることとなっていました。

[改訂新版] Apache Solr入門 ~オープンソース全文検索エンジン (Software Design plus)

作者: 大谷純,阿部慎一朗,大須賀稔,北野太郎,鈴木教嗣,平賀一昭,株式会社リクルートテクノロジーズ,株式会社ロンウイット
出版社/メーカー: 技術評論社
発売日: 2013/11/29
メディア: 大型本
この商品を含むブログ (8件) を見る

以前自分で入門的にドキュメントの追加、削除を書いた時も、そういえば更新はしていませんでした。

チュートリアルを参考に、インデックスの登録、検索、削除を書いてみる
http://d.hatena.ne.jp/Kazuhira/20130608/1370709395

よい機会です。これを機に、ちょっとやってみましょう。

パッと見た感じ、IndexWriter#updateDocumentなどを使えばよさそうです。あと、こちらのサイトも参考にしました。

Updating Document Fields in Lucene
http://hrycan.com/2009/11/26/updating-document-fields-in-lucene/

では、続けて。

準備

まずは、依存関係の定義。
build.sbt

name := "lucene-document-update"

version := "0.0.1-SNAPSHOT"

scalaVersion := "2.10.4"

organization := "org.littlewings"

scalacOptions ++= Seq("-Xlint", "-deprecation", "-unchecked")

val luceneVersion = "4.7.1"

libraryDependencies ++= Seq(
  "org.apache.lucene" % "lucene-analyzers-kuromoji" % luceneVersion
)

mainメソッドを持ったクラスの雛形。
src/main/scala/org/littlewings/lucene/update/LuceneUpdateDocument.scala

package org.littlewings.lucene.update

import scala.collection.JavaConverters._

import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.ja.JapaneseAnalyzer
import org.apache.lucene.document.{Document, Field, StringField, TextField}
import org.apache.lucene.index.{DirectoryReader, IndexableField, IndexWriter, IndexWriterConfig, Term}
import org.apache.lucene.search.{IndexSearcher, MatchAllDocsQuery, Sort, SortField, TermQuery, TopFieldCollector}
import org.apache.lucene.store.{Directory, RAMDirectory}
import org.apache.lucene.util.Version

object LuceneUpdateDocument {
  def main(args: Array[String]): Unit = {
  }

  implicit class CloseableWrapper[A <: AutoCloseable](val underlying: A) extends AnyVal {
    def foreach(fun: A => Unit): Unit =
      try {
        fun(underlying)
      } finally {
        underlying.close()
      }
  }
}

AutoCloseableをfor式、というかforeachが使えるように、Implicit ClassかつValue Classを定義。

以下、ちょっとずつ書いていきましょう。

ドキュメントの登録と全件検索

まずは、Analyzerの作成メソッド。

  private def createAnalyzer(version: Version): Analyzer =
    new JapaneseAnalyzer(version)

ドキュメントを作成するためのメソッドを用意。

  private def createDocument(entry: Map[String, Any]): Document = {
    val document = new Document
    document.add(new StringField("isbn", entry("isbn").toString, Field.Store.YES))
    document.add(new TextField("title", entry("title").toString, Field.Store.YES))
    document.add(new StringField("price", entry("price").toString, Field.Store.YES))
    document.add(new TextField("summary", entry("summary").toString, Field.Store.YES))
    document
  }

今回も、ネタは書籍です。

これらを使って、インデックスにドキュメントを登録していきます。

  private def registerDocuments(directory: Directory, version: Version, analyzer: Analyzer): Unit =
    for (writer <- new IndexWriter(directory,
                                   new IndexWriterConfig(version, analyzer))) {
      Array(
        createDocument {
          Map("isbn" -> "978-4774127804",
              "title" -> "Apache Lucene 入門 〜Java・オープンソース・全文検索システムの構築",
              "price" -> 3360,
              "summary" -> "Luceneは全文検索システムを構築するためのJavaのライブラリです。")
        },
        createDocument {
          Map("isbn" -> "978-4774161631",
              "title" -> "[改訂新版] Apache Solr入門 オープンソース全文検索エンジン",
              "price" -> 3780,
              "summary" -> "最新版Apaceh Solr Ver.4.5.1に対応するため大幅な書き直しと原稿の追加を行い、現在の開発環境に合わせて完全にアップデートしました。Apache Solrは多様なプログラミング言語に対応した全文検索エンジンです。")
        },
        createDocument {
          Map("isbn" -> "978-4797352009",
              "title" -> "集合知イン・アクション",
              "price" -> 3990,
              "summary" -> "レコメンデーションエンジンをつくるには?ブログやSNSのテキスト分析、ユーザー嗜好の予測モデル、レコメンデーションエンジン……Web 2.0の鍵「集合知」をJavaで実装しよう!")
        }
      )
      .foreach(writer.addDocument)

      writer.commit()
    }

3冊登録。

そして、登録したドキュメントを全件検索して表示するメソッド。

  private def printAllDocuments(directory: Directory, version: Version): Unit =
    for (reader <- DirectoryReader.open(directory)) {
      val searcher = new IndexSearcher(reader)
      val allQuery = new MatchAllDocsQuery
      val limit = 1000

      val collector =
        TopFieldCollector
          .create(new Sort(new SortField("price", SortField.Type.INT, true)),
                  limit,
                  true,
                  false,
                  false,
                  false)

      searcher.search(allQuery, collector)

      val topDocs = collector.topDocs
      val hits = topDocs.scoreDocs

      hits.foreach { h =>
        val hitDoc = searcher.doc(h.doc)
        println(s"Doc, id[${h.doc}]:" + System.lineSeparator +
                hitDoc
                  .getFields
                  .asScala
                  .map(f => s"${f.name}:${f.stringValue}")
                  .mkString("  ", System.lineSeparator + "  ", ""))
      }
    }

ソート順は、価格の高い順です。

これらを使用して、mainメソッドを実装します。

  def main(args: Array[String]): Unit = {
    val version = Version.LUCENE_CURRENT
    val analyzer = createAnalyzer(version)

    for (directory <- new RAMDirectory) {
      registerDocuments(directory, version, analyzer)

      printAllDocuments(directory, version)
    }
  }

実行すると、こんな結果になります。

Doc, id[2]:
  isbn:978-4797352009
  title:集合知イン・アクション
  price:3990
  summary:レコメンデーションエンジンをつくるには?ブログやSNSのテキスト分析、ユーザー嗜好の予測モデル、レコメンデーションエンジン……Web 2.0の鍵「集合知」をJavaで実装しよう!
Doc, id[1]:
  isbn:978-4774161631
  title:[改訂新版] Apache Solr入門 オープンソース全文検索エンジン
  price:3780
  summary:最新版Apaceh Solr Ver.4.5.1に対応するため大幅な書き直しと原稿の追加を行い、現在の開発環境に合わせて完全にアップデートしました。Apache Solrは多様なプログラミング言語に対応した全文検索エンジンです。
Doc, id[0]:
  isbn:978-4774127804
  title:Apache Lucene 入門 〜Java・オープンソース・全文検索システムの構築
  price:3360
  summary:Luceneは全文検索システムを構築するためのJavaのライブラリです。

登録したドキュメントを更新する

では、ここから、登録済みのドキュメントを更新するためのメソッドを作成します。

  private def updateDocument(directory: Directory,
                             version: Version,
                             analyzer: Analyzer,
                             term: Term,
                             fields: IndexableField*): Unit =
    for {
      writer <- new IndexWriter(directory,
                                new IndexWriterConfig(version, analyzer))
      reader <- DirectoryReader.open(directory)
    } {
      val searcher = new IndexSearcher(reader)
      val query = new TermQuery(term)

      val hits = searcher.search(query, null, 1).scoreDocs
      
      if (hits.size > 0) {
        val hitDoc = searcher.doc(hits(0).doc)

        fields.foreach { f =>
          hitDoc.removeField(f.name)
          hitDoc.add(f)
        }

        writer.updateDocument(term, hitDoc)

        writer.commit()
      }
    }

IndexWriter#updateDocumentメソッドを使用することで、指定したTermで取得できるドキュメントをアトミックに削除／追加することができるらしいです。

引数がTermということで、ElasticsearchやSolrがユニークキーを必要とするのは、きっとこういうことなんだなぁと思ったり。

ですので、同じTermで引けた方がよいだろうということで、TermQueryでドキュメントを検索。

      val searcher = new IndexSearcher(reader)
      val query = new TermQuery(term)

      val hits = searcher.search(query, null, 1).scoreDocs

上記クエリは、Termで指定した内容で検索結果が1意に決まる前提で書いています。

検索結果があれば、それに対して更新を行います。

      if (hits.size > 0) {
        val hitDoc = searcher.doc(hits(0).doc)

        fields.foreach { f =>
          hitDoc.removeField(f.name)
          hitDoc.add(f)
        }

        writer.updateDocument(term, hitDoc)

        writer.commit()
      }

この時、ドキュメントの最初の1件を取得して、その中のフィールドを新しい値（このメソッドの引数）で置き換えます。いったん削除して、追加。

        fields.foreach { f =>
          hitDoc.removeField(f.name)
          hitDoc.add(f)
        }

つまり、過去に登録したドキュメントの値を自分で変えているわけですね。

最後、更新。

        writer.updateDocument(term, hitDoc)

        writer.commit()

コミットも付けて。

IndexWriter#updateDocumentはIndexableFieldインターフェースを実装したクラス（各種Field）をIterableの形式で与えるように宣言されていますが、DocumentクラスもIterableなIndexableFieldとして定義されているので、ここではDocumentそのものを与えました。

public void updateDocument(Term term,
                  Iterable<? extends IndexableField> doc)

まあ、後で出しますがFieldのListなどで指定する場合は、更新後にDocumentとして持ちたいFieldを全部指定することになりますので、素直にDocumentを渡すでいいのかなぁ？

これらを全部合わせると、呼び出し元はこんな感じに。

  def main(args: Array[String]): Unit = {
    val version = Version.LUCENE_CURRENT
    val analyzer = createAnalyzer(version)

    for (directory <- new RAMDirectory) {
      registerDocuments(directory, version, analyzer)

      printAllDocuments(directory, version)

      println("==================================================")

      updateDocument(directory,
                     version,
                     analyzer,
                     new Term("isbn", "978-4774127804"),
                     new StringField("price", "5000", Field.Store.YES))

      updateDocument(directory,
                     version,
                     analyzer,
                     new Term("isbn", "978-4797352009"),
                     new TextField("title", "【集合知イン・アクション】", Field.Store.YES),
                     new StringField("price", "2000", Field.Store.YES))

      printAllDocuments(directory, version)
    }
  }

一部テキストの装飾を買えたり、価格を変更したりしています。

実行してみましょう。

before

Doc, id[2]:
  isbn:978-4797352009
  title:集合知イン・アクション
  price:3990
  summary:レコメンデーションエンジンをつくるには?ブログやSNSのテキスト分析、ユーザー嗜好の予測モデル、レコメンデーションエンジン……Web 2.0の鍵「集合知」をJavaで実装しよう!
Doc, id[1]:
  isbn:978-4774161631
  title:[改訂新版] Apache Solr入門 オープンソース全文検索エンジン
  price:3780
  summary:最新版Apaceh Solr Ver.4.5.1に対応するため大幅な書き直しと原稿の追加を行い、現在の開発環境に合わせて完全にアップデートしました。Apache Solrは多様なプログラミング言語に対応した全文検索エンジンです。
Doc, id[0]:
  isbn:978-4774127804
  title:Apache Lucene 入門 〜Java・オープンソース・全文検索システムの構築
  price:3360
  summary:Luceneは全文検索システムを構築するためのJavaのライブラリです。

after

Doc, id[3]:
  isbn:978-4774127804
  title:Apache Lucene 入門 〜Java・オープンソース・全文検索システムの構築
  summary:Luceneは全文検索システムを構築するためのJavaのライブラリです。
  price:5000
Doc, id[1]:
  isbn:978-4774161631
  title:[改訂新版] Apache Solr入門 オープンソース全文検索エンジン
  price:3780
  summary:最新版Apaceh Solr Ver.4.5.1に対応するため大幅な書き直しと原稿の追加を行い、現在の開発環境に合わせて完全にアップデートしました。Apache Solrは多様なプログラミング言語に対応した全文検索エンジンです。
Doc, id[4]:
  isbn:978-4797352009
  summary:レコメンデーションエンジンをつくるには?ブログやSNSのテキスト分析、ユーザー嗜好の予測モデル、レコメンデーションエンジン……Web 2.0の鍵「集合知」をJavaで実装しよう!
  title:【集合知イン・アクション】
  price:2000

よくよく見ると、ドキュメントのIDが振り直されていますね。また、フィールドの位置もずれています。

ドキュメントのIDが変わったのは、削除してから追加したからなのでしょう。フィールドの順番が変わっているは、更新分のフィールドをは最後に追加する挙動になっているからですね。

なお、今回はこのようなコードで

        fields.foreach { f =>
          hitDoc.removeField(f.name)
          hitDoc.add(f)
        }

フィールドの更新を確認することができましたが、これは実はドキュメントのフィールドがすべてStore.YESとなっていることが前提となっています。

仮に、Store.NOのフィールドがいた場合、上記ループにはStore.NOのフィールドは出現しないことになります。つまり、フィールドが減ってしまうわけですね。

単純に、ドキュメントを検索するTermと更新対象のフィールドのみを指定、というのはNGかと。この場合は、ドキュメントに残るのは更新対象のフィールドのみとなってしまいます。

まあ、面倒でなければSolrのように、更新しないフィールドも含めて全部ドキュメントの更新リクエストに含めるのがいいんでしょうかね。ちょっと納得した気がします。

今回のソースコードは、こちらにアップしておきました。

https://github.com/kazuhira-r/lucene-examples/tree/master/lucene-document-update

CLOVER🍀

That was when it all began.

Luceneでドキュメントの更新

準備

ドキュメントの登録と全件検索

登録したドキュメントを更新する