Qdrantのチュートリアルから「初心者向けのセマンティック検索（Semantic Search for Beginners）」を試す

これは、なにをしたくて書いたもの？

先日、Qdrantをインストールしてみました。

Ubuntu Linux 22.04 LTSにベクトルデータベースQdrantをインストールして試す - CLOVER🍀

ここからどう進めたものか、というところなのですが、Quickstartの最後にチュートリアルを読んだり例を読むことが勧められていたので、
しばらくチュートリアルを試してみたいと思います。

To move onto some more complex examples of vector search, read our Tutorials and create your own app with the help of our Examples.

Quickstart - Qdrant

最初は「初心者向けのセマンティック検索（Semantic Search for Beginners）」を試してみたいと思います。

Semantic Search 101 - Qdrant

環境

今回の環境はこちら。Qdrantは172.17.0.2で動作しているものとします。

$ ./qdrant --version
qdrant 1.7.4

QdrantのWeb UIは0.1.21を使っています。

Qdrant Clientを使うPython環境。

$ python3 --version
Python 3.10.12


$ pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

Qdrantのチュートリアル「初心者向けのセマンティック検索（Semantic Search for Beginners）」を試す

では、Qdrantのチュートリアル「初心者向けのセマンティック検索（Semantic Search for Beginners）」をドキュメントに沿って
やっていきたいと思います。

Semantic Search 101 - Qdrant

とはいえ、完全にそのままやるのもなんなので、自分の興味で変えるところは変えていきます。

ライブラリーのインストール。

$ pip3 install qdrant-client sentence-transformers

インストールされたライブラリーの一覧。

$ pip3 list
Package                  Version
------------------------ ----------
annotated-types          0.6.0
anyio                    4.2.0
certifi                  2024.2.2
charset-normalizer       3.3.2
click                    8.1.7
exceptiongroup           1.2.0
filelock                 3.13.1
fsspec                   2023.12.2
grpcio                   1.60.1
grpcio-tools             1.60.1
h11                      0.14.0
h2                       4.1.0
hpack                    4.0.0
httpcore                 1.0.2
httpx                    0.26.0
huggingface-hub          0.20.3
hyperframe               6.0.1
idna                     3.6
Jinja2                   3.1.3
joblib                   1.3.2
MarkupSafe               2.1.5
mpmath                   1.3.0
networkx                 3.2.1
nltk                     3.8.1
numpy                    1.26.3
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.19.3
nvidia-nvjitlink-cu12    12.3.101
nvidia-nvtx-cu12         12.1.105
packaging                23.2
pillow                   10.2.0
pip                      22.0.2
portalocker              2.8.2
protobuf                 4.25.2
pydantic                 2.6.0
pydantic_core            2.16.1
PyYAML                   6.0.1
qdrant-client            1.7.2
regex                    2023.12.25
requests                 2.31.0
safetensors              0.4.2
scikit-learn             1.4.0
scipy                    1.12.0
sentence-transformers    2.3.1
sentencepiece            0.1.99
setuptools               59.6.0
sniffio                  1.3.0
sympy                    1.12
threadpoolctl            3.2.0
tokenizers               0.15.1
torch                    2.2.0
tqdm                     4.66.1
transformers             4.37.2
triton                   2.2.0
typing_extensions        4.9.0
urllib3                  2.2.0

テキストのベクトル化にはSentence Transformersを使うようです。

モデルには最も高速なものであるというall-MiniLM-L6-v2が選ばれていましたが、今回はこちらで試した後に

sentence-transformers/all-MiniLM-L6-v2 · Hugging Face

intfloat/multilingual-e5-baseも使ってみることにします。

intfloat/multilingual-e5-base · Hugging Face

作成したプログラムはこちら。

search.py

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

## テキスト埋め込みで使うモデル
encoder = SentenceTransformer("all-MiniLM-L6-v2")
#encoder = SentenceTransformer("intfloat/multilingual-e5-base")

print(f"vector dimention = {encoder.get_sentence_embedding_dimension()}")
print()

## データセット
documents = [
    {
        "name": "The Time Machine",
        "description": "A man travels through time and witnesses the evolution of humanity.",
        "author": "H.G. Wells",
        "year": 1895,
    },
    {
        "name": "Ender's Game",
        "description": "A young boy is trained to become a military leader in a war against an alien race.",
        "author": "Orson Scott Card",
        "year": 1985,
    },
    {
        "name": "Brave New World",
        "description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.",
        "author": "Aldous Huxley",
        "year": 1932,
    },
    {
        "name": "The Hitchhiker's Guide to the Galaxy",
        "description": "A comedic science fiction series following the misadventures of an unwitting human and his alien friend.",
        "author": "Douglas Adams",
        "year": 1979,
    },
    {
        "name": "Dune",
        "description": "A desert planet is the site of political intrigue and power struggles.",
        "author": "Frank Herbert",
        "year": 1965,
    },
    {
        "name": "Foundation",
        "description": "A mathematician develops a science to predict the future of humanity and works to save civilization from collapse.",
        "author": "Isaac Asimov",
        "year": 1951,
    },
    {
        "name": "Snow Crash",
        "description": "A futuristic world where the internet has evolved into a virtual reality metaverse.",
        "author": "Neal Stephenson",
        "year": 1992,
    },
    {
        "name": "Neuromancer",
        "description": "A hacker is hired to pull off a near-impossible hack and gets pulled into a web of intrigue.",
        "author": "William Gibson",
        "year": 1984,
    },
    {
        "name": "The War of the Worlds",
        "description": "A Martian invasion of Earth throws humanity into chaos.",
        "author": "H.G. Wells",
        "year": 1898,
    },
    {
        "name": "The Hunger Games",
        "description": "A dystopian society where teenagers are forced to fight to the death in a televised spectacle.",
        "author": "Suzanne Collins",
        "year": 2008,
    },
    {
        "name": "The Andromeda Strain",
        "description": "A deadly virus from outer space threatens to wipe out humanity.",
        "author": "Michael Crichton",
        "year": 1969,
    },
    {
        "name": "The Left Hand of Darkness",
        "description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.",
        "author": "Ursula K. Le Guin",
        "year": 1969,
    },
    {
        "name": "The Three-Body Problem",
        "description": "Humans encounter an alien civilization that lives in a dying system.",
        "author": "Liu Cixin",
        "year": 2008,
    },
]


## Qdrantへの接続
qdrant = QdrantClient(host="172.17.0.2", port=6333)

## コレクションの再作成（都度）
qdrant.recreate_collection(
    collection_name="my_books",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  ## ベクトルの次元数はSentence Transformersから取得
        distance=models.Distance.COSINE  ##
    )
)

## ポイントの登録
qdrant.upload_points(
    collection_name="my_books",
    points=[
        models.PointStruct(
            id=idx, vector=encoder.encode(doc["description"]).tolist(), payload=doc
            ## passage: はintfloat/multilingual-e5-baseのprefix
            #id=idx, vector=encoder.encode("passage: " + doc["description"]).tolist(), payload=doc
        )
        for idx, doc in enumerate(documents)
    ]
)

## 検索する
hits = qdrant.search(
    collection_name="my_books",
    query_vector=encoder.encode("alien invasion").tolist(),
    ## query: はintfloat/multilingual-e5-baseのprefix
    #query_vector=encoder.encode("query: alien invasion").tolist(),
    limit=3
)

print("search results:")
for hit in hits:
    print(hit.payload, "score:", hit.score)

print()

## フィルターで絞り込む
hits = qdrant.search(
    collection_name="my_books",
    query_vector=encoder.encode("alien invasion").tolist(),
    ## query: はintfloat/multilingual-e5-baseのprefix
    #query_vector=encoder.encode("query: alien invasion").tolist(),
    query_filter=models.Filter(
        must=[models.FieldCondition(key="year", range=models.Range(gte=2000))]
    ),
    limit=1
)

print("search & filter results:")
for hit in hits:
    print(hit.payload, "score:", hit.score)

print()

最初にSentence Transformersで使用するモデルを指定します。

## テキスト埋め込みで使うモデル
encoder = SentenceTransformer("all-MiniLM-L6-v2")

ここで、このモデルの次元数を出力するようにしました。

print(f"vector dimention = {encoder.get_sentence_embedding_dimension()}")
print()

all-MiniLM-L6-v2モデルでは、次元数は384になります。

vector dimention = 384

続くのは本の名前、著者、発行した年、簡単な説明を含むデータセットです。

## データセット
documents = [
    {
        "name": "The Time Machine",
        "description": "A man travels through time and witnesses the evolution of humanity.",
        "author": "H.G. Wells",
        "year": 1895,
    },

    〜省略〜

    {
        "name": "The Three-Body Problem",
        "description": "Humans encounter an alien civilization that lives in a dying system.",
        "author": "Liu Cixin",
        "year": 2008,
    },
]

QdrantクライアントからQdrantサーバーへの接続を定義。チュートリアルでは、こちらはインメモリーになっていました。

## Qdrantへの接続
qdrant = QdrantClient(host="172.17.0.2", port=6333)

ポイントを登録するコレクションの作成。

## コレクションの再作成（都度）
qdrant.recreate_collection(
    collection_name="my_books",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  ## ベクトルの次元数はSentence Transformersから取得
        distance=models.Distance.COSINE  ##
    )
)

コレクション名は「my_books」です。距離メトリクスはコサイン類似度を使います。
ベクトルの次元数も必要なのですが、これはSentence Transformersから取得します。all-MiniLM-L6-v2では384でしたね。

ポイントの登録。

## ポイントの登録
qdrant.upload_points(
    collection_name="my_books",
    points=[
        models.PointStruct(
            id=idx, vector=encoder.encode(doc["description"]).tolist(), payload=doc
        )
        for idx, doc in enumerate(documents)
    ]
)

オリジナルのチュートリアルではQdrantClient#upload_recordsというメソッドを使っているのですが、今のQdrant Clientで使うと
以下のように非推奨であると警告されるので、書かれている通りにQdrantClient#upload_pointsに変更しました。

/path/to/search.py:105: DeprecationWarning: `upload_records` is deprecated, use `upload_points` instead
  qdrant.upload_records(

この方が、Qdrantの用語とも合いますしね。

「alien invasion（宇宙人の侵略）」をキーワードに検索してみます。スコアの上位3件を取得し、ペイロードとスコアを出力します。

## 検索する
hits = qdrant.search(
    collection_name="my_books",
    query_vector=encoder.encode("alien invasion").tolist(),
    limit=3
)

print("search results:")
for hit in hits:
    print(hit.payload, "score:", hit.score)

print()

結果はこのようになりました。

search results:
{'author': 'H.G. Wells', 'description': 'A Martian invasion of Earth throws humanity into chaos.', 'name': 'The War of the Worlds', 'year': 1898} score: 0.5700934
{'author': 'Douglas Adams', 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'name': "The Hitchhiker's Guide to the Galaxy", 'year': 1979} score: 0.5040469
{'author': 'Liu Cixin', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'name': 'The Three-Body Problem', 'year': 2008} score: 0.45902938

1番の「alien invasion（宇宙人の侵略）」の作者はH.G. Wellsで、内容的にもこれが1番…なのかもしれません…（1位と2位の差がなんとも
言い難いです）。

次に、結果をフィルターで絞り込んでみます。2000年以降の書籍に絞って、上位1件のみの取得としています。

## フィルターで絞り込む
hits = qdrant.search(
    collection_name="my_books",
    query_vector=encoder.encode("alien invasion").tolist(),
    query_filter=models.Filter(
        must=[models.FieldCondition(key="year", range=models.Range(gte=2000))]
    ),
    limit=1
)

print("search & filter results:")
for hit in hits:
    print(hit.payload, "score:", hit.score)

print()

結果。

search & filter results:
{'author': 'Liu Cixin', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'name': 'The Three-Body Problem', 'year': 2008} score: 0.45902938

先ほどの3件から、2008年のものだけが残りました。

ここまでがチュートリアルの内容です。

Sentence Transformersで使うモデルをintfloat/multilingual-e5-baseに変更してみる

次に、Sentence Transformersで使うモデルをintfloat/multilingual-e5-baseに変更してみます。

intfloat/multilingual-e5-base · Hugging Face

先ほどのソースコードでコメントアウトしていた箇所がintfloat/multilingual-e5-baseを使っていた箇所なのですが、コメントアウトを
解除して内容を入れ替えます。

モデルの使用。

## テキスト埋め込みで使うモデル
encoder = SentenceTransformer("intfloat/multilingual-e5-base")

ポイントの登録。intfloat/multilingual-e5-baseで登録するペイロードをベクトル化する際には、「passage: 」という接頭辞が必要です。

## ポイントの登録
qdrant.upload_points(
    collection_name="my_books",
    points=[
        models.PointStruct(
            ## passage: はintfloat/multilingual-e5-baseのprefix
            id=idx, vector=encoder.encode("passage: " + doc["description"]).tolist(), payload=doc
        )
        for idx, doc in enumerate(documents)
    ]
)

検索とフィルター。検索時は「query: 」という接頭辞が必要です。

## 検索する
hits = qdrant.search(
    collection_name="my_books",
    ## query: はintfloat/multilingual-e5-baseのprefix
    query_vector=encoder.encode("query: alien invasion").tolist(),
    limit=3
)

print("search results:")
for hit in hits:
    print(hit.payload, "score:", hit.score)

print()

## フィルターで絞り込む
hits = qdrant.search(
    collection_name="my_books",
    ## query: はintfloat/multilingual-e5-baseのprefix
    query_vector=encoder.encode("query: alien invasion").tolist(),
    query_filter=models.Filter(
        must=[models.FieldCondition(key="year", range=models.Range(gte=2000))]
    ),
    limit=1
)

print("search & filter results:")
for hit in hits:
    print(hit.payload, "score:", hit.score)

print()

実行してみます。

$ python3 search.py

こんな結果になりました。

vector dimention = 768

search results:
{'author': 'Douglas Adams', 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'name': "The Hitchhiker's Guide to the Galaxy", 'year': 1979} score: 0.8414118
{'author': 'Liu Cixin', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'name': 'The Three-Body Problem', 'year': 2008} score: 0.8392728
{'author': 'Michael Crichton', 'description': 'A deadly virus from outer space threatens to wipe out humanity.', 'name': 'The Andromeda Strain', 'year': 1969} score: 0.8389114

search & filter results:
{'author': 'Liu Cixin', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'name': 'The Three-Body Problem', 'year': 2008} score: 0.8392728

ベクトルの次元数は768でした。コレクションはQdrantClient#recreate_collectionで都度再作成しているので、このベクトルの次元数で
コレクションは再作成されます。

検索結果は、all-MiniLM-L6-v2の時に1位だった「The War of the Worlds」がいなくなりましたね…。
代わりに2位だった「The Hitchhiker's Guide to the Galaxy」が1位になりました。

フィルターを使った時の結果は同じです。

ちなみに、以下のように取得件数を10件にしてみると、「The War of the Worlds」は4位にいました。

## 検索する
hits = qdrant.search(
    collection_name="my_books",
    ## query: はintfloat/multilingual-e5-baseのprefix
    query_vector=encoder.encode("query: alien invasion").tolist(),
    limit=10
)

そんなに変わらない気もするのですが、なんとも言えません…。

とりあえず、チュートリアルを試してみましたということで。