CLOVER🍀

That was when it all began.

QdrantのExampleから、「基本的なRAG(Basic RAG)」を試す

これは、なにをしたくて書いたもの?

今までQdrantのチュートリアルを試してきたのですが、今度はExampleを見てみようと思います。

Examples - Qdrant

ただ、Exampleで見るのは「基本的なRAG(Basic RAG)」のみにしたいと思います。それから、Qdrant自体を集中的に扱うのもここで
区切りにしようかなと。

今回のExampleの狙い

このExampleでは、Qdrant+Fastembed、OpenAIを使ってRAGを構成する例を示します。

ところでExampleは「Examples」ページにリストアップされている内容から、実際のページに移るとタイトルが大幅に変わるのですが、
そういうものなのでしょうか…?

「Examples」ページでは「Basic RAG」なのに、実際のページのタイトルはこちらになっています。

Retrieval Augmented Generation (RAG) with OpenAI and Qdrant

で、RAGを扱うページなのですが、なんとRAG自体の説明はこのページにはありません。

こちらのページを見ること、だそうです。

Patterns for Building LLM-based Systems & Products / Retrieval-Augmented Generation: To add knowledge

RAGは「Retrieval-Augmented Generation」の略で、モデルの外部で検索によりデータを取得してモデルへの入力に加えることで、
出力を改善できるものと紹介されています。

つまり、LLMに入力を与える前に関連する情報を検索して追加することで、より良い回答を得ようとするものだというのがざっくりした
捉え方でしょうか。

RAGの利点は以下です。

  • 外部から取得したコンテキストを使うことでハルシネーション(幻覚)を軽減する
  • LLMをファインチューニングするよりもコストがかからない(検索インデックスの最新化の方が安価)
    • 最新のデータにアクセスしやすくなる
  • 偏ったドキュメントや有害なドキュメントを更新または削除する場合も、(LLMをファインチューニングしたりすることよりも)簡単に済む

論文もあるんですね。

[2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

扱うデータは、すべてこのExampleのページ内に含まれるものを使うようなので、ここは素直に従ってみましょう。

あとOpenAIは使わずに、llama-cpp-pythonで代用することにします。

環境

今回の環境はこちら。Qdrantは172.17.0.2で動作しているものとします。

$ ./qdrant --version
qdrant 1.9.0

QdrantのWeb UIは0.1.25を使っています。

Qdrant Clientを使うPython環境。

$ python3 --version
Python 3.10.12


$ pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

llama-cpp-python+Llama 3

最初にOpenAIの代わりのサーバーを立ち上げておきます。今回はllama-cpp-pythonを使います。

まずはインストール。

$ pip3 install llama-cpp-python[server]

バージョン。

$ pip3 list
Package           Version
----------------- -------
annotated-types   0.6.0
anyio             4.3.0
click             8.1.7
diskcache         5.6.3
exceptiongroup    1.2.1
fastapi           0.110.2
h11               0.14.0
idna              3.7
Jinja2            3.1.3
llama_cpp_python  0.2.65
MarkupSafe        2.1.5
numpy             1.26.4
pip               22.0.2
pydantic          2.7.1
pydantic_core     2.18.2
pydantic-settings 2.2.1
python-dotenv     1.0.1
PyYAML            6.0.1
setuptools        59.6.0
sniffio           1.3.1
sse-starlette     2.1.0
starlette         0.37.2
starlette-context 0.3.6
typing_extensions 4.11.0
uvicorn           0.29.0

モデルはLlama 3を使うことにします。

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face

量子化済みのモデルをダウンロードします。

$ curl -L https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf?download=true -o Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

起動。

$ python3 -m llama_cpp.server --model Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --chat_format llama-3

これでOpenAIの代替サーバーの準備は完了です。

QdrantのExampleから「基本的なRAG(Basic RAG)」を試す

それでは、QdrantのExampleの「基本的なRAG(Basic RAG)」を試してみましょう。

Retrieval Augmented Generation (RAG) with OpenAI and Qdrant

まずはライブラリーのインストール。

$ pip3 install qdrant-client fastembed openai

インストールしたライブラリーおよびバージョン。

$ pip3 list
Package            Version
------------------ --------
annotated-types    0.6.0
anyio              4.3.0
certifi            2024.2.2
charset-normalizer 3.3.2
coloredlogs        15.0.1
distro             1.9.0
exceptiongroup     1.2.1
fastembed          0.2.6
filelock           3.13.4
flatbuffers        24.3.25
fsspec             2024.3.1
grpcio             1.62.2
grpcio-tools       1.62.2
h11                0.14.0
h2                 4.1.0
hpack              4.0.0
httpcore           1.0.5
httpx              0.27.0
huggingface-hub    0.20.3
humanfriendly      10.0
hyperframe         6.0.1
idna               3.7
loguru             0.7.2
mpmath             1.3.0
numpy              1.26.4
onnx               1.16.0
onnxruntime        1.17.3
openai             1.23.6
packaging          24.0
pip                22.0.2
portalocker        2.8.2
protobuf           4.25.3
pydantic           2.7.1
pydantic_core      2.18.2
PyYAML             6.0.1
qdrant-client      1.9.0
requests           2.31.0
setuptools         59.6.0
sniffio            1.3.1
sympy              1.12
tokenizers         0.15.2
tqdm               4.66.2
typing_extensions  4.11.0
urllib3            2.2.1

FastEmbedというのは、テキスト埋め込みを行えるライブラリーです。Qdrantと一緒に使った時は、Qdrantの操作時にテキスト埋め込みを
自動的に行ってくれるようになります。

ドキュメントに習って、作成したプログラムはこちら。

rag.py

from openai  import OpenAI
from qdrant_client import QdrantClient

qclient = QdrantClient("http://172.17.0.2:6333", prefer_grpc=True)

qclient.delete_collection(collection_name="knowledge-base")
print(f"get_collection = {qclient.get_collections()}")

qclient.add(
    collection_name="knowledge-base",
    documents=[
        "Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!",
        "Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.",
        "PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.",
        "MySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database.",
        "NGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.",
        "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.",
        "SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.",
        "The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals.",
    ]
)

print(f"get_collection = {qclient.get_collections()}")

openai_client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy-api-key")

prompt = """
What tools should I need to use to build a web service using vector embeddings for search?
"""

completion = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": prompt},
    ]
)

print()

print(f"message = {completion.choices[0].message.content}")

query_results = qclient.query(
    collection_name="knowledge-base",
    query_text=prompt,
    limit=3,
)

print()

print(f"query results = {query_results}")

context = "\n".join(r.document for r in query_results)

print()

print(f"context = {context}")

metaprompt = f"""
You are a software architect.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: {prompt.strip()}

Context:
{context.strip()}

Answer:
"""

print()

print(f"meta prompt = {metaprompt}")

completion = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": metaprompt},
    ]
)

print()

print(f"message = {completion.choices[0].message.content}")

実行。

$ python3 rag.py

動かした結果を載せながら、ソースコードを順に説明しておきます。

まずはQdrantを操作するクライアントを作成。

qclient = QdrantClient("http://172.17.0.2:6333", prefer_grpc=True)

qclient.delete_collection(collection_name="knowledge-base")
print(f"get_collection = {qclient.get_collections()}")

使うコレクションの名前は「knowledge-base」としていて、最初に削除しています。

なので、このprintの結果はコレクションがないことになります。

get_collection = collections=[]

データを登録します。コレクションは自動的に作成され、テキスト埋め込みも自動的に行われます。

qclient.add(
    collection_name="knowledge-base",
    documents=[
        "Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!",
        "Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.",
        "PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.",
        "MySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database.",
        "NGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.",
        "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.",
        "SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.",
        "The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals.",
    ]
)

print(f"get_collection = {qclient.get_collections()}")

QdrantClient#get_collectionsの結果として、コレクションが現れます。

get_collection = collections=[CollectionDescription(name='knowledge-base')]

llama-cpp-pythonで起動したサーバーを使用するOpenAIのライブラリーの設定。

openai_client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy-api-key")

「検索用のベクトル埋め込みを使ったWebサービスを構築するには、どのようなツールが必要ですか?」と聞いてみます。

prompt = """
What tools should I need to use to build a web service using vector embeddings for search?
"""

completion = openai_client.chat.completions.create(
    model="gpt-3.5-turbo", 
    messages=[
        {"role": "user", "content": prompt},
    ]
)
 
print()
 
print(f"message = {completion.choices[0].message.content}")

プログラミング言語、ベクトル埋め込み向けのライブラリーやフレームワーク検索エンジンなどたくさんの情報が返ってきます。

message = To build a web service that uses vector embeddings for search, you'll likely need the following tools:

**Programming languages and frameworks:**

1. **Python**: A popular choice for natural language processing (NLP) tasks, including building vector embedding-based search services.
2. **Java**: Alternatively, you can use Java as your programming language of choice, especially if you're already familiar with it or prefer its ecosystem.

**Vector embedding libraries and frameworks:**

1. **Gensim**: A Python library for topic modeling and document similarity analysis using word embeddings (e.g., Word2Vec, Doc2Vec).
2. **TensorFlow** or **PyTorch**: Deep learning frameworks that can be used to train your own custom vector embedding models.
3. **Hugging Face Transformers**: A library providing pre-trained language models and a simple interface for building text-to-vector embeddings.

**Search and indexing tools:**

1. **Elasticsearch**: A popular search engine that supports vector-based queries and indexing.
2. **Apache Solr**: Another powerful search engine that can be used for vector-based search and indexing.
3. **Inverted indexes**: You can also build your own inverted indexes using libraries like Python's `numpy` or `scipy`.

**Other dependencies:**

1. **Numpy**: A library for efficient numerical computation, essential for many NLP tasks.
2. **Pandas**: A library for data manipulation and analysis, useful for preprocessing and handling large datasets.
3. **Scikit-learn**: A machine learning library that provides algorithms for classification, regression, clustering, and more.

**Optional tools:**

1. **Distributed computing frameworks**: If you plan to scale your service to handle large volumes of data or high query loads, consider using distributed computing frameworks like Apache Spark, Hadoop, or Dask.
2. **Cloud services**: Consider deploying your service on cloud platforms like AWS, Google Cloud, or Microsoft Azure, which provide scalable infrastructure and managed services for search and indexing.

**Development environments:**

1. **Jupyter Notebook**: A web-based interactive environment for exploring data, prototyping, and developing your vector embedding-based search service.
2. **Integrated Development Environments (IDEs)**: Choose an IDE like PyCharm, Visual Studio Code, or IntelliJ IDEA to write, debug, and optimize your code.

Keep in mind that the specific tools you choose will depend on your project requirements, performance needs, and personal preferences.

もう少し絞った回答にしたいのが今回のお題のようです。

ここで、同じ質問をQdrantで検索してみます。

query_results = qclient.query(
    collection_name="knowledge-base",
    query_text=prompt,
    limit=3,
)

print()

print(f"query results = {query_results}")

結果、Qdrant、FastAPI、PyTorchについて情報が得られました。ちなみに、元のドキュメントだとPyTorchではなくSentenceTransformersが
返ってきています…。

query results = [QueryResponse(id='55bc59a8-5ab2-47b8-bb9a-449f84373e33', embedding=None, sparse_embedding=None, metadata={'document': 'Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!'}, document='Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!', score=0.8290700316429138), QueryResponse(id='d5554610-fd63-4304-a705-378701215c11', embedding=None, sparse_embedding=None, metadata={'document': 'FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.'}, document='FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.', score=0.8190128803253174), QueryResponse(id='385c9690-821d-46a5-8ffb-54e0286098aa', embedding=None, sparse_embedding=None, metadata={'document': 'PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.'}, document='PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.', score=0.8056522607803345)]

これをつなげてコンテキストにします。

query_results = qclient.query(
    collection_name="knowledge-base",
    query_text=prompt,
    limit=3,
)

print()

print(f"query results = {query_results}")

結果。

context = Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.

このコンテキストと、先程のプロンプトを組み合わせます。

metaprompt = f"""
You are a software architect. 
Answer the following question using the provided context. 
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: {prompt.strip()}

Context: 
{context.strip()}

Answer:
"""

print()

print(f"meta prompt = {metaprompt}")

こんなプロンプトになります。

meta prompt =
You are a software architect.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "I don't know".

Question: What tools should I need to use to build a web service using vector embeddings for search?

Context:
Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.

Answer:

先程の質問に加えて、質問相手が「ソフトウェアアーキテクトである」ということ、そしてQdrantからの検索結果を追加しています。

そしてこのプロンプトで質問してみます。

completion = openai_client.chat.completions.create(
    model="gpt-3.5-turbo", 
    messages=[
        {"role": "user", "content": metaprompt},
    ]
)
 
print()
 
print(f"message = {completion.choices[0].message.content}")

結果。

message = Based on the context provided, I would recommend using the following tools to build a web service using vector embeddings for search:

1. Qdrant: As mentioned in the context, Qdrant is a vector database & vector similarity search engine that can be used as an API service.
2. FastAPI: With its high-performance capabilities and ease of use, I recommend using FastAPI as the web framework to build the API.
3. PyTorch: Since you want to use vector embeddings for search, PyTorch can be used for training neural network encoders that generate these embeddings.

Additionally, you may also need:

* A Python IDE or code editor (e.g., PyCharm, VSCode) to write and debug your code.
* A library like scikit-learn or TensorFlow for preprocessing and processing the data.
* A database to store the vector embeddings. Qdrant provides its own database, but you can also use other databases like MySQL or MongoDB.

Please note that this is just a suggested approach and may vary depending on the specific requirements of your project.

だいぶストレートな回答になりました。

あとは追加情報があるくらいですね。

これがRAGの基本的な形なんだな、というのを自分で書いてみて試してみた感じですね。

ちなみに、このプログラムが完了するまで、自分の環境では10分近くかかります…。

おわりに

QdrantのExampleから、「基本的なRAG(Basic RAG)」を試してみました。

RAGがどのようなものかはぼんやり知っていましたが、自分で動かしてみるのは初めてだったのでいい機会になりました。
もうちょっとちゃんとやろうとすると、LangChainを使ったりといろいろするのでしょうけれど、まずは基本的なところからかなと。

また、今回でQdrantのチュートリアルやExampleをなぞるのは終わりにします。いい勉強になりました。

Metaの「Llama 3」をOpenAI API互換のサーバーを持つllama-cpp-pythonとLocalAIで試す

これは、なにをしたくて書いたもの?

MetaからLlama 3がリリースされました。

Meta、無料で商用可の新LLM「Llama 3」、ほぼすべてのクラウドでアクセス可能に - ITmedia NEWS

このLlama 3をOpenAI API互換のサーバーを持つllama-cpp-pythonおよびLocalAIで動かせそうなので、試してみることにしました。

Llama 3

Llama 3はMetaの公開しているLLMです。

Meta Llama 3

Introducing Meta Llama 3: The most capable openly available LLM to date

パラメーターは8B、70Bの2種類で、ベースのモデルとInstruction tuning済みのモデルがそれぞれあります。

そしてこのモデルをllama-cpp-pythonやLocalAIで使いたいのですが。

まずllama.cppでは対応済み。

Added llama-3 chat template by DifferentialityDevelopment · Pull Request #6751 · ggerganov/llama.cpp · GitHub

llama-cpp-pythonでも対応済みです。

Add Llama-3 chat format by andreabak · Pull Request #1371 · abetlen/llama-cpp-python · GitHub

LocalAIについてはテンプレートを使えば大丈夫そうです。

How to run llama3? · mudler LocalAI · Discussion #2076 · GitHub

では、試してみようと思います。

オリジナルのモデルはこれらですが、

今回使うのはこちらのGGUFフォーマットかつ量子化済みのモデルにします。

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face

環境

今回の環境はこちら。

$ python3 --version
Python 3.10.12


$ pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

モデルをダウンロードする

こちらからモデルをダウンロードします。

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face

5GBほどのモデルです。

$ curl -L https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf?download=true -o Meta-Llama-3-8B-Instruct.Q4_K_M.gguf


$ ll -h Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
-rw-rw-r-- 1 xxxxx xxxxx 4.6G  4月 25 00:15 Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

llama-cpp-pythonで試す

まずはllama-cpp-pythonで試してみます。

インストール。

$ pip3 install llama-cpp-python[server]

依存関係を含むバージョン。

$ pip3 list
Package           Version
----------------- -------
annotated-types   0.6.0
anyio             4.3.0
click             8.1.7
diskcache         5.6.3
exceptiongroup    1.2.1
fastapi           0.110.2
h11               0.14.0
idna              3.7
Jinja2            3.1.3
llama_cpp_python  0.2.64
MarkupSafe        2.1.5
numpy             1.26.4
pip               22.0.2
pydantic          2.7.1
pydantic_core     2.18.2
pydantic-settings 2.2.1
python-dotenv     1.0.1
PyYAML            6.0.1
setuptools        59.6.0
sniffio           1.3.1
sse-starlette     2.1.0
starlette         0.37.2
starlette-context 0.3.6
typing_extensions 4.11.0
uvicorn           0.29.0

起動。オプションに--chat_format llama-3が必要です。

$ python3 -m llama_cpp.server --model Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --chat_format llama-3

動かしてみます。自己紹介をお願いしてみましょう。

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions -d \
    '{"messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq
{
  "id": "chatcmpl-ff1221b5-5555-4a32-9c1b-c3c2818efc02",
  "object": "chat.completion",
  "created": 1713972674,
  "model": "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "I'd be happy to introduce myself.\n\nI am LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but rather a computer program designed to simulate conversation and answer questions to the best of my ability based on the knowledge and data I've been trained on.\n\nI'm constantly learning and improving my responses based on user interactions, so please bear with me if I don't always get it right at first. My goal is to assist and provide helpful information to those who interact with me, while also making our conversation as engaging and natural as possible.\n\nWhat would you like to talk about or ask?",
        "role": "assistant"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 138,
    "total_tokens": 154
  }
}

real    1m9.754s
user    0m0.051s
sys     0m0.003s

日本語でも試してみましたが、実行時間がだいぶ伸びることに加えて日本語で返ってきませんでした…。

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions -d \
    '{"messages": [{"role": "user", "content": "あなたの自己紹介をしてください"}]}' | jq
{
  "id": "chatcmpl-cd219a91-a85c-4ceb-ae8c-4b49a28cd881",
  "object": "chat.completion",
  "created": 1713972772,
  "model": "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "Nice to meet you! 😊\n\nMy name is LLaMA, and I'm a large language model AI trained by Meta AI that can understand and respond to human input in a conversational manner. My primary function is to assist users with information queries, provide helpful responses, and even engage in creative conversations.\n\nHere are some interesting facts about me:\n\n1. **Language skills**: I'm fluent in multiple languages, including English, Japanese, Spanish, French, German, Italian, Chinese, and many more! 🌎\n2. **Knowledge base**: My training data consists of a massive corpus of text from the internet, which allows me to provide accurate answers to a wide range of questions.\n3. **Conversational abilities**: I can understand natural language processing (NLP) and respond accordingly, making it feel like you're having a conversation with a human! 💬\n4. **Creative capabilities**: I can generate text, poetry, stories, dialogues, and even entire scripts!\n5. **Continuous learning**: My training is ongoing, so I'm always improving my understanding of language and updating my knowledge base.\n\nI'm here to help answer your questions, provide information, or simply chat about any topic you're interested in! What would you like to talk about? 🤔",
        "role": "assistant"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 262,
    "total_tokens": 281
  }
}

real    2m42.950s
user    0m0.042s
sys     0m0.018s

意味は通じているようなのですが。

Llama 3は英語で使うことにしましょう。

ところで、--chat_format llama-3というのは以下で利用されるものですね。

https://github.com/abetlen/llama-cpp-python/blob/v0.2.64/llama_cpp/llama_chat_format.py#L929-L946

これらのトークンについてですが

    _roles = dict(
        system="<|start_header_id|>system<|end_header_id|>\n\n",
        user="<|start_header_id|>user<|end_header_id|>\n\n",
        assistant="<|start_header_id|>assistant<|end_header_id|>\n\n",
    )
    _begin_token = "<|begin_of_text|>"
    _sep = "<|eot_id|>"

こちらに記載があります。

Meta Llama 3 | Model Cards and Prompt formats

LocalAIで試す

次は、LocalAIで試しましょう。

ダウンロード。

$ curl -LO https://github.com/mudler/LocalAI/releases/download/v2.12.4/local-ai-avx2-Linux-x86_64
$ chmod a+x local-ai-avx2-Linux-x86_64
$ ./local-ai-avx2-Linux-x86_64 --version
LocalAI version v2.12.4 (0004ec8be3ca150ce6d8b79f2991bfe3a9dc65ad)

modelsディレクトリに量子化されたLlama 3のモデルを配置します。

$ tree models
models
└── Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

0 directories, 1 file

設定ファイルを用意します。

local-ai-config.yaml`

- name: llama-3-8b-instruct
  backend: llama-cpp
  mmap: true
  context_size: 8192
  f16: true
  stopwords:
    - <|im_end|>
    - <dummy32000>
    - "<|eot_id|>"
  parameters:
    model: Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
  template:
    chat_message: |
      <|start_header_id|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}<|end_header_id|>

      {{ if .FunctionCall -}}
      Function call:
      {{ else if eq .RoleName "tool" -}}
      Function response:
      {{ end -}}
      {{ if .Content -}}
      {{.Content -}}
      {{ else if .FunctionCall -}}
      {{ toJson .FunctionCall -}}
      {{ end -}}
      <|eot_id|>
    function: |
      <|start_header_id|>system<|end_header_id|>

      You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
      <tools>
      {{range .Functions}}
      {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
      {{end}}
      </tools>
      Use the following pydantic model json schema for each tool call you will make:
        {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
        Function call:
    chat: |
      <|begin_of_text|>{{.Input }}
      <|start_header_id|>assistant<|end_header_id|>
    completion: |
      {{.Input}}
    usage: |
      curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
          "model": "llama3-8b-instruct",
          "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
      }'

主な内容はLlama 3向けのテンプレートを入れたもので、このあたりを参考に作成しています。

How to run llama3? · mudler LocalAI · Discussion #2076 · GitHub

models(llama3): add llama3 to embedded models by mudler · Pull Request #2074 · mudler/LocalAI · GitHub

https://github.com/mudler/LocalAI/blob/48d0aa2f6da0b1c039fa062e61facf5e6191420e/embedded/models/llama3-instruct.yaml

起動。

$ ./local-ai-avx2-Linux-x86_64 --config-file local-ai-config.yaml --models-path models --threads 4

確認。

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8080/v1/chat/completions -d \
    '{"model": "llama-3-8b-instruct", "messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq
{
  "created": 1714056935,
  "object": "chat.completion",
  "id": "059aff63-0d6a-4d29-ba9c-02b4a467f03a",
  "model": "llama-3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "I'd be happy to introduce myself!\n\nI'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but I'm designed to simulate conversation and answer questions to the best of my ability. I can provide information on a wide range of topics, and I'm constantly learning and improving my responses.\n\nI don't have personal experiences or emotions like humans do, but I'm here to help you with any questions or topics you'd like to discuss. I'm happy to chat and provide information to the best of my ability."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

real    0m46.588s
user    0m0.043s
sys     0m0.006s

初回のモデルのロードには、3分ほどかかりましたが…。

11:55PM INF Trying to load the model 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper
11:55PM INF [llama-cpp] Attempting to load
11:55PM INF Loading model 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf' with backend llama-cpp
11:58PM INF [llama-cpp] Loads OK

日本語の確認は、こちらではパスします。試してはみましたが、遅い&やっぱり英語で返ってきました…。

こんなところでしょうか。

おわりに

MetaのLLM、Llama 3をllama-cpp-pythonおよびLocalAIで試してみました。

最小のモデルが8BとLlama 2よりもちょっと大きいのですが、ある意味予想通りでしたが割とあっさり使えて良かったです。