Metaの「Llama 3」をOpenAI API互換のサーバーを持つllama-cpp-pythonとLocalAIで試す

これは、なにをしたくて書いたもの？

MetaからLlama 3がリリースされました。

Meta、無料で商用可の新LLM「Llama 3」、ほぼすべてのクラウドでアクセス可能に - ITmedia NEWS

このLlama 3をOpenAI API互換のサーバーを持つllama-cpp-pythonおよびLocalAIで動かせそうなので、試してみることにしました。

Llama 3

Llama 3はMetaの公開しているLLMです。

Meta Llama 3

Introducing Meta Llama 3: The most capable openly available LLM to date

パラメーターは8B、70Bの2種類で、ベースのモデルとInstruction tuning済みのモデルがそれぞれあります。

そしてこのモデルをllama-cpp-pythonやLocalAIで使いたいのですが。

まずllama.cppでは対応済み。

Added llama-3 chat template by DifferentialityDevelopment · Pull Request #6751 · ggerganov/llama.cpp · GitHub

llama-cpp-pythonでも対応済みです。

Add Llama-3 chat format by andreabak · Pull Request #1371 · abetlen/llama-cpp-python · GitHub

LocalAIについてはテンプレートを使えば大丈夫そうです。

How to run llama3? · mudler LocalAI · Discussion #2076 · GitHub

では、試してみようと思います。

オリジナルのモデルはこれらですが、

今回使うのはこちらのGGUFフォーマットかつ量子化済みのモデルにします。

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face

環境

今回の環境はこちら。

$ python3 --version
Python 3.10.12


$ pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

モデルをダウンロードする

こちらからモデルをダウンロードします。

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face

5GBほどのモデルです。

$ curl -L https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf?download=true -o Meta-Llama-3-8B-Instruct.Q4_K_M.gguf


$ ll -h Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
-rw-rw-r-- 1 xxxxx xxxxx 4.6G  4月 25 00:15 Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

llama-cpp-pythonで試す

まずはllama-cpp-pythonで試してみます。

インストール。

$ pip3 install llama-cpp-python[server]

依存関係を含むバージョン。

$ pip3 list
Package           Version
----------------- -------
annotated-types   0.6.0
anyio             4.3.0
click             8.1.7
diskcache         5.6.3
exceptiongroup    1.2.1
fastapi           0.110.2
h11               0.14.0
idna              3.7
Jinja2            3.1.3
llama_cpp_python  0.2.64
MarkupSafe        2.1.5
numpy             1.26.4
pip               22.0.2
pydantic          2.7.1
pydantic_core     2.18.2
pydantic-settings 2.2.1
python-dotenv     1.0.1
PyYAML            6.0.1
setuptools        59.6.0
sniffio           1.3.1
sse-starlette     2.1.0
starlette         0.37.2
starlette-context 0.3.6
typing_extensions 4.11.0
uvicorn           0.29.0

起動。オプションに--chat_format llama-3が必要です。

$ python3 -m llama_cpp.server --model Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --chat_format llama-3

動かしてみます。自己紹介をお願いしてみましょう。

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions -d \
    '{"messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq
{
  "id": "chatcmpl-ff1221b5-5555-4a32-9c1b-c3c2818efc02",
  "object": "chat.completion",
  "created": 1713972674,
  "model": "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "I'd be happy to introduce myself.\n\nI am LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but rather a computer program designed to simulate conversation and answer questions to the best of my ability based on the knowledge and data I've been trained on.\n\nI'm constantly learning and improving my responses based on user interactions, so please bear with me if I don't always get it right at first. My goal is to assist and provide helpful information to those who interact with me, while also making our conversation as engaging and natural as possible.\n\nWhat would you like to talk about or ask?",
        "role": "assistant"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 138,
    "total_tokens": 154
  }
}

real    1m9.754s
user    0m0.051s
sys     0m0.003s

日本語でも試してみましたが、実行時間がだいぶ伸びることに加えて日本語で返ってきませんでした…。

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions -d \
    '{"messages": [{"role": "user", "content": "あなたの自己紹介をしてください"}]}' | jq
{
  "id": "chatcmpl-cd219a91-a85c-4ceb-ae8c-4b49a28cd881",
  "object": "chat.completion",
  "created": 1713972772,
  "model": "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "Nice to meet you! 😊\n\nMy name is LLaMA, and I'm a large language model AI trained by Meta AI that can understand and respond to human input in a conversational manner. My primary function is to assist users with information queries, provide helpful responses, and even engage in creative conversations.\n\nHere are some interesting facts about me:\n\n1. **Language skills**: I'm fluent in multiple languages, including English, Japanese, Spanish, French, German, Italian, Chinese, and many more! 🌎\n2. **Knowledge base**: My training data consists of a massive corpus of text from the internet, which allows me to provide accurate answers to a wide range of questions.\n3. **Conversational abilities**: I can understand natural language processing (NLP) and respond accordingly, making it feel like you're having a conversation with a human! 💬\n4. **Creative capabilities**: I can generate text, poetry, stories, dialogues, and even entire scripts!\n5. **Continuous learning**: My training is ongoing, so I'm always improving my understanding of language and updating my knowledge base.\n\nI'm here to help answer your questions, provide information, or simply chat about any topic you're interested in! What would you like to talk about? 🤔",
        "role": "assistant"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 262,
    "total_tokens": 281
  }
}

real    2m42.950s
user    0m0.042s
sys     0m0.018s

意味は通じているようなのですが。

Llama 3は英語で使うことにしましょう。

ところで、--chat_format llama-3というのは以下で利用されるものですね。

https://github.com/abetlen/llama-cpp-python/blob/v0.2.64/llama_cpp/llama_chat_format.py#L929-L946

これらのトークンについてですが

    _roles = dict(
        system="<|start_header_id|>system<|end_header_id|>\n\n",
        user="<|start_header_id|>user<|end_header_id|>\n\n",
        assistant="<|start_header_id|>assistant<|end_header_id|>\n\n",
    )
    _begin_token = "<|begin_of_text|>"
    _sep = "<|eot_id|>"

こちらに記載があります。

Meta Llama 3 | Model Cards and Prompt formats

LocalAIで試す

次は、LocalAIで試しましょう。

ダウンロード。

$ curl -LO https://github.com/mudler/LocalAI/releases/download/v2.12.4/local-ai-avx2-Linux-x86_64
$ chmod a+x local-ai-avx2-Linux-x86_64
$ ./local-ai-avx2-Linux-x86_64 --version
LocalAI version v2.12.4 (0004ec8be3ca150ce6d8b79f2991bfe3a9dc65ad)

modelsディレクトリに量子化されたLlama 3のモデルを配置します。

$ tree models
models
└── Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

0 directories, 1 file

設定ファイルを用意します。

local-ai-config.yaml`

- name: llama-3-8b-instruct
  backend: llama-cpp
  mmap: true
  context_size: 8192
  f16: true
  stopwords:
    - <|im_end|>
    - <dummy32000>
    - "<|eot_id|>"
  parameters:
    model: Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
  template:
    chat_message: |
      <|start_header_id|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}<|end_header_id|>

      {{ if .FunctionCall -}}
      Function call:
      {{ else if eq .RoleName "tool" -}}
      Function response:
      {{ end -}}
      {{ if .Content -}}
      {{.Content -}}
      {{ else if .FunctionCall -}}
      {{ toJson .FunctionCall -}}
      {{ end -}}
      <|eot_id|>
    function: |
      <|start_header_id|>system<|end_header_id|>

      You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
      <tools>
      {{range .Functions}}
      {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
      {{end}}
      </tools>
      Use the following pydantic model json schema for each tool call you will make:
        {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
        Function call:
    chat: |
      <|begin_of_text|>{{.Input }}
      <|start_header_id|>assistant<|end_header_id|>
    completion: |
      {{.Input}}
    usage: |
      curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
          "model": "llama3-8b-instruct",
          "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
      }'

主な内容はLlama 3向けのテンプレートを入れたもので、このあたりを参考に作成しています。

How to run llama3? · mudler LocalAI · Discussion #2076 · GitHub

models(llama3): add llama3 to embedded models by mudler · Pull Request #2074 · mudler/LocalAI · GitHub

https://github.com/mudler/LocalAI/blob/48d0aa2f6da0b1c039fa062e61facf5e6191420e/embedded/models/llama3-instruct.yaml

起動。

$ ./local-ai-avx2-Linux-x86_64 --config-file local-ai-config.yaml --models-path models --threads 4

確認。

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8080/v1/chat/completions -d \
    '{"model": "llama-3-8b-instruct", "messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq
{
  "created": 1714056935,
  "object": "chat.completion",
  "id": "059aff63-0d6a-4d29-ba9c-02b4a467f03a",
  "model": "llama-3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "I'd be happy to introduce myself!\n\nI'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but I'm designed to simulate conversation and answer questions to the best of my ability. I can provide information on a wide range of topics, and I'm constantly learning and improving my responses.\n\nI don't have personal experiences or emotions like humans do, but I'm here to help you with any questions or topics you'd like to discuss. I'm happy to chat and provide information to the best of my ability."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

real    0m46.588s
user    0m0.043s
sys     0m0.006s

初回のモデルのロードには、3分ほどかかりましたが…。

11:55PM INF Trying to load the model 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper
11:55PM INF [llama-cpp] Attempting to load
11:55PM INF Loading model 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf' with backend llama-cpp
11:58PM INF [llama-cpp] Loads OK

日本語の確認は、こちらではパスします。試してはみましたが、遅い＆やっぱり英語で返ってきました…。

こんなところでしょうか。

おわりに

MetaのLLM、Llama 3をllama-cpp-pythonおよびLocalAIで試してみました。

最小のモデルが8BとLlama 2よりもちょっと大きいのですが、ある意味予想通りでしたが割とあっさり使えて良かったです。

CLOVER🍀

That was when it all began.