That was when it all began.

Metaの「Llama 3」をOpenAI API互換のサーバーを持つllama-cpp-pythonとLocalAIで試す


MetaからLlama 3がリリースされました。

Meta、無料で商用可の新LLM「Llama 3」、ほぼすべてのクラウドでアクセス可能に - ITmedia NEWS

このLlama 3をOpenAI API互換のサーバーを持つllama-cpp-pythonおよびLocalAIで動かせそうなので、試してみることにしました。

Llama 3

Llama 3はMetaの公開しているLLMです。

Meta Llama 3

Introducing Meta Llama 3: The most capable openly available LLM to date

パラメーターは8B、70Bの2種類で、ベースのモデルとInstruction tuning済みのモデルがそれぞれあります。



Added llama-3 chat template by DifferentialityDevelopment · Pull Request #6751 · ggerganov/llama.cpp · GitHub


Add Llama-3 chat format by andreabak · Pull Request #1371 · abetlen/llama-cpp-python · GitHub


How to run llama3? · mudler LocalAI · Discussion #2076 · GitHub




QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face



$ python3 --version
Python 3.10.12

$ pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)



QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face


$ curl -L -o Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

$ ll -h Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
-rw-rw-r-- 1 xxxxx xxxxx 4.6G  4月 25 00:15 Meta-Llama-3-8B-Instruct.Q4_K_M.gguf




$ pip3 install llama-cpp-python[server]


$ pip3 list
Package           Version
----------------- -------
annotated-types   0.6.0
anyio             4.3.0
click             8.1.7
diskcache         5.6.3
exceptiongroup    1.2.1
fastapi           0.110.2
h11               0.14.0
idna              3.7
Jinja2            3.1.3
llama_cpp_python  0.2.64
MarkupSafe        2.1.5
numpy             1.26.4
pip               22.0.2
pydantic          2.7.1
pydantic_core     2.18.2
pydantic-settings 2.2.1
python-dotenv     1.0.1
PyYAML            6.0.1
setuptools        59.6.0
sniffio           1.3.1
sse-starlette     2.1.0
starlette         0.37.2
starlette-context 0.3.6
typing_extensions 4.11.0
uvicorn           0.29.0

起動。オプションに--chat_format llama-3が必要です。

$ python3 -m llama_cpp.server --model Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --chat_format llama-3


$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions -d \
    '{"messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq
  "id": "chatcmpl-ff1221b5-5555-4a32-9c1b-c3c2818efc02",
  "object": "chat.completion",
  "created": 1713972674,
  "model": "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf",
  "choices": [
      "index": 0,
      "message": {
        "content": "I'd be happy to introduce myself.\n\nI am LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but rather a computer program designed to simulate conversation and answer questions to the best of my ability based on the knowledge and data I've been trained on.\n\nI'm constantly learning and improving my responses based on user interactions, so please bear with me if I don't always get it right at first. My goal is to assist and provide helpful information to those who interact with me, while also making our conversation as engaging and natural as possible.\n\nWhat would you like to talk about or ask?",
        "role": "assistant"
      "logprobs": null,
      "finish_reason": "stop"
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 138,
    "total_tokens": 154

real    1m9.754s
user    0m0.051s
sys     0m0.003s


$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions -d \
    '{"messages": [{"role": "user", "content": "あなたの自己紹介をしてください"}]}' | jq
  "id": "chatcmpl-cd219a91-a85c-4ceb-ae8c-4b49a28cd881",
  "object": "chat.completion",
  "created": 1713972772,
  "model": "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf",
  "choices": [
      "index": 0,
      "message": {
        "content": "Nice to meet you! 😊\n\nMy name is LLaMA, and I'm a large language model AI trained by Meta AI that can understand and respond to human input in a conversational manner. My primary function is to assist users with information queries, provide helpful responses, and even engage in creative conversations.\n\nHere are some interesting facts about me:\n\n1. **Language skills**: I'm fluent in multiple languages, including English, Japanese, Spanish, French, German, Italian, Chinese, and many more! 🌎\n2. **Knowledge base**: My training data consists of a massive corpus of text from the internet, which allows me to provide accurate answers to a wide range of questions.\n3. **Conversational abilities**: I can understand natural language processing (NLP) and respond accordingly, making it feel like you're having a conversation with a human! 💬\n4. **Creative capabilities**: I can generate text, poetry, stories, dialogues, and even entire scripts!\n5. **Continuous learning**: My training is ongoing, so I'm always improving my understanding of language and updating my knowledge base.\n\nI'm here to help answer your questions, provide information, or simply chat about any topic you're interested in! What would you like to talk about? 🤔",
        "role": "assistant"
      "logprobs": null,
      "finish_reason": "stop"
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 262,
    "total_tokens": 281

real    2m42.950s
user    0m0.042s
sys     0m0.018s


Llama 3は英語で使うことにしましょう。

ところで、--chat_format llama-3というのは以下で利用されるものですね。


    _roles = dict(
    _begin_token = "<|begin_of_text|>"
    _sep = "<|eot_id|>"


Meta Llama 3 | Model Cards and Prompt formats




$ curl -LO
$ chmod a+x local-ai-avx2-Linux-x86_64
$ ./local-ai-avx2-Linux-x86_64 --version
LocalAI version v2.12.4 (0004ec8be3ca150ce6d8b79f2991bfe3a9dc65ad)

modelsディレクトリに量子化されたLlama 3のモデルを配置します。

$ tree models
└── Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

0 directories, 1 file



- name: llama-3-8b-instruct
  backend: llama-cpp
  mmap: true
  context_size: 8192
  f16: true
    - <|im_end|>
    - <dummy32000>
    - "<|eot_id|>"
    model: Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
    chat_message: |
      <|start_header_id|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}<|end_header_id|>

      {{ if .FunctionCall -}}
      Function call:
      {{ else if eq .RoleName "tool" -}}
      Function response:
      {{ end -}}
      {{ if .Content -}}
      {{.Content -}}
      {{ else if .FunctionCall -}}
      {{ toJson .FunctionCall -}}
      {{ end -}}
    function: |

      You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
      {{range .Functions}}
      {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
      Use the following pydantic model json schema for each tool call you will make:
        {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
        Function call:
    chat: |
      <|begin_of_text|>{{.Input }}
    completion: |
    usage: |
      curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
          "model": "llama3-8b-instruct",
          "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]

主な内容はLlama 3向けのテンプレートを入れたもので、このあたりを参考に作成しています。

How to run llama3? · mudler LocalAI · Discussion #2076 · GitHub

models(llama3): add llama3 to embedded models by mudler · Pull Request #2074 · mudler/LocalAI · GitHub


$ ./local-ai-avx2-Linux-x86_64 --config-file local-ai-config.yaml --models-path models --threads 4


$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8080/v1/chat/completions -d \
    '{"model": "llama-3-8b-instruct", "messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq
  "created": 1714056935,
  "object": "chat.completion",
  "id": "059aff63-0d6a-4d29-ba9c-02b4a467f03a",
  "model": "llama-3-8b-instruct",
  "choices": [
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "I'd be happy to introduce myself!\n\nI'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but I'm designed to simulate conversation and answer questions to the best of my ability. I can provide information on a wide range of topics, and I'm constantly learning and improving my responses.\n\nI don't have personal experiences or emotions like humans do, but I'm here to help you with any questions or topics you'd like to discuss. I'm happy to chat and provide information to the best of my ability."
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0

real    0m46.588s
user    0m0.043s
sys     0m0.006s


11:55PM INF Trying to load the model 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper
11:55PM INF [llama-cpp] Attempting to load
11:55PM INF Loading model 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf' with backend llama-cpp
11:58PM INF [llama-cpp] Loads OK




MetaのLLM、Llama 3をllama-cpp-pythonおよびLocalAIで試してみました。

最小のモデルが8BとLlama 2よりもちょっと大きいのですが、ある意味予想通りでしたが割とあっさり使えて良かったです。