CLOVER🍀

That was when it all began.

OpenAI API互換のサヌバヌをロヌカルで動かせるLocalAIを詊す

これは、なにをしたくお曞いたもの

以前、ロヌカルで動かせるOpenAI API互換のサヌバヌずしおllama-cpp-pythonを䜿っおみたした。

llama-cpp-pythonで、OpenAI API互換のサーバーを試す - CLOVER🍀

他にも同様のこずができるものずしお、LocalAIずいうものがあるこずを知ったのでこちらを詊しおみようかなず。

LocalAI

LocalAIのWebサむトはこちら。

LocalAI :: LocalAI documentation

GitHubリポゞトリヌはこちらです。

GitHub - mudler/LocalAI: :robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs ggml, gguf, GPTQ, onnx, TF compatible models: llama, llama2, rwkv, whisper, vicuna, koala, cerebras, falcon, dolly, starcoder, and many others

最初に曞かれおいたすが、ロヌカルで動かすこずができるOpenAIの代替ずしお䜜られおいお、OpenAI APIず互換性のあるREST APIを
提䟛するものになっおいたす。

LocalAI is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families that are compatible with the ggml format. Does not require GPU. It is maintained by mudler.

LocalAI :: LocalAI documentation

以䞋が特城のようです。

  • ロヌカルで動䜜するOpenAI APIのドロップむン代替
  • GPU䞍芁
    • GPUが䜿える堎合は、GPU甚のアクセラレヌションを利甚可胜
  • モデルは初回ロヌドされ、それ以降は高速化のためモデルをメモリにロヌドしたたたになる
  • shell-outサブプロセスの実行はしないが、バむンディングを利甚した掚論の高速化、パフォヌマンスの向䞊を行う

機胜ずしおは以䞋がありたす。

LocalAI / Features

どういう仕組みで動いおいるのかずいうず、こちらに説明がありたす。

LocalAI / How does it work?

こういうこずのようです。

  • LocalAI自䜓は、OpenAI SDKクラむアントず察話できるようにためのGoで実装されたラッパヌ
  • gRPCを䜿い、様々なバック゚ンドず統合する

぀たり、すでに実装枈みのバック゚ンドおよびモデルに察しお、OpenAI APIずしお振る舞うようなレむダヌずしお実装されたものが
LocalAIずいうこずになりたす。

利甚できるバック゚ンドおよびモデルは、以䞋に衚がありたす。

Model compatibility :: LocalAI documentation

この䞭にはllama.cppも含たれおいたす。

🦙 llama.cpp :: LocalAI documentation

むしろ、基本はllama.cppを䜿うのかなずも思いたす。

なので、ハヌドりェア芁件はllama.cppを芋るように曞かれおいたりしたす。

Depending on the model you are attempting to run might need more RAM or CPU resources. Check out also here for gguf based backends. rwkv is less expensive on resources.

Model Compatibility / Hardware requirements

llama.cpp / Usage / Memory/Disk Requirements

こう芋るずllama-cpp-pythonず近い印象も受けたすが、バック゚ンドやモデルにllama.cpp以倖も䜿甚可胜、最初からOpenAI APIの代替を
目指しおいるllama-cpp-pythonは䞻䜓ではなさそうだったずいうのが違うずころでしょうか。

サンプルもいろいろあるようです。

LocalAI/examples at v2.3.1 · mudler/LocalAI · GitHub

各皮バック゚ンドのバヌゞョンを確認するには、Makefileを芋ればよさそうです。

https://github.com/mudler/LocalAI/blob/v2.3.1/Makefile#L6-L37

今回はこちらの゚ントリヌを曞いた時ず同じで、

llama-cpp-pythonで、OpenAI API互換のサーバーを試す - CLOVER🍀

こちらのAPIを動かすこずを詊しおみたいず思いたす。

OpenAI / API reference / ENDPOINTS / Chat / Create chat completion

環境

今回の環境は、こちら。Ubuntu Linux 22.04 LTSです。

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy


$ uname -srvmpio
Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

LocalAIをむンストヌルする

LocalAIのむンストヌル方法は、こちらに曞かれおいたす。

Getting started :: LocalAI documentation

コンテナむメヌゞを䜿う方法、バむナリをダりンロヌドしお䜿う方法、゜ヌスコヌドからビルドする方法が曞かれおいたすが、
今回はバむナリをダりンロヌドしお䜿うこずにしたす。

バむナリはavxずavx2、avx512の3皮類がありたすが、今回はavx2を䜿うこずにしたす。

$ grep -E avx /proc/cpuinfo
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d

〜以降省略〜

LocalAIをダりンロヌド。

$ curl -LO https://github.com/mudler/LocalAI/releases/download/v2.3.1/local-ai-avx2-Linux-x86_64

たあたあサむズありたすね 。

$ ll -h local-ai-avx2-Linux-x86_64
-rw-rw-r-- 1 xxxxx xxxxx 317M  1月  1 18:14 local-ai-avx2-Linux-x86_64

実行暩限を付䞎しお

$ chmod a+x local-ai-avx2-Linux-x86_64

たずはバヌゞョン確認。

$ ./local-ai-avx2-Linux-x86_64 --version
LocalAI version v2.3.1 (a95bb0521d3f3183c9bba468c1417f4d000bdfb3)

ヘルプ。

$ ./local-ai-avx2-Linux-x86_64 --help
NAME:
   LocalAI - OpenAI compatible API for running LLaMA/GPT models locally on CPU with consumer grade hardware.

USAGE:
   local-ai [options]

VERSION:
   v2.3.1 (a95bb0521d3f3183c9bba468c1417f4d000bdfb3)

DESCRIPTION:

   LocalAI is a drop-in replacement OpenAI API which runs inference locally.

   Some of the models compatible are:
   - Vicuna
   - Koala
   - GPT4ALL
   - GPT4ALL-J
   - Cerebras
   - Alpaca
   - StableLM (ggml quantized)

   For a list of compatible model, check out: https://localai.io/model-compatibility/index.html


COMMANDS:
   models      List or install models
   tts         Convert text to speech
   transcript  Convert audio to text
   help, h     Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --f16                                                              (default: false) [$F16]
   --autoload-galleries                                               (default: false) [$AUTOLOAD_GALLERIES]
   --debug                                                            (default: false) [$DEBUG]
   --single-active-backend                                            Allow only one backend to be running. (default: false) [$SINGLE_ACTIVE_BACKEND]
   --parallel-requests                                                Enable backends to handle multiple requests in parallel. This is for backends that supports multiple requests in parallel, like llama.cpp or vllm (default: false) [$PARALLEL_REQUESTS]
   --cors                                                             (default: false) [$CORS]
   --cors-allow-origins value                                          [$CORS_ALLOW_ORIGINS]
   --threads value                                                    Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested. (default: 4) [$THREADS]
   --models-path value                                                Path containing models used for inferencing (default: "/home/kazuhira/study/llm/clover/localai/models") [$MODELS_PATH]
   --galleries value                                                  JSON list of galleries [$GALLERIES]
   --preload-models value                                             A List of models to apply in JSON at start [$PRELOAD_MODELS]
   --preload-models-config value                                      A List of models to apply at startup. Path to a YAML config file [$PRELOAD_MODELS_CONFIG]
   --config-file value                                                Config file [$CONFIG_FILE]
   --address value                                                    Bind address for the API server. (default: ":8080") [$ADDRESS]
   --image-path value                                                 Image directory (default: "/tmp/generated/images") [$IMAGE_PATH]
   --audio-path value                                                 audio directory (default: "/tmp/generated/audio") [$AUDIO_PATH]
   --backend-assets-path value                                        Path used to extract libraries that are required by some of the backends in runtime. (default: "/tmp/localai/backend_data") [$BACKEND_ASSETS_PATH]
   --external-grpc-backends value [ --external-grpc-backends value ]  A list of external grpc backends [$EXTERNAL_GRPC_BACKENDS]
   --context-size value                                               Default context size of the model (default: 512) [$CONTEXT_SIZE]
   --upload-limit value                                               Default upload-limit. MB (default: 15) [$UPLOAD_LIMIT]
   --api-keys value [ --api-keys value ]                              List of API Keys to enable API authentication. When this is set, all the requests must be authenticated with one of these API keys. [$API_KEY]
   --enable-watchdog-idle                                             Enable watchdog for stopping idle backends. This will stop the backends if are in idle state for too long. (default: false) [$WATCHDOG_IDLE]
   --enable-watchdog-busy                                             Enable watchdog for stopping busy backends that exceed a defined threshold. (default: false) [$WATCHDOG_BUSY]
   --watchdog-busy-timeout value                                      Watchdog timeout. This will restart the backend if it crashes. (default: "5m") [$WATCHDOG_BUSY_TIMEOUT]
   --watchdog-idle-timeout value                                      Watchdog idle timeout. This will restart the backend if it crashes. (default: "15m") [$WATCHDOG_IDLE_TIMEOUT]
   --preload-backend-only                                             If set, the api is NOT launched, and only the preloaded models / backends are started. This is intended for multi-node setups. (default: false) [$PRELOAD_BACKEND_ONLY]
   --help, -h                                                         show help
   --version, -v                                                      print the version

COPYRIGHT:
   Ettore Di Giacinto

モデルを䜿甚しおテキスト生成させおみる

では、モデルを䜿甚しおLocalAIにテキスト生成させおみたしょう。

モデルはこちらからllama-2-7b-chat.Q4_K_M.ggufを䜿うこずにしたす。

TheBloke/Llama-2-7B-Chat-GGUF · Hugging Face

4GBあるモデルです。

Getting Startedを芋るず、models/[モデル名]ずいった圢で指定させるのがよさそうです。

Getting started :: LocalAI documentation

modelsディレクトリを䜜成。

$ mkdir models

モデルをllama-2-7b-chat-ggufずいう名前でダりンロヌド。

$ curl -L https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf -o models/llama-2-7b-chat-gguf

こうなりたした。

$ tree models -h
[4.0K]  models
└── [3.8G]  llama-2-7b-chat-gguf

0 directories, 1 file

LocalAIを起動しおみたす。

$ ./local-ai-avx2-Linux-x86_64 --models-path models --context-size 700 --threads 4

オプションの意味はこちらに曞かれおいたすが、こんな感じですね。

  • --models-path 
 掚論に䜿甚されるモデルを含んだディレクトリパスデフォルト倀は./models
  • --context-size 
 モデルのデフォルトのコンテキストサむズデフォルト倀は512
  • --threads 
 䞊列蚈算に䜿甚するスレッド数デフォルトは4

Getting Started / CLI parameters

起動時のログ。

7:05PM DBG no galleries to load
7:05PM INF Starting LocalAI using 4 threads, with models path: models
7:05PM INF LocalAI version: v2.3.1 (a95bb0521d3f3183c9bba468c1417f4d000bdfb3)
7:05PM INF Preloading models from models

 ┌───────────────────────────────────────────────────┐
 │                   Fiber v2.50.0                   │
 │               http://127.0.0.1:8080               │
 │       (bound on host 0.0.0.0 and port 8080)       │
 │                                                   │
 │ Handlers ............ 73  Processes ........... 1 │
 │ Prefork ....... Disabled  PID ............. 22558 │
 └───────────────────────────────────────────────────┘

modelsディレクトリから、モデルをプリロヌドしたようです。

では、modelにllama-2-7b-chat-ggufを指定しおテキスト生成を行っおみたす。

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8080/v1/chat/completions -d \
    '{"model": "llama-2-7b-chat-gguf", "messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq

1分40秒ほどかけお、結果が返っおきたした。

{
  "created": 1704103556,
  "object": "chat.completion",
  "id": "89ed376e-0d0f-41cb-a711-c007c880fc3d",
  "model": "llama-2-7b-chat-gguf",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "\n\nI'm a 32-year-old woman from the United States. I'm a writer and editor, and I've been working in the industry for about 10 years now. I've written for a variety of publications, including newspapers, magazines, and online sites. I'm also a mom to two young children, and I enjoy spending time with them and watching them grow. In my free time, I like to read, watch movies, and go for walks. I'm excited to be here and to share my thoughts and experiences with you."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

real    1m4.061s
user    0m0.031s
sys     0m0.011s

llama-cpp-pythonを䜿った時ず同じモデルを䜿っおいるのですが、自己玹介が32歳のアメリカ圚䜏の女性ずいうこずになっおいたすね。

usageの倀がないずころがちょっず気になりたすね 。

  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }

この時、LocalAI偎ではこんなログが出力されおいたす。

7:06PM INF Loading model 'llama-2-7b-chat-gguf' with backend llama-cpp
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38605: connect: connection refused"

ちなみに、たったく関係ないモデル名を指定するず

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8080/v1/chat/completions -d \
    '{"model": "hoge", "messages": [{"role": "user", "content": "What is your name?"}]}' | jq

どのバック゚ンドでも利甚できるモデルがないずいうこずで、゚ラヌになりたす。

{
  "error": {
    "code": 500,
    "message": "could not load model - all backends returned error: 18 errors occurred:\n\t* could not load model: rpc error: code = Canceled desc = \n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Canceled desc = \n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unknown desc = failed loading model\n\t* could not load model: rpc error: code = Unavailable desc = error reading from server: EOF\n\t* could not load model: rpc error: code = Unknown desc = stat models/hoge: no such file or directory\n\t* grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/stablediffusion. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS\n\t* grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/tinydream. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS\n\t* grpc process not found: /tmp/localai/backend_data/backend-assets/grpc/piper. some backends(stablediffusion, tts) require LocalAI compiled with GO_TAGS\n\n",
    "type": ""
  }
}

real    0m30.788s
user    0m0.035s
sys     0m0.004s

この時のLocalAI偎のログ。

7:13PM INF Loading model 'hoge' with backend llama-cpp
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:46755: connect: connection refused"
7:13PM INF Loading model 'hoge' with backend llama-ggml
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:32975: connect: connection refused"
7:13PM INF Loading model 'hoge' with backend llama
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35765: connect: connection refused"
7:13PM INF Loading model 'hoge' with backend gpt4all
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33913: connect: connection refused"
7:13PM INF Loading model 'hoge' with backend gptneox
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37145: connect: connection refused"
7:13PM INF Loading model 'hoge' with backend bert-embeddings
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:32845: connect: connection refused"
7:13PM INF Loading model 'hoge' with backend falcon-ggml
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33267: connect: connection refused"
7:13PM INF Loading model 'hoge' with backend gptj
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38685: connect: connection refused"
7:14PM INF Loading model 'hoge' with backend gpt2
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37601: connect: connection refused"
7:14PM INF Loading model 'hoge' with backend dolly
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41031: connect: connection refused"
7:14PM INF Loading model 'hoge' with backend mpt
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37935: connect: connection refused"
7:14PM INF Loading model 'hoge' with backend replit
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:42047: connect: connection refused"
7:14PM INF Loading model 'hoge' with backend starcoder
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38465: connect: connection refused"
7:14PM INF Loading model 'hoge' with backend rwkv
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40645: connect: connection refused"
7:14PM INF Loading model 'hoge' with backend whisper
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35603: connect: connection refused"
7:14PM INF Loading model 'hoge' with backend stablediffusion
7:14PM INF Loading model 'hoge' with backend tinydream
7:14PM INF Loading model 'hoge' with backend piper

指定されたモデルに察しお、䜿甚可胜なバック゚ンドを順次探すような挙動のようですね。

蚭定ファむルで指定する

最埌に、 蚭定ファむルを䜿っおLocalAIを構成するようにしおみたしょう。

こちらに習う感じですね。

llama.cpp / YAML configuration

Advanced / Advanced configuration with YAML files

ちょっずわかりにくいのですが、[モデル名].yamlで蚭定ファむルを䜜成するか、--config-fileで蚭定ファむルを指定するかで曞き方が
倉わるようです。

モデルに぀いおは、modelsディレクトリを再䜜成しおモデルを今床はそのたたの名前で眮いおおきたす。

$ mkdir models
$ curl -L https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf -o models/llama-2-7b-chat.Q4_K_M.gguf

たずは、蚭定ファむルをモデル名にしおみたしょう。モデル名はgpt-3.5-turboにしおみたした。このファむルは、modelsディレクトリ内に
配眮する必芁がありたす。

models/gpt-3.5-turbo.yaml

name: gpt-3.5-turbo
backend: llama
context_size: 700
parameters:
  model: llama-2-7b-chat.Q4_K_M.gguf

backendで䜿甚するバック゚ンド、parameters / modelで察応するモデルのファむルを指定するようです。

modelsディレクトリはこういう構成ですね。

$ tree models -h
[4.0K]  models
├── [ 102]  gpt-3.5-turbo.yaml
└── [3.8G]  llama-2-7b-chat.Q4_K_M.gguf

0 directories, 2 files

起動。

$ ./local-ai-avx2-Linux-x86_64 --models-path models --threads 4

modelにgpt-3.5-turboを指定しお動䜜確認。

$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8080/v1/chat/completions -d \
    '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq

LocalAI偎では蚭定ファむルで指定されたモデルを認識しおいたす。

8:28PM INF Loading model 'llama-2-7b-chat.Q4_K_M.gguf' with backend llama
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35403: connect: connection refused"

結果が返っおきたした。

{
  "created": 1704108534,
  "object": "chat.completion",
  "id": "ccd1fc52-f9ae-4c70-9242-bc0486972b40",
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "\n\nI'm a 32-year-old woman from the United States. I'm a writer and editor, and I've been working in the industry for about 10 years now. I've written for a variety of publications, including newspapers, magazines, and online sites. I'm also a mom to two young children, and I enjoy spending time with them and watching them grow. In my free time, I like to read, watch movies, and go for walks. I'm excited to be here and to share my thoughts and experiences with you."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

real    1m17.497s
user    0m0.044s
sys     0m0.000s

LocalAIの蚭定ファむルずしお--config-fileオプションで指定する堎合は、このようになりたす。

localai-config.yaml

- name: gpt-3.5-turbo
  backend: llama
  context_size: 700
  parameters:
    model: llama-2-7b-chat.Q4_K_M.gguf

モデルごずに配列になりたす。--config-fileで指定する堎合は、この圢匏になるこずにこちらを芋るたで気づかなくお、YAMLファむルが
パヌス゚ラヌになっおたあハマりたした 。

Advanced / Advanced configuration with YAML files

この堎合、modelsディレクトリ内にはモデルのファむルがあればOKです。

$ tree models -h
[4.0K]  models
└── [3.8G]  llama-2-7b-chat.Q4_K_M.gguf

0 directories, 1 file

起動。

$ ./local-ai-avx2-Linux-x86_64 --config-file localai-config.yaml --models-path models --threads 4

動䜜確認結果は、モデルごずの蚭定ファむルを䜜成した時ず同じなので省略したす。

今回はこれくらいにしおおきたしょう。

おわりに

OpenAI API互換のサヌバヌをロヌカルで動かせるLocalAIを詊しおみたした。

llama.cppをOpenAI API互換のサヌバヌずしお動かすなら、llama-cpp-pythonの方が動䜜ずしおはわかりやすいのかなずいう気がしたす。
結局、llama.cppからできるこずは倉わらないず思うので。

その䞀方で、別のバック゚ンドを䜿う時やそれに関する呚蟺知識を埗るこずなどを螏たえるず、抌さえおおいた方がよさそうだなずは
思いたした。

llama-cpp-pythonずはケヌスバむケヌスで䜿い分けおいきたいず思いたす。