OpenAI Python APIライブラリーからllama-cpp-pythonで立てたOpenAI API互換のサーバーのチャットモデルへアクセスしてみる

これは、なにをしたくて書いたもの？

前に、llama-cpp-pythonを使って、OpenAI API互換のサーバーを立てるということをやってみました。

llama-cpp-pythonで、OpenAI API互換のサーバーを試す - CLOVER🍀

この時はcurlでアクセスして確認してみましたが、今度はOpenAIのPython APIライブラリーを使ってみたいと思います。

OpenAI Python APIライブラリー

OpenAI Python APIライブラリーのGitHub リポジトリーはこちら。

GitHub - openai/openai-python: The official Python library for the OpenAI API

現時点でのバージョンは1.3.7のようです。

ドキュメントはこちら。

Welcome to the OpenAI developer platform

どういうものか、少し見てみましょう。

OpenAI Pythonライブラリーは、OpenAIのREST APIに簡単にアクセスできるようにするものだそうです。リクエストとレスポンスの
型定義が含まれており、バックエンドにはhttpxというライブラリーが使われているそうです。

The OpenAI Python library provides convenient access to the OpenAI REST API from any Python 3.7+ application. The library includes type definitions for all request params and response fields, and offers both synchronous and asynchronous clients powered by httpx.

OpenAI Python API library

型定義は、こちらのOpenAI APIのOpenAPI定義が元になっています。

GitHub - openai/openai-openapi: OpenAPI specification for the OpenAI API

使い方は、簡単にこちらにまとまっています。

OpenAI Python API library / Usage

非同期での利用、ストリーミング、ページネーション、ファイルアップロード、エラーハンドリング、リトライ、タイムアウトなどが
記述されています。

ちなみに、Node.js（TypeScript／JavaScript）のライブラリーもあるようです。

GitHub - openai/openai-node: The official Node.js / Typescript library for the OpenAI API

その他の言語のについては、コミュニティベースのものが紹介されています。

Guides / Libraries

ドキュメントも見てみましょう。

まずはこちらで用語を押さえるのがよさそうです。

Introduction

主要な概念は以下のようです。

テキスト生成モデル（Text generation models）
- 自然言語や形式言語を理解できるようにトレーニングされたもの
- generative pre-trained transformers（略してGPT）
- 入力に対するテキスト出力が可能
- 入力は「プロンプト」と呼ばれる
- 詳細はText generation models、Prompt engineering
アシスタント（Assistants）
- エンティティと呼ばれるもののこと
- OpenAI APIの場合は、GPT-4などのLLMを使用してユーザーに代わってタスクの実行が可能
- 通常、モデルのコンテキストウィンドウ内に埋め込まれた情報に基づいて動作する
- 詳細はAssistants API
埋め込み（Embeddings）
- コンテンツの意味を保持することを目的としたデータのベクトル表現
- 類似のデータのチャンクには、近いEmbeddingsが含まれる傾向にある
- 詳細はEmbeddings
トークン（Tokens）
- テキスト生成モデルおよび埋め込みにおける処理単位で、文字列が分解されたもの
- 1単語が1トークンになるわけではない
- Tokenizerで確認可能
- テキスト生成モデルの場合、プロンプトと出力がモデルの最大コンテキスト長を超えてはならない
- （トークンを出力しない）埋め込みの場合、入力はモデルの最大コンテキスト長より短くなくてはならない
- 各テキスト生成モデル、埋め込みの最大コンテキスト長はModelsで確認可能

Introduction / Key concepts

今回は、まずはQuickstartを見ながらやってみたいと思います。

Developer quickstart

アクセスする先は、OpenAI APIではなくllama-cpp-pythonによるOpenAI API互換のサーバーとします。
使うAPIは、テキスト生成モデルのチャットモデルになります。

環境

今回の環境は、こちら。

$ python3 -V
Python 3.10.12

llama-cpp-python。

$ pip3 freeze | grep llama_cpp_python
llama_cpp_python==0.2.20

モデルはこちらを使います。

TheBloke/Llama-2-7B-Chat-GGUF · Hugging Face

起動。

$ python3 -m llama_cpp.server --model llama-2-7b-chat.Q4_K_M.gguf

OpenAI Python APIライブラリーをインストールして試してみる

まずは、OpenAI Python APIライブラリーをインストールします。

セットアップは、こちらに従います。

Developer quickstart / Step 1: Setup Python

$ python3 -m venv venv
$ . venv/bin/activate
$ pip3 install openai

インストールしたOpenAI Python APIライブラリーのバージョン。

$ pip3 freeze | grep openai
openai==1.3.7

こちらに従って、プログラムを作成。

Developer quickstart / Step 3: Sending your first API request

quickstart.py

import time
from openai import OpenAI

start_time = time.perf_counter()

client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy-api-key")

completion = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
        {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."} 
    ]
)

elapsed_time = time.perf_counter() - start_time

print(completion)

print()

print(f"id = {completion.id}")
print(f"model = {completion.model}")

print("choices:")
for choice in completion.choices:
    print(f"  {choice}")

print(f"usage = {completion.usage}")

print()

print(f"elapsed time = {elapsed_time:.3f} sec")

OpenAIのコンストラクタには通常引数は不要（APIキーは環境変数で設定する）なようですが、今回はbase_urlの指定もあるので
引数で指定。base_urlには、llama-cpp-pythonにアクセスするURLを設定します。

client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy-api-key")

ここはQuickstartの内容と同じです。

completion = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
        {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."} 
    ]
)

このあたりは、レスポンスを標準出力に書き出しているのですが、全体や各要素を少し細かくして表示しています。
あと、時間がけっこうかかるので、処理時間も表示するようにしました。

elapsed_time = time.perf_counter() - start_time

print(completion)

print()

print(f"id = {completion.id}")
print(f"model = {completion.model}")

print("choices:")
for choice in completion.choices:
    print(f"  {choice}")

print(f"usage = {completion.usage}")

print()

print(f"elapsed time = {elapsed_time:.3f} sec")

確認。

$ python3 quickstart.py

結果。

ChatCompletion(id='chatcmpl-376a3b53-d3cf-48f0-a27f-1082be8ca755', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="  Recursion, oh gentle programmer's delight,\nA loop within a loop, a function's recursive might.\nIt's like a tree, you see, with branches so bright,\nEach one calling itself, until the morning light.\n\nIn recursion's embrace, a program unfolds,\nWith each iteration, a new tale to unfold.\nA solution's sought, with logic so neat,\nAnd in its heart, the loop can't be beat.\n\nIt starts with base, a seed so small and fine,\nAnd grows with each call, like vines entwining.\nThe function's name is echoed through the land,\nAs it loops and loops, a programmer's hand.\n\nWith each recursion, more complexity gained,\nThe program unfolds its secrets unrestrained.\nA dance of code, a symphony divine,\nRecursion weaves its magic, a programming shrine.\n\nSo here's to recursion, a tool so grand,\nThat makes our programs grow, in this digital land.\nWith it, we craft, with it we play,\nAnd bring our ideas to life each day.\n\nNow go forth, dear programmer, and code with cheer,\nFor recursion is the key that sets your spirit free.", role='assistant', function_call=None, tool_calls=None))], created=1701510436, model='gpt-3.5-turbo', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=279, prompt_tokens=53, total_tokens=332))

id = chatcmpl-376a3b53-d3cf-48f0-a27f-1082be8ca755
model = gpt-3.5-turbo
choices:
  Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="  Recursion, oh gentle programmer's delight,\nA loop within a loop, a function's recursive might.\nIt's like a tree, you see, with branches so bright,\nEach one calling itself, until the morning light.\n\nIn recursion's embrace, a program unfolds,\nWith each iteration, a new tale to unfold.\nA solution's sought, with logic so neat,\nAnd in its heart, the loop can't be beat.\n\nIt starts with base, a seed so small and fine,\nAnd grows with each call, like vines entwining.\nThe function's name is echoed through the land,\nAs it loops and loops, a programmer's hand.\n\nWith each recursion, more complexity gained,\nThe program unfolds its secrets unrestrained.\nA dance of code, a symphony divine,\nRecursion weaves its magic, a programming shrine.\n\nSo here's to recursion, a tool so grand,\nThat makes our programs grow, in this digital land.\nWith it, we craft, with it we play,\nAnd bring our ideas to life each day.\n\nNow go forth, dear programmer, and code with cheer,\nFor recursion is the key that sets your spirit free.", role='assistant', function_call=None, tool_calls=None))
usage = CompletionUsage(completion_tokens=279, prompt_tokens=53, total_tokens=332)

elapsed time = 67.885 sec

1分を超えましたね…。

ちょっと長いので、質問を前回のエントリーと同じものにしてみましょう。

llama-cpp-pythonで、OpenAI API互換のサーバーを試す - CLOVER🍀

quickstart2.py

import time
from openai import OpenAI

start_time = time.perf_counter()

client = OpenAI(base_url = "http://localhost:8000/v1", api_key = "dummy-api-key")

completion = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Could you introduce yourself?"} 
    ]
)

elapsed_time = time.perf_counter() - start_time

print(completion)

print()

print(f"id = {completion.id}")
print(f"model = {completion.model}")

print("choices:")
for choice in completion.choices:
    print(f"  {choice}")

print(f"usage = {completion.usage}")

print()

print(f"elapsed time = {elapsed_time:.3f} sec")

変わったのはここですね。

completion = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Could you introduce yourself?"} 
    ]
)

確認。

$ python3 quickstart2.py
ChatCompletion(id='chatcmpl-037f969d-8a31-4324-995a-1a878e901bd1', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="  Hello! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. My primary function is to assist users with their inquiries and provide information on a wide range of topics. I'm here to help you with any questions or tasks you may have, so feel free to ask me anything! ", role='assistant', function_call=None, tool_calls=None))], created=1701510862, model='gpt-3.5-turbo', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=80, prompt_tokens=16, total_tokens=96))

id = chatcmpl-037f969d-8a31-4324-995a-1a878e901bd1
model = gpt-3.5-turbo
choices:
  Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="  Hello! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. My primary function is to assist users with their inquiries and provide information on a wide range of topics. I'm here to help you with any questions or tasks you may have, so feel free to ask me anything! ", role='assistant', function_call=None, tool_calls=None))
usage = CompletionUsage(completion_tokens=80, prompt_tokens=16, total_tokens=96)

elapsed time = 20.823 sec

だいぶ短くなりました。

テキスト生成モデルのAPIをもう少し見てみる

role

いきなりこんな感じで使ってみましたが、これだと意味がよくわかりませんね。

completion = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
        {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."} 
    ]
)

こちらでもう少し追ってみましょう。

Text generation models

今回使っているのは、Chat Completions APIです。

Text generation models / Chat Completions API

チャットモデルは、メッセージをリストとして受け取り、モデルが生成したメッセージを出力として返します。

Chat models take a list of messages as input and return a model-generated message as output.

APIリファレンスとしてはこちらですね。

OpenAI / API reference / ENDPOINTS / Chat

実際に呼び出しているのはこちらのAPIです。

OpenAI / API reference / ENDPOINTS / Chat / Create chat completion

modelには使用するモデルを指定します。今回はgpt-3.5-turboを指定しましたが、llama-cpp-pythonでは意味のないパラメーターです。
モデルの一覧はModelsに書かれています。

messagesには、roleとcontentを含めた辞書を渡します。

roleは役割を表すもので、以下が指定できます。

system … アシスタントの動作を設定する。性格やどのように動作するかの指示をする
user … アシスタントが応答するためのリクエストやコメント
assistant … アシスタント（通常はOpenAI、今回はllama-cpp-python）によるメッセージ

また、ツールの呼び出しに使うfunctionもあるようです。

どうしてassistantがあるのか？ですが、チャットを行っている際にOpenAI（今回はllama-cpp-python）は会話の内容を覚えている
わけではなく、一連の会話をすべて送信することでそれまでの内容を理解しているようです。
なので、チャットとして続ける場合にはassistantにはアシスタントが返してきたメッセージが入ることになります。

今回はローカルのllama-cpp-pythonを使っているので、Quickstartのサンプルをそのまま使うと2つのメッセージが入るので、それで
顕著に遅くなっている気がしますね…。

レスポンスをJSONにする

以下のように、モデルをgpt-3.5-turbo-1106として（OpenAIの場合）、response_format = { "type": "json_object" }と指定すると
レスポンスのchoices.message.contentの中身がJSONになるようです。

completion = client.chat.completions.create(
    model = "gpt-3.5-turbo-1106",
    response_format = { "type": "json_object" },
    messages=[
        {"role": "user", "content": "Could you introduce yourself?"}
    ]
)

Text generation models / JSON mode

なのですが、llama-cpp-pythonではこの設定を入れると応答が返ってこなくなりました…。

複数のメッセージを返す

以下のようにnに2以上の値を指定すると、メッセージが複数返ってくるようになるらしいです（デフォルトは1）。

completion = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Could you introduce yourself?"}
    ],
    n = 2
)

返ってくるメッセージを多くすると、その分トークンも使うことになるので注意が必要です。
なのですが、llama-cpp-pythonでは2以上の値を指定しても返ってくるメッセージはひとつでした…。

再現可能なレスポンスにする

Chat Completion APIはデフォルトで非決定的で、結果はリクエストごとに異なる可能性があります。
これをできるだけ押さえるには、temperatureを0にします。

completion = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Could you introduce yourself?"} 
    ],
    temperature = 0
)

Text generation models / Reproducible outputs

temperatureには0から2までの値を指定可能（少数可）で、値が小さいほど結果が確定的になり、値が大きいほどランダムになります。

このパラメーターはllama-cpp-pythonでも機能しました。