ããã¯ããªã«ãããããŠæžãããã®ïŒ
ãã¡ãã®ãšã³ããªãŒãæžããæã«ãOpenAI APIã«ãããäž»èŠãªæŠå¿µããŸãšããŠã¿ãŸããã
OpenAI Python APIライブラリーからllama-cpp-pythonで立てたOpenAI API互換のサーバーへアクセスしてみる - CLOVER🍀
ãã®ãã¡ãããŒã¯ã³ã®æ°ãæ¹ãæ°ã«ãªããšããããtiktokenã䜿ããšèªåã§ãæååãããŒã¯ã³ãšããŠæ°ãããããããªã®ã§è©ŠããŠãããããªãš
æããŸããŠã
ããŒã¯ã³
ãããããŠããã¡ãã確èªããŸãã
ããŒã¯ã³ãšã¯ã以äžã®æŠå¿µã§ãã
- ããã¹ãçæã¢ãã«ããã³åã蟌ã¿ã«ãããåŠçåäœã§ãæååãå解ããããã®
- 1åèªã1ããŒã¯ã³ã«ãªãããã§ã¯ãªã
- Tokenizerã§ç¢ºèªå¯èœ
- ããã¹ãçæã¢ãã«ã®å Žåãããã³ãããšåºåãã¢ãã«ã®æ倧ã³ã³ããã¹ãé·ãè¶ ããŠã¯ãªããªã
- ïŒããŒã¯ã³ãåºåããªãïŒåã蟌ã¿ã®å Žåãå ¥åã¯ã¢ãã«ã®æ倧ã³ã³ããã¹ãé·ããçããªããŠã¯ãªããªã
- åããã¹ãçæã¢ãã«ãåã蟌ã¿ã®æ倧ã³ã³ããã¹ãé·ã¯Modelsã§ç¢ºèªå¯èœ
åã¢ãã«ã䜿ã£ãAPIã«å¯Ÿãããå ¥åºåã®äžéã«ãªãããã§ãã
ãŸããããŒã¯ã³ã¯OpenAI APIã®å©çšæéã«ãé¢ãã£ãŠããã®ã§ããã®ç¹ã§ãæ°ã«ãªããšããã§ããã
ããã§æååãããŒã¯ã³ãããšã©ããªããã¯ãTokenizerã§ç¢ºèªã§ããããã§ãã
ããŒã¯ã³åã®ããã»ã¹ã¯ãã¢ãã«ã«ãã£ãŠç°ãªãããšãæžãããŠããŸãããšãã£ãŠããTokenizerã®ããŒãžãèŠããšGTP-3.5ããã³GTP-4ã§ã¯
åããã®ã®ããã§ããã
It's important to note that the exact tokenization process varies between models. Newer models like GPT-3.5 and GPT-4 use a different tokenizer than our legacy GPT-3 and Codex models, and will produce different tokens for the same input text.
äžè¬çãªè±èªã®ããã¹ãã§ã¯ãã²ãšã€ã®ããŒã¯ã³ã¯æ倧4æåãåèªã®3/4ã«çžåœããããã§ãã100ããŒã¯ã³ã§ããã°ãçŽ75åèªãšãã
ããšã¿ããã§ãã
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly Ÿ of a word (so 100 tokens ~= 75 words).
ããŒã¯ã³åã®ããã°ã©ã ãäœæãããå Žåã¯ãtiktokenã䜿ãã°ããããšããã®åŸã«æžãããŠããŸãã
If you need a programmatic interface for tokenizing text, check out our tiktoken package for Python. For JavaScript, the community-supported @dbdq/tiktoken package works with most GPT models.
ããã€ãè©ŠããŠã¿ãŸãããã
æ¥æ¬èªã ãšãåå²äœçœ®ã埮åŠãªããšãããããã§ããã
ãŸããããŒã¯ã³ã®idã確èªã§ããããã§ãã
ãã®æç¹ã§ã¯ãidã®æå³ã¯ããã£ãŠãªãã®ã§ããã
å°ãé·ãã®æç« ãšããŠããã®ããŒãžã®æåã®ã»ã¯ã·ã§ã³ãäžžããšè²Œã£ãŠã¿ãŸãããããã®åŸã«ãGoogle翻蚳ã§æ¥æ¬èªåãããã®ã
ããŒã¯ã³åããŠã¿ãŸãã
æ¥æ¬èªã¯ããŒã¯ã³æ°ãå€ããªãåŸåã«ããããã§ããæåæ°ãšèŠæ¯ã¹ããšãé転ããŠããŸããã
tiktoken
ç¶ããŠãtiktokenã«ã€ããŠãGitHubãªããžããªãŒã¯ãã¡ãã
GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models.
çŸåšã®ããŒãžã§ã³ã¯0.5.2ã§ãã
tiktokenã¯ãOpenAIã®ã¢ãã«ã§äœ¿çšããBPEããŒã¯ãã€ã¶ãŒãšãããŠããŸãã
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
Pythonã§å®è£ ãããŠããŸãããpipã§ã€ã³ã¹ããŒã«ã§ããã®ã¯ãªãŒãã³ãœãŒã¹ã®ããŒãžã§ã³ã ããã§ãã
The open source version of tiktoken can be installed from PyPI:
ãªãŒãã³ãœãŒã¹ã®ããŒãžã§ã³ä»¥å€ã«ãã©ããããã®ãããã®ãã¯ããããŸãããã
ãœãŒã¹ã³ãŒãäŸã¯ããã¡ãã
https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
ãã®äžã«ã¯ãtiktoken以å€ã®ããŒã¯ã³åã®ã©ã€ãã©ãªãŒã«ã€ããŠã®ãªã³ã¯ãæ²èŒãããŠããŸãã
BPEããŒã¯ãã€ã¶ãŒãã©ããããã®ãã¯ããã¡ãã«æžãããŠããŸãã
tiktoken / What is BPE anyway?
BPEã¯Byte pair encodingã®ç¥ã§ãããã¹ããããŒã¯ã³ã«å€æãããã®ã§ãã以äžã®æ§è³ªãæã¡ãŸãã
- å¯éãã€ãã¹ã¬ã¹ãªã®ã§ãããŒã¯ã³ãå ã®ããã¹ãã«å€æã§ãã
- ããŒã¯ãã€ã¶ãŒã®ãã¬ãŒãã³ã°ããŒã¿ã«ãªããä»»æã®ããã¹ãã«å¯ŸããŠåäœãã
- ããã¹ãã¯å§çž®ãããããŒã¯ã³ã·ãŒã±ã³ã¹ã¯å ã®ããã¹ãã«å¯Ÿå¿ãããã€ããããçããªããå¹³åããŠãã²ãšã€ã®ããŒã¯ã³ã¯çŽ4ãã€ãã«ãªã
- ã¢ãã«ã«å ±éã®ãµãã¯ãŒããèªèãããããšãããããšãã°ããingãã¯è±èªã§äžè¬çãªãµãã¯ãŒãã§ãããããBPEã§ã¯å€ãã®å ŽåãencodingãããencodããšãingãïŒãencããšãodingããªã©ã§ã¯ãªãïŒã®ãããªããŒã¯ã³ã«åå²ããããããã¯ãã¢ãã«ãææ³ãç解ããããã®å©ãã«ãªã
ãšã³ã³ãŒãã£ã³ã°ãšã¢ãã«
tiktokenã§ããã¹ããããŒã¯ã³ã«å€æããïŒãšã³ã³ãŒãããïŒéã«ãã©ã®ã¢ãã«ãã©ã®ãšã³ã³ãŒãã£ã³ã°ã«å¯Ÿå¿ãããã¯
ãµã³ãã«ã®ããŒãžã«æžãããŠããŸãã
https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
ãšã³ã³ãŒãã£ã³ã°å | ã¢ãã«å |
---|---|
cl100k_base | gpt-4ãgpt-3.5-turboãtext-embedding-ada-002 |
p50k_base | Codex modelsãtext-davinci-002ãtext-davinci-003 |
r50k_base ãŸã㯠gpt2 | davinciã®ãããªGPT-3ã¢ãã« |
ãã®ãããã®ãããã³ã°ã¯ããã¡ãã«æžãããŠãããã§ãã
https://github.com/openai/tiktoken/blob/0.5.2/tiktoken/model.py#L7-L64
èŠãŠããã®ã¯ãããããã«ããŠãç°¡åã«äœ¿ã£ãŠã¿ãŸãããã
ç°å¢
ä»åã®ç°å¢ã¯ãã¡ãã
$ python3 -V Python 3.10.12 $ pip3 -V pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
tiktokenã䜿ã£ãŠã¿ã
ãŸãã¯tiktokenãã€ã³ã¹ããŒã«ã
$ pip3 install tiktoken
ããŒãžã§ã³ã
$ pip3 freeze | grep tiktoken tiktoken==0.5.2
ç°¡åãªãµã³ãã«ãæžããŠã¿ãŸããå ¥åããæååã¯ãTokenizerã§è©Šãããã®ãšåããã®ã«ããŸãããã
token_sample.py
import tiktoken encoding = tiktoken.encoding_for_model("gpt-3.5-turbo") result = encoding.encode("Hello World.") print(f"input text = Hello World., tokenize result = {result}, token length = {len(result)}") result = encoding.encode("ããã«ã¡ã¯ãäžçã") print(f"input text = ããã«ã¡ã¯ãäžçã, tokenize result = {result}, token length = {len(result)}")
ã¢ãã«ãæå®ããŠããšã³ã³ãŒãã£ã³ã°ïŒEncoding
ïŒãååŸããŸãã
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
ããšã¯æååãäžããŠEncoding#encode
ã¡ãœãããåŒã³åºããããŒã¯ã³åããŸãã
result = encoding.encode("Hello World.")
å®è¡ããŠã¿ãŸãã
$ python3 token_sample.py input text = Hello World., tokenize result = [9906, 4435, 13], token length = 3 input text = ããã«ã¡ã¯ãäžçã, tokenize result = [90115, 5486, 3574, 244, 98220, 1811], token length = 6
ããŒã¯ã³åããçµæã¯ãTokenizerã§ããŒã¯ã³ã®idã§ç¢ºèªããæã®å€ãšåãã«ãªããŸãããã
ã€ãŸããããŒã¯ã³ãšããã®ã¯ãã®idïŒæŽæ°ïŒã®ããšãæããŠããããšãããããŸãã
ãŸãããŒã¯ã³ãå«ãŸãããªã¹ãã®é·ãã¯ããŒã¯ã³åããæã®å€ãšäžèŽããŠããã®ã§ãããã¹ããããŒã¯ã³ã«å€æããæã®ããŒã¯ã³æ°ã¯
ãã®æ°ã«çç®ããã°ããããšãããããŸããã
ããã¥ã¡ã³ãã«ããã°ãããŒã¯ã³åãããã®ã¯å
ã®æååã«æ»ãããšãã話ã§ãããããã«ã¯Encoding#decode
ã䜿ããŸãã
å ã»ã©ã®ããã°ã©ã ãå°ãä¿®æ£ã
token_sample.py
import tiktoken encoding = tiktoken.encoding_for_model("gpt-3.5-turbo") result = encoding.encode("Hello World.") print(f"input text = Hello World., tokenize result = {result}, token length = {len(result)}") decoded = encoding.decode(result) print(f"decoded = {decoded}") result = encoding.encode("ããã«ã¡ã¯ãäžçã") print(f"input text = ããã«ã¡ã¯ãäžçã, tokenize result = {result}, token length = {len(result)}") decoded = encoding.decode(result) print(f"decoded = {decoded}")
確èªã
$ python3 token_sample.py input text = Hello World., tokenize result = [9906, 4435, 13], token length = 3 decoded = Hello World. input text = ããã«ã¡ã¯ãäžçã, tokenize result = [90115, 5486, 3574, 244, 98220, 1811], token length = 6 decoded = ããã«ã¡ã¯ãäžçã
確ãã«ãããŒã¯ã³åããçµæããå ã®æååã«æ»ããŸãããã
ã·ã³ã°ã«ããŒã¯ã³ããã³ãŒãããå Žåã¯ãEncoding#decode_single_token_bytes
ã䜿ãã®ãè¯ããããã§ããEncoding#decode
ã®æ¹ã¯ã
UTF-8å¢çäžã«ãªãããŒã¯ã³ã«å¯ŸããŠã¯æ倱ã倧ããããã§ãã
ã¡ãªã¿ã«ããã®æã«äœ¿ã£ãŠãããšã³ã³ãŒãã£ã³ã°åã¯ãã¡ãã§ãã
print(f"encoding name = {encoding.name}")
encoding name = cl100k_base
ãšã³ã³ãŒãã£ã³ã°åãæå®ããŠäœ¿ãããšãã§ããŸãããã®å Žåã¯ãtiktoken#get_encoding
ã䜿ããŸãã
token_sample2.py
import tiktoken encoding = tiktoken.get_encoding("cl100k_base") result = encoding.encode("Hello World.") print(f"input text = Hello World., tokenize result = {result}, token length = {len(result)}") decoded = encoding.decode(result) print(f"decoded = {decoded}") result = encoding.encode("ããã«ã¡ã¯ãäžçã") print(f"input text = ããã«ã¡ã¯ãäžçã, tokenize result = {result}, token length = {len(result)}") decoded = encoding.decode(result) print(f"decoded = {decoded}") print(f"encoding name = {encoding.name}")
çµæã
$ python3 token_sample2.py input text = Hello World., tokenize result = [9906, 4435, 13], token length = 3 decoded = Hello World. input text = ããã«ã¡ã¯ãäžçã, tokenize result = [90115, 5486, 3574, 244, 98220, 1811], token length = 6 decoded = ããã«ã¡ã¯ãäžçã encoding name = cl100k_base
ããšã¯ä»ã®ãšã³ã³ãŒãã£ã³ã°ãšã®å·®ç°ãªã©ãç°¡åã«ãã¹ãã§ç¢ºèªããŠã¿ãŸãããã
pytestãã€ã³ã¹ããŒã«ã
$ pip3 install pytest
ããŒãžã§ã³ã
$ pip3 freeze | grep pytest pytest==7.4.3
ãã¹ãã¯ãããªæãã«ããŸããããã¹ãããŠããå 容ïŒãšããã確èªããŠããå 容ïŒã¯ãã³ã¡ã³ããåç §ããŠãã ããã
test_tiktoken.py
import tiktoken # GPT-3.5-turboãšGPT-4ãåããšã³ã³ãŒãã£ã³ã°ã«ãªãããšãç¢ºèª def test_gpt35_gpt4_encoding_equals(): assert tiktoken.encoding_for_model("gpt-3.5-turbo") == tiktoken.encoding_for_model("gpt-4") assert tiktoken.encoding_for_model("gpt-4").name == "cl100k_base" # GPT-3.5-turboãšGPT-4ãåããšã³ã³ãŒãã£ã³ã°çµæã«tãªãããšãç¢ºèª def test_gpt35_gpt4_encoding_result_equals(): encoding_for_gtp35turbo = tiktoken.encoding_for_model("gpt-3.5-turbo") encoding_for_gtp4 = tiktoken.encoding_for_model("gpt-4") for text in ["Hello World.", "ããã«ã¡ã¯ãäžçã"]: assert encoding_for_gtp35turbo.encode(text) == encoding_for_gtp4.encode(text) # ãšã³ã³ãŒãã£ã³ã°ããšã®å·®ç°ãç¢ºèª def test_each_encodings(): cl100k_base_encoding = tiktoken.get_encoding("cl100k_base") assert cl100k_base_encoding.encode("Hello World.") == [9906, 4435, 13] assert cl100k_base_encoding.encode("ããã«ã¡ã¯ãäžçã") == [90115, 5486, 3574, 244, 98220, 1811] p50k_base_encoding = tiktoken.get_encoding("p50k_base") assert p50k_base_encoding.encode("Hello World.") == [15496, 2159, 13] assert p50k_base_encoding.encode("ããã«ã¡ã¯ãäžçã") == [46036, 22174, 28618, 2515, 94, 31676, 23513, 10310, 244, 45911, 234, 16764] r50k_base_encoding = tiktoken.get_encoding("r50k_base") assert r50k_base_encoding.encode("Hello World.") == [15496, 2159, 13] assert r50k_base_encoding.encode("ããã«ã¡ã¯ãäžçã") == [46036, 22174, 28618, 2515, 94, 31676, 23513, 10310, 244, 45911, 234, 16764]
æ°ã«ãªã£ãŠãšã³ã³ãŒãã£ã³ã°ããšã®å·®ç°ãèŠãŠã¿ãŸããããcl100k_base
ãšãã以å€ã§åäœã«å·®ããããŸããã
ä»åã®å
¥åç¯å²ã§ã¯ãp50k_base
ãšr50k_base
ã®éã®å·®ã¯ããããŸããã§ããã
ãã£ãšããé垞䜿ãã®ã¯cl100k_base
ã®æ¹ãªæ°ãããŸããã
ä»åã¯ããããªãšããã§ãããã
ãããã«
tiktokenã䜿ã£ãŠãããã¹ããããŒã¯ã³ã«å€æããŠã¿ãŸããã
ããã¥ã¡ã³ããªã©ãèŠãŠãããšãã£ãããããã¹ããããŒã¯ã³ã«ããããšãã£ãæãã§ããŸãå®äœãããããªãå°è±¡ã§ãããã
ãããã£ãŠããã°ã©ã ã§åãããŠã¿ããšãªããšãªãããã£ãŠããããªããšããæ°ãããŸãã