Transformersでテキスト生成を試してみる

これは、なにをしたくて書いたもの？

Transformersでテキスト生成ができそうだったので、こちらをTransformersの足がかりとして試してみたいと思います。

Transformersでのテキスト生成

テキスト生成には、Transformersでできることに挙げられています。

Task Guides内でどこにあたるのかがちょっとわかりにくかったのですが、こちらのようです。

テキスト生成ができるモデルとしては、以下が挙がっています。

Some of the models that can generate text include GPT2, XLNet, OpenAI GPT, CTRL, TransformerXL, XLM, Bart, T5, GIT, Whisper.

今回は、GPT2を使うことにしましょう。

OpenAI GPT2

事前学習済みのモデルとしては、日本語に対応していそうな次の2つで試してみることにします。

abeja/gpt2-large-japanese · Hugging Face

rinna/japanese-gpt2-medium · Hugging Face

環境

今回の環境は、こちらです。

$ python3 --version
Python 3.10.12


$ pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

実行環境は、CPUのみの環境です。

abeja/gpt2-large-japaneseで試す

まずは、abeja/gpt2-large-japaneseで試してみたいと思います。

仮想環境の有効化。

$ python3 -m venv venv
$ . venv/bin/activate

Transformersのインストールページとabeja/gpt2-large-japaneseのページに習って、ライブラリーのインストール。

$ pip3 install transformers[torch,sentencepiece]

Installation

abeja/gpt2-large-japanese · Hugging Face

なんか、衝撃のサイズになりました…。こういうものなんでしょうか…？

$ du -sh venv
4.7G    venv

インストールされたライブラリー。

$ pip3 list
Package                  Version
------------------------ ----------
accelerate               0.25.0
certifi                  2023.11.17
charset-normalizer       3.3.2
filelock                 3.13.1
fsspec                   2023.12.2
huggingface-hub          0.20.1
idna                     3.6
Jinja2                   3.1.2
MarkupSafe               2.1.3
mpmath                   1.3.0
networkx                 3.2.1
numpy                    1.26.2
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.18.1
nvidia-nvjitlink-cu12    12.3.101
nvidia-nvtx-cu12         12.1.105
packaging                23.2
pip                      22.0.2
protobuf                 4.25.1
psutil                   5.9.7
PyYAML                   6.0.1
regex                    2023.12.25
requests                 2.31.0
safetensors              0.4.1
sentencepiece            0.1.99
setuptools               59.6.0
sympy                    1.12
tokenizers               0.15.0
torch                    2.1.2
tqdm                     4.66.1
transformers             4.36.2
triton                   2.1.0
typing_extensions        4.9.0
urllib3                  2.1.0

最初はpipelineを使って作成してみます。

Pipelines for inference

以下のソースコードで、「日本の首都は」の続きを生成してもらいましょう。

text_generation_pipeline.py

import time
from transformers import pipeline

start_time = time.perf_counter()

generator = pipeline("text-generation", model="abeja/gpt2-large-japanese")

outputs = generator("日本の首都は")

elapsed_time = time.perf_counter() - start_time

print(outputs)

print()

print(f"elapsed time = {elapsed_time:.3f} sec")

実行結果はこちら。初回はモデルのダウンロードが行われるので、時間がかかります。このモデルは約3GBでした。

$ python3 text_generation_pipeline.py
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
[{'generated_text': '日本の首都は東京であり、これを中心とした地域である。 この区分を、地方行政における「広域地方行政単位」に適用する。 広域地方行政単位は、複数の地方公共団体で構成される。 現在の日本の広域地方行政単位は、旧町村の'}]

elapsed time = 43.904 sec

東京と出てきましたが、この結果は実行する度に変化します。東京ではない海外の都市を答えることの方がむしろ多かったです。

$ python3 text_generation_pipeline.py
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
[{'generated_text': '日本の首都はローマ(正確には、アッシジの旧市街とパレルモの間)にあるのだが、この旧市街はローマ帝国や東方拡大とともに徐々に衰退したのである。それでも、ローマの旧市街は、 ローマの時代から、ヨーロッパでは非常に'}]

elapsed time = 47.745 sec

続いては、Auto Classesを使ってみます。

text_generation_autoclasses.py

import time
from transformers import AutoModelForCausalLM, AutoTokenizer

start_time = time.perf_counter()

tokenizer = AutoTokenizer.from_pretrained("abeja/gpt2-large-japanese")
model = AutoModelForCausalLM.from_pretrained("abeja/gpt2-large-japanese")

inputs = tokenizer("日本の首都は", return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=15,
    pad_token_id=tokenizer.pad_token_id
)

generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)

elapsed_time = time.perf_counter() - start_time

print(generated_texts)

print()

print(f"elapsed time = {elapsed_time:.3f} sec")

使っているのは、AutoTokenizerとAutoModelForCausalLMです。

Auto Classes / AutoTokenizer

Auto Classes / Natural Language Processing / AutoModelForCausalLM

ソースコードは、こちらを参考に作成しています。

Text generation strategies

実行結果。頑なに日本の首都の続きをロンドンとして生成しようとするのですが、pipelineを使った時と比べてパラメーターが違うのだと
思うのですが…。
※temperatureを変えたらいいのかなとも思いますが、今回は試しませんでした

$ python3 text_generation_autoclasses.py
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
['日本の首都はロンドンです。 ロンドンは、世界でも有数の大都市で、世界中']

elapsed time = 88.547 sec

今回使ったモデルは、ABEJA社がJapanese CC-100、Japanese Wikipedia、Japanese OSCARで学習したGTP2モデルです。

abeja/gpt2-large-japanese · Hugging Face

トークナイザーは、SentencePieceを使っています（なのでインストールが必要）。

GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

rinna/japanese-gpt2-medium

続いては、rinna/japanese-gpt2-mediumで試してみます。環境は別々に作ったのですが、rinna/japanese-gpt2-mediumも
abeja/gpt2-large-japaneseと同じでSentencePieceを使っているようなので、結局インストールするライブラリーは同じでした…。

$ pip3 install transformers[torch,sentencepiece]

依存ライブラリーのバージョンの情報は省略します。

pipelineを使って作成したソースコード。モデル名が違う以外は、abeja/gpt2-large-japaneseを使った時と同じです。

text_generation_pipeline.py

import time
from transformers import pipeline

start_time = time.perf_counter()

generator = pipeline("text-generation", model="rinna/japanese-gpt2-medium")

outputs = generator("日本の首都は")

elapsed_time = time.perf_counter() - start_time

print(outputs)

print()

print(f"elapsed time = {elapsed_time:.3f} sec")

実行。初回はモデルのダウンロードが行われるので、時間がかかります。

$ python3 text_generation_pipeline.py
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
[{'generated_text': '日本の首都は東京であり、日本の首都の中で最も人口が多い大都市です。人口は約280万人で、神奈川県の中でも有数の人口密集地です。東京に在住する人は、人口が最も多い都市です。そして、横浜市や川崎市、相模原市には、'}]

elapsed time = 5.614 sec

こちらもしっかり嘘を言いますが…abeja/gpt2-large-japaneseよりも速く返ってきますね。モデルが1.3Gほどとabeja/gpt2-large-japaneseの
半分以下なこともある気がしますが。

$ python3 text_generation_pipeline.py
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
[{'generated_text': '日本の首都はニューヨークのマンハッタンに位置する。 人口は約304万人。 この都市の起源は1655年にニューヨーク市のマンハッタン島にあったセントポールが、1657年にマンハッタン島の対岸に移動し、マンハッタン島が正式にアメリカ合衆国の首都となった。 17世紀、アメリカ合衆国で'}]

elapsed time = 5.762 sec

Auto Classesを使った場合。こちらは、rinna/japanese-gpt2-mediumに書かれてあるようにAutoTokenizerを設定しておきました。

text_generation_autoclasses.py

import time
from transformers import AutoModelForCausalLM, AutoTokenizer

start_time = time.perf_counter()

tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt2-medium", use_fast=False)
tokenizer.do_lower_case = True
model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt2-medium")

inputs = tokenizer("日本の首都は", return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=15,
    pad_token_id=tokenizer.pad_token_id
)

generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)

elapsed_time = time.perf_counter() - start_time

print(generated_texts)

print()

print(f"elapsed time = {elapsed_time:.3f} sec")

実行結果。こちらも実行に応じた変化が見られないので、パラメーターの指定が足りない感じがしますね…。

$ python3 text_generation_autoclasses.py
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
['日本の首都はニューヨーク です。 ニューヨークは、 ニューヨークシティ とも呼ばれます。']

elapsed time = 3.194 sec

こちらのモデルは、Japanese CC-100とJapanese Wikipediaで事前学習しているようです。

rinna/japanese-gpt2-medium · Hugging Face

トークナイザーにSentencePieceを使っているのは、abeja/gpt2-large-japaneseと同じです。

おわりに

Transformersを使って、テキスト生成を試してみました。

Transformersのドキュメントを読み解いたりするのにだいぶ苦労しましたが、とりあえずなんとか動かすことはできました…。

思った以上にリソースを使う（ディスクも）ので、自分の環境ではちょっと扱いが難しい感じなのですが、いい勉強題材ではあるので
もう少しいろいろやりたいところですが…どうしたものでしょうね。

CLOVER🍀

That was when it all began.

Transformersでテキスト生成を試してみる

これは、なにをしたくて書いたもの？

Transformersでのテキスト生成

環境

abeja/gpt2-large-japaneseで試す

rinna/japanese-gpt2-medium

おわりに