KerasのTokenizerの基本的な使い方

自然言語処理において翻訳などのseq2seqモデルやそれ以外でもRNN系のモデルを使う場合、
前処理においてテキストの列を数列に変換(トークン化)することがあります。

そのよな時に、Kerasのユーティリティーに用意されている、Tokenizerが便利なのでその基本的な使い方を紹介します。
今回の主な内容は次の4つです。(その他細かいオプションとか、別の使い側は次回以降の更新で。)
– インスタンスの生成
– テキストを数列化する
– デフォルトパラメーターで生成した時の設定
– 数列をテキストに戻す

ドキュメントはこちらです。

サンプルに何かデータが必要なので、20newsのデータを一部だけ読み込んで使います。


from sklearn.datasets import fetch_20newsgroups

# データの読み込み。少量で良いのでカテゴリも一つに絞る。
remove = ('headers', 'footers', 'quotes')
categorys = [
        "sci.med",
    ]
twenty_news = fetch_20newsgroups(
                                subset='train',
                                remove=remove,
                                categories=categorys
                            )
text_data = twenty_news.data

Tokenizer を使うときはまずはインスタンスを生成し、
テキストデータを学習させる必要があります。
(ここで学習しなかった単語はトークン化できません。)


from tensorflow.keras.preprocessing.text import Tokenizer
# Tokenizer のインスタンス生成
keras_tokenizer = Tokenizer()
# 文字列から学習する
keras_tokenizer.fit_on_texts(text_data)

# 学習した単語とそのindex
print(keras_tokenizer.word_index)
"""
{'the': 1, 'of': 2, 'to': 3, 'and': 4, 'a': 5, 'in': 6, 'is': 7,
 'i': 8, 'that': 9, 'it': 10, 'for': 11, 'this': 12, 'are': 13, ...,
--- 以下略 ---
"""

テキストデータをトークン化するときは、texts_to_sequences に、”テキストデータの配列を”渡します。
テキストを一つだけ渡すと、それを文字単位に分解してしまうので注意してください。


# テキストデータを数列に変更
sequence_data = keras_tokenizer.texts_to_sequences(text_data)
# 一つ目のテキストの変換結果。
print(sequence_data[0])
"""
[780, 3, 1800, 4784, 4785, 3063, 1800, 2596, 10, 41, 130, 24,
15, 4, 148, 388, 2597, 11, 60, 110, 20, 38, 515, 108, 586, 704,
353, 21, 46, 31, 7, 467, 3, 268, 6, 5, 4786, 965, 2223, 43, 2598,
2, 1, 515, 24, 15, 13, 747, 11, 5, 705, 662, 586, 37, 423, 587, 7092,
77, 13, 1490, 3, 130, 16, 5, 2224, 2, 12, 415, 3064, 12, 7, 5, 516,
6, 40, 79, 47, 18, 610, 3732, 1801, 26, 2225, 706, 918, 3065, 2,
1, 1801, 21, 32, 61, 1638, 31, 329, 7, 9, 1, 1802, 966, 1491,
18, 3, 126, 3066, 4, 50, 1352, 3067]
"""

これで目的のトークン化ができました。
今回は、インスタンス化する時に何も引数を渡さず、完全にデフォルトの設定になっているのですが、
一応主な設定を確認しておきましょう。


# デフォルトでは、文字を小文字に揃える。
print(keras_tokenizer.lower)
# True

# デフォルトでは文字単位ではなく、次のsplitで区切った単語単位でトークン化する。
print(keras_tokenizer.char_level)
# False

# デフォルトでは、split に半角スペースが指定されており、スーペースで区切られる。
print(keras_tokenizer.split == " ")

# いくつかの記号は除外され、単語中に含まれている場合はそこで区切られる。
print(keras_tokenizer.filters)
# !"#$%&()*+,-./:;<=>?@[\]^_`{|}~

# 例えば、 dog&cat は &が取り除かれ、 dog と cat が個別にトークン化される。
print(keras_tokenizer.texts_to_sequences(["dog&cat"]))
# [[7316, 2043]]

# & が 半角ペースだった場合と結果は同じ
print(keras_tokenizer.texts_to_sequences(["dog cat"]))
# [[7316, 2043]]

最後に、トークン列をテキストに戻す方法です。
sequences_to_texts　を使います。


# 数列をテキストに戻す。
text_data_2 = keras_tokenizer.sequences_to_texts(sequence_data)

print("元のテキスト")
print(text_data[0])
print("\n復元したテキスト")
print(text_data_2[0])

"""
元のテキスト
[reply to keith@actrix.gen.nz (Keith Stewart)]


It would help if you (and anyone else asking for medical information on
some subject) could ask specific questions, as no one is likely to type
in a textbook chapter covering all aspects of the subject.  If you are
looking for a comprehensive review, ask your local hospital librarian.
Most are happy to help with a request of this sort.

Briefly, this is a condition in which patients who have significant
residual weakness from childhood polio notice progression of the
weakness as they get older.  One theory is that the remaining motor
neurons have to work harder and so die sooner.

復元したテキスト
reply to keith actrix gen nz keith stewart it would help if you and anyone
else asking for medical information on some subject could ask specific
questions as no one is likely to type in a textbook chapter covering all
aspects of the subject if you are looking for a comprehensive review ask
your local hospital librarian most are happy to help with a request of this
sort briefly this is a condition in which patients who have significant
residual weakness from childhood polio notice progression of
the weakness as they get older one theory is that the remaining
motor neurons have to work harder and so die sooner
"""

改行のほか、括弧やカンマなどの記号が消えていること、一部の大文字が小文字になっていることなどが確認できます。

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル