Bert -

日本語の分散表現の計算方法まとめ

Posted on Wed Mar 2 2022 | 2 minutes | 585 words |

単語単位の分散表現

Word2vec
- 自然言語処理における分散表現の一つのオリジナル
- 基本原理くらいは知っていてもいいかもしれない
- gensimがよく使われる
Fasttext で文書分類問題までやったった
- fastと名前がついているだけあってfacebookが公開しているモデルは高速に動作する
- 分散表現とクラス分類に対応していたり、利便性が高い
- 特にこのモデルで利用されている分かち書きの特徴から未知語に強いとされている
日本語Wikipediaで学習済みのBERTが公開されているので使い方メモ
- Google の検索エンジンにも採用されている、らしい
- 自然言語処理の研究を大きく変えたモデル
- 関連する技術であるTransformerは自然言語処理だけでなく、画像処理の界隈にも流用された
- huggingfaceで日本語版のBERTも色々と公開されている
日本語に対応したT5
- この日本語版のモデルの作者が公開しているサンプルがわかりやすい
- また同じ作者がSBERTのモデルも公開している

文単位の分散表現

tf-idf
- 最初の選択肢
- 単語の出現頻度を計算してスコアを割り当てる
- gensimがよく使われる
BM25
- 単語の出現頻度を計算してスコアを出す
- QAモデルの最初の大雑把な検索によく使われる印象
doc2vec
- word2vecの文書版
- gensimがよく使われる
Universal Sentence Encoder
- 結構重宝する
- そこそこ性能もよく使い勝手がいい
SBERT
- GPUがないとしんどいかも
- 性能自体は上のUSEよりも体感ではいい

自然言語処理 T5 BERT Sentece Transformers SBERT word2vec fasttext

How to train a Japanese model with Sentence transformer to get a distributed representation of a sentence

Posted on Wed Feb 3 2021 | 3 minutes | 508 words |

. BERT is a model that can be powerfully applied to natural language processing tasks.

However, it does not do a good job of capturing sentence-wise features.

Some claim that sentence features appear in [ CLS\ ], but This paper](https://arxiv.org/abs/1908.10084) claims that it does not contain that much useful information for the task.

Sentence BERT is a model that extends BERT to be able to obtain features per sentence.

The following are the steps to create Sentence BERT in Japanese.

[Read More]

technical natural language processing BERT distributed representation technology sentence transformer

A note on how to use BERT learned from Japanese Wikipedia, now available

Posted on Wed Jun 17 2020 | 1 minutes | 472 words |

huggingface has released a Japanese model for BERT.

The Japanese model is included in transformers.

However, I stumbled over a few things before I could get it to actually work in a Mac environment, so I’ll leave a note.

Preliminaries: Installing mecab

The morphological analysis engine, mecab, is required to use BERT’s Japanese model.

The tokenizer will probably ask for mecab.

This time, we will use homebrew to install Mecab and ipadic.

[Read More]

technical natural language processing bert technology python distributed representation