I even did a document classification problem with Fasttext

目次

Summary of what I’ve done with Fasttext to the document classification problem.

  • Facebook research has published a document classification library using Fasttext.
  • Fasttext is easy to install in a python environment.
  • Run time is fast.

Preliminaries

I decided to tackle the task of document classification, and initially thought.

NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit

NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit. However, it was not very accurate.

My boss taught me how to do it.

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

In.

Features of the Fasttext library

This will be a library that solves the document classification problem using Fasttext in an end-to-end manner.

Therefore, it is designed to optimize document vectors for classification tasks.

It is very fast, taking less than a few seconds to learn, and is a good starting point.

Also, the performance is not bad. Hyperparameter tuning](https://fasttext.cc/docs/en/autotune.html) is also available.

The basics are as above.

Just specify the parameters you want to fix (e.g., the number of dimensions of the variance representation), and run as follows

model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid')

where ‘cooking.train’ and ‘cooking.valid’ are text files in the specified format.

How to use the Fasttext library.

You can use the fasttext library by running ````p pip install fasttext


The above command is all you need to set up a fastText environment for python as of June 2020. It is very easy.

## How to use it for document classification problems


We need to create a 'data.train.txt' file as the teacher data.

The format of the data is a text file with "**label**" and a tokenized document on each line.

train.txt

__label__1 Love is heavy
__label__2 I love you


In the same way, we will create test data, etc. in the same format as the teacher data.

To train the model, run the following code.

```` import fasttext
import fasttext  
  
model = fasttext.train_supervised('train.txt')

The training time depends on the amount of teacher data, but can be handled by the CPU, and with the data at hand (about 1000 cases), training was completed in a few seconds.

Estimation results using the learned model can be obtained as follows.

model.predict("Do you believe in love?")
``` model.predict("Do you believe in love?")

In this case, an array of classes to be classified and the predicted probabilities will be returned.

This solves the document classification problem using fastText.

Evaluation

You can use sklearn to do mixing matrices and accuracy comparisons.

The following is a quote from the official scikit-learn website.

```py
from sklearn.metrics import confusion_matrix  
  
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]  
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]  
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])  
array([[2, 0, 0],  
      [0, 0, 1],  
      [1, 0, 2]])
````py

```py
from sklearn.metrics import classification_report  
y_true = [0, 1, 2, 2, 2].  
y_pred = [0, 0, 2, 2, 1]]  
target_names = ['class 0', 'class 1', 'class 2']  
print(classification_report(y_true, y_pred, target_names=target_names))  
             precision recall f1-score support  
<BLANKLINE  
    class 0 0.50 1.00 0.67 1  
    class 1 0.00 0.00 0.00 1  
    class 2 1.00 0.67 0.80 3  
<BLANKLINE  
   accuracy 0.60 5  
  macro avg 0.50 0.56 0.49 5  
weighted avg 0.70 0.60

Summary

By using fastText and scikit-learn, we can use

Using fastText and scikit-learn, we can easily tackle document classification problems in the python environment.

Using fastText and scikit-learn, we can easily tackle document classification problems in python environment.

Since python is also useful for creating datasets, it is a good option if you want to try document classification for the first time.

For practical applications, please refer to the following links.

[What are some applications of the document classification problem?] (https://www.subcul-science.com/post/20200618blog-post_54/)

Learn about how distributed representation works.

We will learn about distributed representations, which are used to learn the meanings of words. Why don’t you try to understand how distributed expressions work by actually running the program?

If you can get an idea of how distributed representations are learned, you will have a better understanding of what and how they are learned in the BERT system.

For more information, please refer to the following link

First Introduction to Natural Language Processing with Googlecolaboratory and python

This document will help readers to understand how distributed representation works in natural language processing, and help readers to develop new natural language processing services.

References


See also