How to Efficiently Solve Low Accuracy and High Cost Issues in Japanese Text Generation with T5

目次

Challenges in Japanese Text Generation

When working on Japanese text summarization, title generation, and document classification tasks, do you face these problems?

1. Accuracy Issues

  • Traditional rule-based methods cannot generate natural Japanese text
  • English-oriented models cannot handle Japanese grammar and expressions
  • Need to build separate models for multiple tasks

2. Development Cost Issues

  • Time and resources required for task-specific model development
  • Different approaches needed for document classification, summarization, and title generation
  • Enormous effort required for preparing training data and building models

3. Operational Complexity

  • Need to manage and operate multiple models
  • Different APIs and interfaces for each task
  • Complex model updates and maintenance

Real-world Text Generation Challenge Cases

Failure Case: Limitations of Task-specific Individual Development

# Traditional approach
classification_model = load_bert_classifier()      # For document classification
summarization_model = load_summarization_model()   # For summarization
title_generation_model = load_title_model()        # For title generation

# Problems:
# - Managing 3 separate models
# - 3x memory usage
# - High development and maintenance costs

The solution to this problem is Japanese T5 (Text-To-Text Transfer Transformer).

Unified Problem-Solving Approach with T5

Feature 1: Text-to-Text Unified Framework

  • Unifies all NLP tasks as “text→text” transformations
  • Realizes document classification, summarization, and title generation with a single model
  • Intuitive interface that specifies tasks with text

Feature 2: Japanese Optimization

  • Utilizes pre-trained models optimized for Japanese
  • Handles Japanese-specific word order and expressions
  • Adapts to mixed hiragana, katakana, and kanji

Feature 3: Flexibility and Extensibility

  • Processes multiple tasks with a single model
  • New tasks can be added through additional training
  • Highly efficient development through transfer learning

Specific Implementation of the Solution

1. Environment Setup

pip install torch transformers pytorch_lightning sentencepiece

2. Using Pre-trained Models

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load Japanese T5 model
MODEL_NAME = "sonoisa/t5-base-japanese"
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME, is_fast=True)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

3. Unified Task Processing

def generate_text(task_prefix, input_text, max_length=128):
    # Input with task specification
    input_text = f"{task_prefix}: {input_text}"
    
    # Tokenization
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    
    # Text generation
    output_ids = model.generate(
        input_ids,
        max_length=max_length,
        temperature=1.0,
        repetition_penalty=2.0,
        do_sample=True
    )
    
    # Decoding
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text

# Document summarization
summary = generate_text("要約", "Long document content...")

# Title generation  
title = generate_text("タイトル生成", "Article body text...")

# Document classification
category = generate_text("分類", "Document to classify...")

4. Fine-tuning with Custom Data

from torch.utils.data import Dataset

class TsvDataset(Dataset):
    def __init__(self, tokenizer, data_path, input_max_len=512, target_max_len=128):
        self.tokenizer = tokenizer
        self.input_max_len = input_max_len
        self.target_max_len = target_max_len
        self.inputs = []
        self.targets = []
        self._load_data(data_path)
    
    def _load_data(self, data_path):
        with open(data_path, "r", encoding="utf-8") as f:
            for line in f:
                target, input_text = line.strip().split("\t")
                
                # Input format with task specification
                input_text = f"要約: {input_text}"
                
                # Tokenization
                tokenized_input = self.tokenizer.encode_plus(
                    input_text,
                    max_length=self.input_max_len,
                    padding="max_length",
                    truncation=True,
                    return_tensors="pt"
                )
                
                tokenized_target = self.tokenizer.encode_plus(
                    target,
                    max_length=self.target_max_len,
                    padding="max_length", 
                    truncation=True,
                    return_tensors="pt"
                )
                
                self.inputs.append(tokenized_input)
                self.targets.append(tokenized_target)

# Create dataset
train_dataset = TsvDataset(tokenizer, "train.tsv")

5. Training with PyTorch Lightning

import pytorch_lightning as pl

class T5FineTuner(pl.LightningModule):
    def __init__(self, model_name, learning_rate=3e-4):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(model_name)
        self.tokenizer = T5Tokenizer.from_pretrained(model_name)
        self.learning_rate = learning_rate
    
    def forward(self, input_ids, attention_mask, labels=None):
        return self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
    
    def training_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]
        
        outputs = self(input_ids, attention_mask, labels)
        loss = outputs.loss
        
        self.log("train_loss", loss)
        return loss
    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.learning_rate)

# Execute training
model = T5FineTuner(MODEL_NAME)
trainer = pl.Trainer(max_epochs=8, gpus=1)
trainer.fit(model, train_dataloader)

Specific Effects of Problem Resolution

Before (Traditional Methods)

  • Development Cost: Individual model development per task (3-6 months/task)
  • Operational Cost: Managing and maintaining multiple models
  • Memory Usage: Number of tasks × Model size
  • Accuracy: Task-specific but insufficient Japanese support

After (T5 Implementation)

  • Development Cost: Multiple task support with single model (1-2 months)
  • Operational Cost: Managing only one model
  • Memory Usage: Only one model size
  • Accuracy: High accuracy with Japanese specialization

Practical Application Examples

1. Automatic News Article Summarization System

def news_summarizer(article_text):
    # Generate multiple summary types
    short_summary = generate_text("短信要約", article_text, max_length=64)
    title = generate_text("タイトル生成", article_text, max_length=32)
    
    return {
        "summary": short_summary,
        "title": title
    }

# Usage example
article = "Long news article content..."
result = news_summarizer(article)
print(f"Title: {result['title']}")
print(f"Summary: {result['summary']}")

2. Integrated Document Classification and Summarization System

def document_processor(document):
    # Document classification
    category = generate_text("分類", document, max_length=16)
    
    # Category-specific summarization
    if "技術" in category:
        summary = generate_text("技術要約", document)
    elif "ビジネス" in category:
        summary = generate_text("ビジネス要約", document)
    else:
        summary = generate_text("一般要約", document)
    
    return {
        "category": category,
        "summary": summary
    }

Quantitative Improvement Effects

Development Efficiency Improvement

  • Development Time: Multiple tasks×3-6 months → Single model×1-2 months (75% reduction)
  • Codebase: Task-specific implementations → Unified interface (50% reduction)
  • Maintenance Effort: Multiple model management → Single model management (70% reduction)

Resource Efficiency Improvement

  • Memory Usage: Multiple models → Single model (60-80% reduction)
  • GPU Utilization: Individual optimization → Integrated optimization (40% improvement)
  • Inference Speed: Multiple API calls → Single API call (30% faster)

Accuracy Improvement

  • Japanese Support: English models → Japanese specialization (20-30% accuracy improvement)
  • Consistency: Task-specific models → Unified model (improved result consistency)
  • Customizability: Fixed models → Fine-tuning support

Summary

Japanese text generation challenges can be solved using T5 as follows:

  1. Accuracy Issues → High accuracy with Japanese-specialized pre-trained models
  2. Development Cost Issues → Unified processing of multiple tasks with single model
  3. Operational Complexity → Simple management through Text-to-Text framework

Particularly, integration of summarization, classification, and generation tasks enables significant efficiency improvements from traditional individual development approaches.

By shifting from “what can be done” to “what problems can be solved”, T5 becomes not just a technical tool but a practical solution that revolutionizes Japanese NLP business productivity.

Promotion

We welcome consultations even at the early stage of “I wonder if this could be automated with AI?”

▼For specific consultations and development requests:

>>Consult on Coconala (Free Estimates)


See also