↓Skip to main content

Yoji

24 July 2021·182 words·1 min·

ryancahildebrandt/yoji

Wisdom in 4 Characters Or Less
#

Training a neural network to generate 四字熟語 (as best it can!)

This project contains 0% LLM-generated content

Purpose
#

A project to generate 四字熟語 (yoji-jukugo, 4 character Japanese idioms), using a sequential tensorflow model.

Dataset
#

The dataset used for the current project was scraped/pulled from the following:

Yojijukugo for idioms and meanings/readings
Jamdict for kanji readings, meanings, and other information
Kanji Database for kanji classification, grade level, and misc characteristics

Outputs
#

The main report, compiled with datapane and also in html format
The full yoji_df dataframe describing the idioms, their constituent kanji, and all additional characteristics from the data linked above
List of generated idioms, sans definitions and readings
The same list, expanded out to a dataframe including readings and meanings of constituent characters and bigrams

Update!
#

After sharing the initial project with some coworkers, it was suggested (by @DC & @JZ) that I retrain the model on bigrams within each idiom, as this more closely aligns with how yoji-jukugo are semantically divided and understood. I’ve updated the report linked above with some additional thoughts on the new model and its results!

Author

Ryan Hildebrandt

Data Scientist, etc.