Skip to main content
  1. My Projects/

Aozora Annotator

·411 words·2 mins·



Aozora Bunko Text Annotator
#


Open in gitpod
This project contains 0% LLM-generated content

Purpose
#

This project seeks to reduce the sometimes innumerable number of trips back and forth between your favorite Japanese dictionary and the text you’re reading. The included script parses a text from the Aozora Bunko literature corpus and looks up helpful information for terms in the text via the Jotoba API


Usage
#

Once you’ve cloned this repo and installed the necessary ruby gems (found in the Gemfile), you’ll need to make sure you have a copy of the Aozora Bunko database file located in the ./data directory. Once you do, you can start annotating!

The easiest way to use this tool is via the command line. From the repo directory:

#to show all cli options and arguments
ruby azb.rb -h

#to search the database and return all texts with metainfo containing "源氏物語"
ruby azb.rb -s 源氏物語

#to pull information for text 165444, perform lookups, generate annotations, and render html and plaintext documents to the outputs directory
ruby azb.rb -i 165444

# to run the full pipeline as described above, this time with options!
ruby azb.rb -i 165444 -c -k -f 225%

Sometimes the api lookup behavior isn’t perfect, so if you’re planning on using this as a teaching aid or instructional materials, you can always fine tune the lookups by editing the json file after the initial lookup fetching


Dataset
#

The dataset used for the current project was pulled from the following:

  • Aozora Bunko Corpus for Japanese full texts
  • Jotoba and Jotoba API for looking up terms. Jotoba brings together information from a range of free sources including JMDict, Tofugu, and Tatoeba and all sources are listed here

Outputs
#

  • Annotation format breakdown

    • Alternating
      • One term with its annotations immediately between it and the next term
      • term (annotation) term (annotation)
    • Layered
      • One sentence with all its annotations on the following line
      • sentence
      • (sentence annotations)
      • sentence
      • (sentence annotations)
    • Parallel
      • Full text with readings rendered above and meanings below, similar to the furigana annotation style commonly used
      • (sentence readings)
      • sentence
      • (sentence meanings)
    • Side by side
      • One sentence with all its annotations displayed on the right of the page
      • sentence || (sentence annotations)
      • sentence || (sentence annotations)
  • Example outputs, generated from 三十三の死 by しづ素木:

Ryan Hildebrandt
Author
Ryan Hildebrandt
Data Scientist, etc.