Skip to main content
  1. My Projects/

Particles

·326 words·2 mins·

Report
Python
Japanese
NLP


ryancahildebrandt/particles

HTML
0
0


The Numbers on は, が, & Co.
#



Obligatory Word Cloud



Open in gitpod
This project contains 0% LLM-generated content

Purpose
#

A project to look at the contexts in which different particles appear most frequently, and see how these contexts compare to conventional wisdom/rules about how the particles are used


Introduction
#

Particles are one of the trickiest things for Japanese learners to pick up, and this project seeks to approach the question of when and where to use some of the more common particles by looking at a little data! I took a couple corpora of Japanese text, annotated them with linguistic features, and narrowed the dataset to the particles and the words they’re related to in their respective sentences. From there, I compiled the dependency and part of speech for each token as well as its syntactic head and compared particles that get commonly mixed up by Japanese learners. Alongside each comparison, I gathered some common rules of thumb used to help people distinguish which particles are appropriate in which contexts, for reference.


Dataset
#

The corpora used for the current project can be found here, here, and here. They’ve been processed via the Ginza library, which is based on SudachiPy and spaCy. These corpora represent a mix of transcribed speech, translated example sentences, and blog articles.


Outputs
#

  • The main report, compiled with datapane and also in html format
  • The png for the wordcloud used at the top of the page
  • Interactive sankey plot for the particles and their attributes
  • Another sankey, this time for the syntactic heads
  • A comparison of the top 10 most commonly used particles
  • Another comparison chart, this time for syntactic heads
  • The notebook for the NLP analyses (NOTE: this takes a very time long to run, I’d avoid it if possible as the remainder of the code runs just fine without having to run this every time)
  • The notebook for the analyses and viz generated after the NLP

Ryan Hildebrandt
Author
Ryan Hildebrandt
Data Scientist, etc.