DEV Community

Cover image for Turning Gemma 4 into an Old Korean Translator
bebechien for Google AI

Posted on • Originally published at bebechien.github.io

Turning Gemma 4 into an Old Korean Translator

Bridging the gap for modern native speakers

Thereโ€™s something uniquely beautiful about old books. The smell of weathered paper, the texture of the pages, and the stories that have survived generations. But if youโ€™ve ever tried opening a piece of Classical Korean literatureโ€”like the Joseon Dynasty novel HongGildongJeon (ํ™๊ธธ๋™์ „)โ€”youโ€™ll quickly realize that time leaves its own mark on language.

Between the lack of word spacing and obsolete letters like the dot vowel Arae-a (ใ†) or the soft Yeorin-hieut (ใ††), reading it feels less like browsing a novel and more like solving a beautiful, ancient puzzle. Even for native speakers, the linguistic gap is massive.

So, that's why I decided to creat this tutorial, a digital bridge between the past and the present. Using Gemma 4 E2B (IT), I set out to create a humble translator that turns Classical Korean into smooth, modern Korean.

The Recipe for Training

To keep things manageable, I ran this on a single NVIDIA T4 GPU (16GB) using Google Colab.

1. Setting Up the Kitchen

First, we pull in our favorite open-source tools: Hugging Faceโ€™s transformers, trl for the training loop, and peft so we can use LoRA (Low-Rank Adaptation) to fine-tune our model without needing a massive server cluster.

2. Gathering the Ingredients

For our data, I used a public domain version of HongGildongJeon, paired with a beautiful modern translation by ์ง์ง€ํ”„๋กœ (licensed under Creative Commons).

To make Gemma feel at home, I structured the data into a conversation, guiding the model with a clear system prompt:

[
  {"role": "system", "content": "Translate Classical Korean into Modern Korean."},
  {"role": "user", "content": "๋ด์…˜๊ตญ์…ฐ๋‘‰แ„ƒแ†ก์™•์ฆ‰์œ„์‹ญ์˜ค์—ฐ์˜ํ™ํฌ๋ฌธ๋ฐง๊ธ”แ„’แ†žแ†ซแ„Œแ†ก์ƒ์ด์ž‡์Šค๋˜"},
  {"role": "assistant", "content": "์กฐ์„ ๊ตญ ์„ธ์ข…๋Œ€์™• ์ฆ‰์œ„ ์‹ญ์˜ค๋…„์— ํ™ํšŒ๋ฌธ ๋ฐ–์— ํ•œ ์žฌ์ƒ์ด ์žˆ์œผ๋˜,"}
]
Enter fullscreen mode Exit fullscreen mode

(Translation note: This line introduces us to a prime minister living just outside the Honghoemun Gate during the 15th year of King Sejong's reign!)

The "Before" Picture

Before giving Gemma any specific training, I ran a quick baseline test. Base models are smart, but archaic grammar is a highly specific domain. Without tuning, Gemma tried its best but ended up giving long, overly literal explanations:

  • Original Classical Text: แ„‡แ†กแ†จ์”จ๋“ฏ๊ณ แ„‚แ†ก์‹ฌ์˜ํƒ„๋ณต์™ˆ๊ทธ๊ทผ๋ณธ์„แ„€แ†žแ†ท์ดˆ์ง€์•„๋‹ˆแ„’แ†ž๋‹ˆ์žฅ๋ถ€๋กœ๋‹คแ„’แ†ž๊ณ แ„Œแ†ก์‚ผ์œ„๋กœแ„’แ†ž๋”๋ผ
  • Human Translation: ๋ฐฑ์”จ ๋“ฃ๊ณ  ๋‚ด์‹ฌ์— ํƒ„๋ณต ์™ˆ, "๊ทธ ๊ทผ๋ณธ์„ ๊ฐ์ถ”์ง€ ์•„๋‹ˆํ•˜๋‹ˆ ์žฅ๋ถ€๋กœ๋‹ค!" ํ•˜๊ณ , ์žฌ์‚ผ ์œ„๋กœํ•˜๋”๋ผ.
  • Gemma's Initial Guess: "Like the color, the heart's praise said, 'The foundation cannot be deeply felt...'"
  • Initial Similarity Score: 4.85% ๐Ÿ’”

(Translation note: This line actually means - Upon hearing this, Mr. Baek was deeply impressed and said, "He does not hide his true nature; he is a true man!" and comforted him again and again.)

The base model was clearly lost in time. It needed a map.

Teaching Gemma with Care

To train the model efficiently, I used a Parameter-Efficient Fine-Tuning (PEFT) setup with LoRA.

from peft import LoraConfig

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)
Enter fullscreen mode Exit fullscreen mode

The Secret Sauce: collate_fn

When fine-tuning a chat model to behave like a specific tool, you don't want it to waste energy learning how to re-write your prompt. By using a custom data collator, I masked the system and user inputs (setting their labels to -100), forcing Gemma's loss calculation to focus strictly on generating the correct modern assistant response.

After setting our hyper-parameters to gently cruise through 5 epochs with a learning rate of 2e-5, I hit train.

The Warm "After" Glow

After a bit of patience and letting the trainer do its magic, the results were incredibly rewarding. The character-by-character similarity score jumped all the way up to a brilliant 79.93%!

Look at how it handles the text now:

  • Original Classical Text: แ„‡แ†กแ†จ์”จ๋“ฏ๊ณ แ„‚แ†ก์‹ฌ์˜ํƒ„๋ณต์™ˆ๊ทธ๊ทผ๋ณธ์„แ„€แ†žแ†ท์ดˆ์ง€์•„๋‹ˆแ„’แ†ž๋‹ˆ์žฅ๋ถ€๋กœ๋‹คแ„’แ†ž๊ณ แ„Œแ†ก์‚ผ์œ„๋กœแ„’แ†ž๋”๋ผ
  • Human Translation: ๋ฐฑ์”จ ๋“ฃ๊ณ  ๋‚ด์‹ฌ์— ํƒ„๋ณต ์™ˆ, "๊ทธ ๊ทผ๋ณธ์„ ๊ฐ์ถ”์ง€ ์•„๋‹ˆํ•˜๋‹ˆ ์žฅ๋ถ€๋กœ๋‹ค!" ํ•˜๊ณ , ์žฌ์‚ผ ์œ„๋กœํ•˜๋”๋ผ.
  • Gemma's Fine-Tuned Translation: ๋ฐฑ์”จ๋“ฏ ๊ณ ๋‚ด์‹ฌ์— ํƒ„๋ณต ์™ˆ, "๊ทธ ๊ทผ๋ณธ์„ ๊ฐ์ดˆ์ง€ ์•„๋‹ˆํ•˜๋‹ˆ ์žฅ๋ถ€๋กœ๋‹ค." ํ•˜๊ณ  ์ œ์‚ผ ์œ„๋กœ ํ•˜๋”๋ผ.
  • New Similarity Score: 85.71% โœจ

Closing Thoughts

Technology often pushes us relentlessly into the future, but my favorite tech projects are the ones that allow us to look backward with greater clarity. By spending a little time fine-tuning a lightweight model like Gemma 4, we can build tools that preserve cultural history, making ancient wisdom and classic stories accessible to anyone with a laptop.

Next time you find a piece of history that feels just a bit too out of reach, remember that a small dataset and a fine-tuning session might be all you need to bring it into the light.

Here's the structured workflow when you do a fine-tuning for your own domain:

  1. Define a clear goal
  2. Prepare a high-quality dataset and evaluation plan
  3. Verify the model is learning
  4. Evaluate with metrics and human judgment
  5. Deploy and iterate

๐Ÿ‘‰ Check out this tutorial in Gemma Cookbook
๐Ÿ‘‰ Star the repository to support us

Top comments (2)

Collapse
 
sornodeep profile image
Sornodeep-99

It feels amazing what can be achieved simply by trying these days. Amazing work.

Collapse
 
rondo profile image
Rondo • Edited

Wow, I never expected to see my native langauge here! (ํ™๊ธธ๋™์ „ is surely one of the most well-kown classic novels in Korea.)
It's really increadible how Gemma can get well about translating classic Korean into modern Korean. The gap between before-fine-tuning and after is unbelievable.
Maybe I can try translating classic English into modern with Gemma as well, in the same way of this article does. (For example: Thou art -> You are -> ๋‹น์‹ ์€/๋„ˆ๋Š”)