Thereโs something uniquely beautiful about old books. The smell of weathered paper, the texture of the pages, and the stories that have survived generations. But if youโve ever tried opening a piece of Classical Korean literatureโlike the Joseon Dynasty novel HongGildongJeon (ํ๊ธธ๋์ )โyouโll quickly realize that time leaves its own mark on language.
Between the lack of word spacing and obsolete letters like the dot vowel Arae-a (ใ) or the soft Yeorin-hieut (ใ), reading it feels less like browsing a novel and more like solving a beautiful, ancient puzzle. Even for native speakers, the linguistic gap is massive.
So, that's why I decided to creat this tutorial, a digital bridge between the past and the present. Using Gemma 4 E2B (IT), I set out to create a humble translator that turns Classical Korean into smooth, modern Korean.
The Recipe for Training
To keep things manageable, I ran this on a single NVIDIA T4 GPU (16GB) using Google Colab.
1. Setting Up the Kitchen
First, we pull in our favorite open-source tools: Hugging Faceโs transformers, trl for the training loop, and peft so we can use LoRA (Low-Rank Adaptation) to fine-tune our model without needing a massive server cluster.
2. Gathering the Ingredients
For our data, I used a public domain version of HongGildongJeon, paired with a beautiful modern translation by ์ง์งํ๋ก (licensed under Creative Commons).
To make Gemma feel at home, I structured the data into a conversation, guiding the model with a clear system prompt:
[
{"role": "system", "content": "Translate Classical Korean into Modern Korean."},
{"role": "user", "content": "๋ด์
๊ตญ์
ฐ๋แแก์์ฆ์์ญ์ค์ฐ์ํํฌ๋ฌธ๋ฐง๊ธแแแซแแก์์ด์์ค๋"},
{"role": "assistant", "content": "์กฐ์ ๊ตญ ์ธ์ข
๋์ ์ฆ์ ์ญ์ค๋
์ ํํ๋ฌธ ๋ฐ์ ํ ์ฌ์์ด ์์ผ๋,"}
]
(Translation note: This line introduces us to a prime minister living just outside the Honghoemun Gate during the 15th year of King Sejong's reign!)
The "Before" Picture
Before giving Gemma any specific training, I ran a quick baseline test. Base models are smart, but archaic grammar is a highly specific domain. Without tuning, Gemma tried its best but ended up giving long, overly literal explanations:
- Original Classical Text: แแกแจ์จ๋ฏ๊ณ แแก์ฌ์ํ๋ณต์๊ทธ๊ทผ๋ณธ์แแแท์ด์ง์๋แแ๋์ฅ๋ถ๋ก๋คแแ๊ณ แแก์ผ์๋กแแ๋๋ผ
- Human Translation: ๋ฐฑ์จ ๋ฃ๊ณ ๋ด์ฌ์ ํ๋ณต ์, "๊ทธ ๊ทผ๋ณธ์ ๊ฐ์ถ์ง ์๋ํ๋ ์ฅ๋ถ๋ก๋ค!" ํ๊ณ , ์ฌ์ผ ์๋กํ๋๋ผ.
- Gemma's Initial Guess: "Like the color, the heart's praise said, 'The foundation cannot be deeply felt...'"
- Initial Similarity Score: 4.85% ๐
(Translation note: This line actually means - Upon hearing this, Mr. Baek was deeply impressed and said, "He does not hide his true nature; he is a true man!" and comforted him again and again.)
The base model was clearly lost in time. It needed a map.
Teaching Gemma with Care
To train the model efficiently, I used a Parameter-Efficient Fine-Tuning (PEFT) setup with LoRA.
from peft import LoraConfig
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=16,
bias="none",
target_modules="all-linear",
task_type="CAUSAL_LM",
)
The Secret Sauce: collate_fn
When fine-tuning a chat model to behave like a specific tool, you don't want it to waste energy learning how to re-write your prompt. By using a custom data collator, I masked the system and user inputs (setting their labels to -100), forcing Gemma's loss calculation to focus strictly on generating the correct modern assistant response.
After setting our hyper-parameters to gently cruise through 5 epochs with a learning rate of 2e-5, I hit train.
The Warm "After" Glow
After a bit of patience and letting the trainer do its magic, the results were incredibly rewarding. The character-by-character similarity score jumped all the way up to a brilliant 79.93%!
Look at how it handles the text now:
- Original Classical Text: แแกแจ์จ๋ฏ๊ณ แแก์ฌ์ํ๋ณต์๊ทธ๊ทผ๋ณธ์แแแท์ด์ง์๋แแ๋์ฅ๋ถ๋ก๋คแแ๊ณ แแก์ผ์๋กแแ๋๋ผ
- Human Translation: ๋ฐฑ์จ ๋ฃ๊ณ ๋ด์ฌ์ ํ๋ณต ์, "๊ทธ ๊ทผ๋ณธ์ ๊ฐ์ถ์ง ์๋ํ๋ ์ฅ๋ถ๋ก๋ค!" ํ๊ณ , ์ฌ์ผ ์๋กํ๋๋ผ.
- Gemma's Fine-Tuned Translation: ๋ฐฑ์จ๋ฏ ๊ณ ๋ด์ฌ์ ํ๋ณต ์, "๊ทธ ๊ทผ๋ณธ์ ๊ฐ์ด์ง ์๋ํ๋ ์ฅ๋ถ๋ก๋ค." ํ๊ณ ์ ์ผ ์๋ก ํ๋๋ผ.
- New Similarity Score: 85.71% โจ
Closing Thoughts
Technology often pushes us relentlessly into the future, but my favorite tech projects are the ones that allow us to look backward with greater clarity. By spending a little time fine-tuning a lightweight model like Gemma 4, we can build tools that preserve cultural history, making ancient wisdom and classic stories accessible to anyone with a laptop.
Next time you find a piece of history that feels just a bit too out of reach, remember that a small dataset and a fine-tuning session might be all you need to bring it into the light.
Here's the structured workflow when you do a fine-tuning for your own domain:
- Define a clear goal
- Prepare a high-quality dataset and evaluation plan
- Verify the model is learning
- Evaluate with metrics and human judgment
- Deploy and iterate
๐ Check out this tutorial in Gemma Cookbook
๐ Star the repository to support us
Top comments (2)
It feels amazing what can be achieved simply by trying these days. Amazing work.
Wow, I never expected to see my native langauge here! (ํ๊ธธ๋์ is surely one of the most well-kown classic novels in Korea.)
It's really increadible how Gemma can get well about translating classic Korean into modern Korean. The gap between before-fine-tuning and after is unbelievable.
Maybe I can try translating classic English into modern with Gemma as well, in the same way of this article does. (For example: Thou art -> You are -> ๋น์ ์/๋๋)