AI / NLP — 2026
“Teaching machines to read between the lines of Malay verse.”
Services
Tech
Documentation
Ensemble models
MalayBERT macro F1
Pantun theme classes
Consensus on clear cases
A Malay pantun theme classifier that runs three very different models — a fine-tuned MalayBERT transformer, a classic TF-IDF + SVM, and a TextCNN — side by side, then resolves their predictions through a majority vote to label the theme of any four-line pantun.
The Challenge
Malay pantun rarely states its meaning outright. The literal "pembayang" (foreshadow) and the figurative "maksud" (intent) often pull in different directions, so keyword matching collapses: a pantun mentioning "kasih" is not necessarily about love. Compounding this, the labelled dataset is small and badly imbalanced — some theme classes have only a few dozen samples — which punishes data-hungry models.
Our Solution
We built an ensemble that plays each model to its strength. MalayBERT (mesolitica/bert-base-standard-bahasa-cased) reads whole-context meaning and figurative intent; the TF-IDF + SVM nails explicit keyword signals like "Tuhan" or "Budi"; TextCNN captures local n-gram patterns. A majority-vote layer surfaces a single consensus theme with per-model confidence, plus a pantun anatomy breakdown (A-B-A-B rhyme, pembayang vs. maksud) and related pantun suggestions.
The Outcome
MalayBERT led on nuance at ~60% macro F1, with SVM close behind (~55%) and TextCNN trailing (~47%) — exactly as the data scarcity predicted. The transparent three-model view turns a black-box label into an explainable, teachable read of each verse, useful for students of classical Malay literature.
Full Tech Stack
← No previous