Cantonese is a dialect of China which mainly used in Hong Kong, Macau, Guangdong Province and other regions. Although Cantonese is very similar to Mandarin, there are still many challenges in accurately translating Cantonese. Many native Mandarin speakers have encountered obstacles in reading Cantonese texts. Compared with other languages ​​with higher ubiquity and richer resources, Cantonese is a low resource language with sparse data that cannot support to create neural machine translation models, because Cantonese is rarely used in formal article. Ku Su Wa and Chong Iok Hei, the 2020/2021 outstanding undergraduates in the Department of Computer and Information Science, realized this problem, and therefore, used it as the topic of their graduation project. Under the guidance of Professor Derek Wong, they used bilingual dictionary neural machine translation system “Candarin” without parallel data, back-translation and dual learning. By these approaches, they successfully solved low resource problems in Cantonese. In order to make it easier for the use of general public, they even designed an easy-to-use website and mobile version. Their graduation project was selected as one of the ‘Best Excellent Projects’.

In regards of the achievements of the four years at UM, Chong Iok Hei shares, ‘Except for teaching me the latest technical knowledge and information, UM also helps to develop my relationships with others. At UM, I have met many like-minded friends and great teachers.’ Ku Su Wa also says, ‘In this harmonious learning environment, I have met many new friends and professors. They have given me a lot of help and encouragement. Whenever I encounter difficulties, they will sincerely help me and it makes me feel very warm. Also, I have developed the ability of self-management, and know how to arrange and manage my time appropriately.’ In the future, Chong Iok Hei hopes to work in an educational institution so that Macau children can acquire computer knowledge from an early age. Ku Su Wa also shared that his interest regarding natural language processing was deepened through his graduation project, and he will continue to conduct in-depth research in the future.

粵語是中國的方言,目前主要在港澳、廣東省及部份地區使用,儘管粵語和普通話很相似,但要準確翻譯粵語還是存在不少挑戰。不少以普通話為母語的人在閱讀粵語文本時遇到了障礙。與其他普遍性較高、資源較豐富的語言相比, 因為粵語很少以文字形式在正式文本上使用,因此粵語沒有豐富的數據, 無法利用數據開發基於機器學習的系統。2020/2021學年電腦及資訊科學系的優秀本科畢業生古樹樺及鍾旭熙注意到市場的這種需求,特意以此為畢業設計作品的題目,二人在黃輝教授的指導下運用沒有並行數據的雙語詞典神經機器翻譯模型「Candarin」、反向翻譯和雙重學習等方法,成功創造出粵語和普通話兩種語言的互譯器,解決了粵語翻譯的問題。務求方便大眾使用,兩人更設計出易於使用的網頁及手機版本,他們的畢業設計作品獲選成為「最佳優秀作品項目」之一。


Their project received the award of ‘Best Final Year Project of CIS’


The webpage and app of Candarin