波多野结衣办公室双飞_制服 丝袜 综合 日韩 欧美_网站永久看片免费_欧美一级片在线免费观看_免费视频91蜜桃_精产国品一区二区三区_97超碰免费在线观看_欧美做受喷浆在线观看_国产熟妇搡bbbb搡bbbb_麻豆精品国产传媒

Domestic AI models tackle unique demands of Cantonese

Cultural elements, colloquialisms require local solutions, experts say

By OASIS HU in Hong Kong | China Daily | Updated: 2024-10-21 07:42
Share
Share - WeChat
LU PING/CHINA DAILY

"Is there a wife in a wife cake?" Heyson He Lixi, an influencer promoting Cantonese, asked.

"There is no actual wife in a wife cake, as it is just a traditional Hong Kong pastry," answered Sensechat, the world's first Cantonese large language model, a ChatGPT-like product, in fluent Cantonese.

He raised a more challenging question, "What is sorrowful rice?"

"Sorrowful rice, a well-known delicacy in Hong Kong, derives its name from the martial arts masterpiece The Return of the Condor Heroes by Hong Kong writer Jin Yong. In the novel, the character Yang Guo experiences deep sorrow after losing his beloved, Xiaolongnyu," Sensechat responded.

He was impressed — it was the first artificial intelligence tool he had encountered that exhibited a profound understanding of Cantonese culture.

Born in 1998, He is fond of Cantonese and has been creating videos promoting the language since college. Over the years, he has amassed a fan base that surpasses 700,000 across various online platforms, including YouTube and popular Chinese mainland social media platforms like Bilibili, Xiaohongshu, and Douyin. Due to the recent AI craze, He started producing videos evaluating the tools' ability to understand Cantonese.

The outcomes are underwhelming. Many AI large language models (LLMs) and AI software developed on the mainland fail to recognize Cantonese phonetically. Some Western-developed AI software can listen to Cantonese, but cannot speak it accurately. ChatGPT, for instance, often blends Cantonese with Mandarin. Suno, an AI large language model tool that specializes in generating songs, can pronounce Cantonese to a degree, but its primary focus remains music creation.

In July, the Sensetime Group, an AI developer based in Hong Kong, introduced Sensechat, a Cantonese version of its proprietary LLM, and announced that it would be available for free to Hong Kong users indefinitely.

Upon a friend's recommendation, He downloaded Sensechat.

"I felt 85 percent satisfied with Sensechat," he said. "The application still requires to be further refined, but it is one of the few that can truly understand Cantonese."

The application emphasizes one of the unique traits of Cantonese — its colloquial nature.

Pronunciation of Cantonese involves extensive use of modal particles, which are often used at the end of sentences to indicate mood. These particles usually go unnoticed by most AI tools, but Sensechat captures them effectively.

In terms of written text, Sensechat can understand and reflect the nuances between the two forms of written Cantonese. It has a standardized form used in formal situations, similar to Mandarin, and a phonetic style for everyday use. This characteristic, He said, is often overlooked by other large language models.

He recorded his interactions with Sensechat, and shared it online, garnering over 150,000 views. "Cantonese speakers truly need such a tool," He said.

Data size matters

Training an LLM typically involves three stages, said Cao Jiannong, the chair professor in the Department of Computing at Hong Kong Polytechnic University.

The first stage requires pre-training using extensive data, followed by fine-tuning with high-quality data. In the third stage, humans are needed to align the output of the LLM with local culture, ethics, morals, laws, and other rules to restrict the risk of generating inaccurate, biased, or unlawful content.

Developing a Cantonese LLM faces difficulties in all three stages, Cao said.

While Hong Kong's internet infrastructure is relatively well-developed, there is a scarcity of Cantonese content available online. A major factor contributing to this scarcity is that while Cantonese is widely spoken in daily life, the written form of Cantonese is Chinese.

Moreover, English has long served as the official language in Hong Kong. Consequently, a significant portion of the city's online information, including official archived documents in areas such as law, finance, politics, and medicine, is predominantly available in English, Cao said.

LLMs rely heavily on abundant data for their training, said Francis Fong Po-kiu, honorary president of the Hong Kong Information Technology Federation, a local IT-related business association. Without data, there is simply no way to develop a language model, he said.

Literature scarcity

Cantonese web resources suffer not only from a shortage in quantity, but also a lack of quality, said Cao.

When it comes to written material, Hong Kong has not prioritized literature, resulting in a scarcity of quality Cantonese literary works, said Keith Li King-wah, chairman of Hong Kong Wireless Technology Industry Association.

Most available Cantonese texts come from online forums and social media, and often contain low-quality and even offensive language, potentially leading AI models to produce crude content, Li said.

Collecting speech data presents another problem.

Despite access to Cantonese videos online, such as movies and TV dramas, they cannot be used due to background noise, said Albert Lam Yun-sang, the chief technology officer and chief scientist at Fano Labs, a Hong Kong-based startup focusing on speech and language technologies.

Besides insufficient data, Cantonese's intricate linguistic characteristics are another obstacle in training an AI model.

The Economist magazine analyzed language learning time, and found that mastering Cantonese requires 88 weeks of study, placing it alongside Mandarin, Arabic, Japanese, and Korean in the top five most difficult languages to learn.

Lu Lewei, director of the Sensetime Research Institute, said that Cantonese is highly colloquial with numerous inflections. It has nine tones and even a slight variation in pronunciation can alter a word's meaning.

The language also features a blend of Chinese and English and a mix of old and modern terms.

In language modeling, the simplicity of a language offers advantages. The more complex the language is, the harder for the AI model to learn about it, Lam said.

Furthermore, underlying Cantonese is the local culture, which can be challenging for those tasked with aligning the output of large language models, Cao said.

Urgent need

Despite the difficulties involved in creating Cantonese AI models, demand for them is undeniable, said Fong from the Hong Kong Information Technology Federation.

The global Cantonese-speaking population is nearly 120 million, and 85.2 million of those are native Cantonese speakers.

In Hong Kong, 6.3 million residents, or 88.2 percent of the city's population, use Cantonese as their spoken language. In other cities within the Guangdong-Hong Kong-Macao Greater Bay Area, Cantonese is the predominant dialect, with 67 million residents in Guangdong province conversing in it.

In the future, AI will be akin to today's computers and fundamentally a tool for the general public. Without Cantonese AI tools, Cantonese-only speakers may encounter significant inconvenience and marginalization in both the offline and online world, Cao said.

For a city, lack of AI expertise could result in decreased productivity in sectors such as education, healthcare, finance, and law. These limitations could impede the whole city's development, Cao added.

Fong said AI models from other countries or regions may struggle to grasp Cantonese culture accurately. This could lead to cultural or political misinterpretations, resulting in the spreading of incorrect messages.

Dependence on outside AI models could make privacy and security vulnerable, Fong said.

Government officials, for instance, might face national security risks and local companies might leak data if they inadvertently disclose sensitive information to the models developed in foreign jurisdictions, he added.

Fong urged the Hong Kong Special Administrative Region government and local organizations to develop Cantonese LLMs.

In July, Sun Dong, Hong Kong's Secretary for Innovation, Technology, and Industry, announced that the SAR government is cooperating with local universities to develop a Hong Kong-based large language model.

A document co-pilot application for civil servants is now being used on a trial basis.

The model has already been implemented in Sun's department and the system will eventually become available to all Hong Kong residents, the secretary said.

The bureau said plans are underway to expand the pilot application to three other government bureaus, but it gave no indication when Hong Kong residents would gain access to it.

Fong said if it could be launched successfully, the government LLM would have many benefits.

It would be a positive step in resolving the issue of some Western AI models limiting their usage in Hong Kong. Also, implementing a localized AI model could safeguard privacy and provide more convenience to residents, Fong said.

Cao said it's unclear what specific features the government's AI model could offer and how it would distinguish itself from other similar products.

"I don't think the government has done enough research on what they want to do," Cao said.

Local startups

Local technology companies, meanwhile, are actively meeting the needs of the Cantonese-speaking market.

One startup, Votee AI, developed an opensource Cantonese LLM this year.

After years of operating in the local market, Votee AI has gathered substantial amounts of open-source Cantonese data along with primary data.

Taking a community-centered approach, they have also collaborated with local Cantonese linguists and AI researchers, including the team behind the online Cantonese dictionary "words.hk", to capture the nuances of Hong Kong speech.

Sensetime has also accumulated a vast reservoir of internal open-source data.

The company has synthesized data by leveraging advanced technologies and bought supplementary information from external channels to collect data.

To combat the shortage of high-quality Cantonese data, Sensetime also collected audio Cantonese data from hundreds of its local employees.

Sensechat's clients include customer service providers, financial institutions, legal firms, healthcare companies, and others.

For Hong Kong residents, the company promises to provide the service for free indefinitely for free on both the web version and mobile application.

A local tech industry insider, who chose to stay anonymous, said Sensechat should opensource its technology to allow more residents and organizations to access it freely, to benefit the city.

After trying the Sensechat platform, he said its understanding of some Hong Kong slang could be more precise. Nonetheless, "it should be recognized that Sensechat filled a void in the local market," he said.

Cultural roots

In addition to developing local AI models, existing mainstream language models should be encouraged to improve their Cantonese functions, said Li from the Hong Kong Wireless Technology Industry Association.

However, mainstream AI language models are primarily developed by commercial entities in the West. Without market demand, they may not be willing to enhance their products' Cantonese capabilities.

Li believes the Hong Kong SAR government and local organizations should take the lead in collecting Cantonese data, digitize cultural content, and share these resources openly to enrich the Cantonese body of information.

Cantonese speakers can also actively use the language to engage with mainstream AI language models.

These actions can demonstrate to AI model developers that there is a market demand for Cantonese, while interaction with these models can also enhance their understanding of Cantonese culture.

The key to encouraging more people to use Cantonese lies in making Cantonese culture appealing, Li said.

Language is not just a communication tool; it encapsulates the cultural essence and identity of its speakers, he said.

The marginalized status of Cantonese in the digital sphere is a reflection of the decline of the cultural significance of the region.

In the 1970s and 1980s, Hong Kong, although just a city, was so culturally influential that Cantonese was a popular language around the world, Li said.

"At that time, the whole world watched Hong Kong movies and TVB(television shows), knew Jackie Chan and Bruce Lee, and sang Cantonese songs. However, in the present day, even many students in Hong Kong cannot speak Cantonese," he said.

"The focus of government policies should not only be on technology, but also on culture."

He, the influencer, said he learned Cantonese from his grandparents when he was a child, which later made him more proficient in the language than other school students. The confidence this gave him motivated him to become a Cantonese blogger.

However, as He aged, Cantonese became so marginalized that even voice-operated devices and software in his home failed to understand Cantonese commands.

While He could communicate with these devices in Mandarin and English, his grandparents, who only speak Cantonese, struggled to keep pace.

He hopes that Cantonese LLMs will one day help his elderly grandparents manage their daily lives through voice-controlled apps capable of understanding Cantonese.

Top
BACK TO THE TOP
English
Copyright 1995 - . All rights reserved. The content (including but not limited to text, photo, multimedia information, etc) published in this site belongs to China Daily Information Co (CDIC). Without written authorization from CDIC, such content shall not be republished or used in any form. Note: Browsers with 1024*768 or higher resolution are suggested for this site.
License for publishing multimedia online 0108263

Registration Number: 130349
FOLLOW US
波多野结衣办公室双飞_制服 丝袜 综合 日韩 欧美_网站永久看片免费_欧美一级片在线免费观看_免费视频91蜜桃_精产国品一区二区三区_97超碰免费在线观看_欧美做受喷浆在线观看_国产熟妇搡bbbb搡bbbb_麻豆精品国产传媒
国产肥白大熟妇bbbb视频| 欧美日韩国产首页| 久久婷婷久久一区二区三区| 婷婷国产v国产偷v亚洲高清| 91麻豆成人久久精品二区三区| 97精品在线播放| 久久久亚洲午夜电影| 久久激情五月婷婷| 熟妇高潮精品一区二区三区| 337p亚洲精品色噜噜噜| 午夜精品福利一区二区三区蜜桃| 男生和女生一起差差差视频| 在线区一区二视频| 亚洲黄色av一区| 一卡二卡三卡四卡五卡| 欧美亚洲日本一区| 一区二区在线观看免费视频播放| 99精品欧美一区二区三区小说| 色婷婷久久一区二区三区麻豆| 综合分类小说区另类春色亚洲小说欧美| 国产不卡在线播放| www.超碰在线观看| 亚洲欧美综合在线精品| 99精品国产91久久久久久| 欧美伊人精品成人久久综合97 | 国产999精品久久久久久绿帽| 国产精品www爽爽爽| 26uuu亚洲| 国产精品自拍av| 国产激情无码一区二区三区| 国产精品国产三级国产a| 成人av小说网| 欧美日韩一区二区在线观看| 婷婷激情综合网| 亚洲v国产v欧美v久久久久久| 国产午夜久久久久| 不卡的看片网站| 欧美日韩精品电影| 日本不卡高清视频| 欧美aaa级片| 中文天堂资源在线| 亚洲欧美一区二区视频| 亚洲成a人无码| 日韩欧美视频一区| 国产精品原创巨作av| 色综合天天性综合| 亚洲成人一区在线| 中文字幕在线1| **网站欧美大片在线观看| 精产国品一区二区三区| 欧美成人伊人久久综合网| 国产综合久久久久久鬼色| 国产精品白嫩白嫩大学美女| 亚洲成人免费看| 蜜臀久久99精品久久久久久| 日韩理论片中文av| 国产精品无码网站| 国产精品日产欧美久久久久| 国产大学生av| 久久久精品2019中文字幕之3| 99久久国产综合精品麻豆| 日韩视频在线永久播放| 成人亚洲精品久久久久软件| 欧美精品久久天天躁| 国产一区欧美日韩| 欧美酷刑日本凌虐凌虐| 国产酒店精品激情| 欧美日本一区二区三区四区| 国产一区二区三区蝌蚪| 欧美日韩小视频| 国产麻豆午夜三级精品| 欧美日韩精品福利| 懂色一区二区三区免费观看| 91麻豆精品国产91久久久更新时间| 国产二区国产一区在线观看| 欧美精品一卡二卡| 成人毛片视频在线观看| 欧美大片日本大片免费观看| 91亚洲精品久久久蜜桃| 久久精品视频在线免费观看| 制服丝袜在线第一页| 中文字幕一区二区5566日韩| 国产美女喷水视频| 夜夜精品视频一区二区| 潘金莲一级黄色片| 奇米色一区二区三区四区| 欧美亚洲日本国产| 风间由美一区二区av101| 欧美变态凌虐bdsm| 免费观看污网站| 日韩毛片视频在线看| 亚洲毛片亚洲毛片亚洲毛片| 五月婷婷另类国产| 欧美在线啊v一区| 成人高清av在线| 久久精品视频网| 亚洲午夜福利在线观看| 亚洲午夜电影网| 色综合天天综合在线视频| 国产综合色视频| 日韩一区和二区| 亚洲少妇一区二区三区| 亚洲欧洲成人精品av97| 少妇高潮一区二区三区喷水| 久久精品理论片| 日韩视频免费观看高清在线视频| 亚洲 自拍 另类 欧美 丝袜| 成人免费一区二区三区视频 | 亚洲欧美日韩系列| 少妇高潮在线观看| 麻豆高清免费国产一区| 91精品国产91热久久久做人人| 男人操女人下面视频| 亚洲精选免费视频| 色一情一伦一子一伦一区| 成人丝袜高跟foot| 中文字幕免费在线观看视频一区| 国产91丝袜美女在线播放| 美女脱光内衣内裤视频久久网站| 555www色欧美视频| 丰满岳乱妇一区二区| 亚洲大片一区二区三区| 精品视频在线视频| 年下总裁被打光屁股sp| 亚洲国产美国国产综合一区二区| 欧洲在线/亚洲| 中文字幕欧美视频| 亚洲午夜视频在线| 欧美日韩一区二区三区在线| 99久久综合网| 亚洲国产欧美日韩另类综合| 欧美久久久久久蜜桃| 国产国语老龄妇女a片| 亚洲超丰满肉感bbw| 7878成人国产在线观看| 国产老熟女伦老熟妇露脸| 日韩国产精品久久久| 日韩精品一区国产麻豆| 实拍女处破www免费看| 精品一区二区免费| 欧美国产一区在线| 日本精品在线免费观看| 亚洲天堂2016| 欧美亚洲禁片免费| 中国极品少妇xxxx| 蜜臀91精品一区二区三区| 久久亚洲精精品中文字幕早川悠里| 一级黄色录像毛片| 国产mv日韩mv欧美| 亚洲欧美日韩在线播放| 欧美日韩二区三区| 亚洲熟妇无码av| 国产乱子伦视频一区二区三区| 国产精品久久久99| 欧美色综合天天久久综合精品| 欧美熟妇精品一区二区蜜桃视频| 男男成人高潮片免费网站| 国产色综合久久| 日本高清不卡视频| 中文字幕在线播放一区| 久久国产生活片100| 国产精品久久久久aaaa樱花| 欧美在线观看视频一区二区三区| 在线视频 日韩| 国产美女一区二区三区| 亚洲欧美另类图片小说| 在线播放视频一区| 亚洲高潮女人毛茸茸| 99国内精品久久| 蜜臀国产一区二区三区在线播放| 国产女人水真多18毛片18精品视频| 色哟哟欧美精品| 一级做a爰片毛片| 国产成人午夜精品影院观看视频 | 综合激情成人伊人| 日韩一级片在线观看| 中文字幕无码日韩专区免费| 成年人性生活视频| 久久精品国产澳门| 亚洲女同ⅹxx女同tv| 日韩午夜电影av| 9999热视频| 玖草视频在线观看| 成人免费va视频| 人人超碰91尤物精品国产| 国产精品久久久久毛片软件| 6080日韩午夜伦伦午夜伦| 99热99这里只有精品| 毛茸茸free性熟hd| 处破女av一区二区| 男女性色大片免费观看一区二区| 中文字幕一区二区三区不卡| 日韩欧美国产wwwww| 色琪琪一区二区三区亚洲区| 白白色免费视频| 少妇熟女视频一区二区三区| 国产精品一区二区果冻传媒| 五月激情综合色| 亚洲同性gay激情无套| 欧美精品一区二区三区在线 |