Category Archives: Cantonese

Update: A Reworking of Chinese Language Classification

If you want to know where I have been the past few days, I have been working on this piece. I work on it for hours every day. So far, I have put in over 500 hours on this piece. That’s over three months of full-time work. My haters say I don’t work. The Hell I don’t work! I’d like to see them try to do this sort of work. This piece has been ridiculed by some linguist idiots on the Net. I worked a bit with linguists outside the Web. In fact, one of the top Sinologists (they have a Wikipedia entry) has been mentoring me on this project for some time now. I will not reveal this person’s name.

The number of languages has increased vastly from 365 to 526. Actually there are probably more than that. I have reason to believe that there may be 1,000-2,000 separate Chinese languages using the 90% intelligibility barrier (<90% = separate language, >90% = dialect). Just to contrast, Wikipedia says there are 14 Chinese languages (a grotesque underestimate) and the Chinese government insanely lies that there is only one Chinese language. Despite their superior IQ’s, a huge number of Chinese people fall of this idiot lie that is obviously based on political BS and not science. The Net is littered with otherwise intelligent Chinese people arguing strenuously that there is only one Chinese languages. Just goes to show that you have a high IQ and still be an idiot if your thought processes are too biased, which is the basic problem with all human thinking anyway.

Just to toot my own horn a bit and in response to my detractors, this is the most elaborate and extensive overview of the Chinese languages written in English in terms of pure classification that I have ever seen. There may well be works of this caliber or even beyond that are written in Chinese. In fact, one of the problems with this work  is that so much of the original research is in Chinese. My Chinese is not very good, so it’s hard for me to read that stuff. This is not a finished product at all. This work will undergo revisions for quite some time if I keep working on it. It may not be done when I die. It’s a Herculean project.

I had to paste this in from a Word document, which is why the formatting looks so strange. But the post has received a huge update, in particular the Hokkien, Teochew, Wu and Cantonese sections. I am up to 526 languages. I would like any speakers of any Chinese language to look this over for me and add any corrections, explanations, elaborations, etc. I am especially interested in any mutual intelligibility data you might have as I have a bit of a mutual intelligibility fetish.

Warning: Very long. Runs to 87 pages in a Word document.

A Reworking of Chinese Language Classification

by Robert Lindsay

The Chinese languages have undergone a lot of reclassification lately (Mair 1991), from one Chinese language a couple of decades ago up to 14 Chinese languages today according to the latest Ethnologue.

However, Jerry Norman, one of the world’s top experts on Chinese, has stated that based on mutual intelligibility, there are 350-400 separate languages within Chinese (Mair 1991). According to Gong Xun, a Sichuan Mandarin speaker in Deyang, China, by my criteria of distinguishing between language and dialect, there would be 300-400 separate languages in Fujian alone.

So far, 2,500 dialects of the Chinese language have been identified, and a number of them are separate languages.

Based on the criteria of mutual intelligibility, I have expanded the 14 Chinese languages into 526 separate languages.

There are different ways of calculating mutual intelligibility. Mutual intelligibility is hard to determine. I am not interested in typological studies of varieties involving either lexicon, phonology or tones, unless this can be quantified in terms of mutual intelligibility in a scientific way (Cheng 1991). For the most part, what I am interested in is, “Can they understand each other?”

I decided to put it at 90%, with >90% being dialect and <90% being a separate language. This is based on what appears to be Ethnologue‘s criteria for establishing the line between a dialect and a language.

In the cases below where I had mutual intelligibility data available, a number of Chinese languages had no more than 65% intelligibility between them (Cheng 1991).

The best way to see this study is as a pilot study. The purpose of the classification below is more to stimulate academic interest and sprout new thinking and theory. It is not intended to be an end-all or be-all statement on the subject; in fact, it is quite the opposite. Pilot studies, which is what this is, are de facto never accurate and precise.

Reasonable, fair-minded, and professional comments, additions, criticisms, elaborations, presentations of evidence, etc. are highly encouraged.

I assume this paper will be controversial. Keep in mind that this work is extremely tentative and should not be taken as the last word on the subject by a long shot.

Interested scholars, observers or speakers of Chinese languages are encouraged to contribute any knowledge that they may have to add to, confirm or criticize this data below. So far as I know, this is the first real attempt to split Chinese beyond the 14 languages elucidated by Ethnologue.

There are many problems with the data below. In many cases, “separate language” just means that the variety is not intelligible with Putonghua. Unfortunately, I currently lack excellent mutual intelligibility data within the major language groups such as Gan, Xiang, Wu, and the branches of Mandarin. There is probably quite a bit of lumping still to be done below. Where varieties are mutually intelligible below, I have tried to lump them into one language with various dialects.

In many cases, we seem to be dealing with dialect chains. This is particularly the case with the Mandarin languages, incorrectly referred to as the Mandarin dialects.

For instance, in Henan each major city can understand the next city over fairly well, but at the second or third city over, you run into serious comprehension difficulties. But even there, the languages are fairly close, with intelligibility at ~70%, and after three weeks of close contact, they can communicate fairly well. In many cases, it is a matter of working out the tone changes, for tone changes are very common even among the Mandarin lects.


Putonghua is Standard Mandarin, based on the Beijing Mandarin dialect as of 1949, but it has since diverged wildly, and many Putonghua speakers today cannot understand Beijing Mandarin. Putonghua is being promoted as the national language of China.

In addition to Putonghua, there 1,500 other dialects of Mandarin spoken in China. In general, other Mandarin dialects are not intelligible to Putonghua speakers (Campbell 2009). However, the Northeastern Mandarin dialects and the dialects around Beijing are more intelligible with Putonghua than the Mandarin dialects in the rest of the country.

The implication is that there may be over 1,500 Mandarin languages in China. However, many of these Mandarin dialects are intelligible with at least some other Mandarin lects. Hence, despite the lack of intelligibility with Putonghua, there is a lot of potential lumping within Mandarin.

The degree to which Mandarin dialects are intelligible to each other is very much an open question and in general is poorly investigated.

We should also note here that even Putonghua, the language that was meant to tie the nation together, seems to be evolving into regional languages.

Guangdong Putonghua is not fully intelligible to speakers of the Putonghuas of Northern China and hence is probably a separate language.

Shanghai Putonghua is often not intelligible with Putonghua from other regions. It has heavy interference from Shanghaihua, which seriously effects the Putonghua accent. Even after four years of exposure, Standard Putonghua speakers often have problems with it.

Anhui Putonghua has poor intelligibility with Standard Putonghua due to its phonology. Therefore, it is a separate language.

In addition, Jianghuai Putonghua and Zhengcao Putonghua are not intelligible with Putonghua from other areas (Campbell 2009). These varieties of Mandarin cause a particular interference with Putonghua Mandarin that results in a severe dialectal disturbance in their Putonghua.

These Putonghuas are spoken in the regions native to the Jianghuai and Zhengcao branches of Mandarin. Jianghuai Mandarin is spoken in Anhui, Jiangsu, Hubei and to a much lesser extent Zhejiang Provinces. Zhengcao Mandarin is spoken in Anhui, Henan, Shandong, and Jiangsu, with one dialect spoken in Hebei.

Tibetan Mandarin has heavy Tibetan admixture.

There are also varieties of Putonghua that are spoken in Singapore and Taiwan. Claims that Taiwan Mandarin is fully intelligible with Putonghua are incorrect. Taiwanese Mandarin is about 80-85% intelligible with Putonghua. Based on that intelligibility figure, Taiwanese Mandarin is a separate language.

Singapore Mandarin has fewer differences with Putonghua than Taiwanese Mandarin and hence is a dialect of Putonghua.

Malay Mandarin is said to be quite different but nevertheless mutually intelligible with Putonghua. Nevertheless, Malay Mandarin speakers say they have to make speech adjustments with Chinese speakers, otherwise their speech is poorly intelligible. This implies that Malay Mandarin is indeed a separate language.

Yunnan Putonghua is intelligible with Putonghua from other regions (Campbell 2009).

Mandarin has 873 million speakers. There are an incredible 1,526 varieties of Mandarin.

Beijing Jilu Mandarin is has low intelligibility with other branches of Mandarin: 72% intelligible with Southwest Mandarin, 64% intelligible with Zhongyuan Mandarin and 55% intelligible with Jiaoliao Mandarin (Cheng 1997).

Putonghua was based on Beijing Dialect. However, many Putonghua speakers claim that Beijinghua is not inherently intelligible with Putonghua. Complaints about unintelligible taxi drivers in Beijing are legendary. At the very least, competing views of the intelligibility of Beijinghua and Putonghua deserve investigation.

On the other hand, Beijinghua is intelligible with Hebei Mandarin and Nanjing City Mandarin, yet Putonghua is not intelligible with Hebei.

The Beijinger variety of Beijing’s hutongs and taxi drivers is legendary for being hard to understand.

The truth is that Putonghua was never entirely based on Beijinghua. It was in terms of pronunciation but in for vocabulary. Putonghua got only 35% of its vocabulary from Beijinghua. Most of its vocabulary came from Japanese Kanji words. They used a form of Mandarin that was based on Chinese scholars who went to study in Japan at the end of the Qing Era. So Putonghua, like Standard Italian which is based on Florentine Italian of Dante circa 1400, is in a sense frozen in time.

The two lects may also have taken separate trajectories. This has also occurred in Italian, where, though Standard Italian was based on Florentine Tuscan, Standard Italian and Tuscan Italian have taken separate trajectories since. If you see old Tuscan men on TV in Italy, a speaker of Standard Italian from Southern Italy would need subtitles to understand them, but one from Northern Italy would not.

Others say that Putonghua was based on the language of the Beijing suburbs, not the city itself.

For whatever reason, Beijinghua often seems to have less than 90% intelligibility with Putonghua, though the question needs further research. Beijinghua, in its pure and least mutually intelligible form, seems to be spoken mostly in the innermost hutongs and among taxi drivers and other low-income and working class people. The variety of people with more education and money is probably a lot more comprehensible.

I would describe the real, pure, Putonghua as “CCTV speech”, the variety you hear on Chinese state television. Evidence that Beijinghua lacks full intelligibility with Putonghua is here, here, here, here, here, here, here and here.

The question of whether or not Beijinghua is a separate language from Putonghua is sure to be highly controversial. Perhaps intelligibility testing could settle the question.

Jinan (New Jinan) Jilu Mandarin is not intelligible with Putonghua.

Cangzhou Jilu Mandarin, spoken in southeastern Hebei, is a separate language. It is only partly intelligible with Putonghua. Renqiu Jilu Mandarin, Huanghua, Hejian Jilu Mandarin, Cangxian Jilu Mandarin, Qingxian Jilu Mandarin, Xianxian Jilu Mandarin, Dongguang Jilu Mandarin, Haixing Jilu Mandarin, Yanshan Jilu Mandarin, Suning Jilu Mandarin, Nanpi Jilu Mandarin, Wuqiao Jilu Mandarin, and Mengcun Jilu Mandarin, all spoken in Cangzhou Prefecture, are all dialects of Cangzhou Jilu Mandarin.

Cangzhou Jilu Mandarin shares some similarities with Tianjin Jilu Mandarin and Baoding Jilu Mandarin, but it is probably not fully intelligible with either.

Tianjin Mandarin‘s tones are quite different from Putonghua’s, its tone sandhi is much more complicated, and it is more closely related to varieties 150-500 miles away, since originally Tianjin Mandarin speakers came from Anhui (Lee 2002). Nevertheless, Tianjin Mandarin is a dialect of Beijing Mandarin.

Baoding Jilu Mandarin appears to be a separate language because there are people from the city who cannot speak it at all.

Beijing is in group called the Beijing Group of Jilu Mandarin. It contains 43 separate varieties and may contain more than one language.

Jinan is a member of the Liaotai Group of Jilu Mandarin Group, which has 37 lects.

The Baoding Group of Jilu Mandarin has 52 lects.

Cangzhou, Renqiu, Huanghua, Hejian, Cangxian, Qingxian, Xianxian, Dongguang, Haixing, Yanshan, Suning, Nanpi, Wuqiao, and Mengcun are members of the Huangle subgroup of Baotang, which has 25 lects.

Tianjin forms its own subgroup within Baotang.

Jilu Mandarin itself consists of 154 lects.

Northeastern (Dongbei) Mandarin is generally intelligible with Putonghua.

Shenyang Northeastern Mandarin is the main dialect in this group, and it is intelligible with Harbin Northeastern Mandarin, Liaoning Northeastern Mandarin, Changchun Northeastern Mandarin, and Heilongjiang Northeastern Mandarin. Harbin Northeastern Mandarin is also intelligible with Tianjin Jilu Mandarin and Beijing Jilu Mandarin. Nanjing City Northeastern Mandarin, Hebei Northeastern Mandarin, and much of the rest of NE Mandarin are all mutually intelligible.

Shenyang is a member of the Jishen Group of Northeastern Mandarin, which has 44 lects.

Within Jishen, Shenyang is a member of the Tongxi Group, which has 24 lects.

Harbin is a member of the Hafu Group of Northeastern Mandarin, which has 64 lects.

Within Hafu, Harbin Mandarin is a member of the Zhaofu Group, which has 18 lects.

Dongbei Mandarin has 108 lects.

Zhongyuan Mandarin is a large split in Mandarin. It is not fully intelligible with Putonghua.

Nanjing Zhongyuan Mandarin (evidence) is also a separate language – now mostly spoken in the suburbs, as city speech is not a separate language anymore. The city language is intelligible with the general Northeastern China Mandarin spoken in Beijing and Hebei.

So we shall call Nanjing Suburbs Zhongyuan Mandarin a separate language.

Luoyang Zhongyuan Mandarin, Kaifeng Zhongyuan Mandarin, Changyuan Zhongyuan Mandarin, and Zhengzhou Zhongyuan Mandarin, all in Henan Province, are not intelligible with Putonghua. However, all four are mutually intelligible, so they are dialects of a single language, Henan Zhongyuan Mandarin.

Xinyang Zhongyuan Mandarin, also spoken in Henan, is a separate language and cannot be understood by Luoyang Zhongyuan Mandarin speakers.

Nanyang Zhongyuan Mandarin has high but not complete intelligibility with Luoyang Zhongyuan Mandarin. Intelligibility between Nanyang Zhongyuan Mandarin and Luoyang Zhongyuan Mandarin is probably ~70%. Nanyang Zhongyuan Mandarin has 15 million speakers.

Gushi Zhongyuan Mandarin is not intelligible with Putonghua. In addition, Gushi Zhongyuan Mandarin is different from Nanyang Zhongyuan Mandarin and is probably not intelligible with it.

Intelligibility between Xinyang Zhongyuan Mandarin and Gushi Zhongyuan Mandarin is not known.

In general, intelligibility between many varieties in Henan is not full, but after a few weeks or so of close contact, they can start to understand each other. Mutual intelligibility between Xinyang Zhongyuan Mandarin, Gushi Zhongyuan Mandarin, and Nanyang Zhongyuan Mandarin may be ~70%.

In Shaanxi, Yanan Zhongyuan Mandarin, Xian Zhongyuan Mandarin, Huxian Zhongyuan Mandarin, Zhouzhi Zhongyuan Mandarin, and Hanzhou Zhongyuan Mandarin are not intelligible with Putonghua, but they may well be intelligible with each other. Xi’an Zhongyuan Mandarin, for instance, is about 65% intelligible with other Mandarin groups. It is closest to Jinan Jilu Mandarin, with which it has 75% intelligibility (Cheng 1997). Let us call this language Shaanxi Zhongyuan Mandarin.

Xining Zhongyuan Mandarin, spoken in Xinghai, seems to be very different from other Shaanxi Zhongyuan Mandarin varieties and is probably a separate language altogether.

In Gansu Province, Gansu Zhongyuan Mandarin appears to be a separate language. Tongwei Zhongyuan Mandarin appears to be a dialect of Gansu Zhongyuan Mandarin.

However, within Gansu Zhongyuan Mandarin, there are divergent lects, such as Sale Zhongyuan Mandarin, which are unintelligible with other Gansu Mandarin lects.

Bozhou Zhongyuan Mandarin (evidence), Yingshang Zhongyuan Mandarin (evidence), and Fuyang Zhongyuan Mandarin (evidence), spoken in Anhui, are at least unintelligible with Putonghua. Fuyang Zhongyuan Mandarin is very different. The unnamed variety spoken 300 km. south of Jinan around Mengcheng in rural Anhui is said to be completely unintelligible with Putonghua, Tianjin Jilu Mandarin, and Beijinghua. For the time being, we will refer to this as one language, Anhui Zhongyuan Mandarin. Intelligibility between varieties of Anhui Zhongyuan Mandarin is not known.

The Mandarin spoken in Qinghai, Quinghai Zhongyuan Mandarin, is very different from that spoken in Gansu.

Xian, Huxian, and Zhouzhi are members of the Guanzhong Group of Zhongyuan Mandarin, which has 45 lects.

Yanan, Hanzhong, and Xining are members of the Qinlong Group of Zhongyuan Mandarin, which has 67 lects.

Luoyang is a member of the Luoxu Group of Zhongyuan Mandarin, which has 28 lects.

Kiafeng, Nanyang, Zengzhou, Changyuan, and Bozhou are members of the Zhengcao Group of Zhongyuan Mandarin, which has 93 lects.

Xinyang and Gushi are in the Xinbeng subgroup of Zhongyuan Mandarin, which has 20 lects.

Tongwei and Sale are part of the Longzhong Group of Zhongyuan Mandarin, which has 25 lects.

Yingshang is a member of the Cailu Group of Zhongyuan Mandarin, which has 30 lects.

Zhongyuan Mandarin has a shocking 338 lects.

Zhongyuan Mandarin has 130 million speakers (Olson 1998).

Southwestern Mandarin is a huge and diverse group of Mandarin, contains a multitude of varieties and is not fully intelligible with Putonghua.

Yichang, Nanping Southwestern Mandarin (spoken near Mt. Wuyievidence), Longcheng Southwestern Mandarin (evidence), Luocheng Southwestern Mandarin (evidence), Lingui Southwestern Mandarin (evidence), Jiuzhaigou Southwestern Mandarin (evidence) Xindu Southwestern Mandarin, Wenshan Southwestern Mandarin (evidence), Mianzhu Southwestern Mandarin (evidence here and here), and Yangshuo Southwestern are all unintelligible with Putonghua.

Guilin Southwestern Mandarin is not intelligible with general Southwestern Mandarin speech either.

Wenshan at least is not intelligible with other Southwestern varieties (Johnson 2010).

Guiliu Southwestern Mandarin is at least not comprehensible with Putonghua or Chengdu Southwestern Mandarin.

Chengyu Southwestern Mandarin is not comprehensible with Putonghua or Guiliu Southwestern Mandarin.

Chengdu Southwestern Mandarin is part of a broadly intelligible Sichuan Southwestern Mandarin koine that is spoken in many of the larger cities in Yunnan.

It includes Ziyang Southwestern Mandarin, Kunming Southwestern Mandarin, Bazhong Southwestern Mandarin, Baojing Southwestern Mandarin, Dazhou Southwestern Mandarin, Neijiang Southwestern Mandarin, Yibin Southwestern Mandarin, Luzhou Southwestern Mandarin, Mianyang Southwestern Mandarin, Deyang Southwestern Mandarin, and Guiyang Southwestern Mandarin (Xun 2009).

Speakers of Chengdu Southwestern Mandarin say that Zigong Southwestern Mandarin and Meishan Southwestern Mandarin are not intelligible to them. Chengduhua is still very widely spoken in Chengdu by people of all ages.

Ziyang Southwestern Mandarin is intelligible with the koine but has a heavy accent.

Leshan Southwestern Mandarin is a separate language. It is unintelligible with the koine, but it can be learned in a few weeks of exposure (Xun 2009).

Intelligibility between Leshan Southwestern Mandarin and Sichuan Southwestern Mandarin may be ~70%.

Hankou Southwestern Mandarin is a separate language, with 80% intelligibility between it and Chengdu Southwestern Mandarin (Cheng 1997).

Chongqing Southwestern Mandarin is a separate language. Chongqing Southwestern Mandarin speakers cannot understand Chengdu or Luzhou speakers.

The many small Southwestern Mandarin varieties around Mt. Emei are not intelligible with Sichuan Southwestern Mandarin, appear to be be very different and may be one or more separate languages.

Wuhan Southwestern Mandarin is not intelligible to speakers of Southwestern Mandarin from other provinces; for instance, it is only 80% intelligible with Chengdu Southwestern Mandarin. Once you go an hour in any direction from Wuhan, Wuhan Southwestern Mandarin is no longer intelligible.

Dali Southwestern Mandarin is spoken in the city of Dali near Kunming. The variety is still widely spoken.

Dahua Southwestern Mandarin, spoken in and around Dahua village on the Puduhe River near Dongchuan in Yunnan Province, is apparently a separate language.

Another language spoken in Yunnan, Lanping Southwestern Mandarin, is also not intelligible with Putonghua.

Chuanlan Southwestern Mandarin is a little-known language spoken by the Tunbao people of Guangxi Province.

Yingshan Southwestern Mandarin is a separate language based on a 200 word Swadesh test (Ben Hamed 2005).

Menghai Southwestern Mandarin (evidence) may well be a completely separate language.

Shaoshan Southwestern Mandarin, spoken in Hunan Province, is a separate language.

Another language spoken in Hunan in Zhangjiajie County is called Zhangjiajie Maoxi Southwestern Mandarin. The Maoxi are a tribal group there that speak a strange variety of Southwestern Mandarin.

Tuoyuan Southwestern Mandarin in Hunan is not fully intelligible with other Southwest Mandarin lects, or at least not with Sichuan Southwestern Mandarin.

Gaoping Southwestern Mandarin and Baixi Southwestern Mandarin in Hunan speak mutually intelligible varieties, even though Gaoping is in Longhui County and Baixi is in Xinhua County. Although they are very far from each other, the two towns can communicate with each other in their own varieties without problems. This is because an extended family left Gaoping 150 years ago and moved to Baixi, marrying the two languages. It would be best to call this language Gaoping Southwestern Mandarin.

Xinfeng Southwestern Mandarin is traditionally categorized as Southwestern Mandarin. It is a Southwestern Mandarin dialect island spoken in Ganzou City in Xinfeng County, Jiangxi surrounded by Gannan Hakka lects. Over time, it has seen so much Hakka influence that it may now be characterized as a mixed dialect. Given the massive Hakka influence, Xinfeng Southwestern Mandarin is no doubt a separate language.

Gong’an Southwestern Mandarin is a very unusual Southwestern Mandarin variety spoken in Gong’an City in Hubei. Hunan is to the south. It is nearly a mixed language, having features of both Southwestern Mandarin and Xiang. As such, no doubt it is a separate language.

Guilin, Luocheng, Yangshuo, Liuzhou, and Lingui are members of the Guiliu Group of Southwestern Mandarin, which has 57 lects.

Leshan and Longchang are members of the Guanchi Group of Southwestern Mandarin, which has 85 lects.

Within Guanchi, Longchang is a member of the Renfu Group, which has 13 lects.

Yichang, Chengdu, Chongqing, and Yingshan are members of the Chengyu Group of Southwestern Mandarin, which has 113 lects.

Menghai, Kunming, Wenshan, and Guiyang are members of the Kungui Group of Southwestern Mandarin. The Kungui Group itself has an incredible 95 lects.

Lanping is in the Dianxi Group of Southwestern Mandarin, which has 36 lects.

Within Dianxi, it is a member of the Baolu subgroup, which has 21 lects.

Taoyuan is a member of the Changhe Group of Southwestern Mandarin, which has 14 lects.

Wuhan is a member of Wutian Group of Southwestern Mandarin, which has nine lects.

Dali is a member of the Dianxi Group of Southwestern Mandarin, which has 36 members.

Within Dianxi, Dali is a member of the Yaoli Group, which has 15 members.

Nanping, Chuanlan, Shaoshan, Jiuzhaigou, Zhangjiajie Maoxi, and Dahua are unclassified.

Southwestern Mandarin itself has a stunning 519 lects. There are 240 million speakers of Southwestern Mandarin (Olson 1998).

Jianghuai Mandarin is a separate branch of Mandarin that is very different from the rest of Mandarin. Language and is not fully intelligible with Putonghua. Some say that this is not even part of Mandarin, as it is better seen as in between Mandarin and Wu.

Jianghuai Mandarin, especially the variety spoken around Taizhou, is not intelligible at all with Anhui Zhongyuan Mandarin or Sichuan Southwestern Mandarin. Jianghuai Mandarin speakers cannot even tell that the Anhui Zhongyuan Mandarin or Sichuan Southwestern Mandarin speakers are speaking Mandarin because the language is so foreign.

Yangzhou Jianghuai Mandarin is considered to be a separate language by a 200 word Swadesh test (Ben Hamed 2005). Yangzhou Jianghuai Mandarin has about 52% intelligibility with the other branches of Mandarin (Cheng 1997). Phonetically, it resembles Wu.

Lianyungang Jianghuai Mandarin is a separate language, as is Yancheng Jianghuai Mandarin and Huaian Jianghuai Mandarin.

Nantong Jianghuai Mandarin, a very strange variety of Mandarin on the border of Wu and Mandarin that shares many features with Wu languages, is a separate language.

Nantong’s sister language, Tongdong Jianghuai Mandarin, is also a separate language. Jinsha Jianghuai Mandarin is a dialect of Nantong Jianghuai Mandarin.

Rugao Jianghuai Mandarin, next to Nantong, is also a separate language.

Hefei Jianghuai Mandarin is considered to be a separate language by a 200 word Swadesh list (Ben Hamed 2005). It is not understood outside of the city.

In 1933, there were three different languages spoken in Tongcheng, Anhui – Tongcheng Wenli Jianghuai Mandarin, East Jianghuai Tongcheng Mandarin, and West Tongcheng Jianghuai Mandarin and. Tongcheng Wenli Mandarin was the classical-based language spoken by the educated elite of the city. Whether these three languages still exist is not known, but surely some of the speakers in 1933 are still alive.

Chuzhou Jianghuai Mandarin, spoken in Anhui, is not intelligible with Putonghua, although it is said to be close to Nanjing Jianghuai Mandarin.

Dangtu Jianghuai Mandarin, also spoken in Anhui, is not intelligible with Putonghua.

Dongtai Jianghuai Mandarin is a separate language (evidence). Dafeng Jianghuai Mandarin, Taizhou Jianghuai Mandarin, Xinghua Jianghuai Mandarin and Haian Jianghuai Mandarin are said to be similar to Dongtai Jianghuai Mandarin, so for the time being, we will list them as dialects of Dongtai Jianghuai Mandarin.

Rudong Jianghuai Mandarin is at least not intelligible with Putonghua.

Jiujiang Jianghuai Mandarin, spoken in Jiangxi Province, is a separate language, as is Xingzi, located close by.

Intelligibility between Rudong Jianghuai Mandarin, Dafeng Jianghuai Mandarin, Taizhou Jianghuai Mandarin, Xinghua Jianghuai Mandarin, Haian Jianghuai Mandarin and Dongtai Jianghuai Mandarin is not known, however they may be closely related.

Jianghuai Mandarin is composed of an incredible 120 varieties. It has 65 million speakers (Olson 1998).

Yangzhou, Lianyungang, Yancheng, Huaian, Nanjing, Hefei, Anqing, the Tongchengs, and Chuzhou and Dangtu are in the Hongchao Group of Jianghuai Mandarin, which has 82 lects.

Dongtai, Dafeng, Taizhou, Haian, Xinghua, Jinsha, Nantong, Tongdong, Rudong, and Rugao are in the Tairu Group of Jianghuai Mandarin. Tairu has 11 different lects.

Jiujiang and Xingzi are members of the Huangxiao Group of Jianghuai Mandarin, which has 20 lects.

Lanyin Mandarin in the far northwest is also a separate language (Campbell 2004). Though Lanyin Mandarin is said to be intelligible with Putonghua, that does not appear to be the case. Minqin Lanyin Mandarin, (evidence) and Lanzhou Lanyin Mandarin (evidence) in Gansu are not fully intelligible with Putonghua, nor is Yinchuan Lanyin Mandarin (evidence) in Ningxia.

Intelligibility within Lanyin Mandarin is not known, but Jiuquan Lanyin Mandarin at least appears to be a completely separate language inside Lanyin Mandarin.

Jiuquan is a member of the Hexi Group of Lanyin Mandarin, which has 18 lects.

Yinchuan is a member of the Yinwu Group of Lanyin Mandarin, which has 12 lects.

Lanzhou is a member of the Jincheng Group of Lanyin Mandarin, which has four lects.

Lanyin Mandarin is composed of 57 separate lects. It has 9 million speakers (Olson 1998).

The Jiaoliao Mandarin spoken in Shandong as Shandong Jiaoliao Mandarin contains varieties such as Qingdao Jiaoliao Mandarin and Wehai Jiaoliao Mandarin which are not fully intelligible with Putonghua. Yantai Jiaoliao Mandarin is a dialect of Wehai Jiaoliao Mandarin. Qingdao Jiaoliao Mandarin, Wehai Jiaoliao Mandarin, Yantai Jiaoliao Mandarin and Yangzheng Jiaoliao Mandarin are all mutually intelligible. Dalian Jiaoliao Mandarin is quite different from Putonghua.

Wehai, Dalian and 21 other varieties are members of the Denglian Group of Jiaoliao Mandarin, which has 23 lects.

Jiaoliao Mandarin is composed of 45 lects. Jiaoliao is not fully intelligible with Putonghua. Intelligibility inside of Jiaoliao Mandarin is not known, but there may be multiple languages inside of it because some Shandong Peninsula varieties sound very strange even to speakers used to hearing Shandong Jiaoliao Mandarin.

Wutun or Wutunhua, is an unclassified language, a Mandarin-Mongolian-Tibetan creole mixed language spoken by 2,000 Tu or Monguar people in Eastern Qinghai Province. The Monguars speak Bonan, a Mongolic language with heavy Tibetan and Mandarin influence. Although the government regards them as Monguar Mongolians, the group self-identifies as Tibetan.

The source of the Mandarin is not known, but it is thought that the group came from outside the region, either Jilu Mandarin speakers from Tianjin in the northeast or from a group of Southwest Mandarin-speaking Hui Muslims in Sichuan Province who converted to Lamaist Buddhism for unknown reasons. They have been in their present location since at least 1585.

This is best seen as a Mandarin language that came under heavy influence of Bonan and to a lesser extent Tibetan after which when it was changed into an agglutinative language under the influence of these two other languages. The lexicon is 60% Mandarin with the tones lost, 25% Tibetan and 10% Bonan.

Karamay is an unclassified Mandarin language spoken in Xinjaing.

The Mandarin spoken around Tiantai in Zhejiang is not intelligible with Putonghua and may be a separate language. It is also unclassified.


Although it is related to Mandarin, Jin is a completely separate language, with only 57% intelligibility with other forms of Mandarin (Cheng 1997). The differences between Jin and Mandarin are somewhat greater than the differences between Mandarin itself.

Besides the Main Jin branch, Baoto Jin is apparently a separate language, as is possibly Taiyuan Jin (evidence).

Within Hohhot Jin, there are two separate languages.

One is Hohhot Xincheng Jin, a combination of Hebei Jin, Northeastern Mandarin and the Manchu language.

The other is Jiucheng Hohhot Jin, spoken by the Muslim Hui minority in the city. It is related to other forms of Jin in Shanxi Province.

Yuci Jin is a separate language from Taiyuan on a 200 word Swadesh test (Ben Hamed 2005).

Fenyang Jin, the language used in Chinese director Jia Zhanke’s movie Xiao Shan Going Home is not intelligible with Putonghua.

Jingbian Jin, in Shanxi, is a separate language.

Yulin Jin is also a separate language.

Hohhot is a member of the Zhanghu Group of Jin, which has 29 lects.

Baotou and Yulin are members of the Dabao Group of Jin, which has 29 lects.

Taiyuan and Yuci are members of the Bingzhou Group of Jin, which has 16 lects.

Fenyang is member of the Luliang Group of Jin, which has 17 lects.

Jingbian is a member of the Wutai Group of Jin, which has 30 lects.

Jin is composed of 171 lects, and some of them are separate languages. Jin has 48 million speakers (Olson 1998).


Gan is a macrolanguage spoken mostly in Jiangxi Province. The mountainous and rugged terrain of Jiangxi means that Gan is very diverse, with many mutually unintelligible varieties within it. Whether Gan is as diverse as Xiang or Hui is not known.

Outside of Gan Proper, Leping Gan is very different. It is not at all intelligible with Nangchang Gan, and hence is a separate language.

Nangchang Gan and Anyi Gan are apparently separate languages within Gan based on a 200 word Swadesh test (Ben Hamed 2005). Nanchang Gan has a great deal of dialectal diversity, with several dialects covering different cities and the rural areas. Intelligibility between these dialects is not known. Nanchang Gan is still spoken very heavily in Nanchang.

Boyang Gan is spoken in another part of Jiangxi and is apparently a separate language from Nanchang Gan.

The nine major dialectal splits in Gan are apparently not mutually intelligible. Similarly, they must surely be separate languages, so Yichun Gan Ji’an Gan, Fuzhou Gan, Yingtan Gan, Leiyang Gan, Huaining Gan, Daye Gan, Wanzai Gan, and Dongkou Gan are all separate languages. There is diversity even among these groups. For instance, Ji’an is divided into Nanxiang Ji’an in the south and Baixiang Ji’an in the north. The two are not intelligible with each other.

In the Yingyi Group, Chaling Dongxian Gan in Hunan near the Jinxiang border is a variety with mixed Gan and Xiang features. The best analysis is that this is a Gan variety. Due to the heavy Xiang mixture, it is no doubt a separate Gan language.

Linchuan Gan, spoken in East-Central Jiangxi, is a very interesting Gan that differs from all others. This seems to be the remains of the old language that was brought into Jiangxi by the ancestors of the Hakka, and it indicates a possible close relationship between Gan and Hakka.

Gao’an Gan, Ducheng Gan, Yongxiu Gan, and Nancheng Gan are quite different from the rest of Gan, so they may well be separate languages.

Hukou Gan, Wuning Ganand Fengxin Gan are major splits in Northern Gan, and are all probably separate languages.

Hancheng Gan is a major split in Southern Gan and as such is probably a separate language.

Nanchang and Anyi are in the Changdu Group of Gan, which has 15 different lects.

Yingtan is a member of the Yingyi Group, which has 12 lects.

Jiangyu and Huarong are members of the Datong Group of Gan, which has 13 lects.

Yichun is a member of the Yiliu Group of Gan, which has 11 lects.

Wanzai is a member of the Yiping Group of Gan, of which it is the only member.

Leiyang is a member of the Leizi Group of Gan, which has five lects.

Wanan is a member of the Jilian Group of Gan, of which it is the only member.

Ji’an is a member of the Jicha Group of Gan, which has 15 lects.

Huaining is a member of the Huaiyue Group of Gan, which has nine lects.

Fuzhou is a member of the Fuguang Group of Gan, which has 15 lects.

Dongkou is a member of the Dongsui Group of Gan, which has five lects.

Gan has 97 separate varieties in it. There are 30 million speakers of the Gan languages (Olson 1998).


Northern, Central and Eastern Min

Northern Min or Min Bei

Within the Min group, Northern Min (Min Bei), a macrolanguage, has already been identified as a separate language. There are 50 million speakers of all of the Min languages (Olson 1998). Northern Min has only 0-20% intelligibility with Min Nan.

Northern Min or Min Bei is said to be a single language. It has nine separate lects, including Shibei Northern Min in Pucheng County; Chong’an Northern Min, Wufu Northern Min, and Xingtian Northern Min in Wuyishan City; Zhenghe Northern Min and Zhenqian Northern Min in Zhenghe County; Jianyang Northern Min in Jianyang County, and Jian’ou Northern Min in Jian’ou County.

The dialects are said to be mutually intelligible, but Jianyang and Jian’ou have only about 75% intelligibility. Northern Min has 10 million speakers.

Central Min or Min Zhong

Central Min or Min Zhong is a separate language not intelligible with Northern or Eastern Min. It has three lects, Shaxian Central Min, Sanming Central Min, and Yongan Central Min, but we don’t know if there are languages among them. The tones of the three varieties are quite different. Further, there are many dialects in the interior of Sanming Prefecture, so there may be more than one language there. Central Min has 3.5 million speakers.

Eastern Min or Min Dong

The standard dialect of Min Dong, Eastern Min, Fukchiuor Fooshuw is Fuzhou Eastern Min.

Eastern Min has only 0-20% intelligibility with Min Nan.

Within Eastern Min, Chengguan Eastern Min, Yangzhong Eastern Min, and Zhongxian Eastern Min are separate languages, all spoken in Youxi County. Zhongxian Eastern Min is spoken in the south of the county, Chengguan is spoken in the middle of the county, and Yangzhong is spoken in the north of the county. The three varieties have markedly poor intelligibility between them (Zheng 2008).

Beyond that, Eastern Min is reported to have several other mutually unintelligible languages inside of it. One of them is Fuqing Eastern Min. Fuzhou speakers can understand Fuqing speakers better than the other way around. Fuzhou and Fuqing are about 65% intelligible in praxis, and it is about the same with the rest of the Hougan Group (Ngù 2009).

Ningde Eastern Min, Fuding Eastern Min and Nanping Eastern Min are other languages in this family (evidence). There are many dialects in the Eastern Min-speaking areas of Nanping, and there may be more than one language here. Of these three, Ningde Eastern Min is definitely a separate language. According to George Ngù, a passionate proponent of Fuzhou Eastern Min, “Fuzhou is not intelligible even within its many varieties.”

It’s not clear if that applies to all of Eastern Min, but it appears that it does. Therefore, Changle Eastern Min, Gutian Eastern Min, Lianjiang Eastern Min, Luoyuan Eastern Min, Minhou Eastern Min, Minqing Eastern Min, Pingnan Eastern Min, Pingtan Eastern Min, Yongtai Eastern Min, Fu’an Eastern Min, Shouning Eastern Min, Xiapu Eastern Min, Zherong Eastern Min, and Zhouning Eastern Min are all separate languages.

Tong’an Eastern Min should probably also be included.

Matsu Eastern Min is spoken on Matsu Island off the coast of China. It is similar to but probably not intelligible with Changle Eastern Min. Matsu may well be a separate language like all the rest of Hougan.

There are two other varieties lumped in with Eastern Min – Man, Mango or Taishun Manjiang Eastern Min is spoken in the central part of Taishun County in Southern Zhejiang in the far southern end of the Wu-speaking area, and Manhua spoken in the eastern part of Cangnan County. Both of these names mean “barbarian speech.”

Both are probably mixtures of Southern Wu (Wenzhou etc.), Eastern Min, Northern Min, and maybe even pre-Sinitic languages. Manhua and Manjiang are not intelligible with Fuzhou Eastern Min. However, Manjiang has affinity with Shouning Eastern Min in phonology, vocabulary, and grammar. Whether or not it is intelligible with Shouning Eastern Min is not known.

Min Nan speakers who have looked at Manjiang data say that it doesn’t even look like a Sinitic language. It is best seen as an Eastern Min language with very strong substratum of a Tai-Kadai or Austroasiatic language.

Manhua is best dealt with as a form of Wu. I discuss it further below under Wu.

Malaysian Eastern Min is spoken in Sibu, Sarawak and in Singapore. These people were originally Fuqing and Fuzhou speakers who came in the 1800’s and is spoken in two lects based on those two cities. Malaysian Fuqing Eastern Min and Malaysian Fuzhou Eastern Min only have 12% intelligibility, much less than the 65% of the parent languages in China. The two Malaysian lects are obviously not the same language, but intelligibility of the two lects with the parent languages in China is not known.

Fuding, Fuan, Shouning, Xiapu, Zherong, and Zhouning are in the Funing Group of Eastern Min, which has six lects.

Fuzhou, Fuqing, Chengguan, Yangzhong, Zhongxian, Ningde, Changle, Gutian, Lianjiang, Luoyuan, Minhou, Minqing, Pingnan, Pingtan, Yongtai, Matsu, Tong’an, and Nanping are in the Houguan Group of Eastern Min, which has 18 lects.

Taishun Manjiang is in an Eastern Min division of its own.

Eastern Min contains 24 separate lects, all of which are separate languages.

Southern Min or Min Nan


Within Min Nan or Southern Min, a macrolanguage, there are a number of separate languages. There is a proposal to split Xiamen, Qiongwen and Teochew into three separate languages before SIL. In fact, all three of those are macrolanguages also.

Amoy, Xiamen or Taiwanese Hokkien, Zhangzhou Hokkien, and Quanzhou Hokkien are part of a larger Southern Min group called Hokkien.

Amoy Hokkien and Taiwanese Hokkien are the same language, as Taiwanese is an Amoy dialect. A good name for the entire language of Amoy-Taiwanese Hokkien is Xiamen Hokkien.

Amoy, the variety spoken in Amoy city in China, is identical to certain Taiwanese dialects. It is more or less intelligible with Taiwanese, as the differences between the two are minor, akin to British and American English. There have only been 120 years of separation between Amoy and Taiwanese. Most of the differences are in modern and local vocabulary.

Amoy and Qaunzhou Hokkien are no longer intelligible with each other due to lack of a standard and the dialectal variations in each. Also, Amoy has developed more modern meanings for certain words, while Quanzhou retains more of the older meanings for the same terms.

Amoy, like Taiwanese, is a mixture of Quanzhou and Zhangzhou Hokkien.

Jinmen or Kinmen Hokkien is a dialect of Amoy spoken on Jinmen Island only two miles off the coast of Amoy. It has good intelligibility with Taiwanese.

A better name for Xiamen according to the Chinese literature is Quanzhang Hokkien (Campbell 2009). This would actually be a macrolanguage. Quanzhang is a combination of Quanzhou and Zhangzhou, two of the most important varieties in the language. Xiamen has only 51% intelligibility with Teochew.

Xiamen is still widely spoken in Taiwan as Taiwanese Hokkien. However, it is in trouble as fewer young people speak it anymore. 20 years ago in Đàoviên, Taiwan, it was common to hear young women in their late teens and twenties speaking Hokkien, but now it is uncommon (Kirinputra 2014).

Within Taiwanese Hokkien, the situation regarding Taipei Hokkien in the past was interesting. The dialects of the city were a mix of Zhangzhou and Quanzhou.

The dialect of the center of the city, Taipei City Hokkien, was mixed between the two, with a slight Quanzhou lean to it.

The dialect spoken in Sulim, Sulim (Shilin) Hokkien, heavily favored Zhangzhou. Other districts spoke a Tong’an-type dialect, which is just Quanzhou mixed with Amoy.

All these conditions are more common with the older generation. The Taiwanese Hokkien of the young generation speaks either the mixed Zhangzhou-leaning “Southern” style favored in the media, or they do not speak any Hokkien at all.

The Yilan Hokkien dialect on Taiwan is so different that it alone has posed serious problems for the task of standardizing Taiwanese, yet it is intelligible with Standard Taiwanese Hokkien. Yilan is a city in Taiwan.

Lugang Hokkien is also very different but is intelligible with Standard Taiwanese (Campbell 2009).

Elsewhere on Taiwan, there are some communication problems for Tainan Hokkien speakers hearing Taipei, but it appears that they are still intelligible with each other (Campbell 2009). Tainan is a city in Taiwan. A similar dialect is spoken in Gaoxiong as Gaoxiong Hokkien. Tainan and Gaoxiong are the prestige dialects of Taiwanese Hokkien that Standard Taiwanese is based on.

Taichung Hokkien is another dialect of Taiwanese spoken in the city of that name.

Tong’an Hokkien is said to be a dialect of Amoy, but the truth is that it is in between Amoy and Quanzhou. Tong’an Hokkien is spoken in the city of that name. A Tong’an variety is also spoken in Malaysia and Indonesia.

There are dialects within Quanzhou, including Anxi Hokkien, Shishi Hokkien, Yongding Hokkien, Dehua Hokkien, Hui’an Hokkien, Jinjiang Hokkien, Nan’an Hokkien, and Hong Kong Tanka Hokkien.

All Quanzhou dialects are apparently mutually intelligibile.

There is a group of Hokkien speakers among the Tanka fisherpeople located to the north of the Four Counties area. They speak a language that resembles Anxi Hokkien. We will call this Hong Kong Tanka Hokkien for now. They communicate well with speakers from the Hokkien homeland, so it looks like their language has not changed much. Most of them arrived in Hong Kong in the 1930’s and 1940’s.

There are differences within Zhangzhou Hokkien.

Longhai Hokkien, Haikang Hokkien, Zhangpu Hokkien, Zhao’an Hokkien, Yunxiao Hokkien, Dongshan Hokkien and Yinchuan Hokkien, are all dialects of Zhangzhou Hokkien, spoken in the vicinity of the city.

Longhai Hokkien is very similar to the standard variety, while Zhangpu Hokkien is somewhat different.

Zhao’an Hokkien, Yunxiao Hokkien, and Dongshan Hokkien are all spoken in Southern Zhangzhou. They have been strongly effected by Teochew such that there is controversy over whether they are Teochew or Hokkien. Yunxiao and Dongshan have changed n → ng and t → k as in Teochew. Zhao’an resembles Teochew more than the others, as it has an ir vowel. Intelligibility data for these diverse Zhangzhou varieties is not available.

With the possible exception of the three varieties mentioned above, all Zhangzhou varieties are mutually intelligible.

Zhangzhou and Quanzhou are not fully intelligible with each other in China. Taiwanese speakers can no longer understand the pure Quanzhou spoken in the Chinese city of that name, and some Quanzhou speakers say they cannot understand Taiwanese either. Nevertheless, Taiwanese has 80% intelligibility of Quanzhou and Zhangzhou. After all, Taiwanese itself is just a mixture between Zhangzhou and Quanzhou.

Zhangzhou and Quanzhou have marginal intelligibility with Teochew.

Zhangping Hokkien, though close to Xiamen, is a separate language according to a 200 word Swadesh test (Ben Hamed 2005).

Pinghe Hokkien is said to be a separate language.

Diaspora, Nusantaran or Overseas Hokkien, that is all Hokkien spoken outside of China in the area for a few hundred miles up and down the coast in either direction from Amoy in China, could be seen as being composed of two main groups. It is a language in trouble as young people everywhere in the diaspora switch to Mandarin, and many children are not learning Hokkien. Technically, Taiwanese is included in Overseas Hokkien, but since it is merely a dialect of Amoy, we put it under Amoy instead.

50 years ago, we could learn interesting things about Overseas Hokkien forms spoken in Jakarta, Yangon, Bandung, Phuket, Trang, Cebu, and possibly Palembang and Surabaya. Now Hokkien may be extinct in Jakarta, Yangon, Palembang and Surabaya and is in trouble in Phuket, Bandung and Cebu (Kirinputra 2014).

The first group, called Eastern Hokkien, is in the north and encompasses Taiwan (Kirinputra 2014).

The second group, which we shall call Malayland Hokkien for lack of a better term, is spoken in Malaysia and in Indonesia in Sumatra and Kalimantan. Malayland is heavily laced with Teochew.

However, the Hokkien spoken in the Philippines is classed as Malayland Hokkien because it is intelligible with Southern Malayland Hokkien even though it is in the east.

Malayland is split into two languages, Southern Malayland Hokkien and Northern Malayland Hokkien. The first language, Northern Malayland Hokkien, was formerly spoken in Northern Malaysia from Taiping along the coast formerly all the way to Phuket, Thailand but is now spoken for the most part only to Penang and over to Terangganu in Malaysia and in Medan and other places in Northern Sumatra in Indonesia.

The language is also referred to as Penang Hokkien or Medan Hokkien, after the very similar dialects spoken in those cities. Terangganu Hokkien is different. On Penang Island, two dialects are spoken, Baba Hokkien, which is heavily-creolized, and Sin Khek Hokkien, a more pure variety. There are also differences between Penang Island Hokkien and Butterworth Hokkien spoken in Butterworth just across the strait.

Hokkien is still very widely spoken in Penang, and it is possible to go through your entire day speaking nothing but Hokkien.

Northern Malayland is still spoken up into Thailand towards Phuket and in the Burmese Panhandle all the way to Rangoon. In Myanmar, the speakers are mostly elderly, and the language is dying out. Burmese Hokkien looks very much like Penang because many speakers came from Penang to Rangoon. Northern Malayland is still spoken in Surat Thani on the east side of the peninsula in Thailand by a few older speakers. On the Phuket side of the peninsula facing the Indian Ocean, it has been decimated.

All varieties of Northern Malayland are apparently mutually intelligible.

Speakers of Northern Malayland have a hard time understanding the Southern Malayland spoken in Klang and Malacca. Southern Malayland speakers in general say they cannot understand Penang.

Northern Malayland Hokkien is more of a Zhangzhou variety in terms of its accent. It is also heavily creolized, with a lot of Malay and Thai embedded deeply in the language. The differences between the two Malayland Hokkien languages are as great as between Hokkien and Teochew. Intelligibility between the two may be as low as 50%.

In Kuala Lumpur and Selangor, Southern and Northern Malayland mix, and it is difficult to say which language is being spoken here. However, the variety spoken in Selangor, Selangor Hokkien, is best described as Southern Malayland, as they cannot understand Penang well. Hokkien is still very widely spoken in Selangor.

The second language, Southern Malayland Hokkien, encompasses Southern Malaysia from Johor up to Kelantan where it is known as in the cities of Selangor, Kelang, Malacca, Muar, Tangkak, Segamat, Batu Pahat, Pontian, Singapore, Riau, the Riau Islands, and Johor Bahru. Kelang Hokkien, and Johor Hokkien are recognized as specific dialects, and Hokkien is still very widely spoken in both cities.

It is also widely spoken in Singapore and Brunei. In Indonesia, it is spoken in the state of Riau as Riau Hokkien, which is very close to Singapore Hokkien, and the city of Bagansiapiapi on Sumatra. It is also spoken in Bangkok, Thailand and in Saigon, Vietnam, where it is dying out (Kirinputra 2014).

Southern Malayland is less creolized than Northern Malayland, if it is creolized at all. Southern Malayland is more of a Xiamen Hokkien variety, while Northern is a type of Zhangzhou.

Kelantan, Kelantanese or Kelantan Peranakan Hokkien is spoken in the Malay state of Kelantan. It is wildly creolized with Malay and is probably not intelligible with any other form of Hokkien.

The variety of Hokkien spoken in Kuching, Sarawak, Kuching Hokkien, is also very different and is said to resemble Kelantan Hokkien. Nevertheless the Hokkien dialect situation in Kelantan is poorly understood, and there are said to be two different types of Hokkien spoken in this area, Kelantan Hokkien A and Kelantan Hokkien B (Kirinputra 2014). Kelantanese is still widely spoken.

The version of Southern Malayland Hokkien spoken in Singapore is called Singapore Hokkien and is based on Amoy, and possibly even more on Jinmen, but speakers also came from Tong’an, Zhangzhou, Quanzhou, Anxi, and Hui’an. It is similar to Taiwanese, but Singaporean speakers can no longer understand Taiwanese well, though they have partial understanding of it. For instance, they have only 30-40% intelligibility with Yilan Taiwanese Hokkien.

Southern Malayland lies between Northern Malayland and Taiwanese Hokkien on the continuum.

A Singapore speaker, if immersed in Taiwan, could pick up Taiwanese fairly quickly, within three months.

Singapore has been isolated from Taiwanese for quite some time, so it has retained older features that are losing ground in mainland Hokkien varieties. Word-final unvoiced stops p, t and k and starting to be lost in Zhangzhou on the mainland and replaced with a glottal stop, whereas in Singapore, they are still preserved.

Many Malay, Cantonese and Teochew words have gone into Singapore which hinder understanding with Taiwanese speakers. Mutual intelligibility between Singapore and Hokkien is ~55%. Similarly, Singapore is no longer intelligible with Amoy.

Singapore speakers, even the older ones, now mix a lot of Mandarin, English and Malay in with their speech. They have been isolated from the main Hokkien-speaking communities in Amoy and Taiwan for so long that they have lost many of the subtler aspects of the language spoken in these areas.

Singapore has withered into a weakened and corrupted version of the more pure Hokkien spoken in Taiwan and Fujian. Further, the language has changed a lot since the Singaporean speakers left the region, and Singaporean Hokkien speakers have not kept up with the continuously evolving Hokkien language spoken in the Hokkien homeland.

Singaporean has also become so heavily admixed with Teochew that it is more properly seen as Hokkien-Teochew than Hokkien Proper.

Singapore has good intelligibility with Philippines Hokkien.

All varieties of Southern Malayland Hokkien spoken in Malaysia and Indonesia are fully intelligible with Singapore Hokkien.

A very pure dialect of Southern Malayland is spoken in the Indonesian city of Bagansiapiapi as Bagansiapiapi or Bagan Hokkien. It has avoided the Mandarinization of Hokkien that is occurring elsewhere. It also lacks influence from Cantonese and Teochew and has fewer loans from Austronesian and English compared to neighboring Southern Malayland or Philippines Hokkien speakers (Kirinputra 2014).

Much of the good intelligibility between Bagan and Taiwanese seems to be due to bilingual learning. They speak like the Hokkien speakers of Tong’an, China. There are only a few thousand speakers remaining, and the language seems to be on its way out.

Another very pure version is the moribund Southern Malayland dialect still spoken by a few people in Saigon, Saigon Hokkien (Kirinputra 2014).

The Southern Malayland dialect spoken in Bangkok is called Bangkok Hokkien and contains Malay loans.

This seems to imply a large trading community involving Saigon, Bangkok and Malayland which exchanged words via different speech forms (Kirinputra 2014).

Intelligibility of Bangkok and Saigon with the rest of Southern Malayland is not known, but it is assumed to be full.

The version of Southern Malayland spoken in the Philippines is called Banlam-ue, Banlamhue, Binamhue, Lanlang-ue, Minnanhua or Philippines Hokkien by speakers. Although its tones are quite different from Indonesian Southern Malayland Hokkien, the two varieties are fully intelligible. Hence Philippines Hokkien is a dialect of Southern Malayland.

Philippines is not readily intelligible with Standard Hokkien. Speakers came to the Philippines long ago, so their Hokkien contains many old words that have fallen out of other Hokkien varieties. It derives from the Jinjiang and Sheshi dialects on the outskirts of Quanzhou. Lanlang-ue means “our language.” Minnanhua is the name of this language in Mandarin (Kirinputra 2014).

At present, it is not intelligible with Quanzhou or Xiamen. That is, Philippines speakers claim that they can only understand about 70% of Taiwanese television.

Despite intelligibility issues, Philippines and Taiwanese have a very similar lexicon. The lexicons of both are similar to Amoy speech. Apparently the Amoy-Luzon-Taiwan trade route produced a convergence in the lexicons of these varieties (Kirinputra 2014). Philippines is full of Tagalog words. Philippines, like Northern Malayland, resembles Zhangzhou from the late 1800’s.

Phillippines is spoken in Manila, Cebu, Zambaonga, Sulu, and Jolo. The standard is based on the variety spoken in Manila. Zamboanga Hokkien differs from Manila Hokkien in that it has more Spanish and Chavacano borrowings and fewer Tagalog words. The dialect on Sulu Island, Sulu Hokkien, is different from the rest of Philippines, sounding more like Amoy and Taiwanese with a trace of Singapore. Cebu Hokkien, spoken on Cebu, resembles Jolo Hokkien, which is spoken on the far southern island of Jolo.

Cebu and Jolo Islands were part of an important route for smuggling goods into the Philippines for centuries. Most of the smugglers were Hokkien Chinese. Philippines is still widely spoken on Sulu, in Zamboanga and in the Binondo region of Manila. Cebu is in trouble with a declining number of speakers. The situation with Jolo is not known.

Southern Malayland, Riau, Klang, Johor, Singapore, Saigon, Bangkok, Bagansiapiapi, Northern Malayland, Penang, Medan, Baba, Shin Kek, Terangganu, Myanmar, Kelantan, Kelantan A, Kelantan B, Kuching, Philippines, Manila, Zamboanga, Sulu, Jolo, Cebu, Yilan, Amoy, Tong’an, Jinmen, Taiwanese, Tainan, Taipei City, Sulim, Taichung, Lugang, Gaoxiong, Quanzhou, Shishi, Jinjiang, Longhai, Hui’an, Anxi, Nan’an, Dehua, Zhangzhou, Zhangpu, Yinchuan, Dongshan, Yunxiao, Zhao’an, Zhangping, and Pinghe are all part of Hokkien, which has 54 lects, eight of which are separate languages.

There are 30 million speakers of Hokkien.

Southern Min: Chaoshan Min or Teochew

Chaoshan Min or Teochew is a macrolanguage spoken in a nine-county region of Guangdong. It is also spoken a lot in Thailand. Most Overseas Chinese in Thailand speak Teochew. The Mandarin name for the language is Chaozhou, but Teochew speakers do not accept that appellation and prefer Teochew instead.

Dialects of Teochew include Chaozhou Teochew, Jieyang or Kek’iôⁿ Teochew, Puning Teochew, Chenghai Teochew, Shantou Teochew, Chaoyang Teochew, Raoping Teochew, Jindengzhan Teochew, Nanao Teochew, Huidong Teochew, Huilai Teochew, Jiexi Teochew, Dabu Teochew, and Fengshun Teochew.

Standard Teochew is based on Chaozhou Teochew or what was formerly the Fucheng language.

Chaoyang Teochew is a highly divergent Teochew lect. The other Teochew varieties cannot understand Chaoyang.

Shantou Teochew, Raoping Teochew and Jieyang Teochew are spoken outside of the Chaoyang-speaking area which hugs the coastline southwest of the Shantou area (Kirinputra 2014), which may explain why they have a hard time understanding Chaoyang.

Shantou is more intelligible with Hokkien than other types of Teochew, but intelligibility is still only 54%. However, Hokkien is utterly unintelligible with Jieyang (Kirinputra 2014). This implies that Shantou and Jieyang are quite different. The implication is that Jieyang Teochew is a separate language.

Shantou speakers cannot understand Chaozhou, as Shantou is quite a bit different from the other Teochew lects, and they also seem to have a hard time understanding other Teochew lects, as they say the Teochew changes every hour or so as you travel and becomes difficult to understand. Shantou Teochew is a separate Teochew language.

Sources report that Teochew varieties can vary greatly in the pronunciation of even single words, and the tones can be quite different too.

Intelligibility data for Raoping, Huilai Teochew, and Jindengzhan Teochew with the rest of Teochew is not known.

Teochew was formed by a group of Hokkien Min speakers who broke off from Zhangzhou Hokkien about 600-1,100 years ago. They moved down to Northeastern Guangdong, and after hundreds of years, a heavy dose of some sort of unknown substrate languages went into the language, possibly including a Cantonese-type variety, producing modern Teochew (Kirinputra 2014).

Teochew has only 51% intelligibility with Xiamen (Cheng 1997).

Overseas Teochew is a significant branch of Teochew that is spoken outside of the Teochew are in China in Vietnam, Cambodia, Thailand, Malaysia, Indonesia, and the Philippines. Overseas Teochew is an extremely variable macrolanguage consisting of a number of different languages.

Malayland Teochew is spoken in Malaysia, Singapore and Indonesia. Malayland Teochew, instead of being a language, is a macrolanguage composed of several languages.

The Teochew variant spoken in Malaysia, Malay Teochew, is composed of many highly variant lects. A different Teochew variety is spoken in each subregion, and varieties sometimes differ dramatically in pronunciation and tones. Whether or not they are mutually intelligible is not known.

Malay Teochew is spoken in four different places in Malaysia in two places at the southern tip of the peninsula and in Kedah and North Perak on the far northwestern coast where there are substantial Teochew populations. Malay is not intelligible with other SE Asian Teochew varieties. Malay has converged more with Hokkien than other types of Teochew.

It seems logical to split at least North Perak Teochew and Kedah Teochew along with Southern Malay Teochew A and Southern Malay Teochew B for the time being.

Singapore Teochew is different from Malay, and both have undergone separate divergent influences, so each one should be regarded as a separate language. However, Singapore Teochew is similar to Shantou because most Singaporean speakers came from there. Singaporean is regarded by Teochew speakers on the mainland as a heavily corrupted and impure variety of Teochew. Singaporean is not intelligible with any of the Teochew spoken in China anymore, not even the Shantou that it came from.

It has come under such heavy influence from Singaporean Hokkien that it is not better regarded as Singaporean Teochew-Hokkien than a pure Teochew tongue. Many of the original Teochew terms have been replaced with Hokkien words. It is also now heavily admixed with Malay and a lot of the characteristics of Mainland Teochew have been lost.

There are variations even among Singaporean Teochew. Speakers of some of the coarser, more rural dialects can only understand 50% of the purer varieties. This is derived from the early days when only some of the immigrants from Shantou were educated and most were uneducated peasants. The peasants did not speak the same higher, more refined Shantou than the educated people did.

In time, the differences became more dramatic. As these varieties still exist, we can call them High Singaporean Teochew and Low Singaporean Teochew, two separate languages. Lo Thia Khiang, the leader of Singapore’s Workers Party, speaks High Singaporean Teochew and is poorly understood by speakers of Low Singapore Teochew.

The variety spoken in Medan, Indonesia on Sumatra, Medan Teochew, is particularly interesting. It has heavy Malay, Hokkien and Cantonese influence and cannot be understood by other Teochew speakers (Kirinputra 2014). The town of Brahang 12 miles from Medan speaks Teochew.

Teochew is also spoken in other places in Indonesia such as Riau, Dabo Singrep, Tanjung Penang, Bantam Island, and Pontianak.

The Teochew spoken in Indochina – in particular in Vietnam and Cambodia (Indochinese Teochew) is a macrolanguage. Some Indochinese Teochew speakers who have returned to their family villages on the mainland say they could only understand 70% of the speech there.

Cambodian Teochew speakers say that Cambodian Teochew, Vietnamese Teochew, and Thai Teochew are all separate languages, and they cannot understand each other (Tek 2016).

Thailand Teochew or Diojiu-we is spoken in Thailand. The Chinese lingua franca in Thailand is not Mandarin but Teochew. There are 5 million Chinese Thais with roots in the Teochew region, and 3 million of them speak Diojiuwe.

Teochew is spoken in the Philippines, but there is little information available about Philippines Teochew.

Chaoyang, Shantou, Raoping, Jieyang, Huilai, Jindengzhan, Thai, Cambodian, Vietnamese, Medan, Singapore, Malay, Kedah, North Perak, Southern Malay A and B, Borneo, and Philippines are part of the Teochew, which has 17 lects 12 of which are separate languages.

Teochew has 10 million speakers.

Southern Min: Hailufeng, Zhenan, Hainanese, Leizhou, Shaojiang, Puxian, Zhongshan, Coastal, She and Datian Min

Hailufeng Min

Hailok’hong, Hailufeng or Haklau Min is a separate language in Southern Min that represents a later move of Zhangzhou speakers 400-500 years ago towards Northeastern Guangdong by the same group that formed Teochew. Since then there has been convergence with Teochew (Kirinputra 2014). It also has substantial Hakka influence. Hailok’hong (Haklau) Min is spoken down the coast between the Teochew zone and the Hong Kong area.

Hailufeng Min is usually better known as Hailok’hong or Haklou Min. It has at least three dialects, Haifeng Hailufeng Min, Lufeng Hailufeng Min, and Shanwei Hailufeng Min, and has limited intelligibility of Teochew proper.

The city of Haifeng has mostly Hailufeng speakers. Lufeng is spoken in the western half of Lufeng. Shanwei is the name of the prefectural city that encompasses Lufeng and Haifeng Counties. Shanwei Min is spoken more in the urban area of Shanwei.

Intelligibility among the three main Hailufeng Min varieties is full.

There is a group of Hailufeng speakers who originally came from Shanwei living in Hong Kong as part of the Tanka fisherpeople community. They live in the northern part of Hong Kong north of the Hokkien-speaking Tankas. They originally came from the Shanwei area which is just to the north. We will call them Hong Kong Tanka Hailufeng Min for now. Intelligibility data for this lect is not available.

Many insist that Hailufeng is a Teochew language because this area was redistricted into the Teochew area administratively in the 20th Century. Chinese people are jealously loyal to their home districts and see all languages spoken in their district in geographical and not linguistic terms. So to admit that Hailufeng is not Teochew would be a sort of treason to the homeland if you will (Kirinputra 2014). The area where the language is spoken along the coast of Guangdong is actually to the south of the Teochew area.

Hailufeng is said to be halfway between Teochew and Zhangzhou. Hailok’hong or Haklou etymologically is Haihong + Lok’hong, which is the same thing Haifeng + Lufeng, so it is a combination of Haifeng and Lufeng. Haklau is also cognate with Hokkien Holo and Cantonese Hoklo, referring either to Taiwanese Hokkien or Teochew. In an overall sense, it meant Hokkien + Teochew, which is a good description of the language (Kirinputra 2014). Hailufeng is still confused a lot with Hokkien in many casual descriptions.

Many Hailufeng speakers can now understand Teochew, but that is due to bilingual learning (Kirinputra 2014).

Lufeng is said to have over 90% intelligibility with Xiamen Hokkien, but if it is really halfway between Teochew and Hokkien, it should have 75% intelligibility instead. Intelligibility testing may be needed. There are 3 million speakers of Hailufeng Min.

Zhenan Min

Zhenan Min, spoken in pockets in Yixing, Anji, and Linan in Southern Jiangsu and Wenzhou in Changxing in Southern Zhejiang Province around Pingyang and Cangnan and in the Zhoushan Islands, is a separate language. Speakers are found in Anhui Guangde, Nigguo, Langxi, the eastern part of Wuhu, Jiangxi Shangrao, Yushan Island, and Guangfeng County, in addition to Pucheng on the northern border of Fujian. It is spoken along the coast far to the north of the general Min-speaking area.

Zhenan Min has 574,000848,000 speakers. Zhenan Min is influenced by Eastern and Northern Min and has limited intelligibility with other Min languages. In the area around Wenzhou, it has come under heavy Wenzhou Wu and Manhua Wu influence. Zhenan Min is still confused with Hokkien in casual descriptions.

Intelligibility among Zhenan Min varieties is not known. Zhenan Min is a result of a migration of Hokkien speakers from Hui’an, Jinjiang, Quanzhou, Nan’an, Xiamen, and Jinmen to the area in middle of the Ming Dynasty about 800 years ago due to pirate attacks and civil wars in the region they fled from. Once they arrived at their new home, high waves prevented them from returning, so they decided to make their new homes here in the north.

Jujiang Zhenan Min is spoken in Taishan County near the Manhua-speaking area.

Baizhang Zhenan Min is spoken as a dialect island in the south of Taishan County. It has come under severe influence from Luoyang Wu and Manhua. It is presently near extinction. Baizhang appears to be a dialect of Jujiang.

Ruoshan Zhenan Min has heavy Wu influence.

Taishun Zhenan Min has 14,000 speakers

Dongtou Zhenan Min has 52,000 speakers,

Pingyang Zhenan Min has 243,000 speakers

Cangnan Zhenan Min has 484,000 speakers.

In Yixing County, half the population speaks Zhenan Min.

Peng River, Fenwenxiang, Lake, Changxing, Liyang, Sanyang, Shiyang, Pengxi, Jujiang, Baizhang, Pingyang Aojiang, Yushan Island, Jingning, Yixing, Anji, Anhui Guangde, Taishun, Nigguo, Langxi, Northern Rui’an, Ni Island, Wuhu, Wenling Shitang, Dongtou, Ruoshan, Jiangxi Shangrao, Shengshi Island, Guangfeng, Linan, South Cangnan, Dongtou ,Yuhuan, Longhai, Lengkeng, Zhangpu, Anxi, Hui’an, Kengkou, Lengkugang, and Tong’an are all part of Zhenan Min, which has at least 41 lects.

Qiongwen Min (Hainanese and Leizhou Min)

Qiongwen Min is spoken on Hainan Island and to the north on the mainland. It has two divisions, Hainanese Min and Leizhou Min.

Hainanese Min has 8 million speakers, 5 million on Hainan and 3 million more overseas. It has the lowest intelligibility with the rest of Southern Min of all of the other Min Nan lects.

Qiongwen itself has 16 separate lects, all spoken on Hainan. Whether any of them are separate languages is not known. It is split into various lects, which in turn are split into various sublects.

The Funcheng Group of Hainanese Min is divided into nine lects, Chengmai Hainanese Min, Dingan Hainanese Min, Haikou Hainanese Min, Changliu Hainanese Min, Lingao Hainanese Min, Qiongzhong Hainanese Min, Qionghoi Hainanese Min, Bun-Sio Hainanese Min, and Tunchang Hainanese Min.

Intelligibility data is not available for Haikou Hainanese Min and Qionghoi Hainanese Min, but most of the vocabulary is not the same in these two lects.

Haikou Hainanese Min is spoken in Haikou City and a few miles away in Qiongshan County. There are no significant differences between the language of Haikou City districts and the suburbs.

Changliu city, six miles to the west, speaks Changliu Hainanese Min, a very closely related variety which appears to be intelligible with Haikou.

In between, residents speak both Changliu and Haikou.

Changliu is closely related to Lingao Hainanese Min spoken in Lingao County, and the two are mutually intelligible.

Chengmai Hainanese Min is spoken near Haikou.

A grammar written around 1900 on the Bun-Sio dialect of Hainanese Min stated that a number of the more distant Hainanese Min varieties were “perfectly unintelligible” to Bun-Sio Hainanese Min speakers (De Souza 1903).

Bun-Sio is spoken in an area called the Bun-Sio District, also known as the Wenchang District, on Hainan. This region encompasses the far northeastern end of the island. There are also Hainanese Min speakers in Malaysia and Vietnam. These speakers speak a version of Bun-Sio which looks a lot like the type described 100 years ago.

From a glance at this grammar, Bun-Sio or Wenchang Hainanese Min has more of a Tai-Kadai substrate than Southern Min in general. There is also a trace of Cantonese and more of a Mandarin influence than in the rest of Hokkien and Teochew. All in all, it is probably acceptable to split off Bun-Sio as a separate language.

Hainanese tones also vary from region to region, once again implying more than one language. The Hainanese Min tone system does not seem to be well described.

Leizhou Min is made up of two main groups: Leizhou Min and Zhanjiang Min. Leizhou Min is a separate language, and it has a close relationship with Hainanese. Nevertheless, Leizhou consists of seven different lects. Haikang is a dialect of Leizhou.

At least some of the other six Leizhou varieties are very different in phonology and lexicon. Intelligibility data is not known, but they may be mutually intelligible. Leizhou, with four million speakers, has low intelligibility with other Min varieties and has only 85% intelligibility with Hainanese, similar to Spanish and Portuguese.

Zhanjiang Min is apparently not intelligible with Leizhou. It is spoken in Zhanjiang City in the far southwest of Guangdong. It seems to be a separate language.

Shaojiang Min or Min Gan

Shaojiang Min or Min Gan is a completely separate high-level division of Southern Min. It is spoken in Nanping County in the far northwest of Fujian bordering the Northern Min and Wu-speaking area to the east by about 984,000 people. It has four languages inside of it – Shaowu Shaojiang Min, Guangze Shaojiang Min, Jiangle Shaojiang Min, and Shunchang Shaojiang Min – that have limited mutual intelligibility. There are subdialects within these larger lects.

The substratum of Shaojiang is not for the most part Min, Gan or Hakka – instead, it is the ancient Baiyue language, however, there are lesser Hakka and Gan influences. Others say that this is not Southern Min at all. Instead it is a division of Northern Min where Central Min is also included. This would make sense due to its location and the fact that Shaojiang split away from Northern Min several hundred years ago. These are Northern Min speakers who came under heavy influence of Hakka, Gan, and Baiyue.

Shaowu, Guangze, Jiangle, and Shunchang are all part of Shaojiang, which has four lects, all are separate languages.

Puxian Min

Puxian Min or Hinghua has already been identified as a separate language. It is spoken on the southeast coast of Fujian. Puxian Min is thought to have a close relationship with Hokkien. It was probably a Proto-Hokkien variety that broke away and came under serious Eastern Min influence and hence became a separate language.

It has limited intelligibility of other Min languages – for instance, Puxian Min has 60% intelligibility of Xiamen Hokkien Min, but the mutual intelligibility is lopsided, as Xiamen intelligibility with Puxian Min is lower at 30% (Terng 2016). Hence Puxian-Xiamen intelligibility is only 45% (Terng 2016).

The name is derived from the names of two different cities in China where this language is spoken – “Pu” for Putian and “Xian” for Xianyou.

Puxian Min has seven dialects. There is full intelligibility between all of the dialects, although there are some minor pronunciation and vocabulary differences (Terng 2016). The two main divisions of Puxian Min are into Putian Puxian Min and Xianyou Puxian Min, hence the name Puxian Min being a mix of the two main varieties. Both are dialects of the main Puxian Min language.

There are at least four subdialects spoken in Putian County, all subdialects of Putian Puxian Min. They are Jiangyou Putian Puxian Min, Changli Putian Puxian Min, two spoken in Putian City called North Putian City Puxian Min and South Putian City Puxian Min. There are other Putian Puxian varieties spoken in the county to the north and south of the Putian City other than Chengli and Jiangyou, but their names are not known. We will call them North Putian County Puxian Min and South Putian County Puxian Min.

There are three dialects spoken in Xianyou County, one in Xianyou City called Xianyou City Puxian Min or Central Xianyou Puxian Min, another in the north of the county called North Xianyou County Puxian Min, and a third in the south of county called South Xianyou County Puxian Min. All are subdialects of a single dialect of Puxian Min, Xianyou Puxian Min. All three subdialects are fully intelligible with each other with only some minor differences in pronunciation and some different vocabulary (Terng 2016).

For instance, North Xianyou kou, “to throw,” is lacking in Xianyou City.

South Xianyou has [i] and [e] for [y] and [ɵ] in Xianyou City and

North Xianyou has [θ] for Xianyou City [ɬ] (Terng 2016).

Xianyou city trades a lot with the north and south of the county, so there is a lot of contact between the subdialects. The city gets rice and rice-derived goods from the south and fish and shellfish from the south.

There is also a lot of intermarriage between speakers of the three subdialects. Most speakers of one of the Xianyou dialects have relatives who speak another of the dialects. The only research on Xianyou Putian Min has focused on the dialect of the city – Central Xianyou – with other two dialects being poorly known (Terng 2016).

Intelligibility between Xianyou and Putian Puxian Min is good at 90%-100%. There are some vocabulary differences.

For instance, “white”: Xianyou City pann, Chengli Putian 城里, Putian City pa; “officer”: Xianyou City kuann, Chengli Putian melon kua, are two pairs that cause some confusion. In these cases, Chengli Putian has lost nasalization that Xianyou City has retained. As we shall see below, loss of final nasalization is not just seen in Chengli Putian but in all of Putian. Nevertheless, Xianyou City intelligibility of Chengli Putian is full at 100% (Terng 2016).

There is some different vocabulary there too, and in some cases of common words, the differences are striking.

For instance, “children”: Xianyou kann en, Putian ta a; “wet”: Xianyou iunn, Putian tang. Once again we see than Xianyou has retained the older nasalization, whereas it appears that all of Putian, not just Chengli, has lost it (Terng 2016).

There are also rhyme differences between Putian and Xianyou. Xianyou has retained more rhymes at 50 rhymes, whereas Chengli Putian has 40, and Jiangyou Putian has 36 rhymes (Terng 2016).

So in addition to loss of nasalization, there may have been rhyme reduction in Putian also. It appears that Xianyou may be the older form of the Puxian Min language and that Putian broke away from it more recently.

Jiangyou Putian’s 36 rhymes versus Xianyou’s 50 rhymes leads to some difficulties in communication, however, Xianyou retains full intelligibility of Jiangyou at 90% (Terng 2016).

However, there is a form of Puxian Min spoken in Singapore, Hinghua Puxian Min, which lacks full intelligibility with Puxian Min in China. Hinghwa Puxian Min speakers are a minority in Singapore, and their language has mixed a lot with Singapore Hokkien, Malay, English, and other languages spoken in Singapore, resulting in a separate language.

South Putian City, North Putian City, Chengli, Jiangyou, North Putian County, and South Putian County are part of Putian Puxian Min.

Xianyou City or Central, South Xianyou, and North Xianyou are part of Xianyou Puxian Min.

Xianyou City, South Xianyou, North Xianyou, South Putian City, North Putian City, Chengli, Jiangyou, North Putian County, South Putian County, and Highwa are all part of Puxian Min, which has 10 lects, two of which are separate languages.

Zhongshan Min

In Guangdong Province in the Pearl River Delta near Hong Kong, there is a a large, divergent split in Min Nan called Zhongshan Min.

Zhongshan Min, a macrolanguage, has 130-150,000 speakers and has limited intelligibility with other Min lects. It is located to the south of Hailufeng Min just north of the Cantonese zone along the Southern Guangdong Coast.

This group is possibly a Northern or Eastern Min group stranded far down in Guangdong. They are sometimes referred to in old literature as “Northeastern Min”. That’s not really a category. It often means Northern Min, but sometimes it means Eastern Min. These languages have all borrowed extensively from Siyi Cantonese spoken in the Pearl River Delta.

Looking at the whole picture, it appears that various immigrants speaking Puxian Min, Northern Min, and Southern Min all settled around Zhongshan. These various Min elements, along with a hefty dose of Cantonese, have gone into the creation of Zhongshan Min.

Two Zhongshan lects, Namlong or Zhangjiabian Zhongshan Min (also spoken in Zhongshan), and Sanxiang Zhongshan Min, are separate languages. Each one is a dialect island surrounded by Cantonese speakers, and all three populations are unconnected.

Namlong is spoken 10 miles southeast of Zhongshan in Cuiheng. It is also spoken in Namlong and Zhangjiabian.

Sanxiang is spoken to the south of Zhongshan in the hilly rural areas.

The third is called Longdu Min and is also a separate language (evidence here and here). It is spoken in the southwest corner of Zhongshan City in Shaxi and Dayong.

In Chinese, Longdu, Namlong and Sanxiang are referred to as All-Lung Min, South Gourd Min, and Three Rural Min respectively. Sources give Longdu and Namlong 100,000 speakers and Sanxiang 30,000 speakers. 14% of the population of Zhongshan speaks Zhongshan Min. Namlong now has mostly elderly speakers.

Sanxiang, Namlong, and Longdu are apparently not mutually intelligible, although Namlong is close to Longdu.

Sanxiang is more divergent. Further, there are more dialects within these three languages, and dialectal divergence is considerable.

Sanxiang Min has at least two dialects, Phao Zhongshan Min and Tiopou Zhongshan Min. Phao is fairly uniform across a number of villages, but Tiopou is quite different. Nevertheless, there is near-full intelligibility between Phao and Tiopou (Bodman 1988).

For now, it is best to list Sanxiang, Namlong, and Longdu as separate languages, with possible dialects Phao, Tiopou, Namlong A, Namlong B, Longdu A, and Longdu B, among them.

Longyan Min or Coastal Min

Longyan Min or Coastal Min (Branner 2008) is a separate language. It is spoken in Longyan City’s Xinluo District and Zhangping City deep inside Fujian to the west of the Hokkien-speaking area. There is an overseas group of Coastal Min speakers in Malaysia in Penang around Parit Buntar. Although the language has been dying out in Malaysia for some time now, the language is still quite alive in Parit Buntar.

The language has anywhere from 300,000 (Branner 2008) to 740,000 speakers and has limited intelligibility with other Min languages. It has heavy Hakka influence due to the large number of Hakka speakers in the surrounding areas. Some put Coastal Min in a Southern Min Nan division of its own, others put it in Hokkien, and others put it outside of all other major Min varieties in its own Min category. The best analysis seems to be that it belongs in its own Southern Min division.

Koongfu Coastal Min and Shizhong Coastal Min are dialects of Coastal Min, but on examination, they are quite different. Koongfu is spoken in Kanshi Township in Yongding County. Shizhong is spoken in Southern Longyan County. Considering the rather extreme divergence of Coastal Min varieties in Wan’an, Koongfu Coastal Min and Shizhong Coastal Min are separate languages.

Another Coastal Min group is best called Wan’an Coastal Min. This is actually a macrolanguage comprising a number of separate languages in Wan’an County of Fujian.

Wan’an and Longyan are not mutually intelligible (Branner 2008).

Wan’an is a small township in northwestern Longyan County in Western Fujian which consists of very rugged, hard to access mountains with scattered very isolated villages made up of poor farmers. Some of these villages were visited for the first time by a Westerner only in the 20 years (Branner 2000).

To give you an idea of how remote the area is, to walk between two villages in Wan’an would take six difficult and confusing hours down ancient cobblestone paths through dark forests. But to take a bus between the two towns that are six hours walking distance away would take three days (Branner 2000)!

There are 13 varieties of Wan’an Min spoken in Western Fujian.

Among them are Wenheng Longgang Wan’an Coastal Min, Xi Wan’an Coastal Min, Xiangxi Wan’an Coastal Min, Shikou Wan’an Coastal Min, Wuzhai Longyan Wan’an Coastal Min, Songyang Longyan Wan’an Coastal Min, Baisha Youshui Longyan Wan’an Coastal Min, Tutan Longyan Wan’an Coastal Min, Shiahtsuen Buhyun Liliing Wan’an Coastal Min, Shanghang Buhyun Liliing Wan’an Coastal Min, Shanghang Gutian Laifang Wan’an Coastal Min, Shanghang Guanzhuang Shangzhuo Wan’an Coastal Min, and Shanghang Baisha Pengxin Wan’an Coastal Min. All are spoken in Wan’an township except  Shiahtsuen Buhyun Liling, which is spoken in Laiyuan Township in Southeastern Liancheng County (Branner 2000).

With many of these lects, they don’t understand each other at first, but after they talk to each other for a while, they start to figure out the other variety (Branner 2008). Owing to difficult intelligibility from village to village, the best analysis seems to be that all of the above are separate languages. Intelligibility among the Wan’an languages is ~70%.

Coastal Min seems to have about 85% intelligibility with Taiwanese Min. The intelligibility of Coastal Min with Penang Northern Malayland Hokkien is very poor.

She Min

A very strange variety called She Min is spoken by the She people in Zhejiang, Fujian and Guangdong. The She language was originally Hmong-Mien, which then added a Cantonese layer, then a Hakka layer, next a Min layer, and in Zhejiang, a Wu layer. It is best described as a Hmong-Mien language that has been Sinicized. There are probably 200,000 speakers of this language.

Zhejiang She Min is no doubt a separate language due to the distance between it and the other two principal varieties in addition to the Wu layer.

Fujian She Min is also a separate language.

In Eastern Guangdong, the She speak Chaosan or Teochew She Min. They live in the Phoenix Mountains in Chao’an County in Chaozhou Prefecture. The language has had heavy contact with Teochew. This is probably a separate language, unintelligible with other She languages and Teochew.

There is also an original She language that is non-Sinitic (Hmong-Mien) and is spoken by only about 1,000 people in Guangdong.

Datian Min

Datian Min in Fujian is also a separate language. Datian Min is in its own group in Min Nan.


Hakka is an extremely diverse group of languages spoken in Southern China. There may be up to 1,000 lects in Hakka. The dialect situation with Hakka is quite confused and somewhat contradictory. Some speakers report adequate intelligibility between lects, while others report difficulty. There are also reports of great diversity and difficult intelligibility even from village to village in Western Fujian, Gannan County in Jiangxi and Northern Guangdong. Intelligibility testing could clear up some of the confusion.

Hakka Proper (Meixian or Moiyen, formerly Jieyang) is spoken in Mei County in Northeastern Guangdong.

Hakka is very different from all other forms of Chinese. Although Southern Min and Hakka are said to be close, Taiwanese Hokkien can understand only 1% of even Taiwanese Hakka.

Meixian Hakka is the central Hakka version used as Standard Hakka. It is at least understood by 75% of Hakka speakers, so it is often used for communicating with Hakkas who speak other Hakka languages. Meixian was chosen as the standard because the region where it is spoken is one of the major strongholds of Hakka language and culture. In addition, it has preserved most of the original Hakka phonology and has less influence from Cantonese and Hokkien.

Nevertheless, Changting Hakka preserves more of the original Hakka than Meixian does.

Xingning Hakka, Zhenping Hakka, and Wuhua Hakka are all dialects of Meixian.

Wuhua Hakka or related varieties include the varieties of Wuhua County, Jiexi Hakka, Northern Bao’an Hakka, and Eastern Dongguan Hakka in Northern Guangdong; Shaoguan Hakka in Sichuan, and Tonggu Hakka in Jiangxi.

Tonggu speakers came from Wuhua a while back. Intelligibility data for these varieties is not available, but Tonggu Hakka is in its own separate group of Hakka, so it must be a separate language.

Meixian was formerly known as Jiaying Hakka. The Hakka varieties of Meixian, Pingyuan Hakka, Dabu Hakka, Xingning, Wuhua, and Jiaoling Hakka used to be included in Jiaying.

Dapu or Dabu Hakka, while close to Meixian, is a separate language. It is spoken in Dapu County, Guangdong. Dapu was the basis for Taichung Dongshi Hakka spoken in Taiwan. Actually, Donshi Hakka was derived directly from Chisan Hakka spoken by the founder of the Hakka community in the county. However, Donshi is now very different from Chisan. Intelligibility data for Chisan is not available.

Fengshun Hakka is a dialect of Dapu. Fengshun has five different varieties. Fengshun is also spoken in Bangkok as Bangkok Fengshun Hakka. Although it has been affected by Teochew influence in Bangkok, Bangkok Fengshun is still relatively pure.

Hopo Hakka is not intelligible with Dabu, Hailu or Meixian. Hopo Hakka has deep influence from Teochew because it is located right next to the Teochew area.

Chaoyang Hakka, Jieyang Hakka, Raoping Hakka, and Huilai Hakka are all dialects of Hopo.

Longchuan Hakka in Northeastern Guangdong is a separate language, with poor intelligibility with other Hakka lects.

Longchuan has six different lects, Huangbu Hakka, Sidu Hakka, Chetian Hakka, Huiyang Hakka, Huicheng Hakka, and Tuocheng Hakka.

Longchuan has heavy Cantonese and Teochew influence. It is mostly spoken in Huicheng District and Bolou County.

Sidu and Tuocheng are close and are probably dialects of Longchuan. Sidu has 18,000 speakers.

Intelligibility data on Huangbu Hakka, Huiyang Hakka, and Chetian Hakka is not known. Huiyang is close to Hong Kong Hakka. However, diversity is great within Longchuan, and dialects differ from village, with difficult intelligibility from village to village.

Boluo Hakka and Heyuan Hakka are separate languages, not mutually intelligible.

Longchuan, Boluo and Heyuan are quite distant from other Hakka.

Huizhou Hakka is in its own group of Hakka, so it must be a separate language. Huizhou is heavily spoken in Huizhou City. Huizhou is not intelligible with Moiyen, Taipu, Hopo, or Taiwanese.

Banshan Hakka is spoken in the Chengkang District of Tangnan town in close proximity to Jindengzhan village, where Teochew is spoken, and Changlin village in Tangnan town in Fengshun, Guangdong where Hakka called Changlin Hakka is spoken. Banshan is a dialect island surrounded by Teochew. Banshan may have significant Teochew influence. Banshan is quite probably a separate language.

Liannan Hakka is spoken in Northwest Guangdong and Wengyuan Hakka is spoken in Northwest Guangdong. They are members of the Yuebai Group of Hakka, which is highly divergent.

In Northern Guangdong, there may be many different Hakka languages, since dialects tend to differ from village to village, and in many cases, communication is difficult between villages.

The Yuemin Group of Hakka from Southern Fujian and Southeastern Guangdong is a separate language.

Heyuan Hakka is spoken in Central Guangdong.

Jiexi Hakka is spoken in Southeastern Guangdong.

Dongguan Qingxi Hakka is spoken in South-Central Guangdong.

Haifeng Hakka, Lufeng Hakka, and Luhe Hakka, located near each other in Haifeng, Lufeng, and Luhe Counties in Shanwei City of Guangdong, appear to be dialects of a separate language called Hailufeng Hakka. It is spoken most heavily in Luhe County, where most people speak Hakka. This is a Hakka with heavy influence from Hailufeng Min.

Sanxiang Hakka, spoken in Zhongshan Prefecture, is different from all other Hakka. In all probability, it is a separate language.

Hong Kong Hakka is not intelligible with the Hakka spoken on Taiwan, nor with Dabu and has no intelligibility of Meixian. Hong Kong Hakka is spoken in the New Territories in Sai Kung Peninsula, Shatin, Taipo, Shataukok, Tsuen Wan, Sai Kung Yam Tin Chi, Island Bridge, Ho Sheung Heung, Yen Kong, Ebara,and Eastern Yuen Long. It is close to Huiyang and Bao’an. They came to the area from the overpopulating Eastern Guangdong around 1650. By 1700, they had built more than 400 Hakka villages in the Hong Kong area. They may have some from the Huiyang area.

Intelligibility between Hong Kong Hakka, Huiyang and Bao’an is not available.

Despite the fact that Hong Kong Hakka lects seem similar to Hakka lects spoken in Eastern and Northeastern Guangdong, many Hong Kong Hakka trace their origins to Guangxi.

Hong Kong Hakka has three principal dialects, Dongguan Hakka, Taipu Hakka, and Wakia Hakka. The language is similar to the Hakka spoken around Huiyang in Eastern Guangdong. They moved from that area to Hong Kong as the beginning of the Qing Dynasty, so they came to Hong Kong 375 years ago.

Dongguan Hakka is spoken near Hong Kong.

Taipu or Taipo Hakka is spoken in the village of the same name in Hong Kong.

Wakia Hakka is also spoken in Hong Kong.

Intelligibility between the Hong Kong varieties is not known.

A variety of Hong Kong Hakka spoken in a part of Hong Kong called Shataukok, Satdiugok, Sathewkok, Shataukok, Satdiukok or Satdiugok Hakka. It is different from the rest of Hong Kong Hakka, and evidence indicates that Shataukok Hakka may indeed be a separate language.

Shataukok has a number of dialects within it, and they are different, but they may be more or less mutually intelligible. However, the MI is difficult to characterize, as it is said that speakers of other dialects can “get the gist” of what the other speakers are saying. “Getting the gist” of a variety usually implies less than 90% intelligibility.

Another variety of Hong Kong Hakka is spoken in Shuijian Village in the southern part of Yuen Long. This lect is completely different form the rest of Hong Kong Hakka. They moved to Hong Kong from Western Fujian 150 years ago. It is said to be similar to Boluo Hakka in Northeastern Guangdong, but this has not been proven.

The best name for this is Shujian Hakka, and it is best seen as a separate language, completely apart from the rest of Hong Kong Hakka. This language is now spoken only by older people who are ashamed of their language and generally refuse to speak it with outsiders.

Located near Hong Kong, Shenzhen/Bao’an Hakka is a separate language. However, it is close to Hong Kong Hakka.

The Gannan Hakka Group spoken in Southern Jiangxi is extremely diverse compared to the Hakka of Guangdong and Fujian. Gannan Hakka varieties differ even from village to village

With Gannan, we may be dealing with a situation of many different languages, as with Wu, Hui, Tuhua, and Xiang. In fact, it quite possible that with Jiangxi Hakka, we may be dealing with every Hakka variety being a separate language.

There are two separate groups there, Bendi Hakka and Keji Hakka. Bendi varieties are some of the most divergent Hakka varieties of all, while Keji varieties are more traditional, having moved out of the core Jiaying area within the last 300 years.

Xingguo Hakka is separate language spoken in Xingguo County in Ganzhuo Prefecture.

Ningdu Hakka is in all probability a separate language.

Ruijin Hakka, spoken in Southeastern Jiangxi, is very different and may well be a separate language. It looks a lot like Gan.

Xinfeng Tieshikou Hakka is in all probability a separate language, spoken in Xinfeng County by 90% of the population.

Many extremely diverse forms of Hakka are spoken in Fujian. Sources say that each Hakka village in Western Fujian speaks its own variety, and that the varieties are far enough apart to make communication from village to village very difficult.

The wildly diverse Tingzhou Hakka Group is spoken in Western Fujian. Even within this group, there are separate languages, including Tingzhou Hakka, Yongding Hakka, Liancheng Hakka, Changting Hakka, Xinquan Hakka, Qingliu Hakka, Mingxi Hakka, Taishun Hakka, Ninghua Hakka, Basel Mission Hakka,  Sanhang Hakka, and probably Gucheng Hakka.

Hakka is also spoken in far Southern Zhejiang in Taishun County.

Taishun Hakka is spoken there, but it has only 1,600 elderly speakers. It has 2,600 speakers.

Taishun She Hakka is spoken by the She minority in that county.

In recent years, both have come under the heavy influence of Luoyang Wu, Zhenan Min and Manhua.

Zhaoan Xiuzhuan Hakka, spoken in Southern Fujian, is a separate language.

Luoyuan She Hakka is spoken in Western Fujian. It is an extremely diverse form of Hakka that differs from all other Hakka. It must surely be a separate language.

Therefore, we conclude that in addition to the above, we will add Wuping Hakka, Longyan Hakka, Zhaoan Hakka, Yunxiao Hakka, Shangsixiang Hakka, Fuding Hakka, Fuan Hakka, Gucheng Hakka and Nanjing Qujiang Hakka.

Within Longyan Hakka, in one county, Lingcheng County, there is a huge variety of dialects, including Xinquan Linguo Liancheng Hakka, Xinquan Lelian Liancheng Hakka, Pengkou Wangcheng Liancheng Hakka, Miaoqian Zhixi Liancheng Hakka, Gechuan Zhuyu Liancheng Hakka, Miaoqian Jiangshe Liancheng Hakka, Sibao Shangjian Zhenbian Liancheng Hakka, Juxi Gaoding Liancheng Hakka, Liancheng Tangqian Dikeng Liancheng Hakka, Wenheng Hengming Liancheng Hakka, Xinquan Dongnancun Liancheng Hakka, Quxi Puxi Dongxiduan Liancheng Hakka, Quxi Qiaotou Liancheng Hakka, Xuanhe Shengxing Liancheng Hakka, Pengkou Wangcheng Liancheng Hakka, and Liwu Nanban Zhangwu Liancheng Hakka (Branner 2008).

Whether these are dialects of separate languages is difficult to determine. Usually they cannot understand each other at first, but after a while, they figure out how to communicate with each other (Branner 2008). There is significant enough difficulty in communicating between these villages that a local Mandarin dialect is used for inter-village communication (Branner 2008), suggesting difficult communication from village to village. This suggests that it is valid to split all of the above off into separate languages.

Hakka is also spoken in the south of Guangxi. There are 3.6 million Hakka speakers in Guangxi.

Dayu Hakka is spoken in Southern Guangxi.

Mengshan Xihe Hakka is spoken in Eastern Guangxi.

Each one is probably a separate language.

Mashan Old Naxing Hakka is spoken in Mashan Old Naxing village in Guangxi. It is located far from other Hakka and has come under the influence of other Sinitic and non-Sinitic languages such that it is now very different. It is surely a separate language.

Binyang Hakka is also spoken in Guangxi. They are Meixian speakers who came to Guangxi 400 years ago. The language is now very different from Meixian. It is quite probably a separate language.

Hakka speakers immigrated to Sichuan a long time ago.

Chengdu Hakka is spoken in Chengdu, Sichuan. It is quite different from other forms of Hakka and has poor intelligibility with other forms. At the moment, Hakka is the main means of communication in the Jinjiang, Jinniu, Chenghua, Longquanyi, Xindu, and Qingbaijiang Districts in Chengdu.

Longcheng Hakka is spoken in Longcheng by Hakka who immigrated there a long time ago. It has since come under heavy influence from Longcheng Southwestern Mandarin.

Five Hakka varieties – Longchang, Longtanshi Hakka , Yilong Hakka, Panlong Hakka, Xindu Hakka, and Huanglianguan Hakka are the main Hakka dialect islands in Sichuan. Although they have commonalities, they are all also quite different. Quite probably all of them are separate languages.

Longtanshi Hakka speakers came from Mei County in Guangdong long ago, but now Meixian and Longtanshi are very different. It resembles Wuhua and Xingning more and has since come under heavy influence from Chengdu Southwestern Mandarin.

Yilong Hakka speakers came to Sichuan 200 years ago.

Hakka varieties are also spoken in Sansheng, Tianhui, Shiling, Xihe, Shibantan, Taixing and Longwang in Sichuan. Intelligibility data is not available for Sansheng Hakka, Tianhui Hakka, Shiling Hakka, Xihe Hakka, Shibantan Hakka, Taixing Hakka, and Longwang Hakka. All have come under heavy influence from Southwestern Mandarin.

A distinct variety of Hakka is spoken by 2,300 Hakkas in Hainan. Hainanese Hakka is distinct and unintelligible with Mainland Hakka.

On Taiwan, Sixian (Four Counties) Taiwanese Hakka, Dongshi or Dapu Taiwanese Hakka and Hailu Taiwanese Hakka are not mutually intelligible, nor is the mixed Gaoxiong Taiwanese Hakka variety created in order that these three varieties could communicate with each other.

The present koine is called Sihai Taiwanese Hakka and is a combination of Sixian Taiwanese Hakka and Hailu Taiwanese Hakka, the two most widely spoken lects. Dongshi Taiwanese Hakka comes from Dapu County, Guangdong. Hailu Hakka comes from Huizhou prefecture.

Sixian itself is currently the most widely spoken Hakka variety in Taiwan. The name comes from the four Guangdong counties of Meixian, Jiaoling, Xingning, and Pingyuan. But the Sixian speakers who came to Taiwan generally came from Jiaoling, so Sixian currently resembles Jiaoling Hakka more than Meixian. Sixian is divided into two main dialects, Miaoli Taiwanese Hakka and Liudui Taiwanese Hakka. The differences between the two appear to be great, and they may well be separate languages.

Xingning Taiwanese Hakka is also still spoken in a few places. It is probably a dialect of Sixian.

Changle Taiwanese Hakka, now almost extinct, is almost certainly a Sixian. Changle speakers came from Wuhua County in Guangdong.

Zhao’an Taiwanese Hakka is very different and must be a separate language. Zhao’an comes from the Zhao’an, Pinghe, Nanjing, and Hua’an Counties of Zhangzhou prefecture in Fujian. Raoping Taiwanese Hakka in all probability is also a separate language. Raoping speakers came from Chaozhou Prefecture, specifically the Raoping and Huilai Counties in Guangdong.

Tingzhou Taiwanese Hakka is extremely different and is surely a separate language. Tingzhou comes from the Changting, Ninghua, Qingliu, Guihua, and Liancheng Counties of Tingzhou prefecture. Tingzhou and Zhao’an are the two most divergent Hakka varieties on Taiwan. Tingzhou is hardly spoken anymore and may be extinct on Taiwan.

Fengshun Taiwanese Hakka is also spoken in Taiwan, but it may be a dialect of Dapu. Fengshun came from Fengshun and Jieyang Counties in Guangdong. Fengshun still has a few speakers left on Taiwan.

Two other lects, Yongding Taiwanese Hakka and are said to be extinct on Taiwan, though each still has a few speakers. Yongding is surely a separate language, but Yongding speakers came from Yongding, Shanghang and Wuping Counties of Tingzhou prefecture of Fujian near Zhao’an.

Western Fujian Taiwanese Hakka, Zhangzhou Taiwanese Hakka, and Sixhai Taiwanese Hakka were all formerly spoken on Taiwan but have all gone extinct. No doubt all three were separate languages.

In general, speakers of other kinds of Hakka find Taiwanese Hakka to be hard to understand, possibly due to Southern Min influence. Hakka speakers make up only 5% of the population of Taiwan. Almost all are proficient in Mandarin or Hokkien, and there are few monolinguals left.

The Hakka spoken in Kunming, Sarawak, in Malaysia is known as Ho Po Hak Hakka. It is similar to Hopo Hakka, spoken in Hopo, near Meizhou.

Although Ho Po Hak speakers make up 70% of the Sarawak Hakka population, there are also speakers of Dapu, Fengshun, Huizhou, Bao’an, Dongguan, Lufeng, Wuhua, Meixian and Yongding on Sarawak. These speakers probably cannot be classed as Ho Po Hak. Intelligibility between these forms of Sarawak Hakka, Ho Po Hak and the Hakkas they are derived from is not known. Ho Po Hak is very different from the Hakka spoken in Sabah, Malaysia.

Hakka speakers make up the majority (57%) of the Chinese in Sabah where Sabah Hakka is spoken. Many arrived in the 1860’s fleeing the massacres perpetrated by the Manchus following the failed Taiping Rebellion. This group settled in Sandakan.

Others were brought from Longchuan County, Guangdong to Kudat in 1882 as laborers by the North Borneo Chartered Company. Sabah Hakka is identical to Huiyang/Fuiyong Hakka spoken in the Huiyang District of the city of Huizhou, near Shenzhen in Guangdong. Huizhou Hakka has heavy Cantonese influence. Most people in Huizhou are Hakka speakers. The main Hakka centers in Sabah are the cities of Sandakan, Kudat, Kota Kinabalu, and Tawau.

Dapu is still spoken in Malaysia and Singapore. Kuala Lumpur Dapu Hakka is very different from the Dapu spoken in China. It is now heavily creolized with Malay. It is quite probably a separate language. It is heavily spoken in the Serdang and Ampang regions of the capital.

There are also some Hakka speakers around Ipoh. It is not known what type of Hakka they speak.

In the 1800’s, there were Hakkas speaking Jiaying Hakka (Jieyang Hakka was the old name for Meixian), Yongding, Fengshun, and Jengcheng Hakka from Guangdong in Singapore, Penang, Malacca and Tel Anson on the Malay Peninsula. Whether they are still present is not known. Meixian speakers were known from Singapore as recently as 1950. A type of Huiyang is still spoken in Penang as Penang Hakka.

Bangka Island Indonesian Hakka, spoken on Bangka Island in Indonesia, has diverged so radically with its tones that it is now a separate language. That is, speakers of other Indonesian Hakka varieties say that they cannot understand Bangka Island speakers. It’s a Hakka creole more than anything else.

In Indonesia, two other major Hakka varieties are spoken, Kun Dian Indonesian Hakka, spoken in Borneo, and Belitung (Ngion Voi) Indonesian Hakka, spoken mostly on Sumatra and Borneo.

Kun Dian is the largest Hakka group in Indonesia. Most live at Pontianak and Singkawang, where they speak two different mutually intelligible lects, but they have spread all over Indonesia. Kun Dian is also spoken in Jakarta, Medan and Surabaya. Kun Dian has 80% intelligibility of Sabah (Longchuan) and Hong Kong. Kun Dian is also similar to Hopo.

Belitung is spoken mostly on Sumatra and Borneo and is characterized by a soft way of speaking. Belitung speakers mostly derived from Meixian speakers.

Belitung and Bangka Island say they cannot understand Kun Dian, but Kun Dian speakers say they can understand the other two for the most part.

Most old people in Belitung and Singkawang are Hakka monolinguals who cannot speak Bahasa Indonesia at all. These elderly speakers have to bring interpreters with them when they go to the doctor.

A type of Meixian is spoken in East Timor as East Timor Hakka.

Although some Indonesian Hakka speakers speak a very pure Hakka similar to the Huizhou spoken on the mainland, these are mostly the oldest generation. The younger generations speak a language that is very heavily adulterated with Indonesian languages.

Wuhua, Meixian, and Dabu are members of the Xinghua subgroup of Yuetai Group of Hakka, which which has five lects. Xinghua Hakka has 3.4 million speakers (Olson 1998).

Bao’an, Lufeng, Haifeng, and Hailufeng are in the Xinhui subgroup of Yuetai Hakka, which has nine lects. Xinhui Hakka has 2.4 million speakers (Olson 1998).

The Yuetai Group of Hakka has 23 lects.

Gaoxiong, Xinzhu, Dongshi, Jiaying, and Miaoli are members of the Jiaying Group of Hakka, which has seven lects.

Tingzhou, Yongding, Liancheng, Changting, Xinquan, Basel Mission, Wuping, Ninghua, Qingliu, and Mingxi are all part of the diverse Tingzhou Group of Hakka. All told, Tingzhou Hakka has 10 lects, most of which are separate languages.

Longchuan, Boluo, and Heyuan are members of the Yuezhong Group of Hakka, which has five lects.

Huizhou in its own subgroup of Hakka.

Xingguo and Ningdu are in the Ninglong or Gannan Group of Hakka, which has 13 lects. There may be as many as 13 different languages in this group.

Dayu is a member of the Yugui Group of Hakka, which has 43 lects.

Ho Po Hak, Bangka Island, Nanjing Qujiang, Jiexi, Hong Kong, Mengshan Xihe, Zhaoan Xiuzhuan, Fuan, Fuding, and Haifeng are unclassified.

There are 12 major Hakka varieties and 210 Hakka varieties altogether. Others claim that there are over 1,000 Hakka varieties spoken in China. There are 30 million speakers of the various Hakka languages.


Xiang is already recognized as a separate language.

Shuangfeng Xiang and Changsha Xiang are separate languages, having only 47% intelligibility (Cheng 1997).

In fact, Changsha itself is divided into multiple languages in the city itself. We do not know how many there are, but we know that they exist. For the moment, we shall just add one variety to Changsha, and divide it into Changsha City Xiang A and Changsha City Xiang B, but there may be more. Furthermore, there are significant differences within the Changsha spoken in Changsha City and in the surrounding countryside.

Shuangfeng is also very different within itself, as the vocabulary changes every 10 miles or so. Intelligibility data is lacking.

Lingshuijiang Xiang, also spoken in Hunan by 300,000 people, may well be a separate language.

Shuangfeng and Lingshuijiang are both part of the Luoshao group of Xiang. Shuihui Xiang and Suantang Xiang are also part of this group, however, Shuihui is so different that it is recommended to split it from Luoshao into its own group with Suantang Xiang. Suantang itself is very different. It has Southwest Mandarin and Xiang elements along with Hmong and Dong influences.

Suantang is so different that it is controversial whether it was Southwestern Mandarin or Xiang, but the best analysis seems to be that it is a Xiang variety. Clearly Shuihui Xiang and Suantang Xiang are separate languages.

Mao Zedong spoke Xiangtan Xiang, a notoriously difficult Xiang language in Hunan, about which it was said, “No one can understand it.” Xiangtan itself is internally diverse, with differences between the dialects of the city and rural areas, but intelligibility data is lacking.

Shaoshan Xiang and Lianyuan Xiang are both spoken near Xiangtan, and both are surely separate languages. There are a number of dialects within each of these languages.

Ningxiang Xiang is said to be very different from Changsha. Given the dramatic divergence present even as background in Xiang, this must mean that Ningxiang is at least not intelligible with Changsha.

Ningxiang County is split into two separate dialects, North Ningxiang Xiang and South Ningxiang Xiang. The differences between the two are great. Upper Ningxiang Xiang looks more like a Lianyuan dialect, and Lower Ningxiang Xiang looks more like a Changsha dialect.

Beyond that, Ningxiang is split into four major divisions – Chengguan Xiang, Shuangjiangkou Xiang, Huaminglou Xiang, and Liushahe Xiang. Surely each is a separate language.

Baishi Xiang, spoken near Xiangtan, is very different.

Liling Xiang is also spoken around Xiangtan and must be a separate language.

Hengyang Xiang is apparently a separate language, as is Jishou Xiang. There is significant dialectal diversity in Hengyang Xiang, but intelligibility data is lacking.

Shaodong Xiang is spoken in Shaodong County which borders Hengyang. There are transitional dialects between the two languages on the border of the two counties.

Liuyang Xiang is a separate Xiang language, actually a macrolanguage, spoken in Liuyang county-level city in Changsha prefecture east of Changsha City near the Jiangxi border in Hunan. Liuyang is split into five divisions – North Liuyang Xiang, South Liuyang Xiang, West Liuyang Xiang, East Liuyang Xiang, and Liuyang City Xiang.

South Liuyang Xiang and East Liuyang Xiang are separate languages, mutually unintelligible with the others. Liuyang City Xiang has recently arisen as a sort of a Liuyang koine that is understandable to speakers of all Liuyang lects. None of the three Liuyang languages is intelligible with Changsha. On closer observation, none of the Liuyang varieties are intelligible with each other. Therefore, North Liuyang Xiang and West Liuyang Xiang are separate languages also.

Even within this classification, each of the five Liuyang Xiang varieties has multiple dialects. Each village is said to have its own variety in Liuyang Xiang.

Henghshan Xiang is a macrolanguage with vast dialectal divergence divided by Mount Hengshan.

There are two Hengshan varieties on either side of the mountain – Qianshan Xiang in the southeast and Houshan Xiang in the northwest – that are very different and must be separate languages.

Jiashanqiang Xiang is a transitional area in the center containing features of both languages. There are 354 villages in the Hengshan Mountain area.

Huayuan Xiang appears to be a separate language.

In the city of Yiyang, Henan Province, three Chinese varieties are spoken. One is a Yiyang Changyi Xiang variety, another is a Yiyang Luoshao Xiang variety, and a third is Luoyang Southwest Mandarin, a dialect of Henan Mandarin, described above. All appear to be separate languages.

We will call the two Xiang varieties Yiyang Changyi Xiang and Yiyang Luoshao Xiang.

Huangxu Xiang, a Xiang dialect island in the Southwestern Mandarin-speaking city of Deyang in Sichuan, is very different from the rest of Xiang and must surely be a separate language.

Quanzhou Xiang in Guangxi is another Xiang dialect island. It has extreme differences with Hunan dialects like Shuangfeng.

According to good sources, there is a tremendous amount of variety diversity in Western Hunan, most of it probably involves Xiang lects, while most or all of these varieties are not mutually intelligible. But until we get more data, we cannot carve any languages out of this mess yet.

Shuangfeng, Shuihui, Suantang and Lingshuijiang are members of the Luoshao Group of Xiang, which has 21 lects.

Changsha City A, Changsha City B, Changsha Rural, Hengyang, Shaodong, Xiangtan, Shaoshan, Baishi, Liling, Lianyuan, Qianshan, Houshan, Jiashanqiang, Ningxiang, Chengguan, Shuangjiangkou, Huaminglou, Liushahe, North Liuyang, South Liuyang, East Liuyang, West Liuyang, and Liuyang City are members of the Changyi Group of Xiang, which has 32 lects.

Jishou and Huayuan are members of the Jixu Group of Xiang, which has eight lects.

Xiang is composed of 74 lects. Many or possibly all of them are separate languages. The various languages of Xiang have 50 million speakers (Olson 1998).


Wu is a major group of diverse Chinese languages that is often divided into Northern Wu and Southern Wu. Southern Wu has 18 million speakers. My opinion is that in general, the Wu varieties are mostly separate languages; however, some are merely dialects of other Wu lects.

A good general rule for Zhejiang Wu varieties is that you can sort of understand the variety of next city over, but the language of two cities away is incomprehensible. For instance, in the Taizhou Prefecture region, there are between four and five mutually unintelligible Wu varieties across a 12 mile area. In Zhejiang, the mountains go all the way down to the sea, so there are few flat areas where language can spread out and become mutually comprehensible.

Huzhou Wu, Jiaxing Wu, and Kunshan Wu are separate languages.

Although the Suzhou City administrative area is large, Suzhou Wu language is spoken only in the city proper and its suburbs. Suzhou City dwellers say that people in the suburbs have a rural or “hard” accent, while the speech of Suzhou City is called “soft.” Suzhou is presently divided into two sets of speakers, one over 50 and another under 50. Differences between age groups in Suzhou were noted as early as the 1930’s. Suzhou Wu is still very widely spoken in the area.

Suzhou is 70% similar to Shanghaihua. That is not enough for full intelligibility. Shanghaiese find Suzhou to be incomprehensible. The differences between Suzhou and Shanghainese are much greater than between suburban Shanghai languages. A Shanghainese speaker would need a few months in Suzhou to learn Suzhou. This is about the same as the difference between Castilian-Catalan and Castilian-Asturian.

Suzhou is more complex phonologically and tone-wise than Shanghainese, so it is harder to learn. Even native Suzhou speakers have problems with the tones sometimes. Further, tone sandhi in Suzhou is quite complex.

Zhangjiagang Wu may be intelligible with Suzhou, but data is lacking. Suzhou is only 43% intelligible with Wenzhou (Cheng 1997). None of these varieties is intelligible with Shanghainese.

Wuxi Wu is spoken in the city of Wuxi. Wuxi is spoken in two areas, referred to as East and West Mountain. East Mountain refers to the city of Dongshan, and West Mountain refers to the city Wuxi. Wuxi is not intelligible with Changzhou or Suzhou. Wuxi is only 20% similar to Shanghainese. Wuxi can understand Shanghainese, but that is no doubt due to bilingual learning. Shanghainese do not understand Wuxi well.

Changzhou Wu is not intelligible with Shanghainese, Wuxi or Suzhou. Changzhou and Wuxi have high but not full intelligibility. Changzhou and Wuxi are part of a dialect chain in which eastern Changzhou speakers can communicate with eastern Wuxi speakers, but as one moves further west into Wuxi or east into Changzhou, intelligibility drops off. It is best then to split Wuxi and Changzhou into separate languages.

Changzhou itself has considerable dialectal divergence, though apparently all dialects are mutually intelligible.

Changzhou is the most orthodox Taihu language. It has eight tones and compared to Suzhou, it is many more sounds and a lot more traditional vocabulary.

Changzhou has 3 million speakers.

Ningbo Wu is close to Shanghainese, and Ningbo speakers can learn Shanghainese in ~two months. This is because many Ningbo speakers moved to Shanghai in the past 100 years and Ningbo became a prestige language in Shanghai in the first part of the 20th Century, so Shanghainese has a lot of Ningbo influence in it.

Many of the local Wu varieties around Shanghainese Wu say that they can understand Shanghaiese well but not the other way around.

The reason for this is complex. About 100 years ago, Suzhou became a very prestigious language in Shanghai and was widely spoken there. However, in the past century, many immigrants came to Shanghai from other parts of China. In particular, many speakers of Ningbo came to Shanghai. Ningbo is quite a bit different from either Shanghaiese or Suzhou.

With speakers of Ningbo, Suzhou and Shanghaiese all present in the city in large numbers, a koine needed to develop. Shanghainese was chosen as the koine and because speakers of three different languages were communicating, Shanghainese got dramatically simplified phonologically in order for it to be better understood by everyone.

Hence, Shanghainese has evolved in a highly simplified form of Taihu. This is why many speakers of nearby Wu languages say that they can understand Shanghainese but not the other way around.

Several varieties are spoken in the suburbs of Shanghai. Reports vary, but Shanghai residents generally report that these varieties are not mutually intelligible with Shanghainese (Gilliland 2006).

Some of these languages are Baoshan Wu, Fengxian Wu, Nanhui Wu, Jiading Wu, Jinshan Wu, Pudong or Chuanshan Wu, Songjiang Wu, and Qingpu Wu.

Pudong Wu, the older form of the Shanghai language, is still spoken in the Pudong District of the city, but it is dying out. There is a question of  whether or not it is mutually intelligible with Shanghainese, but Shanghainese speakers seem to feel it is not mutually intelligible (Gilliland 2006).

These Shanghai suburbs varieties above are probably not fully mutually intelligible. For instance, Fengxian is not fully intelligible with Jiading. Intelligibility between the two may be ~70%, but it only takes a few weeks’ exposure for a Fengxian speaker to learn Jiading Wu.

Qidong Wu, spoken in the city of Qidong, is a separate language. Qidong is said to be very close to Chongming Wu, so for the time being, we will list Chongming as a dialect of Qidong. Chongming, spoken on Chongming Island in suburbs of Shanghai, is not intelligible with Shanghainese.

These varieties spoken in the suburbs of Shanghai are closer to the Old Shanghainese, which is quite a bit different from the New Shanghainese spoken in the city center nowadays.

Changyinsha Wu is very similar to Chongming and Qidong, so it is probably a dialect of Qidong also. Another name for Qidong is Qihai, which refers to the speech of Qidong, Haimen and Tongzhou. For the time being, we will list Changyinsha and Chongming as dialects of Qidong. Chongming, and hence Qidong, are not intelligible with Shanghainese.

Nanjing Wu is a separate language. It is close to Shanghainese Wu but is not fully intelligible with it.

However, there are two varieties spoken in Haimen, and they are not mutually intelligible. Haimen Wu A and Haimen Wu B are then two separate languages.

Wuhu Wu is a separate language, unintelligible with Shanghaihua.

Hangzhou Wu is reportedly much different from the varieties of Shanghainese, Ningbo, etc. to the northeast and is not intelligible with Shanghainese, nor with Suzhou. Hangzhou has 1.2 million speakers. Nevertheless, Hangzhou appears to be dying out in Hangzhou City, as only older people seem to speak the language anymore. Hangzhou is 40% similar to Shanghainese.

Yixing Wu, near Changzhou, is not intelligible with Shanghainese.

Tongxiang Wu also appears to be a separate language, as does Yuyao Wu and Zhoushan Wu.

Lvsi, Qisi or Tongdong Wu, spoken in the nearby town of Qisi, is a separate language from Qidong.

Jiangyin Wu is spoken in Jiangyin city. It is related to Changzhou and has high intelligibility with Changzhou and Wuxi. It has some definite differences with Suzhou. Nevertheless it appears to be a separate language because it cannot be understood outside the city. Many older people still speak only Jiangyin.

Jinxiang Wu also has its own Wu variety with Mandarin influences. This is a Taihu (Northern Wu) outlier spoken far to the south of the Taihu region.

Wenzhou Wu or Oujiang Wu is a macrolanguage, as it is made up of at least 14 separate languages. It is not understood outside of Wenzhou and it is not even intelligible within itself.

The standard version is spoken in Lucheng District by 1 million people and can be referred to as Lucheng Wu. Ouhai Wu, Yongjia Wu and Ruian Wu are said to be to be dialects of Wenzhou Wu, but Ouhai, spoken in the Ouhai District, is not intelligible with Ruian. Ruian is spoken by 1 million people in the city of Ru’ian, and is related to Pingyang Wu spoken in Pingcheng County.

Yongjia, spoken in Yongjia County, is separate too, since if you go five miles in any direction in Wenzhou, there’s a new dialect, and it’s hard to understand people.

Northern Yueqing Wu is a separate language within Wenzhou. They are separated from the rest of the Yuequing city by Yangdang Mountain. Wenzhou is 43% intelligible with Suzhou. Indeed, Wenzhou, instead of being a single language, is instead of family of partially mutually unintelligible lects. See more evidence for that here.

Wujiang Wu is a separate language within Wenzhou that has come under serious influence of Luoyang Wu.

Wenxi Wu is a separate language within Wenzhou. It is spoken in one town in Qingtian County.

Wencheng Wu, spoken in Wengcheng County, is a separate language within Wenzhou.

Chu River Wu is a closely related separate language from Wencheng spoken in Luoyang County in Zhejiang.

Since there are 11 different cities and counties in Wenzhou, and the language changes every five miles or so, it would be logical to assume that there are 11 separate languages within Wenzhou. However, closer analysis reveals at least 14 languages within Wenzhou.

So we should then split off at least one Wenzhou language for each major division. This gives us Cangnan Wu spoken in that county and Longwan Wu and Dongtu Wu spoken in those two districts. Although aberrant Wu varieties probably not a part of Wenzhou are spoken in Taishun and Cangnan, varieties of Wenzhou are also spoken there, so it makes sense to split those two off.

In addition, in Taishun County, there is an aberrant Wu variety spoken in the town of Luoyang influenced by both Manjiang Eastern Min and Oujiang Wu. We can call this Luoyang Wu. This is best seen as the southern extension of Yesou Wu. Liqu Wu is another Luoyang variety spoken in the area.

There is another Wu variety similar to Manjiang Eastern Min spoken in the town of Hedi in Qingyuan County in Lishui. We will call this Hedi Wu. In all probability, it is a separate language.

Manhua Wu, a macrolanguage, is quite different. It is spoken around Cangnan and Wuzhou City in Northern Zhejiang on the southern coast of Wuzhou City in about five townships. The word man literally means “barbarians.”

There is a controversy over whether or not Manhua is Macro-Min or Macro-Wu. It is probably Macro-Wu based on phonology, and it also shares some similar Min-like traits with other Wu varieties such as those in the Chuqu group. Some think it originated in a Southern Min variety that came under the influence of a non-Sinitic language. Word order is completely different from Chinese word order. However, the word order is changing under the influence of Mandarin, and many younger people are using a more Mandarin word order.

Some theories think it has Proto-Vietnamese, Austronesian, and She influences. The major components seem to be Old Cantonese, Old Chinese, and Mandarin. Some also suggest Northern Min, Eastern Min, Southern Min and especially Wu influences. It has 200,000-400,000 speakers.

Within Manhua Wu, there is a northern group spoken in the town of Yishan and a southern group spoken in the towns of Qianku, Qianku Manhua Wu, and Jinxiang, Jinxiang Manhua Wu. Qianku Manhua Wu is the standard for Manhua Wu. Although the internal differences in Manhua Wu are not great, Jinxiang Manhua Wu and Qianku Manhua Wu are not mutually intelligible. It is also very heavily spoken in the city of Lengkang.

All of the above are in the Taihu Group of Wu.

Taizhou Wu is a major split in Wu. It is centered around the city of Taizhou in Eastern Zhejiang, is composed of many known separate lects, all of which are separate languages, including Huangyan Wu, Jiaojiang Wu, Linhai Wu, Sanmen Wu, Tiantai Wu, Wenling Wu, Xianju Wu, Luquiao Wu, Ninghai Wu, Xiaoshan Wu, and Yuhuan Wu.

All in all, there are said to be 4-5 mutually unintelligible Wu varieties spoken in Taizhou City’s metropolitan area alone. Therefore, we will list Taizhou Wu A, Taizhou Wu B, Taizhou Wu C, Taizhou Wu D, and Taizhou E. This is a region that is only 12 miles across.

Jiaojiang Wu and Huangyan Wu cannot understand Linhai Wu. The area has split into so many mutually unintelligible languages mostly due to terrain.

For instance, Taizhou and Huangyan are only a 10 minute bus ride away from each other, but the highway was only built recently, and there is a huge mountain in between both cities. Taizhou and Jiaojiang are only another 10 minute bus ride apart, but there is a huge river separating them and it could be crossed only by boat until a ferry was built in the 1990’s.

Linhai is only 20 minutes away from Taizhou now that a new expressway was recently built that involved blasting through a few mountains that previously had separated the cities.

There are two groups of Southern Wu which are both highly divergent and have very low mutual intelligibility internally. These groups are Wuzhou Wu and Chuqu Wu.

Wuzhou Wu is another major split in Wu.

Wuzhou Wu consists of at least 30 languages: Jinhua Wu, Jinhua Xiaohuang Wu, Tangxi Wu, Lanxi Wu, Pujiang Wu, Yiwus A-R, Dongyang Wu, Pan’an Wu, Yongkang Wu, Wuyi Wu, Quzhou Wu, Longyou Wu and Jinyun Wu.

It is also highly divergent, much more so than even Taihu Wu. A single subgroup of Wuzhou Wu, Yiwu Wu – contains 18 separate languages, all mutually unintelligible. We will call them Yiwu Wu A, Yiwu Wu B, Yiwu Wu C, Yiwu Wu D, Yiwu Wu E, Yiwu Wu F, Yiwu Wu G, Yiwu Wu H, Yiwu Wu I, Yiwu Wu J, Yiwu Wu K, Yiwu Wu L, Yiwu Wu M, Yiwu Wu N, Yiwu Wu O, Yiwu Wu P, Yiwu Wu Q and Yiwu Wu R for the time being.

Lanxi Wu has 660,000 speakers (Rickard 2006).

Chuqu Wu is split into two subgroups, Chuzhou Wu and Longqu Wu. It contains contains at least 22 languages. Some members of this group extend south beyond Zhejiang into Northeastern Jiangxi and Northern Fujian. We are going to cautiously classify almost of Chuqu Wu as separate languages, since it is much more divergent and much less mutually intelligible than Taihu Wu, and Taihu Wu itself has low internal intelligibility.

Chuzhou Wu consists of Qingyuan Wu, Jingning Wu, Jinyun Wu , Lishui Wu, and Taishun Wu, all separate languages.

Longqu Wu consists of Pucheng Wu, Shangrao City Wu, Shangrao County Wu, Guangfeng Wu, Yushan Wu, Kaihua Wu, Changshan Wu, Jiangshan Wu, Suichang Wu, Songyang Wu, Xuanping Wu, Qingtian Wu, Yunhe Wu, Longyou, Quzhou and Longquan Wu, all separate languages.

Pucheng Wu has two dialects, Nampo Wu and North Dabei Wu. Intelligibility data is not known. Pucheng Wu is so diverse that some say it is a language isolate and is not even a part of Wu (Norman 1988).

Taihu Wu contains seven subgroups.

Jiaxing, Shanghainese, Baoshan, Fengxian, Nanhui, Jiading, Jinsha, Qingpu, Pudong, Suzhou, Wuxi, Songjiang, Tongxiang, Qidong, Chongming, Changyinsha, Lvsi, Yunhe, Kunshan, and 11 others are all in the Hujia Group of Taihu Wu. Hujia Wu contains 32 lects, most of which are separate languages.

Changzhou, Yixing, Jiangyin, the Haimens, and seven others are in the Piling Group of Taihu Wu, which has 12 lects. Piling Wu has 8 million speakers.

Wenzhou, Ouhai, Yongjia, Ruian, Wencheng, and seven others are in the Oujiang Group of Taihu Wu, which contains 14 separate languages.

Hangzhou has its own group, the Hangzhou Group of Taihu Wu.

Shaoxing, Fuyang, Xiaoshan, Linan, Yuyao, Zhuji, and six others are in the Linshao Group of Taihu Wu, which contains 12 lects.

Fenghua, Zhoushan, and nine others are in the Yongjiang Group of Taihu Wu. Yongjiang Wu contains 11 lects and has 4 million speakers (Olson 1998).

Changxing and four others are in the Taioxi Group of Taihu Wu, which has five lects.

Taihu Wu is composed of 85 separate lects, most of which are separate languages. Taihu Wu has 47 million speakers.

The Taizhous, Huangyan, Jiaojiang, Sanmen, Tiantai, Wenling, Xianju, Leping, and Yuhuan are members of the Taizhou Group of Wu, which has 13 lects, all separate languages.

The Yiwus, Dongyang, Jinhua, Jinhua Xiaohuang, Lanxi, Tangxi, Wuyi, Pan’an, Pujiang, and Yongkang are all members of the Wuzhou Group of Wu, which contains 27 lects, almost all of which are separate languages. Wuzhou Wu has 4 million speakers (Olson 1998).

Chuqu Wu has two subgroups, Chuzhou Wu and Longqu Wu.

Lishui, Qingyuan, Jingning, Jinyun, and Taishun, and four others are in the Chuzhou group of Chuqu Wu, which contains nine languages. Chuzhou Wu has 1.5 million speakers.

Pucheng, Shangrao County, Shangrao City, Jiangshan, Songyang, Guangfeng, Longquan, Kaihua, Changshan, Suichang, Longyou, Yushan, and Quzhou and one other are members of the Longqu Group of Chuqu Wu, which has 14 languages and 5 million speakers (Olson 1998).

Chuqu Wu contains 24 separate lects, almost all separate languages.

Nanjing Wu is unclassified.

There are at least 216 varieties within Wu. Some say that there are hundreds of mutually unintelligible languages inside of Wu alone.

The various Wu varieties have 85 million speakers (Olson 1998).


Hui or Huizhou is a major group of many different languages with wide internal variation. There is a possibility that all Hui varieties are separate languages. Hui is spoken in the historical area of Huizhou, located mostly in Southern Anhui but also partly in Zhejiang and Jiangxi. The area is very mountainous, leading to strong differentiation among the lects. Every county in the area has its own Hui version unintelligible to outsiders.

Xidi Hui, spoken in a village at the foot of Huangshan Mountain in Anhui, is a separate language. Xidi is unintelligible even to villages a few miles away.

Tunxi Hui, Wuyuan Hui and Xiuning Hui are separate languages. The first is spoken in Anhui, but Wuyuan and Xiuning are spoken in Jiangxi Province.

Within the Jingzhan Group of Hui, Jingde Hui, Ningguo Hui, Chilingkou Hui, (spoken in Chiling, Qimen County), Meixi Xiang Hui, and Shitai Hui are separate languages.

Within Qimen County itself, there are six different Hui lects with low intelligibility between them. It is quite possible that we are talking about six different languages here. One of them appears to be Chilingkou above. The others we will just call: Qimen Hui A, Qimen Hui B, Qimen Hui C, Qimen Hui D and Qimen Hui F.

All except Meixi Xiang Hui are spoken in Anhui Province. Meixi Xiang Hui is spoken in Meixi, Jiangxi.

Jixi Hui and Hongmen Hui are separate languages.

Within the Shexian Group of Hui, there are two different languages that we will only call Shexian Hui A and Shexian Hui B for now. Jixi and the Shexian languages are spoken in Anhui.

Dexing Hui and Dongzhi Hui are separate languages, the first spoken in Jiangxi and the second in Anhui.

In the Yangzhou Group of Hui, Jiande Hui and Chunan Hui are separate languages. Chunan is spoken in Jiangxi. There are two other varieties in the group, Suian Hui and Shouchang Hui. Suian and Chunan are very diverse and are in all probability separate languages. Shouchang is also extremely diverse, and Jiande has some differences with Shouchang.

The Yangzhou languages are interesting because there is controversy whether they are Wu or Hui languages. Careful examination reveals that they cannot be subsumed under Southern Wu due to their great divergence from it, despite having some similarities with Wu. Some authors feel that they are Hui-Wu merged lects, and their similarity with both is given as a reason for merging Wu and Hui into a supergroup.

While it is best to classify them as Hui, they are much different from most Hui lects. All are spoken in western Zhejiang. Discussion here.

Jiande, Chuan, Suian and Shouchang are members of the Yangzhou Group of Hui. Yangzhou Hui has four lects, all separate languages.

Huangshan, Tunxi, Wuyuan, Xiuning, and two others are members of the Xiuyi Group of Hui, which has six lects.

Meixi Xiang, the Qimens, Chilingkou, Jingde, Ningguo, Shitai, and two others are members of the Jingzhan Group of Hui. Jingzhan Hui has 12 lects.

Jixi, Huizhou, Hongmen, the Shexians, and She are members of the Jishe Group of Hui. Jishe Hui has six lects, all separate languages.

Dexing, Dongzhi, Fuliang, and two others are members of the Qide Group of Hui. Qide Hui has five lects.

Xidi is unclassified.

There are 37 different Hui lects, at least 24 of which are separate languages. The various Hui languages have 3.2 million speakers.


Cantonese is a major language group spoken in the south of China. Cantonese speakers are said to be a mix between the Yue people and the Han. They have great pride in their speech which is closer to ancient Chinese than Mandarin.

Some Cantonese activists denounce Mandarin as a pidgin language spoken by Manchu and Mongol invaders glommed onto the Chinese of the people they conquered.

Various attempts are utilized to determine intelligibility between lects. They vary in efficacy, as the following shows.

Attempts to determine intelligibility through the use of complex lexical, tonal, grammatical and phonological formulae produce results that are excessively high in terms of percentage of intelligibility.

A better method is presented in Szeto 2000, in which sentences in other varieties, say Varieties B and C, are played to speakers of Variety A, and speakers of Variety A are asked to give the basic meaning of the Variety B and C sentences played to them. A sentence is recorded as correct if the basic meaning was ascertained.

By this better method, Standard Cantonese has only 31.3% intelligibility of Siyi, 7.2% of Hakka, 2.7% of Teochew and 2.5% of Xiamen (Szeto 2000). This paper also highlights the very important role morphological and syntactic differences play in intelligibility, even apart from phonology and other factors.

In contrast, the more complex method through the use of complex lexical, tonal, grammatical and phonological formulae not relying on actual informants gives false positives. By this method, Cantonese has 54.7% intelligibility of Hakka, 47.45% of Teochew, and 43.5% of Hokkien. This method falsely overestimates the intelligibility of Hakka by 7.6X, of Teochew by 16.1X and of Hokkien by 19X.

Standard Cantonese is traditionally said to have nine tones, but phonemically there are only six tones, since the last three are just three of the first six with a voiceless stop consonant on the end.

These are often called entering tones in traditional Chinese scholarship. Entering tones disappeared from most Mandarin varieties about 800 years ago due to the influence of invading Mongols speaking Turkic languages but are still present in Cantonese, Hakka and Min.

The original entering tones of Middle Chinese have merged into other tones or into Mandarin’s four tones. Traditional Chinese tones or contour tones end in a vowel or a nasal. However, in Standard Cantonese, the entering tone has retained its original short and sharp character from Middle Chinese, so in a sense, it has a different sound quality.

One of the most well-known divisions in Cantonese is Yuehai. Yuehai contains four divisions: Guangfu, Sanyi, Zhongshan, and Guangbao.

The other major divisions of Cantonese are Goulou and Yongshun, found in the watershed of the Pearl River, and Siyi, Gaoyang, Wuhua and Qinlian.

The Guangfu division of Yuehai consists of Guangzhou Cantonese, Xiguan Guangzhou Cantonese, Sabah, Hong Kong Cantonese, Macao Cantonese, Wenzhou Cantonese, Wuzhou Cantonese, Huizhou Cantonese, Nishimura Cantonese, Dongshan Cantonese and Xiguan Cantonese.

Standard or Guangzhou Cantonese is based on the Guangzhou dialect spoken in the city of that name.

A very pure form of Cantonese is spoken in Sabah in Malaysia as Sabah Cantonese. It resembles Standard Cantonese so much that the speaker community is called Little Hong Kong.

Hong Kong Cantonese is spoken in Hong Kong. There are a few differences with Guangzhou but not enough to impair communication.

Macao Cantonese is spoken in Macao.

Xiguan Cantonese is spoken in the suburban areas of Guangzhou. It has a few differences with Guangzhou but presumably not enough to impair communication. It spoken mostly by the older people now, as young people now speak Xiguan Guangzhou Cantonese, which is more properly part of Guangzhou. The dialect is dying out.

Dialects spoken in Guangzhou City include Nishimura Cantonese, Dongshan Cantonese, and some others. Dongshun is spoken in the downtown area. Nishimura is spoken by a few old people in the Nishimura zone, but it is going extinct.

Wenzhou Cantonese is very close to Guangzhou.

Huizhou Cantonese is a Cantonese variety spoken in Huizhou City to the east of Guangzhou to the northeast of Dongguan and to the west of Shanwei. This is part of the Pear River Delta. Huizhou has very heavy Hakka influence such that it is probably a separate language.

Vietnamese Cantonese is quite different from Standard Cantonese, but it is said to be nevertheless intelligible with it. However, other Standard Cantonese speakers say they cannot understand Vietnamese Cantonese very well.

Malayland Cantonese is also quite different from Standard Cantonese. Cantonese speakers who talk to Malayland speakers say that Malayland sounds like a foreign language. Therefore, Malayland appears to be a separate language. Malayland is mostly spoken in Kuala Lumpur and Ipoh, less so in Singapore. There are dialects inside of Malay such as Kuala Lumpur Cantonese and Ipoh Cantonese.

Cantonese is the most commonly spoken Chinese language around Kuala Lumpur. Although Singapore South Malayland Hokkien is the most widely popular non-Mandarin Chinese language in Singapore, Cantonese is the most commonly spoken language in Chinatown.

The Sanyi Group of Cantonese consists of Shunde Cantonese, Panyu Cantonese, Nanhai Cantonese, Xiquiao Cantonese, Foshan Cantonese, Shiwan Cantonese, Shatin Cantonese, and Jiujiang Cantonese.

Around Foshan, Xiquiao Cantonese, Jiujiyang Cantonese, Shiwan Cantonese, and Nanhai Cantonese are all spoken.

Foshan and Nanhai are close to Standard Cantonese and may be intelligible with it. Nanhai and Shunde Cantonese are mutually intelligible. Foshan, Xiquiao, and Jiujiyang are quite similar to Shunde.

Panyu Cantonese is definitely a separate language (Chan 1981). Panyu Cantonese is spoken in Xiaolan and Huangpu in the Zhongshan area.

Shunde Cantonese is almost the same language as Panyu, so if Panyu is a separate language, then Shunde is also. Shunde and Panyu may well be a single language, and if Nanhai is intelligible with Shunde, then Nanhai is also a part of this language. Shunde is spoken in Daliang, Longjiang, Ronggui and Beijiao.

There is at least one separate language inside of Sunde centered around Shunde, Panyu, and Nanhai, all of which are known as the Three Counties Area.

The Zhongshan Group of Cantonese spoken in Guangxi, composed of Shiqi Cantonese and Sanjiao Cantonese, is a separate language. Speakers of Standard Cantonese cannot necessarily understand Shiqi, but Shiqi people can understand Standard Cantonese. Shiqi is spoken in the urban part of Zhongshan City. Whether Shiqi and Sanjiao Cantonese are mutually intelligible is not known. It is best to call this language Shiqi Cantonese for now.

The Guangbao Group of Cantonese is spoken east of the Pearl River Delta in Shenzen, Dongguan and Hong Kong. Within Guangbao are three major divisions, Dongguan Cantonese, Bao’an Cantonese, and Dapeng Cantonese.

Dongguan Cantonese is not intelligible with Standard Cantonese. It is spoken in Dongguan City. A lot of young people are forgetting how to speak it under the influence of Standard Cantonese.

Dongguan is divided into Guangcheng Cantonese, Houjie Cantonese, and Humen Cantonese. Guangcheng is spoken in the Guangcheng subdistrict. Humen is spoken Humen Township on the east side of the Pearl River. Houjie is spoken in Houjie Township to the north of Humen.

Bao’an Cantonese is divided into Danija Cantonese, Weitou Cantonese, Gashiau Cantonese and Nantou Cantonese.

Danija Cantonese is the Cantonese variety spoken by the Tanka fisherpeople who live on boats off the coast of Guangdong, Guangxi, and Zhejiang. The Tanka People also live in Fujian and Hainan. In Fujian, they speak Fuzhou Northern Min. In Hainan, they speak some form of Hainanese Min.

Another group of Tankas in Hong Kong in Aberdeen and Taio to the north of the Hokkien-speaking area are former Hakka and Hokkien speakers who speak Weitou Cantonese, a Cantonese variety close to Standard and Dongguan but closer to Dongguan. It is not intelligible with Hong Kong Hakka.

Weitou is spoken mostly by older people in Hong Kong’s New Territories in walled villages in Yuen Long, Kam Tin, Songgang, Pinghu, Ping Shan, Shantin, Sheung Shui, Tai Tau Leng, Yan Gang, Fanling, Fanling Po Tsuen, Lam Tsuen, Taipo, and Tam Chung Tsuen, in the Bao’an District, in Shenzen in Shangsha, Xiasha, Huanggang, Xinzhou, Fukuda, Gangxia, and Akao, in the Longgang District, in parts of Nantou, and in the Nanshan District.

Nantou Cantonese is spoken in the Namtam area of Nantou by 5,000 people. Intelligibility with the rest of Bao’an is not known.

In Hong Kong, Gashiau Cantonese is spoken by a group of fisherpeople related to the Tanka. This language is related to Danija/Weitou but is not intelligible with it.

Dapeng Cantonese is spoken on the Dapeng Peninsula in the city of Dapeng, in Hong Kong, and Shenzen, in Tung Ping Chau on the Ping Islands in Hong Kong, and in Tai Kok. It has been very heavily influenced by Hakka. It is so different that it must be a separate language. It may be related to or the same thing as the Junhua or Military Language, a mixed language now classified as Mandarin. If so, it is not Cantonese at all, and instead it is a Mandarin lect. In Hong Kong, Tung Ping Chau Dapeng is highly endangered.

The Siyi or Sze Yup Group of Cantonese is a huge group of Cantonese lects spoken in the Pearl River Delta. Siyi Cantonese is the language of the Four Counties: Enping, Kaiping, Taishan and Xinhui. Enping, Xinhui, and Kaiping. Researchers have found 664 different Cantonese dialects in the Pearl River Delta area alone. 194 of them were quite similar, but another 442 of them were quite different. Since it is mostly Siyi varieties that are spoken in this area, this implies that there may be up 664 different lects in Siyi alone.

Siyi has very low intelligibility with Standard Cantonese, 10-20%.

150 years ago, there were fewer, but still significant differences between Siyi and Sanyi (Standard Cantonese), but Siyi was disparaged as a “hill dialect” of poor farmers, while Sanyi was elevated as the prestige variety of the cultured and cosmopolitan. This is why Sanyi became the Standard Cantonese variety. The Siyi incorporated this negative view into their self-image even to the point where they held overseas meetings meeting in Sanyi.

Taishanese, Hoisonese, Hoisan Cantonese, or Toison Cantonese is spoken north of Macao in Taishan County where there are 20 townships, and there is a different lect in every township. Taishanese is the Standard Siyi dialect. As late as the early 1990’s, children in this area were still being taught in the local Taishanese lect. Taishanese is still widely spoken in Chinatowns in the US such as in San Francisco (especially Stockton Street) and in New York.

The varieties in Taishan County can be quite different. For certain, there are at least three distinct languages within Taishanese besides the standard variety, Taishan Cantonese A, Taishan Cantonese B and Taishan Cantonese C, and these three have a hard time understanding each other.

There are clearly at least 17 dialects within Taishan Proper alone. Each town has its own dialect, and in fact, each village has its own dialect. The main town dialects are Taicheng Cantonese, Dajiang Cantonese, Shuibu Cantonese, Sijiu Cantonese, Baisha Cantonese, Sanhe Cantonese, Chonglou Cantonese, Doushan Cantonese, Duhu Cantonese, Chixi Cantonese, Duanfen Cantonese, Guanghai Cantonese, Haiyan Cantonese, Wencun Cantonese, Shenjing Cantonese, Beidou Cantonese, and Chuandao Cantonese.

Baisha is spoken in Bei Hou.

Speakers of Enping Cantonese, spoken in Enping County, cannot understand some other Siyi lects. Therefore, Enping is a separate language.

Kaiping or Chikan Cantonese, spoken in Kaishan County, is not fully intelligible with Enping until they get used to each others’ sounds. Kaiping is so different from Taishanese that it is hard to imagine how they can communicate well, though there is partial intelligibility. There are many different dialects inside of Kaiping alone, and pronunciation varies almost from neighborhood to neighborhood. One dialect is called Gee Cantonese. However, they seem to be mostly mutually intelligible.

In Xinhui, there is a dialect called Hetang Cantonese that is very divergent and has many strange features not found in other Siyi lects. Doubtless it is less than fully intelligible with other Siyi lects.

Xinhui Cantonese is somewhat different from Taishanese but appears to be intelligible with it.

Heshan Cantonese is intelligible with Xinhui and Taishanese.

Siqian Cantonese, Doumen Cantonese and Jiangmen Cantonese are three other Siyi varieties. Intelligibility data for these three lects is not known.

The Yongxun Group of Cantonese consists of Nanning Cantonese, Yongning Cantonese, Guiping Cantonese, Chongzuo Cantonese, Ningmin Cantonese, Hengxian Cantonese, and Baise Cantonese.

Baise Cantonese must be a separate language. It is spoken in the Yongjiang District in Baise City. It is very different, having been influenced heavily by Zhuang speakers.

Conghua or Congzhou Cantonese is spoken in three different dialects in Central Guangdong. Intelligibility data is lacking.

Curiously, Nanning Cantonese is said to be intelligible with Standard Cantonese.

The Goulou Group of Cantonese is a separate from all of the rest of Cantonese and is linked with Ping and Tuhua. It is made up of Yulin Cantonese, Baobai Cantonese, Lizhou Cantonese, Guangning Cantonese, Huaiji Cantonese, Fengkai Cantonese, Deqing Cantonese, Shanglin Cantonese, Binyang Cantonese, Yangshan Cantonese, Ertang Cantonese, Shuishan Cantonese, Yunan Cantonese, and Tengxian Cantonese.

Ertang Cantonese, Shuishan Cantonese and Yunan Cantonese are all spoken in Guilin City in Guangxi Province. They are under Ping influence. Ertang and Shuishan arrived in Guangxi 100 years ago from the Yangshan region of Guangdong.

Yulin Cantonese is a representative variety in Goulou Cantonese and is the existing form of Chinese that is closest to Old Chinese.

Baobai Cantonese is spoken in Baobai south of Yulin. Yulin and Baobai are mutually intelligible, but they are not intelligible with the rest of Goulou Cantonese.

Lizhou Cantonese has difficult intelligibility with Standard Cantonese. It is spoken apart from the main group, so it may be a separate language.

Wuzhou Cantonese is a very divergent Cantonese variety spoken in Wuzhou City in Eastern Guangxi that is very hard even for other Cantonese speakers to understand.

The Gaoyang Group of Cantonese is a division of Cantonese that is composed of Gaozhou Cantonese, Yangiang Cantonese, Liangiang Cantonese and Maoming Cantonese.

Maoming Cantonese is an extremely diverse Cantonese variety that must be a separate language. Intelligibility of Maoming Cantonese with Yangiang Cantonese, Liangiang Cantonese and Gaozhou Cantonese is not known.

The Wuhua Group of Cantonese consists of Huazhou Cantonese, Zhanjiang Cantonese, Maihua Cantonese and Wuchuan Cantonese.

Huazhou Cantonese, spoken next door to Maoming, also cannot be understood by Standard Cantonese speakers.

Zhanjiang Cantonese is utterly unintelligible with Standard Cantonese. They speak Zhanjiang Min in this area, and the Cantonese has heavy Min influence, hence it is probably a separate language.

Maihua Cantonese is a Cantonese variety spoken on Hainan. This is the only Cantonese variety spoken on Hainan, so for that reason alone, it may be a separate language.

The Quinlian Group of Cantonese is a division of Cantonese spoken in the Guangxi coastal areas around Qinzhou, Lianzhou, Lingshan, Beihai and Fangchenggang.

The group is divided into urban varieties which share a high degree of mutual intelligibility with each other and even with other urban varieties in the Yongxun and Gaoyang Groups but have poor intelligibility with the rural varieties.

The reasons for the higher mutual intelligibility with urban varieties even outside of the group may be due to the cities themselves, even outside of known groups, being closer to each other than rural varieties even within the same group. This may have to do with histories of intense trade between cities even outside of groups which made them closer together.

The urban varieties are Qinzhou Cantonese, Fangcheng Cantonese, Dongxing Cantonese, and Lingcheng Cantonese. They would seem to constitute a language called Urban Quinlian Cantonese.

The rural varieties are split into three major groups: Lianzhou Cantonese, Lingshan Cantonese, and Xiaojiang Cantonese.

Lianzhou Cantonese varieties have a Ping base with some Min and Hakka blended in. They are spoken in Hepu, the southern part of Pubei, and the coastal areas of Qinzhou. Lianzhou is so different from even the rest of the rural varieties that it is a separate language.

Hepu Cantonese is a Lianzhou Cantonese lect.

Lingshan Cantonese varieties are spoken in the countryside of Qinzhou, Lingshan and Pubei.

Xiaojiang Cantonese varieties are spoken in Pubei.

The rural varieties have poor intelligibility with the urban lects. A separate language called Rural Quinlian Cantonese seems reasonable.

Beihai Cantonese is very widely spoken in the area around Nanning as the major language. Beihai itself has five separate dialects within it, Beihai Cantonese A, Beihai Cantonese B, Beihai Cantonese C, Beihai Cantonese D and Beihai Cantonese E.

Jimmi Cantonese is an unclassified Cantonese language spoken in Jilong and Tiechong in Huidong and Erbu and Chishi in Haifeng. The popular notion is that this is a blend of Cantonese, Hakka and Min. Hailufeng Min is widely spoken in the area, and Haifeng Hakka is also spoken. Jimmi varieties appear to be mostly Cantonese with some Hakka and an even smaller trace of Min. Surely Jimmi must be a separate language.

Namlong Cantonese, is an unclassified Cantonese language from the Pearl River area. It is also a separate language or at least it was in 1949. Whether it still exists is not certain, but native speakers must still be alive.

Dongguan, Shunde, Foshan, Zhongshan, Nanhai, Panyu, Xiquiao, Foshan, Shiwan, Shatin, and Jiujiang, Guangzhou, Vietnamese, Malayland, Macao, Hong Kong, Nishimura, Dongshan, Xiguan, Dongguan, Bao’an, Tanka, Shiqi, and Sanjiao are members of the Yuehai Group of Cantonese, which has 727 lects.

Yuehai itself is split into Guangfu, Zhongshan, Guangbao and Sanyi subgroups.

Guangzhou, Vietnamese, Malayland, Macao, Hong Kong, Nishimura, Dongshan, Wuzhou, Xiguan, and Tanka are members of the Guangfu Group of Yuehai, which has 10 lects.

Guangfu has 13 million speakers (Olson 1998).

Shunde, Panyu, Nanhai, Xiquiao, Foshan, Shiwan, Shatin, Jiujiang and one other are members of the Sanyi Group of Yuehai, which has eight lects.

Dongguan, Bao’an, and Daping are members of the Guangbao Group of Yuehai, which has three lects.

Shiqi and Sanjiao are members of the Zhongshan Group of Yuehai, which contains two lects.

Taicheng, Dajiang, Shuibu, Sijiu, Baisha, Sanhe, Chonglou, Doushan, Duhu, Chixi, Duanfen, Guanghai, Haiyan, Wencun, Shenjing, Beidou, Chuandao, Heshan, Jiangmen, Siquian, Doumen, Guzhen, Xinhuui, Enping, Gee, and Kaiping are members of the Siyi Group of Cantonese, which has at least 693 lects. There are 3.6 million speakers of Siyi Cantonese.

Nanning, Yongning, Guiping, Chongzuo, Ningmin, Hengxian, Baise, and five others are members of the Yongxun Group of Cantonese, which has 12 lects.

Yongxun Cantonese has five million speakers (Olson 1998).

Zhanjiang, Gaozhou, Maoming and nine others are members of the Gaoyang Group of Cantonese, which has 12 lects.

Gaoyang Cantonese has 5.4 million speakers (Olson 1998).

Huazhou, Zhanjiang, Maihua, and Wuchuan are members of the Wuhua Group of Cantonese, which has four lects.

Yulin, Baobai, Guangning, Wuzhou, Huaiji, Fengkai, Deqing, Yunan, Shanglin, Binyang, Yangshan, Ertang, Shuishan, and Tengxian are members of the Goulou Group of Cantonese, which has at least 14 lects.

Qinzhou, Fangcheng, Dongxing, Lingcheng, Beihai, Lianzhou, Lingshan, Xiaojiang, Conghua, Nanning, and Hepu are members of the Quinlian Group of Cantonese, which has 11 lects.

Namlong is unclassified.

There are 780 lects of Cantonese, and Cantonese has 64 million speakers.


Ping, now recognized as a major split from Cantonese, is composed of Guinan Ping, Guibei Ping, and Benihua Ping. Guinan and Guibei are definitely separate languages, and Benihua appears to be one also. There is high but apparently not full intelligibility between Guinan and Guibei.

Ping has been heavily influenced by the language of the Dong people. Cantonese has almost no intelligibility of Ping.

Guinan Ping is spoken in Northern Guangxi around the city of Guilin near the Southern Mandarin-speaking area.

Guibei Ping is spoken in Southern Guangxi around the city of Nanning. It is close to Cantonese, especially Nanning Cantonese spoken in the same area. Guibei has some loans from Zhuang.

Benihua is a Ping language that has been heavily influenced by the Gong language, and as such, no doubt it is a separate language.

Guinan Ping has 22 lects.

Yongjiang Pinghua, Guandao Pinghua and Rongjiang Pinghua are members of Guibei Ping, which has 11 lects.

There is one Ping variety that is unclassified.

Ping has 34 lects. Ping has 2 million speakers.


Tuhua is a separate branch of Chinese spoken in Northern Guangdong, Western, Southeastern, and Northeastern Hunan Province and parts of Southern Guangxi. It has 132 separate lects. Tuhua is not really a language group but a wastebasket group for various varieties derisively referred to as tuhua – or “farmer’s language.”

Initial examination suggests that a number of things.

First of all, that the Tuhua lects, especially those of Southern Hunan, are very diverse, possibly as diverse as Wu, Xiang and Hui. Many or all of them may well be separate languages. If Tuhua is really as diverse as Wu, Xiang and Hui, then quite probably there is a different Tuhua language spoken in every county. Further, they are poorly studied and dialectally very diverse. There are many dialects inside the known Tuhua lects, and these dialects are often very different. So there appear to be languages inside even the known Tuhua lects.

Further, there appear to be links between the Tuhua varieties of Southeastern Hunan and northern Guangdong and the Ping language of Northern Guangxi, as they border each other. They all appear to be related and to have descended from a common ancestor.

Tuhua may have originally begun as a Sinicized form of the Yao language, and many of its speakers are still Yao people. One theory is that Tuhua is simply an extension of Ping. Another theory is that Tuhua started out as Middle Gan and then mixed with Cantonese, Hakka and Southwestern Mandarin.

Additionally, many Tuhua varieties are starting to splinter recently, as influences from Hakka, Cantonese and Southwest Mandarin begin to affect the younger speakers such that the language of the youngest speakers is quite a bit different from the language of the older speakers.

The best known of the Tuhua varieties is Shaozhou, referred to here as Shaozhou or Shaoguan Tuhua. Sometimes this name is used to describe all Tuhua varieties. It is spoken on the border of Hunan, Guangdong and Guangxi. Most of the speakers are in Northern Guangdong, but there are also some speakers in Southeastern Hunan.

Shaozhou is very different from other Chinese lects. Shaozhou consists of many different varieties which are often strikingly different from the others. Some say that Shaozhou is a branch of Min Nan, while others say it is related to Hakka.

Shaozhou is composed of eight lects, all of which appear to be separate languages. Of these, Shibei Shaozhou Tuhua and Xiangyan Shaozhou Tuhua, spoken in adjacent towns, are separate languages. Shibei has heavy Hakka influence, and Xiangyang is turning more Cantonese. Xiangyang has only been in contact with Cantonese for a few decades, while Shibei has been in contact with Hakka for centuries.

Guitou Shaozhou Tuhua and Dacun Shaozhou Tuhua are also separate languages.

Zhoutian Shaozhou Tuhua and Shitang Shaozhou Tuhua are spoken in Renhua County. These they may both by separate languages.

Really all of the Shaozhou varieties seem to be separate languages, so Nanxiong Shaozhou Tuhua is also. Nanxiong apparently shares a common ancestor with Hakka.

Longgui Shaozhou Tuhua, spoken in Qujiang County in Guangdong, is a separate language. Longgui has 2,000 speakers.

Besides Shaozhou, another major split in Tuhua is Lianzhou Tuhua. It is spoken in Lianzhou County and in Liannan Autonomous Yao County in Quingyuan City in Northern Guangdong Lianzhou is composed of Xi’an Lianzhou Tuhua, Fengyang Lianzhou Tuhua, Xingzi Lianzhou Tuhua, and Bao’an Lianzhou Tuhua. Each is spoken in a distinct township or townships, so no doubt each is a separate language.

In Lechang Prefecture in Northern Guangdong bordering Hunan, there are five separate languages, Lechang Tuhua 1, Lechang Tuhua 2, Lechang Tuhua 3, Lechang Tuhua 4 and Lechang Tuhua 5, which are not fully intelligible with each other.

Xianghua is a branch of Tuhua that contains six varieties of its own. Xianghua Tuhua is a completely separate and highly diverse language that is spoken in Western Hunan.

Also in Hunan, in northeastern Quiyang County, another Tuhua variety is spoken – Quiyang Tuhua. This must certainly be a separate language. There is a great deal of dialectal diversity within Quiyang Tuhua. Yantang Quiyang Tuhua and Yangshi Quiyang Tuhua are two of these dialects.

Xintian Tuhua, spoken in Linwu County in Southern Hunan, is a major split in Tuhua, so it is surely a separate language.

Linwu Dachong Xintian Tuhua is a form of Xintian.

Jiahe Tuhua is a completely separate language, unintelligible with other lects. Furthermore, there are huge dialectal differences within Jiahe Tuhua that may or may not constitute separate languages.

In Yongzhou County in Southeastern Hunan, Yongzhou or Xiangnan Tuhua is spoken.

It is clearly a separate language. It has at least 18 different dialects: Xintian Southern Rural Yongzhou Tuhua, Xintian Yongzhou Northern Rural Yongzhou Tuhua, Ningyuan Zhangjia Yongzhou Tuhua, Ningyuan Yongzhou Pinghua Tuhua, Lanshan Shangdong Yongzhou Tuhua, Lanshan Tushi Yongzhou Tuhua, Lanshang Taiping Yongzhou Tuhua, Shuangpai Lijiaping Yongzhou Tuhua, Gangyu Yongzhou Tuhua, Xiangyu Yongzhou Tuhua, Guiyang Liuhe Yongzhou Tuhua, Jianghua Sumitang Qidouhua Yongzhou Tuhua, Jianghua Baimangying Yongzhou Tuhua, Jiangyong Songbai Yongzhou Tuhua, Jiangyong Chengguan Yongzhou Tuhua, Jiangyong Taochuan Yongzhou Tuhua, Daoxian Xianglinpu Yongzhou Tuhua, Dong’an Gaofeng Yongzhou Tuhua, Dong’an Xuaqiao Yongzhou Tuhua, Dong’an Shiqishi Yongzhou Tuhua, Lengshuitan Xiaojiangqiao Yongzhou Tuhua, and Lengshuitan Lanjiaoshan Yongzhou Tuhua.

There are four main types represented here:

The first type is a Dong’an-Lengshuitan type comprising Dong’an Xuaqiao, Dong’an Gaofeng, Dong’an Shiqishi, Lengshuitan Xiaojiangqiao, Lengshuitan Lanjiaoshan, and Sumitang Qidouhua.

Of these, Dong’an Gaofeng Yongzhou and Dong’an Xuaqiao Yongzhou are spoken in separate districts, so they are in all probability separate languages. Dong’an Shiqishi Yongzhou Tuhua has Xiang and Wu influences.

The Lengshuitan varieties appear to represent at least one language. Lengshuitan Lanjiaoshan has at least one dialect, Lengshuitan Shamuqiao Lanjiaoshan Yongzhou Tuhua. It has a close relationship to Dong’an Xuaqiao Yongzhou Tuhua.

The second type is a Jiangyong-Daoxian type comprising nine lects. At least seven of them are clearly separate languages.

Daoxian Xianglinpu Yongzhou Tuhua must be a separate language, as it is named after a county.

Daoxian Xiaojia Yongzhou Tuhua must be separate language also, as it is a major split in this group.

There are many different Yongzhou Tuhua lects in Jiangyong County, many of which are separate languages. Jiangyong Yunshan Yongzhou Tuhua, Jiangyong Xiaopu Yongzhou Tuhua, Jiangyong Xiacengpu Yongzhou Tuhua and Jiangyong Huilongxu Yongzhou Tuhua, all of which must surely be separate languages.

There are many dialects even within the town of Yunshan where Jiangyong Yunshan is spoken. Jiangyong Yunshan is transitional between Jiangyong Chengguan and Jiangyong Xiacengpu.

Jiangyong Xiacengpu has 21 different dialects.

Jiangyong Huilongxu is the language was the basis for the famous nishu, “women’s script”, a secret language of women (Leming 2004), originating from the Shangjiangxu (Xiao River) region of Northeastern Jiangyong County in Hunan, of which much has been written lately.of the famous Jiangyong women’s script referenced above.

Jiangyong Chengguan Yongzhou Tuhua, Jiangyong Taochuan Yongzhou Tuhua, Jiangyong Cushjiang Yongzhou Tuhua, and Jiangyon Huilongxu Tuhua also appear to be a separate languages.

Jiangyong Cushjiang has nine dialects.

Jiangyong Taochuan has 34 dialects, but there is a lot of uniformity between them.

Jiangyong Huilongxu has two dialects.

Jianghua Sumitang Qidouhua Yongzhou Tuhua has a reasonably close relationship to Jiangyong Songbai Yongzhou Tuhua and Jiangyong Chengguan, and all three are thought to have derived from the same base. Although it is spoken in the same county as Jianghua Baimangying, it appears to be completely different, so it must be a separate language.

Jianghua Baimangying Yongzhou Tuhua also appears to be quite different, so it is probably a separate language also.

As the other eleven main lects in this group are separate languages,

Intelligibility between varieties is not known, but dialectal divergence within Tuhua varieties is typically great, and some or all of the above may be separate languages. There are clearly at least 18 different languages here, and there may be up to 31 different languages.

The third type is a Xintian Southern Rural Yongzhou Tuhua type.

The fourth type is a Ningyuan Yongzhou Pinghua type.

There is also a group of unclassified types comprising Xintian Northern Rural Yongzhou Tuhua, Ningyuan Zhangjia Yongzhou Tuhua, Lanshan Shangdong Yongzhou Tuhua, Lanshang Taiping Yongzhou Tuhua, Guiyang Liuhe Yongzhou Tuhua, Jianghua Baimangying Yongzhou Tuhua, and Shuangpai Lijiaping Yongzhou Tuhua.

Of these, Lanshang Tushi Yongzhou Tuhua may well be a separate language. Guiyang Yongzhou Liuhe Tuhua is probably part of a separate language also, as Guiyang is a county in Southeastern Hunan. Gangyu Yongzhou Tuhua, Xiangyu Yongzhou Tuhua, Lanshang Taiping Yongzhou Tuhua, and Shuangpai Lijiaping Yongzhou Tuhua appear to represent the names of separate counties, so no doubt each one is a separate language.

Xintian Northern Rural Yongzhou Tuhua is apparently completely different from Xintian Southern Rural Yongzhou Tuhua, so it is probably a separate language also.

Another Tuhua variety spoken in Yongzhou in the southern part of the region, Huasheng Southern Yongzhou Tuhua, may have as many as 75 different dialects inside of it. This is undoubtedly a separate language.

The Tuhuas of Southern Hunan appear to be Gan/Xiang mixed languages.

Luojin Chongshan Tuhua is spoken in Yongfu in Southern Guangxi. It has a close relationship to Guibei Pinghua. It is clearly a separate language.


Danzou is a separate group of unclassified Chinese languages. Danzou is spoken in the northwest of Hainan, and Hainanese speakers cannot understand it. It is either related to the language spoken by the Lingao people or is the same language.

Yet the Danzou people speak nine different lects, including varieties described as Hakka, others described as Cantonese, and others described as Mandarin, so obviously there are at least three separate languages inside Danzou. Let us call these Danzou Cantonese, Danzou Hakka and Danzou Mandarin.

Lingling or Linghua is an unclassified language spoken in Longsheng County, Guangxi. Linghua is a separate language. It is spoken by 20,000 ethnic Hmong in Taiping, Pingdeng Township in Longsheng. It is spoken only by residents inside the city as a sort of secret language. Southwestern Mandarin is used with outsiders. The language is a mixture of Hmong and Southwestern Mandarin.

Junhua or Military Language is spoken in Taoyuan County and Luidui in Pingtung County in Taiwan, Lufeng County and Huizhou City in Guangdong; Sanya, Changjiang, Danzhou, Zonghe, and Lingao in Hainan; Guangxi; around Hakka speakers in Wuping County in Zhongshan, Fujian, and other places.

On a Mandarin base, Junhua adds Hakka, Cantonese and Taiwanese. It is considered to be an Old Mandarin language and is normally placed in Southwest Mandarin in a group called the Junhua Group, which contains four lects. But others say that different Military Language varieties are either Hakka or Gan. Wherever these varieties are spoken, they are not understood by people nearby.

Junhua seems to derive from a lingua franca spoken by soldiers in the Ming Dynasty Army and was widely learned and understood by all soldiers at the time. It bears a strong resemblance to Ming Era Chinese.

Military Language is not the same language in the various areas where it is spoken.

Huping Junhua, spoken by 16,000 people in Zhongshan, is not understood by the surrounding peoples and is not considered part of Hakka. The language began in the area in the 1390’s when the Ming Dynasty sent its army to Zhongshan to put down a rebellion. Soldiers came from all over China and remained in the area after the fighting, creating a new languages out of all of their languages mixed together along with local lects. Actually this is thought to be more of a Gan language with Hakka influences.

Taiwanese Junhua in Taiwan is not the same language as the Military Language elsewhere. This language also has heavy Hakka influences, but it also has Min Nan, Mandarin and even Japanese influences. Some say this is a Hakka language.

Uncertain Affiliation/Possibly Not Sinitic

Maojiahua is a language spoken by 20,000 Hmong in southwest of Hunan, in the northeast of Guangxi and in some areas of Hubei. Ethnologue originally listed this language as a form of Chinese, but it now listed as a Eastern Xiangxi Hmong. Another argument is that this is a Chinese language with heavy Hmong influence. As the matter is not yet settled and Ethnologue lists it as Hmong, we will not list it as Chinese.

Waxiang is an unclassified Chinese variety spoken by the Waxiang ethnic group in Luxi, Guzhang and Yongshun counties in Xiangxi Tujia and Miao Autonomous Prefecture, Zhangjiajie prefecture-level city in Dayong and Chenxi, Xupu and Yuanling Counties in Huaihua prefecture-level city in Northwestern Hunan. It is nothing like the Southwestern Mandarin, Xiang, Tujia and Xo Miao Hmong languages that surround it, and none of them can understand it. There are 362,000 speakers of Waxiang.

It shares some lexical influences from the Bai language, suggesting a substratum from the Bai languages. This is either an unclassified Chinese language or a separate minority tongue, maybe related to Hmong. Others view it as a Xiang-Hmong mixed language.


Ben Hamed, Mahe´. 2005. Neighbour-nets Portray the Chinese Dialect Continuum and the Linguistic Legacy of China’s Demic History. Proc. R. Soc. B 272:1015–1022.
Bodman, Nicholas C. 1988. Two Divergent Southern Min Dialects of the Sanxiang District, Zhongshan, Guangdong. BIHP 59 (2): 401-423.
Branner, David. 2000. Problems in Comparative Chinese Dialectology. The Classification of Min and Hakka. Berlin: Walter de Gruyter.
Branner, David. 2008. Personal communication.
Campbell, Hilary. 2004. Chinese Grammar – Synchronic and Diachronic Perspectives. Oxford, UK: Oxford University Press.
Campbell, James Michael. Putonghua and Taiwanese Min Nan speaker. Taipei, Taiwan. 2009. Personal communication.
曹志耘 (Cao, Zhiyun). 2002. 南部吴语语音研究 (Southern Wu Phonology Research). Beijing: Commercial Press (In Chinese).
Chan, Marjorie K.M., Lee, Douglas W. 1981. Chinatown Chinese: A Linguistic and Historical Re-evaluation. Amerasia Journal, Volume 8, Number 1.
Cheng, Chin-Chuan. 1997. Measuring Relationships among Dialects: DOC and Related Resources. Computational Linguistics & Chinese Language Processing 2.1: 41-72.
Cheng, Chin-Chuan. 1998. Extra-Linguistic Data for Understanding Dialect Mutual Intelligibility. Taipei, Taiwan: Paper delivered at the 1998 Annual Conference of the Pacific Neighborhood Consortium.
De Souza, S. C. 1903. A Manual of the Hainan Colloquial Bunsio Dialect. Singapore.
Gilliland, Joshua. 2006. Language Attitudes and Ideologies in Shanghai, China. MA Thesis. Columbus, OH: Ohio State University.
Hirata, Shoji. 1998. Aspect: A General System and its Manifestation in Mandarin Chinese. Taipei: Student Book Company.
Johnson, Eric. 2010. SIL Electronic Survey Reports 2010-027: A Sociolinguistic Introduction to the Central Taic languages of Wenshan Prefecture, China. Dallas, Texas: SIL.
Kirinputra, Láñitri. Hokkien speaker. November 2014. Personal communication.
Lee, Kent A. 2002. Chinese Tone Sandhi and Prosody. MA Thesis. Urbana, IL: University of Illinois at Urbana-Champaign.
Lien, Chinfa. August 17-19, 1998. Denasalization, Vocalic Nasalization and Related Issues in Southern Min: A Dialectal and Comparative Perspective. International Symposium on Linguistic Change and the Chinese Dialects Dedicated to the Memory of the Late Professor Li Fang-kuei. Seattle, Washington.
Liming, Zhao. The Women’s Script of Jiangyong: An Invention of Chinese, in Jie, Tao; Zheng, Bijun; and Mow, Shirley L., editors. 2004. Holding up Half the Sky: Chinese Women Past, Present, and Future, Chapter 4. New York: Feminist Press at the City University of New York.
Mair, Victor H. 1991. What Is a Chinese ‘Dialect/Topolect’?  Sino-Platonic Papers: 29.
McKeown, Adam. 2001. Chinese Migrant Networks and Cultural Change: Peru, Chicago, Hawaii, 1900-1936. Chicago, IL: University of Chicago Press.
Ngù, George. Eastern Min speaker. 2009. Personal communication.
Olson, James Stuart. 1998. An Ethnohistorical Dictionary of China. Westport, CN: Greenwood Publishing Group.
Rickard, Kristine. 2006. A Linguistic-phonetic Description of Lanqi Citation Tones. Proceedings of the 11th Australian International Conference on Speech Science & Technology, pp. 349-353. Edited by Paul Warren & Catherine I. Watson. University of Auckland, New Zealand. December 6-8, 2006. Auckland, NZ: Australian Speech Science & Technology Association Inc.
Szeto, Cecilia. 2000. Testing Intelligibility among Sinitic dialects. Proceedings of ALS2K, the 2000 Conference of the Australian Linguistic Society.
Tek, Rohana. 2016. Cambodian Teochew speaker. July 2016. Personal communication.
Terng, Brice. Central Xianyou Puxian Min speaker. September 2016. Personal communication.
Thurgood, Graham. 2006. Sociolinguistics and Contact-induced Language Change: Hainan Cham, Anong, and Phan Rang Cham.‭ Tenth International Conference on Austronesian Linguistics, January 17-20, 2006, Palawan, Philippines. Linguistic Society of the Philippines and SIL International.
Xun, Gong. Sichuan Mandarin and Putonghua speaker. Personal communication. September 2009.
Zheng, Rongbin. 2008. The Zhongxian Min Dialect: A Preliminary Study of Language Contact and Stratum-Formation, pp. 517-526. Edited by Chan, Marjorie K.M., and Kang, Hana. Proceedings of the 20th North American Conference on Chinese Linguistics (NACCL-20). Volume 1. Columbus, Ohio: The Ohio State University.

1 Comment

Filed under Asia, Cantonese, China, Chinese language, Comparitive, Dialectology, Language Classification, Language Families, Linguistics, Mandarin, Min Nan, Regional, Sinitic, Sino-Tibetan, Sociolinguistics

Is There a Language That is (Nearly) Impossible to Learn to Speak Without Growing up with It?

Answer from Quora

I recently talked to a man who is learning Min Nan, which is a Sinitic language often called a dialect of Chinese. He told me that Min Nan speakers say that the tones are so hard that no one who doesn’t grow up speaking Min Nan ever seems to get it very well.

Cantonese is a similar language that is very difficult. It is much harder than Mandarin, and many native Mandarin speakers say they tried to learn Cantonese and gave up on it because it was too hard. Cantonese has nine tones.

Basque is said to be very hard to learn unless you grow up with it. There is a joke that the Devil spent seven years trying to learn Basque, and he only learned how to say Hello and Goodbye.

Navajo would also be hard. Even Navajo children struggle quite a bit learning Navajo and don’t seem to get it well until maybe age 12. When Navajo children arrive at school, they often do not speak Navajo well yet.

Korean is a surprise, but apparently it is very hard to learn well. A native Korean speaker told me that Korean is so hard that no Korean speaker ever speaks it with 100% accuracy, and everyone makes errors.

Czech is also hard. Even most Czech speakers never get Czech all the way. They have TV contests in Czechoslovakia where they try to stump native speakers with hard forms in the language. If you can last 30 minutes without making even one error, you win. I think only two men have been able to do it, but one was a non-native speaker!

Piraha, spoken in the Brazilian Amazon, is also very hard. Over the course of a few centuries, several Portuguese speaking priests had tried to learn Piraha, but they had all given up because it was too hard. And these same priests had been able to master a number of other Indian languages, but Piraha was just too much. Daniel Everett learned the language and wrote important papers on it. He is only of the only non-native speakers who was able to learn the language.

Tsez, spoken in the Caucasus, is also murderously hard. Every verb can have over 100,000’s of possible forms. I understand that even native speakers make regular errors when speaking Tsez.

1 Comment

Filed under Altaic, Applied, Balto-Slavic, Balto-Slavic-Germanic, Basque, Brazil, Cantonese, Caucasus, Chinese language, Czech, Dene-Yenisien, Indo-European, Indo-Hittite, Isolates, Korean language, Language Families, Language Learning, Linguistics, Mandarin, Min Nan, Na-Dene, Navajo, Near East, Regional, Sinitic, Sino-Tibetan, Slavic, South America

Simplification of Language with Increasing Civilization: A Result of Contact or Civilization Itself

Nice little comment here on an old post, Primitive People Have Primitive Languages and Other Nonsense? 

I would like to dedicate this post to my moronic field of study itself, Linguistics, which believes in many a silly thing as consensus that have never been proved and are either untrue or probably untrue.

One of the idiocies of my field is this belief that in some way or another, most human languages are pretty much the same. They believe that no language is inherently better or worse than any other language, which itself is quite a dubious proposition right there.

They also believe, incredibly, that no language is more complex or simple than any other language. Idiocy!

Another core belief is that each language is perfectly adapted for its speakers. This leads to their rejecting claims that some languages are unsuitable for the modern world due to lack of modern vocabulary. This common belief of many minority languages is obviously true. Drop a Papuan in Manhattan, and see what good his Torricelli tongue does him. He won’t have words for most of the things around him. He won’t even have verbs for most of the actions he sees around him. His language is nearly useless in this environment.

My field also despises notions that some languages are better suited to poetry, literature or say philosophy than others or that some languages are more or less concise or exact than others or that certain concepts or ways of thinking are better expressed in one language as opposed to another. However, this is a common belief among polyglots, and I would not be surprised if it was true.

The question we are dealing with below is based on the notion that many primitive languages are exceeding complex and the common sense observation that as languages acquire more speakers and civilization increases, one tends to see a simplification of language.

My field out and out rejects both statements.

They will tell you that primitive languages are no more complex than more civilized tongues and that there is no truth to the statement that languages simplify with greater numbers of speakers and increased civilization. However, I have shot these two rejected notions to many non-linguists, and they all felt that these statements had truth to them. Once again, my field violates common sense in the name of the abstract and abstruse “we can’t prove anything about anything” scientific nihilism so common in the intellectually degraded social sciences.

Indeed, some of the most wildly complex languages of all can be found among rather primitive peoples such as Aborigines, Papuans, Amerindians and even Africans. Most language isolates like Ket, Burashaski and Basque are pretty wild. The languages of the Caucasus are insanely complex, and that region doesn’t exactly look like Manhattan. Siberian languages are often maddeningly complex.

Even in China, in the remoter parts of China, language becomes highly differentiated and probably more complex. I know an American who was able to learn Cantonese and Mandarin who told me that at age 35, for an American to learn Hokkien was virtually impossible. He tried various schemes, but they all failed. He finally started to get a hold of the language with a strict eight hour a day study schedule. Anything less resulted in failure. Hokkien speakers that he spoke too said you needed to grow up speaking Hokkien to be able to speak the language well at all. By the way, this is another common sense notion that linguists reject. They say there are no languages so difficult that it is very hard to pick them up unless you grew up with them.

The implication here is that Min Nan is even more complex than the difficult Mandarin or even the forbidding Cantonese, which even many Mandarin speakers give up trying to learn because it is too hard.

Min Nan comes out Fujian Province, a land of forbiddingly high mountains where language differentiation is very high, and there is often difficult intelligibility even from village to village. In one area, fifteen years ago an American researcher decided to walk to a nearby village. It took him six very difficult hours over steep mountains. He could have taken the bus, but that was a four-day trip! A number of these areas had no vehicle roads until recently and others were crossed by vast rivers that had no bridges across them. Transportation was via foot. Obviously civilization in these parts of China is at a more primitive level, and it’s hard to develop Hong Kong-style cities in places with such isolating and rugged terrain.

It’s more like, “Oh, those people on the other side of the ridge? We never go there, but we heard that their language is a lot different from ours. It’s too hard to go over that range so we never go to that area.”

In the post, I theorized that as civilization increased, time becomes money, and there is a need to get one’s point across quickly, whereas more primitive peoples often spend no more than 3-4 hours a day working and the rest sitting around, playing  and relaxing. A former Linguistics professor told me that one theory is that primitive people, being highly intelligent humans (all humans are highly intelligent by default), are bored by their primitive lives, so they enjoy their wildly complex languages and like to relax, hang out and play language games with them to test each other on how well they know the structures. They also like to play tricky and maybe humorous language games with their complicated languages. In other words, these languages are a source of intellectual stimulation and entertainment in an intellectually impoverished area.

Of course, my field rejects this theory as laughably ridiculous, but no one has disproven it yet, and I doubt if the hypothesis has even been tested, hence it is an open question. My field even tends to reject the notion of open questions, preferring instead to say that anything not proven (or even tested for that matter) is demonstrably false. That’s completely anti-scientific, but that’s the trend nowadays across the board as scientistic thinking replaces scientific thinking.

Of course this is in line with the terrible conservative or reactionary trend in science where Science is promoted to a fundamentalist religion and scientists decide that various things are simply proven true or proven not true and attempts to change the consensus paradigm are regarded derisively or with out and out fury and rage and such attempts are rejected via endless moving of goalposts with the goal of making it never possible to prove the hypothesis. If you want to see an example of this in Linguistics, look at the debate around  Altaic. They have set it up so that no matter how much existing evidence we are able to gather for the theory, we will probably never be able to prove it as barriers to proof have been set up to make the question nearly unprovable.

It’s rather senseless to set up Great Wall of China-like barriers to proof in science because at some point,  you are hardly proving anything new, apparently because you don’t want to.

Fringe science is one of the most hated branches of science and many scientists refer to it as pseudoscience. Practitioners of fringe science have a very difficult time as the Scientific Establishment often persecutes them, for instance trying to get them fired from professorships. Yet this Establishment is historically illiterate because many of the most stunning findings in history were made by widely ridiculed fringe scientists.

The commenter below rejects my theory that increased civilization itself results in language simplification, as it gets more important to get your point across as quickly  as possible with increasing complexity and development of society. Instead he says civilization leads to increased contact between speakers of different dialects or language, and in such cases,  language must be simplified, often dramatically, in order for any decent communication to occur. Hence increased contact, not civilization in and of itself, is the driver of simplification.

I like this theory, and I think he may be onto something.

To me the simplification of languages of more ‘civilized’ people is mostly a product of language contact rather than of civilization itself. If the need arises to communicate with foreign people all of the time, for example in trade, then the language must become more simple in order to be able to be understood by more people.

Also population size matters a lot. It has been found that the greater the number of speakers, the greater the rate of language change. For example Polynesian languages, although having been isolated centuries or even millennia ago, still have only minor differences from one another.

In the case of many speakers, not all will be able to learn all the rules of a language, so they will tend to use the most common ones. And if the language is split in many dialects, then speakers of each dialect must find a compromise in order to communicate, which might come out as simple. If we add sociolects, specific registers for some occasions, sacred registers, slang etc, something that will arise in a big and stratified civilization, then the linguistic barriers people will need to overcome become greater. So it is just normal that after some centuries, this system to simplify.

We don’t need to look farther than Europe. Most languages of the western half being spoken in countries with strong trade links to one another and with much of the world later in history are quite analytic, but the languages of the more isolated eastern part are still like the older Indo-European languages. Basques, living in a small isolated pocket in the Iberian Peninsula, have kept a very complex language. Icelanders, also due to isolation, have kept a quite conservative Germanic language, whereas most modern Germanic languages are ridiculously simplified. No one can argue in his sane mind that Icelanders are primitives.

On the other hand, Romanian, being spoken in the more isolated Balkans, has retained more of the complex morphology of Latin compared to West Romance languages. And of course advance of civilization won’t automatically simplify the language, as Turkish and Russian, both quite complicated languages compared to the average European tongue, don’t seem to give up their complexity nowadays.

On the other hand, indigenous people were living in a much more isolated setting compared to the modern world, the number of speakers was comparatively low, and there was no need to change. Also, neighboring tribes were often hostile to one another, so each tribal group sought to make itself look special. That is the reason why places with much inter-tribal warfare like New Guinea have so many languages which are so different from one another. When these languages need to communicate, we get ridiculously simple contact languages like Hiri Motu.
So language simplification is more a result of language contact rather than civilization itself.


Filed under Aborigines, Altaic, Amerindians, Anthropology, Applied, Asia, Basque, Cantonese, Caucasus, China, Chinese language, Cultural, Dialectology, Europe, Germanic, Indo-European, Isolates, Language Families, Language Learning, Linguistics, Mandarin, Min Nan, Near East, Papuans, Race/Ethnicity, Regional, Russian, Science, Siberian, Sinitic, Sino-Tibetan, Sociolinguistics, Turkic, Turkish

Massive Update of A Reworking of Chinese Language Classification

My Internet enemies (you know who you are) love to rip me to pieces over this stuff, but I suspect that is because they operate under the cover of anonymity plus the general loud-mouthed jerk “troll culture” of the Internet combines to provides a Linguisticus Sociopathicus that is seldom found in the hallowed halls of reserved academe.

The funny this is, if this Chinese work is so horrible, why has it earned praise from some of the world’s top Sinologists, who in fact actually assisted me with the project? Perhaps they should answer that. If I “know less about Linguistics than a Linguistics 10 student” then why do I sit on the review board of a peer-reviewed linguistics academic journal? Why did an 80 page paper of mine that will soon be published in a book make through two peer reviews and a dozen editors, including some of the world’s top Turkologists?

The funny thing is that I get along pretty well with other linguists outside of the Internet. We work together calmly, chat about this, that and the other, share papers and gather information from each other, all the things that academics do. I even get addressed as Dear Colleague. And then on the Internet, suddenly I’m so stupid I don’t know what a verb is. Whatever.

Anyway, a huge project of mine, A Reworking of Chinese Language Classification, has received a massive update. It underwent a ton of fixes, a lot of dead links were removed, and many matters were cleared up or explained better. Also the language count jumped by 200 from ~360 to 573. Now some of these may not be full languages and I may be exaggerating but I believe that using the 90% intelligibility criterion, there are a good 2,000 separate languages within Sinitic alone.

We simply cannot carve them out because the Chinese government will go crazy, and no Sinologist wants to make the Chinese government mad. The Chinese government lies and says there is one Chinese language with 3,000+ dialects in it, including such massive lects as Cantonese, Hakka, Min, Hui, Wu, Peng, Gan and Ji? Not to mention that Mandarin itself is of course not a single language but is actually a collection of scores or more languages inside of itself.

The project involves a brief description in English of the Chinese lects, stating such things as names, where they are spoken, the number of speakers, classification, degree of endangerment, linguistic history and development, classification issues, mutual intelligibility issues, dialects within, membership in language groups, the language/dialect question, anthropological history, sociolinguistic issues historical and modern, future trends, controversies, and sometimes more arcane linguistic data.

I am not trying to brag here and I am not real familiar with the literature, but my account of Chinese dialects is the most thorough such account I have ever run across so far in English. Now there may be better publications out there, but I am not aware of them. Further, most do not seem to have tackled the dialect vs. language problem.

Almost all of the good material on this stuff is in Chinese, and I do not read Chinese, so this caused massive problems, but I seem to be able to deal with them ok, as a lot of the research that I referenced was in Chinese and I am able to sort of make my way through it to get the gist of it despite the language barrier. I have also come up with a few native speaker informants who have given me excellent information on their particular lects. For instance, I recently ran into a speaker of something called Cambodian Teochew (I had no idea such a thing existed) who told me that the four SE Asian Teochew lects, Malay Teochew, Thai Teochew, Cambodian Teochew and Vietnamese Teochew, were not mutually intelligible. That is, there are four separate languages within Overseas Teochew alone! Unbelievable.


Filed under Asia, Cantonese, China, Chinese language, Comparitive, Dialectology, Government, Language Classification, Language Families, Linguistics, Mandarin, Regional, Sinitic, Sino-Tibetan, Sociolinguistics

What Race Is This Person (Singapore)?


An interesting phenotype from Singapore.

This is the aunt of a friend of mine. The family is from Singapore. They are part of an ethnic group called the Pernakans, a Southern Chinese group that moved to Malaysia ~600 years ago for some reason, possibly due to overcrowding in Fujian or worse, the terrible wars that periodically raged through the region.

Chinese groups have been leaving from this part of Southern China for a very long time now, especially in the last 200 years. In the past couple of centuries, this part of China has become very crowded. Possibly as a result, wild and vicious wars periodically raged through the area, sometimes killing 100,000’s of people. If you study Chinese history, you will hear about these wars a lot. It is not uncommon to read that invaders conquered several large cities and exterminated the whole populations of perhaps 300,000 people, men, women and children. This is how the Chinese have often fought wars. Chinese wars are unbelievably vicious and savage.

The Pernakans moved to Malaysia, and over time, bred in with Dutch and Portuguese and to a lesser extent British Europeans. All three were colonists in the region. I believe that they were Min speakers, but their Hokkien has gotten so changed, in particular from massive borrowings from Malay, that these languages in general are no longer intelligible with Amoy or Taiwanese Hokkien Proper.

Most Pernakans now are somewhat Eurasian, Chinese crossed with Dutch, Portuguese and sometimes British. The Pernakans had their own patriarchal culture and were known as very hard workers, often at manual labor type jobs like farming, timber harvest are working on rubber plantations. They committed little crime and had very orderly societies. The European colonists marveled at their high level of civilization. They did keep slaves, but they probably treated their slaves better than any slaves have ever been treated, and in many cases, slaves were freed.

Over time, most Pernakans also bred in with Malays. Pernakans are now a Chinese/Malay/European race, but the Asiatic tends to be prominent over the European in the stock. The mixing of cultures over 600 years in Malaysia resulted in some very interesting fine cuisine.

Many of these Chinese migrated to Singapore, where they, along with Teochew speakers (another Min group) and a large group of Cantonese Chinese, form what is known as the Singaporean Chinese, one of the wealthiest and most economically advanced ethnic groups on Earth. There is still a division of labor in Singapore, with Chinese on top, Malays on the bottom, and Southern Indian Dravidian speakers in between. Nevertheless all three groups are substantially mixed by this point. Most Chinese have Malay blood, and a lot of Malays have some Chinese in them. Malays and Indians are now intermarrying quite a bit. There is some ethnic conflict but not a lot possibly due to the wealth and everyone being so mixed.

Although this woman has a somewhat archaic phenotype (note prognathism), these archaic types are fairly common in Southern China. Many can be seen in the mountains of Yunnan Province. The archaism may be due to incomplete transition from Australoid -> Mongoloid, as the transition happened much later in Southern China than in Northern China, and prominent Australoid types were common in the far south of China only 3-4,000 YBP.

I also believe that this woman may be admixed with Caucasian. And I think the Malay admixture is quite clear. Perhaps I am mistaken, but I think I see some Vedda influence here. That would not be unusual, as Malays were Veddoids only until quite recently, and the Senoi are Veddoids to this day. The Mani Negritos are also still extant.

The transition in Malaysia went from Australoid Negritos (Mani) and Orang Asli -> Australoid Veddas (Senoi) -> Paleomongoloid Southeast Asians (modern Malays). The Malays appear to be aware of this transition, as they state that the Mani and Orang Asli are their ancestors. The bloodline of the Orang Asli goes back 72,000 YBP, so this group has been present in Malaysia since the very first Out of Africa groups, and their archaism is about on a par with the Andaman Islanders, another Australoid group which is also the remains of some of the earliest OOA groups.


Filed under Andaman Islanders, Anthropology, Asia, Asian, Asians, Cantonese, China, Chinese, Chinese (Ethnic), Chinese language, Colonialism, Cultural, Culture, Dutch, English, Europeans, History, Language Families, Linguistics, Malays, Malaysia, Mixed Race, Negritos, Physical, Political Science, Portuguese, Race/Ethnicity, Regional, SE Asia, SE Asian, SE Asians, Singapore, Sinitic, Sino-Tibetan, Sociology, War

A Look at the Cantonese Language

Method and Conclusion. See here.

Results. A ratings system was designed in terms of how difficult it would be for an English-language speaker to learn the language. In the case of English, English was judged according to how hard it would be for a non-English speaker to learn the language. Speaking, reading and writing were all considered.

Ratings: Languages are rated 1-6, easiest to hardest. 1 = easiest, 2 = moderately easy to average, 3 = average to moderately difficult, 4 = very difficult, 5 = extremely difficult, 6 = most difficult of all. Ratings are impressionistic.

Time needed. Time needed for an English language speaker to learn the language “reasonably well”: Level 1 languages = 3 months-1 year. Level 2 languages = 6 months-1 year. Level 3 languages = 1-2 years. Level 4 languages = 2 years. Level 5 languages = 3-4 years, but some may take longer. Level 6 languages = more than 4 years.

This post will look at the Cantonese language in terms of how difficult it would be for an English speaker to learn it.


Cantonese is even harder to learn than Mandarin. Cantonese has eight tones to Mandarin’s four, and in addition, it continues to use a lot of the older traditional Chinese characters that were superseded when China moved to a simplified script in 1949. Furthermore, since non-Mandarin characters are not standardized, Cantonese cannot be written down as it is spoken.

In addition, Cantonese has verbal aspect, possibly up to 20 different varieties. Modal particles are difficult in Cantonese. Clusters of up to 3 sentence final particles are very common. 我食咗飯 and 我食咗飯架啦喎 are both grammatical for I have had a meal, but the particles add the meaning of I have already had a meal, answering a question or even to imply I have had a meal, so I don’t need to eat anymore.

Cantonese gets a 5.5 rating, nearly hardest of all.

Leave a comment

Filed under Applied, Cantonese, Chinese language, Language Families, Language Learning, Linguistics, Sinitic, Sino-Tibetan

Is There a Language That Is Almost Impossible to Learn Without Growing Up with It?

A question was recently asked on Quora. Here is my answer.

Hello, I recently talked to a Westerner who is learning Min Nan, which is a Sinitic language often called a dialect of Chinese. He already speaks Mandarin, but he told me Min Nan if vastly harder than Mandarin. At age 35, he was studying it 2 hours a day, and at some point, he hit a wall, and he didn’t seem to be making any progress. He kept adding more study hours to the day  – four hours, six hours – with little effect. Finally when he was studying it for eight hours a day, he started making some good progress. I believe he said contour tones and tone sandhi were the major roadblocks.

Min Nan speakers say that even Cantonese is easier than Min Nan, and Cantonese is deadly hard. They also say that Min Nan tones are so hard that no one who did not learn Min Nan growing up gets anywhere near native fluency.

Cantonese is a similar language that is very difficult. It is much harder than Mandarin, and many native Mandarin speakers say they tried to learn Cantonese and gave up on it because it was too hard. Cantonese has 9 tones. The general consensus among Chinese is that Cantonese is much harder to learn than Mandarin.

Basque is said to be very hard to learn unless you grow up with it. There is a joke that the Devil spent seven years trying to learn Basque, and he only learned how to say Hello and Goodbye.

Navajo would also be murderously hard. Even Navajo children struggle quite a bit learning Navajo. When they show up at school at age 5-6, they are still struggling with Navajo. There are reports that Navajo children don’t seem to get Navajo well until maybe age 12.

Korean is a surprise, but apparently it is very hard to learn well. A native Korean speaker told me that Korean is so hard that no Korean speaker ever speaks it with 100% accuracy, and everyone makes errors.

As another respondent pointed out, Japanese is also quite notorious, and most Westerners get nowhere near native fluency.

Czech is also hard. Even most Czech speakers never get Czech all the way. They have TV contests in Czechoslovakia where they try to stump native speakers with hard forms in the language. If you can last 30 minutes without making even one error, you win. I think only two men have been able to do it, but one was a non-native speaker! Czech also has a strange r sound found only in one other language on Earth. It is said that no native speaker ever gets this phoneme quite right.

Piraja is also very hard as another respondent pointed out. Only two non-natives have ever been able to speak Piraha with any fluency. When Daniel Everett went to study the language, he found a number of reports from priests who had tried to learn Piraha since the early 1800’s, and only one had succeeded. The others tried to learn but gave up because they said it was too hard.

Tsez, spoken in the Caucasus, is also murderously hard. Every verb can have tens of thousands of possible forms. Reports say that even native speakers make regular errors when speaking Tsez.


Filed under Altaic, Applied, Balto-Slavic, Balto-Slavic-Germanic, Basque, Cantonese, Chinese language, Czech, Dene-Yenisien, Indo-European, Indo-Hittite, Isolates, Korean language, Language Families, Language Learning, Linguistics, Mandarin, Min Nan, Na-Dene, Navajo, NE Caucasian, Sinitic, Sino-Tibetan, Slavic, Tsez

Some Arguments Against Using Mutual Intelligibility as a Criterion in Linguistics

KIRINPUTRA writes in response to this piece:

I think Lindsay is right in using mutual intelligibility as the criterion for determining what’s a language. I also think that intelligibility can be real tough to measure, and that something should be said for the kind of situation where mutual unintelligibility is only temporary, i.e. where a week of exposure has the speakers off and running.

As Campbell puts it, “But the question remains, does one actually have to specifically pick out and learn new phrases on their way to learning or can you pick them up in passing assuming to understand?”

So languages A and B are mutually unintelligible, but speakers become able to understand each other after a week of steady contact. Languages C and D are mutually unintelligible, and speakers still can’t understand each other after months of steady contact, unless they learn each other’s language or use a third language. Do we treat both situations the same and call them different languages? I think that’s worth thinking about.

Campbell brings up another valid point: attitudes influence intelligibility. Part of this is raw, conscious effort. Part of this is psychological and pretty much subconscious.

Another point that nobody has brought up yet is topic dependency. Mutual intelligibility usually varies depending on what the speakers are trying to talk about. A “deep” Taiwanese Hokkien speaker and a “deep” Medan (Sumatra) Hokkien speaker could probably understand each other reasonably well across a wide range of household and agricultural topics, but if it came to fixing a car or a motorbike, they’d be speaking different languages, in effect.

The task of quantifying intelligibility gets harder if we wanna pin this down. Maybe a “basket of topics” concept could be advanced, kind of like the “basket of goods and services” concept used to measure inflation.

There’s a video on Youtube where two Siam Thai speakers go up into central Guangxi and try to communicate w/ Zhuang speakers speaking only Siam Thai. First it doesn’t work, then it starts working. They realize that it only works when the topic is one that’s heavy on shared vocabulary.

Based on intelligibility criteria, how many languages is Hokkien (what Lindsay calls “Xiamen”)? A lot of Penang Hokkien would go over a Taiwanese Hokkien speaker’s head at first exposure, just b/c of intrinsic linguistic differences. Typically, there would also be a lack of effort on the part of the Taiwanese speaker to understand a non-Taiwanese form of Hokkien.

Even beyond this, psychologically, both sides (but esp. the Taiwanese) have a hard time acknowledging an unfamiliar form of their familiar Hokkien tongue. Due to subconscious psychological reasons and a lack of effort, they may honestly not be able to understand each other (assuming the Penang speaker is one of the few with no Taiwanese Hokkien media intake). The shared vocabulary, collocations, idioms, etc., though, are definitely enough for them to understand each other w/ just an attitude adjustment.

Yet, I don’t think the shared vocabulary and grammar are “good enough” to establish that PngHk and TWHk are dialects of the same language. How do we really know? What strikes me as being much better evidence is having witnessed TWHk and PngHk speakers communicating effectively in their respective dialects w/o having to resort to another language – even though such encounters have typically resulted in a quick switch to Mandarin as of the last 10 or 15 years or so.

Intelligibility is tricky to quantify, no doubt; but lexical and syntactic similarity have got to be even trickier to measure in any meaningful way.

I have to take exception with a couple of Campbell’s minor points. They sound suspiciously like the stuff you read in papers by some (not all) Chinese scholars.

Campbell says, “Fangyan we have determined as topolect, but as used many centuries ago could also refer to any language of a different region. Today it has a specific use and currently applies to a “county”, notwithstanding the fangyan of neighboring counties may be the exact same thing.”

I don’t know what Campbell means by “today it has a specific use”. It’s not only common for laypeople to use “fangyan” to refer to the speech of a province or any other region, it’s also pretty common for scholars to spit out collocations like “Yue (~ Cantonese) fangyan”, never mind that “Yue” is a group of languages spoken across two provinces of China and taking in at the very, very least three mutually unintelligible languages.

Campbell also says, “It reminds me of Sinoxenic borrowings of Chinese words into neighboring Korean, Japanese, and Vietnamese which all now have approximately 60% of their core lexicon borrowed from Chinese. But these languages belong to other families and developed separately…”

This is kind of begging the question. What if the North Chinese political grip on Vietnam was somehow renewed? Sure enough, Vietnamese would continue to absorb “Chinese” elements deeper and deeper into its lexicon and structures, to the point where a linguist from the “modern” linguistics tradition would say it was a Chinese language.

And indeed the evidence seems to reveal that this is exactly how Hokkien, Teochew, Hailamese, Wenzhou, Hoisan (Taishan), etc. “became” Chinese languages. The best paper I’ve seen on this was by a Chinese scholar named Pan Wuyun (潘悟云). What’s Sinoxenic? Who was neighboring what? What’s core lexicon? Who developed separate and who developed together, and where and when? These are unresolved questions, not the open-and-shut case that most linguists in the field (even many non-Chinese) seem to think it is.

Campbell is probably right in saying, “Hua is usually tacked on to a place name. The “speech” of a particular place as long as there are no others competing (for example Nanning in Guangxi has several languages).” I would add that competing languages w/i counties is the rule rather than the exception throughout tropical and coastal subtropical China.

The tendency in each area (not necessarily just one county) with competing languages is for each language to go by a two or three syllable nickname where the last syllable is usually 話 (hua in Mandarin). Cantonese (but not the Hoisan type) is usually known as 白 hua. Hokciu (a.k.a. Foochow) is known locally as 平 hua (exact same name as Tuhua). In the Leizhou area, 海 hua and 黎 hua are two distinct “Min” varieties, reportedly mutually intelligible only w/ each other or at most also w/ some type of Hailamese / Hainanese Min.

Speaking of which, a primer on Hailamese was published about a century ago in Singapore. The author (de Souza) explains in the introduction which dialect of Hailamese the book is based on, and says that dialects of Hailamese from the other side of the island are “perfectly impossible to understand”. So there may actually be more than one language w/i just Hailamese Min.

Finally, about the Chinese scholars falling down on the job. I would say that, first of all, they generally don’t think this is their job. To them, “Chinese” is basically “assumed” to be one language. U could just call that shoddy academics. Secondly, though, some Chinese scholars are doing a pretty good job, such as Pan Wuyun.

In the Anglo tradition, a guy like Pan Wuyun would come out at some point with a “come-on-and-own-up, most-of-all-y’all-is-wrong” paper. But unfortunately that kind of thing is really rare in China. And so it’s left to foreign scholars or guys like Lindsay or myself to say this, w/ the disclaimer (at least in my case) that there are many individual decent scholars in China too.

The truth is that among most linguists, mutual intelligibility is not a controversial topic. There are a few loudmouths who scream that it cannot be measured, but to most of us linguists it is a ho-hum subject, not the source of a lot of screaming and yelling. Most of the tumult comes from outside the field, amateurs or simply ignorant people who are not linguists. They usually bring up all sorts of arguments, but in the field, we do not worry much about any of these rejoinders.

Often we will do more than one study. If the results are different, we just average them together and to get a mean.

Surely attitude matters, but if you test enough people, all of that levels out. You have some that really want to understand the other language and others who just give up easily. You average them all together and get a mean for the population.

There are not many languages that can be learned after only a week of contact. And if there were, we would not say they were mutually unintelligible. Even very closely related languages like Azeri and Turkish take about 3-4 weeks of close contact before they are communicating pretty well.

I have an informant in China in Hubei Province. She said about every third city over was a new Mandarin language, and you  could learn the new language after about 3 weeks of close contact.

In Africa, they have a concept called 1 day languages and 2 day languages because that is how long it takes to learn them. These would not be considered languages because they are too easily learned.

As an example, I have heard Latin Americans say that when they fly into El Salvador in the morning, they don’t understand all of what the Salvadorans around them are saying, and the Salvadorans do not understand everything they are saying. However, by the end of the day, everyone is drinking and slapping each other on the back and they all understand each other.

So Salvadoran Spanish could be considered a 1 day language. Salvadoran Spanish is a dialect of the Spanish language, not a separate language.

About topic dependency: we usually test for mutual intelligibility by playing a relatively neutral recording of someone speaking in the language. I suppose you could use a video too. You cannot use two people trying to talk to each other because then you have all of this extralinguistic coaching going on that interferes with the result and makes it higher than it is.

Due to subconscious psychological reasons and a lack of effort, they may honestly not be able to understand each other (assuming the Penang speaker is one of the few with no Taiwanese Hokkien media intake). The shared vocabulary, collocations, idioms, etc., though, are definitely enough for them to understand each other w/ just an attitude adjustment.

This has been brought up by a well-known linguist as a complaint to me against using native speaker knowledge as a criterion for mutual intelligibility. He told me we could not rely on native speakers to tell us how much they understand of another language because, well, native speakers lie. Instead we could only rely in the knowledge of linguists.

He gave the example of two groups that understand each other very well but hate each other so much that say they can’t understand the speech of the other people even though they can. In other words, they lie. Realistically, I have been studying mutual intelligibility for a long time now (in fact, I am a bit of an expert in it) and I have yet to come across this situation. This really is just a red herring.

Yet, I don’t think the shared vocabulary and grammar are “good enough” to establish that PngHk and TWHk are dialects of the same language. How do we really know? What strikes me as being much better evidence is having witnessed TWHk and PngHk speakers communicating effectively in their respective dialects w/o having to resort to another language – even though such encounters have typically resulted in a quick switch to Mandarin as of the last 10 or 15 years or so.

That doesn’t really count. You might be looking at an intelligibility situation of 80-85% between those Hokkien lects. Also we do not look at two speakers negotiating a conversation because that throws in new variables.

For inherent intelligibility, we want someone listening to a recording or watching a video. Quite a few speakers of very closely related languages (and some not so closely related) can negotiate the sort of conversation described above. Yet the fact that they both revert to Mandarin instead of carrying on in different Hokkien forms implies we are dealing with two separate languages here. They abandoned their own tongues and switched to common Mandarin presumably because there are too many misunderstandings when they use their Hokkien varieties.

Intelligibility is tricky to quantify, no doubt; but lexical and syntactic similarity have got to be even trickier to measure in any meaningful way.

Not really, we have many measures of lexical similarity and we use them all the time. We also measure syntactic and morphological differences – variations in grammar. A lot of linguists decide that two tongues are different languages simply based on the fact that they are too far apart – structurally separate languages.

If you think this website is valuable to you, please consider a contribution to support the continuation of the site. Donations are the only thing that keep the site operating.


Filed under Applied, Cantonese, Chinese language, Dialectology, Language Families, Language Learning, Linguistics, Mandarin, Multilingualism, Sinitic, Sino-Tibetan, Sociolinguistics

A Look at the Chinese Language

From here.

This post will look at how hard it is to learn Chinese for an English speaker.

It’s fairly easy to learn to speak Mandarin at a basic level, though the tones can be tough. This is because the grammar is very simple – short words, no case, gender, verb inflections or tense. But with Japanese, you can keep learning, and with Chinese, you hit a wall, often because the isolating syntactic structure is so strangely different from English.

Actually, the grammar is harder than it seems. At first it seems simple, like a simplified English with no tense or articles. But the simplicity makes it difficult. No tense means there is no easy way to mark time in a sentence. Furthermore, tense is not as easy as it seems. Sure, there are no verb conjugations, but instead you must learn some particles and special word orders that are used to mark tense.

Once you start digging into Chinese, there is a complex layer under all the surface simplicity. There are serial verbs, a complex classifier system, syntax marked by something called topic-prominence, preposed relative clauses, use of verbs rather than adverbs to mark direction, and all sorts of strange stuff. Verb complements can be baffling, especially potential and directional complements. The 了 character can have seemingly countless meanings. You also need to learn quite a bit of vocabulary just to speak simple sentences.

Chinese phonology is not as easy as some say. There are too many instances of the zh, ch, sh, j, q, and x sounds in the language such that many of the words seem to sound the same. There is a distinction between aspirated and nonaspirated consonants which does not exist in English.

Chinese orthography is probably the hardest orthography of any language. The alphabet uses symbols, so it’s not even a real alphabet. There are at least 85,000 symbols and actually many more (although this is controversial), but you only need to know about 4-6,000 of them, and many Chinese don’t even know 1,000. To be highly proficient in Chinese, you need to know 10,000 characters, and probably less than 5% of Chinese know that many.

The Communists tried to simplify the system (simplified Mandarin), but they simply decreased the number of strokes needed for each symbol. The Communists’ spelling reform left much to be desired.

To make matters worse, there are different ways to write each symbol – different styles of Chinese calligraphy. For instance, Classical Chinese may be written in so called “grass-style” calligraphy or in another style altogether.

It’s a real problem when you encounter a symbol you don’t know because there is often no good way to sound out the word as the system simply is not very phonetic. The Chinese alphabet is probably only 25% phonetic, and many frequently-used characters give tell you nothing about how to pronounce them. Further, you need to learn at least 300 characters before you can start to use the meager phonetics of the writing system at all.

Furthermore, word boundaries are not obvious, as one character does not necessarily equal one word. Therefore it is hard to tell where one word starts and stops and another one begins.

Similarly, a dictionary is not necessarily helpful when trying to read Chinese. You can have a Chinese sentence in front of you along with a dictionary, and the sentence still might not make sense even after looking it up in the dictionary.

Furthermore, merely learning how to look up words in the dictionary in the first place takes new Chinese learners several months and learning how to use a dictionary well is typically not possible until a year of study. Even people who have studied for several years sometimes encounter characters that they simply cannot find in the dictionary. In China, dictionary look-up contests are often held, showing that the process is not transparent at all.

A good student of Chinese often has more than one dictionary, and some have up to 20 different dictionaries. There are separate dictionaries for simplified and traditional characters and dictionaries that have both. There are entire dictionaries just for Classical Chinese particles and others for four character idioms (chéngyǔ), a type of allegorical sayings with two parts (xiēhòuyǔ), and another for proverbs (yànyǔ). There are separate dictionaries for terms that entered Chinese during the Chinese era and others for specifically Buddhist terms. There is an easier way to use a Chinese dictionary called four-part look-up, but it takes a long time to learn it and most learners never master it for whatever reason.

To solve all of these problems with the ideographic writing system, numerous romanization schemes have been invented. At last count, there were a dozen or so of them, but a number of those are rarely used. Certainly, there are 2-3 heavily used ones and that is not counting the bomofu phonetic alphabet used in Taiwan. One of the main problems with these romanization systems is that none of them are very good and they all have serious limitations. Furthermore, the romanization system you studied as a Chinese learner tends to affect your accent in Chinese.

Writing the characters is even harder than reading them. One wrong dot or wrong line either completely changes the meaning or turns the symbol into nonsense. The writing system is often so opaque that even native speakers forget how to write the characters of eve commonly used words.

Even leaving the characters aside, the stylistic and literary constraints required to write Chinese in an eloquent or formal (literary) manner would make your head swim. And just because you can read Chinese does not mean that you can read Classical Chinese (wenyanwen) prose. It’s actually written in a different language, so to learn to read Chinese properly like an educated Chinese person does, you will have to learn not one language but two.

One rejoinder is that Classical Chinese to Chinese people is similar to Greek and Latin to an English speaker, but this is a bad analogy, as Classical Chinese is widely studied in Chinese secondary schools and some of the finest Chinese prose is written in this language (see the Confucius and Mencius examples below). Further, after studying French for a few years, you should be able to read French authors who wrote 300 years ago, but after a similar period of studying Chinese, you will not be able to read Confucius or Mencius.

Hence most educated Chinese would be expected to know something about Classical Chinese, and if you wanted to learn Chinese like an educated Chinese speaker, you would have to learn this other language also.

In addition, you need to learn Classical Chinese even if you do not aspire to be an educated Chinese speaker because  one encounters Classical Chinese often in modern Chinese society, often in paintings or character scrolls.

The tones are often quite difficult for a Westerner to pick up. If you mess up the tones, you have said a completely different word. Often foreigners who know their tones well nevertheless do not say them correctly, and hence, they say one word when they mean another.

One problem with the tone system is that when you want to change the meaning of a sentence in a subtle manner via changing intonation of a word, you are bound to change the tone of the word in Chinese. Merely by placing semantic emphasis on a single word, you may deliver a gibberish sentence. Chinese speakers have their own way of using tone as a way of generating subtle semantic meaning, but they do so in an entirely different way than speakers of non-tonal languages do.

However, compared to other tone systems around the world, the tonal system in Chinese is comparatively easy.

A major problem with Chinese is homonyms. To some extent, this is true in many tonal languages. Since Chinese uses short words and is disyllabic, there is a limited repertoire of sounds that can be used. At a certain point, all of the sounds are used up, and you are into the realm of homophones.

Tonal distinctions are one way that monosyllabic and disyllabic languages attempt to deal with the homophone problem, but it’s not good enough, since Chinese still has many homophones even with the tones, and in that case, meaning is often discerned by context, stress, rhythm and intonation.

Chinese, like French and English, is heavily idiomatic.

It’s little known, but Chinese also uses different forms to count different things, like Japanese.

There is zero common vocabulary between English and Chinese, so you need to learn a whole new set of lexical forms and have no cognates to fall back on.

In addition, nouns often show relatedness or hierarchy. For instance, in English, you can simply say my brother or my sister, but in Chinese, you cannot do this. You have to indicate whether you are speaking of an older or younger sibling.

mei meiyounger sister
jie jie
older sister
ge ge
older brother
di di
younger brother

Many agree that Chinese is the hardest to learn of all of the major languages. In a recent international survey of language professors worldwide, these teachers rated Chinese as the hardest language to learn among languages that are commonly studied.

Mandarin gets a 5 rating for extremely hard.

However, Cantonese is even harder to learn than Mandarin. Cantonese has nine tones to Mandarin’s four, and in addition, they continue to use a lot of the older traditional Chinese characters that were superseded when China moved to a simplified script in 1949. Furthermore, since non-Mandarin characters are not standardized, Cantonese cannot be written down as it is spoken.

In addition, Cantonese has verbal aspect, possibly up to 20 different varieties. Modal particles are difficult in Cantonese. Clusters of up to the 3 sentence final particles are very common. 我食咗飯 and 我食咗飯架啦喎 are both grammatical for I have had a meal, but the particles add the meaning of I have already had a meal or answering a question or even to imply I have had a meal, so I don’t need to eat anymore.

Cantonese gets a 5.5 rating, close to hardest of all.

Min Nan is also said to be harder to learn than Mandarin, as it has a more complex tone system, with five tones on three different levels. Even many Taiwanese natives don’t seem to get it right these days, as it is falling out of favor and many fewer children are being raised speaking it than before.

Min Nan gets a 5.5 rating, close to hardest of all.

A recent 15 year survey out of Fudan University utilizing both the departments of Linguistics and Anthropology looked at 579 different languages in order to try to find the most complicated language in the world. The result was that a Wu language dialect (or perhaps a separate language) in the Fengxian district of Shanghai (Fengxian Wu) was the most complex language of all, with 20 separate vowels. The nearest competitor was Norwegian with 16 vowels.

Fengxian Wu gets a 5.5 rating, close to hardest of all.

Classical Chinese is still read by many Chinese people and Chinese language learners. Unless you have a very good grasp on modern Chinese, classical Chinese will be completely wasted on you. Classical Chinese is much harder to read than reading modern Chinese.

Classical Chinese covers an era extending over 3,000 years, and to attain a reading fluency in this language, you need to be familiar with all of the characters used during this period along with all of the literature of the period so you can understand all the allusions. Even with a knowledge of Classical Chinese, you need to read it in context. If you are good at Classical Chinese and someone throws you a random section of it, it will take you a good amount of time to figure it out unless you know context.

The language is much more to the point than Modern Chinese, but this is not as good as it sounds. This simplicity leaves a room for ambiguity and context plays an important role. A joke about some obscure historical or literary anecdote will be lost you unless you know what it refers to. For reading modern Chinese, you will need at least 5,000 characters, but even then, you will still need a dictionary. With Classical Chinese, there are no lower limits on the number of characters you need to know. The sky is the limit.

Classical Chinese gets a 5.5 rating, close to hardest of all.


Filed under Altaic, Applied, Cantonese, Chinese language, Japanese, Japonic, Language Families, Language Learning, Left, Linguistics, Mandarin, Marxism, Min Nan, Sinitic, Sino-Tibetan

More On The Hardest Languages To Learn – Non-Indo-European Languages

Caution: This post is very long. It runs to 200 pages on the Net. Updated January 17, 2016.

This is a continuation of the earlier post. I split it up into two parts because it had gotten too long.

The post refers to which languages are the hardest for English speakers to learn, though to some extent, the ratings are applicable across languages. Most Chinese speakers would recognize Spanish as being an easy language, despite its alien nature. And even most Chinese, Navajo, Poles or Czechs acknowledge that their languages are hard to learn. To a certain extent, difficulty is independent of linguistic starting point. Some languages are just harder than others, and that’s all there is to it.

Method, Results and Conclusion. See here.

In this case, 73 non-IE languages were examined.

Ratings: Languages are rated 1-6, easiest to hardest. 1 = easiest, 2 = moderately easy to average, 3 = average to moderately difficult, 4 = very  difficult, 5 = extremely difficult, 6 = most difficult of all.

Time needed: Time needed to learn the language “reasonably well”: Level 1 languages = 3 months-1 year. Level 2 languages = 6 months-1 year. Level 3 languages = 1-2 years. Level 4 languages = 2 years. Level 5 languages = 3-4 years, but some may take longer.

Northeast Caucasian, Northwest Caucasian and Kartvelian

Of course the Caucasian languages like Tsez, Tabasaran, Georgian, Chechen, Ingush, Abkhaz and Circassian are some of the hardest languages on Earth to learn.

Chechen and Circassian are rated 6, hardest of all.

Northeast Caucasian

NE Caucasian languages have the uvulars and ejectives of Georgian in addition to pharyngeals, lateral fricatives, and other strangeness. They have noun classes like the Bantu languages (but usually fewer). Nevertheless, they have noun class agreement markers on verbs on adjectives. One thing NE Caucasian has is lots of case. Some languages have 40+ cases. They are built from the ground up via two forms – one a spatial form such as in, on or around and the other a directional motion form such as to, from, through or at.


Tsez has 64-126 different cases, making it by far the most complex case system on Earth! It is one of the few languages on Earth that has two genitive cases – Genitive 1 (-s) and Genitive 2 (-z). Genitive 1 is used when the genitive’s head noun is in absolutive case and Genitive 2 is used when the genitive’s head noun is in any other case. It also has four noun classes. It is said that even native speakers have a hard time picking up the correct inflection to use sometimes.

In Tsez, you need to know a lot Tsez grammar to communicate at a basic level. The sentence:

English: I like your mother.

Tsez: Дāьр деби энийу йетих. (Dǟr debi eniyu yetix.)

In order to speak that sentence in Tsez, you need to know:

• the words themselves (word order is not as important)
• that the verb -eti- requires the subject to be in the dative/lative case and the object to be in the absolutive
• the noun class for eniyu (class II)
• the dative/lative form of di (I), which is dǟr
• the genitive 1 form of mi (you), which is debi
• the congruence prefix y- that corresponds to the noun class of the absolutive argument of the phrase, in this case mother
• the present tense ending for vowel-final verbs -x

Tsez is rated 6, hardest of all.


Archi has an extremely complex phonology and one of the most complicated grammars on Earth. The extreme fusional aspects and the verbal morphology are what make the grammar so difficult. Every verb root has 1,502,839 possible forms! It is also an ergative language, but there is irregularity in its ergative system.

Some verbs take the typical ergative/absolutive case (absolutive for the subject of an intransitive very and ergative for the subject of a transitive verb – where the direct object would be in absolutive). In others the subject is in dative rather than the expected ergative/absolutive case. These are usually verbs of perception like love/want, hear, see, feel, and be bored. For instance, the verb:

-эти- = to love/want must have its subject in dative case instead of the expected absolutive or ergative case.

Among non-click languages, Archi has one of the largest consonant inventories, with only the extinct Ubykh having more. There are 26 vowels and between 76 and 82 consonants, depending on the analysis. Five of the six vowels can occur in five varieties: short, pharyngealized, high tone, long (with high tone), and pharyngealized with high tone.

It has many unusual phonemes, including contrasts between several voiceless velar lateral fricatives, voiceless and ejective velar lateral affricates and a voiced velar lateral fricative. The voiceless velar lateral fricative ʟ̝̊, the voiced velar lateral fricative ʟ̝, and the corresponding voiceless and ejective affricates k͡ʟ̝̊ and k͡ʟ̝̊ʼ are extremely unusual sounds, as velar fricatives are not typically laterals.

There are 15 cases, 10 regular cases, five spatial cases and five directional cases. The Spatial cases are Inessive (in), Intrative (between), superessive (above), Subessive (below) and Pertingent (against). The directional cases are Essive (as), Elative (out of), Lative (to/into), Allative (onto), Terminative (specifies a limit) and Translative (indicates change).

There are four noun classes:

I Male human
II Female human
III All insects, some animates, and some inanimates
IV Abstracts, some animates, and some inanimates that can only be seen via verbal agreement

Archi is rated 6, hardest of all.

Eastern Samur

Tabasaran is rated the 3rd most complex grammar in the world, with 48 different noun cases.

Tabasaran is rated 6, hardest of all.


Ingush has a very difficult phonology, an extremely complex grammar, and furthermore, is extremely irregular. Ingush also has a proximate/obviate distinction and is the only language in the region that has this feature. Ingush along with Chechen both have a closed class of verbs, an unusual feature in the world’s languages. New verbs are formed by adding a noun to the verb do:

shootdo gun

Ingush is rated 6, hardest of all.


One problem with Georgian is the strange alphabet: ქართულია ერთ ერთი რთული ენა. It also has lots of glottal stops that are hard for many foreigners to speak; consonant clusters can be huge – up to eight consonants stuck together (CCCCCCCCVC)- and many consonant sounds are strange. In addition, there are uvulars and ejectives. Georgian is one of the hardest languages on Earth to pronounce. It regularly makes it onto craziest phonologies lists.

Its grammar is exceedingly complex. Georgian is both highly agglutinative and highly irregular, which is the worst of two worlds. Other agglutinative languages such as Turkish and Finnish at least have the benefit of being highly regular. The verbs in particular seem nearly random with no pattern to them at all. The system of argument and tense marking on the verb is exceedingly complex, with tense, aspect, mood on the verb, person and number marking for the subject, and direct and indirect objects.

Although it is an ergative language, the ergative (or active-stative case marking as it is called) oddly enough is only used in the aorist and perfect tenses where the agent in the sentence receives a different case, while the aorist also masquerades as imperative. In the present, there is standard nominative-accusative marking. A single verb can have up to 12 different parts, similar to Polish, and there are six cases and six tenses.

Georgian also features something called polypersonal agreement, a highly complex type of morphological feature that is often associated with polysynthetic languages and to a lesser extent with ergativity.

In a polypersonal language, the verb has agreement morphemes attached to it dealing with one or more of the verbs arguments (usually up to four arguments). In a non polypersonal language like English, the verb either shows no agreement or agrees with only one of its arguments, usually the subject. Whereas in a polypersonal language, the verb agrees with one or more of the subject, the direct object, the indirect object, the beneficiary of the verb, etc. The polypersonal marking may be obligatory or optional.

In Georgian, the polypersonal morphemes appear as either suffixes or prefixes, depending on the verb class and the person, number, aspect and tense of the verb. The affixes also modify each other phonologically when they are next to each other. In the Georgian system, the polypersonal affixes convey subject, direct object, indirect object, genitive, locative and causative meanings.

g-mal-av-en = they hide you
= they hide it from you

mal (to hide) is the verb, and the other four forms are polypersonal affixes.

In the case below,

xelebi ga-m-i-tsiv-d-a = My hands got cold.

xelebi means hands. The m marker indicates genitive or my. With intransitive verbs, Georgian often omits my before the subject and instead puts the genitive onto the verb to indicate possession.

Georgian verbs of motion focus on deixis, whether the goal of the motion is towards the speaker or the hearer. You use a particle to signify who the motion is heading towards. If it heading towards neither of you, you use no deixis marker. You specify the path taken to reach the goal through the use or prefixes called preverbs, similar to “verbal case.” These come after the deixis marker:

up             a-
out            ga-
in             sha-
down into      cha-
across/through garda-
thither        mi-
away           c’a-
or down        da-


up towards me = amo-. The deixis marker is mo- and up is a-

On the plus side, Georgian has borrowed a great deal of Latinate foreign vocabulary, so that will help anyone coming from a Latinate or Latinate-heavy language background.

Georgian is rated 5, extremely difficult.

Northwest Caucasian

All NW Caucasian languages are characterized by a very small number of vowels (usually only two or three) combined with a vast consonant inventory, the largest consonant inventories on Earth. Almost any consonant can be plain, labialized or palatalized. This is apparently the result of an historical process whereby many vowels were lost and their various features became assigned to consonants. For instance, palatalized consonants may have come from Ci sequences and labialized consonants may have come from Cu sequences.

The grammars of these languages are complex. Unlike the NE Caucasian languages, they have simple noun systems, usually with only a handful of cases.

However, they have some of the complex verbal systems on Earth. These are some of the most synthetic languages in the Old World. Often the entire syntax of the sentence is contained within the verb. All verbs are marked with ergative, absolutive and direct object morphemes in addition to various applicative affixes.

These are akin to what some might call “verbal case.” For instance, in applicative voice systems, applicatives may take forms such as comitative, locative, instrumental, benefactive and malefactive. These roles are similar to the case system in nouns – even the names are the same. So you can see why some call this “verbal case.”

NW Caucasian verbs can be marked for aspect (whether something is momentous, continuous or habitual), mood (if something is certain, likely, desired, potential, or unreal). Other affixes can shape the verb in an adverbial sense, to express pity, excess or emphasis.

Like NE Caucasian, they are also ergative.

NW Caucasian makes it onto a lot of craziest language lists.

These are some of the strangest sounding languages on Earth. Of all of these languages, Abaza has the most consonants. Here is a video in the Abaza language.


Ubykh, a Caucasian language of Turkey, is now extinct, but there is one second language speaker, a linguist who is said to have taught himself the language. It has more consonants than any non-click language on Earth – 84 consonant sounds in all. Furthermore, the phonemic inventory allows some very strange consonant clusters.

Ubykh has many rare consonant sounds. is only also found in two of Ubykh’s relatives, Abkhaz and Abaza and in two other languages, both in the Brazilian Amazon. The pharyngealized labiodental voiced fricative  does not exist in any other language. It often makes it onto weirdest phonologies lists. Ubykh also got a very high score on a study of the weirdest languages on Earth.

Combine that with only two vowel sounds and a highly complex grammar, and you have one tough language.

In addition, Ubykh is both agglutinative and polysynthetic, ergative and has polypersonal agreement:

If only you had not been able to make him take it all out from under me again for them…

There are an incredible 16 morphemes in that nine syllable word.

Ubykh has only four case systems on its nouns, but much case function has shifted over to the verb via preverbs and determinants. It is these preverbs and determinants that make Ubykh monstrously complex. The following are some of the directional preverbs:

  • above and touching
  • above and not touching
  • below and touching
  • below and not touching
  • at the side of
  • through a space
  • through solid matter
  • on a flat horizontal surface
  • on a non-horizontal or vertical surface
  • in a homogeneous mass
  • towards
  • in an upward direction
  • in a downward direction
  • into a tubular space
  • into an enclosed space

There are also some preverbal forms that indicate deixis:

j-  = towards the speaker

Others can indicate ideas that would take up whole phrases in English:

jtɕʷʼaa- = on the Earth, in the Earth

ʁadja ajtɕʷʼaanaaɬqʼa
They buried his body.
(Lit. They put his body in the earth.)

faa– = out of, into or with regard to a fire.

Amdʒan zatʃətʃaqʲa faastχʷən.
I take a brand out of the fire.

Morphemes may be as small as a single phoneme:

They give you to him.

w – 2nd singular absolutive
a – 3rd singular dative
n – 3rd ergative
– to give
aa – ergative plural
n – present tense

Adverbial suffixes are attached to words to form meanings that are often formed by aspects or tenses in other languages:

asfəpχaI need to drink it.
I can drink it.
I drink it all the time.
I am drinking it all up.
I drink it too much.
I drink it again.

Nouns and verbs can transform into each other. Any noun can turn into a stative verb:


I was a child.
(Lit. I child-waschild-was is a verb – to be a child.)

By the same token, many verbs can become nouns via the use of a nominal affix:

qʼato say

what I say
– (Lit. That which I saymy speech, my words, my language, my orders, etc.

Number is marked on the verb via a verbal suffix and is only marked on the noun in the ergative case.

However, it does lack the convoluted case systems of the Caucasian languages next door and there is no grammatical gender.

Ubykh is rated 6, hardest of all.


Abkhaz is an extremely difficult language to learn. Each basic consonant has eight different positions of articulation in the mouth. Imagine how difficult that would be for an Abkhaz child with a speech impediment. Abkhaz seems to put agreement markers on just about everything in the language. Abkhaz makes it onto many craziest language lists, and it recently got a very high score on a weirdest language study.

Abkhaz is rated 6, hardest of all.


Burushaski is often thought to be a language isolate, related to no other languages, however, I think it is Dene-Caucasian. It is spoken in the Himalaya Mountains of far northern Pakistan in an area called the Hunza. It’s verb conjugation is complex, it has a lot of inflections, there are complicated ways of making sentences depending on many factors, and it is an ergative language, which is hard to learn for speakers of non-ergative languages. In addition, there are very few to no cognates for the vocabulary.

Burushaski is rated 6, hardest of all.

American Indian Languages

American Indian languages are also notoriously difficult, though few try to learn them in the US anyway. In the rest of the continent, they are still learned by millions in many different nations. You almost really need to learn these as a kid. It’s going to be quite hard for an adult to get full competence in them.

One problem with these languages is the multiplicity of verb forms. For instance, the standard paradigm for the overwhelming number of regular English verbs is a maximum of five forms:


Many Amerindian languages have over 1,000 forms of each verb in the language.


Yet the Salishans (see below) always considered the neighboring language Kootenai to be too hard to learn. Kootenai also has a distinction between proximate/obviate along with direct/inverse alignment, probably from contact with Algonquian.

However, the Kootenai direct/inverse system is less complex than Algonquian’s, as it is present only in the 3rd person. Kootenai also has a very strange feature in that they have particles that look like subject pronouns, but these go outside of the full noun phrase. This is a very rare feature in the world’s languages. Kootenai scored very high on a weirdest language survey.

Kootenai is an isolate spoken in Idaho by 100 people.

Kootenai is rated 6, hardest of all.


Yuchi is a language isolate spoken in the Southern US. They were originally located in Eastern Tennessee and were part of the Creek Confederacy at one time. Yuchi is nearly extinct, with only five remaining speakers.

Yuchi has noun genders or classes based on three distinctions of position: standing, sitting or lying. All nouns are either standing, sitting or lying. Trees are standing, and rivers are lying, for instance. It it is taller than it is wide, it is standing. It if is  wider than it is tall, it is lying.

If it is about as about as wide as it is tall, it is sitting. All nouns are one of these three genders, but you can change the gender for humorous or poetic effect. A linguist once asked a group of female speakers whether a penis was standing, sitting or lying. After lots of giggles, they said the default was sitting, but you could say it was standing or lying for poetic effect.

Also all Yuchi pronouns must make a distinction between age (older or younger than the speaker) and ethnicity (Yuchi or non-Yuchi).

Yuchi gets a 6 rating, hardest of all.


Tlingit is probably one of the hardest, if not the hardest, language in the world. Tlingit is analyzed as partly synthetic, partly agglutinative, and sometimes polysynthetic. It has not only suffixes and prefixes, but it also has infixes or affixes in the middle of words.

‘eechto pick

All prefixes must be in proper order for the word to work.

I am usually picking, on purpose, a long object through the hole while standing on a table.

I am usually being forced to pick a long object through the hole while standing on a table.

I am usually being picking the edible long object through the hole while standing on a table.

Tlingit has a pretty unusual phonology. For one thing, it is the only language on Earth with no l. This despite the fact that it has five other laterals: dl (), tl (tɬʰ), tl’ (tɬʼ), l (ɬ) and l’ (ɬʼ). The tɬʼ and ɬʼ sounds are rare in the world’s languages. ɬʼ  is only found in the wild NW Caucasian languages. It also has two labialized glottal consonants, ʔʷ and hw ().

Tlingit gets a 6 rating, hardest of all.


Navajo has long, short and nasal vowels, a tone system and a grammar totally unlike anything in Indo-European. A stem of only four letters or so can take enough affixes to fill a whole line of text.

Navajo is a polysynthetic language. In polysynthetic languages, very long words can denote an entire sentence, and it’s quite hard to take the word apart into its parts and figure out exactly what they mean and how they go together. The long words are created because polysynthetic languages have an amazing amount of morphological richness. They put many morpheme together to create a word out of what might be a sentence in a non-polysynthetic language.

Some Navajo dictionaries have thousands of entries of verbs only, with no nouns. Many adjectives have no direct translation into Navajo. Instead, verbs are used as adjectives. A verb has no particular form like in English – to walk. Instead, it assumes various forms depending on whether or not the action is completed, incomplete, in progress, repeated, habitual, one time only, instantaneous, or simply desired. These are called aspects. Navajo must have one of the most complex aspect systems of any language:

The Primary aspects:

Momentaneous – punctually (takes place at one point in time)
Continuative – an indefinite span of time & movement with a specified direction
Durative – over an indefinite span of time, non-locomotive uninterrupted continuum
Repetitive – a continuum of repeated acts or connected series of acts
Conclusive – like durative but in perfective terminates with static sequel
Semelfactive – a single act in a repetitive series of acts
Distributive – a distributive manipulation of objects or performance of actions
Diversative – a movement distributed among things (similar to distributive)
Reversative – results in directional change
Conative – an attempted action
Transitional – a shift from one state to another
Cursive – progression in a line through time/space (only progressive mode)

The subaspects:

Completive – an event/action simply takes place (similar to the aorist tense)
Terminative – a stopping of an action
Stative – sequentially durative and static
Inceptive – beginning of an action
Terminal – an inherently terminal action
Prolongative – an arrested beginning or ending of an action
Seriative – an interconnected series of successive separate & distinct acts
Inchoative – a focus on the beginning of a non-locomotion action
Reversionary – a return to a previous state/location
Semeliterative – a single repetition of an event/action

The tense system is almost as wild as the aspectual system.

For instance, the verb ndideesh means to pick up or to lift up. But it varies depending on what you are picking up:

ndideeshtiilto pick up a slender stiff object (key, pole)
to pick up a slender flexible object (branch, rope)
to pick up a roundish or bulky object (bottle, rock)
to pick up a compact and heavy object (bundle, pack)
to pick up a non-compact or diffuse object (wool, hay)
to pick up something animate (child, dog)
to pick up a few small objects (a couple of berries, nuts)
to pick up a large number of small objects (a pile of berries, nuts)
to pick up something flexible and flat (blanket, piece of paper)
to pick up something I carry on my back
to pick up anything in a vessel
to pick up mushy matter (mud).

But picking up is only one way of handling the 12 different consistencies. One can also bring, take, hang up, keep, carry around, turn over, etc. objects. There are about 28 different verbs one can use for handling objects. If we multiply these verbs by the consistencies, there are over 300 different verbs used just for handling objects.

In Navajo textbooks, there are conjugation tables for inflecting words, but it’s pretty hard to find a pattern there. One of the most frustrating things about Navajo is that every little morpheme you add to a word seems to change everything else around it, even in both directions.

Navajo is said to have a very difficult system for counting numerals.

There is also a noun classifier system with more than a dozen classifiers that affect inflection. This is quite a few classifiers even for a noun classifier language and is similar to African languages like Zulu. In addition, it has the strange direct/inverse system.

To add insult to injury, Navajo is an ergative language.

Navajo also has an honorifics or politeness system similar to Japanese or Korean.

Navajo also has the odd feature where the word niinaabecause can be analyzed as a verb.

X áhóót’įįd biniinaa…
Because X happened…

Shiniinaa sits’il.
It broke into pieces because of me.

In the latter sentence, the only way we know that 1st singular was involved in because of the person marking on niinaa.

There are 25 different kinds of pronominal prefixes that can be piled onto one another before a verb base.

Navajo has a very strange feature called animacy, where nouns take certain verbs according to their rank in the hierarchy of animation which is a sort of a ranking based on how alive something is. Humans and lightning are at the top, children and large animals are next and abstractions are at the bottom.

All in all, Navajo, even compared to other polysynthetic languages, has some of the most incredibly complicated polysynthetic morphology of any language. On craziest grammar and craziest language lists, Navajo is typically listed.

It is even said that Navajo children have a hard time learning Navajo as compared to children learning other languages, but Navajo kids definitely learn the language. Similarly with Hopi below, even linguists find even the best Navajo grammars difficult or even impossible to understand.

However, Navajo is quite regular, a common feature in Amerindian languages.

Navajo is rated 6, hardest of all.


Slavey, a Na-Dene language of Canada, is hard to learn. It is similar to Navajo and Apache. Verbs take up to 15 different prefixes. All Athabascan languages have wild verbal systems. It also uses a completely different alphabet, a syllabic one designed for Canadian Indians.

Slavey is rated 6, hardest of all.


Haida is often thought to be a Na-Dene language, but proof of its status is lacking. If it is Na-Dene, it is the most distant member of the family. Haida is in the competition for the most complicated language on Earth, with 70 different suffixes.

Haida is rated 6, hardest of all.


The Salishan languages spoken in the Northwest have a long reputation for being hard to learn, in part because of long strings of consonants, in one case 11 consonants long. Salish languages are the only languages on Earth that allow words without sonorants.

Many of the vowels and consonants are not present in most of the world’s widely spoken languages. The Salish languages are, like Chukchi, polysynthetic. Some translations treat all Salish words are either verbs or phrases. Some say that Salish languages do not contain nouns, though this is controversial. The verbal system of Salish languages is absurdly complex.

All Salishan languages are rated rated 6, hardest of all.

Nuxálk (Bella Coola)

Nuxálk is a notoriously difficult Salishan Amerindian language spoken in British Colombia. It is famous for having some really wild words and even sentences that don’t seem to have any vowels in them at all. For instance:

xłp̓x̣ʷłtłpłłskʷc̓  (xɬpʼχʷɬtʰɬpʰɬːskʷʰt͡sʼ in IPA)
He had a bunchberry plant.

seal fat

Here are some more odd words and sentences:


Nuyamłamkis timantx tisyuttx ʔułtimnastx.
The father sang the song to his son.

Musis tiʔimmllkītx taq̓lsxʷt̓aχ.
The boy felt that rope.

However, this word is not typically used by speakers and by no means do most words consist of all consonants. The language sounds odd when spoken. It has been described as “whispering while chewing on a granola bar” (see the video sample under Montana Salish below).

These wild consonant clusters are even crazier than the ones in Ubykh and NW Caucasian. In fact, the nutty consonant clusters in Salish and causing a debate in linguistics about whether or not the syllable is even a universal phenomenon in language as some Salish words and phrases appear to lack syllables. Some Berber dialects have raised similar questions about the syllable.

Nuxálk makes it onto lists of the craziest phonologies on Earth.

Nuxálk is rated 6, hardest of all.

Interior Salish

Montana Salish is said to be just as hard to learn as Nuxálk . Spokane (Montana Salish) has combining and independent forms with the same meaning:


Montana Salish makes it onto a lot of craziest grammars lists.

This link shows an elder on the Flathead Indian Reservation in Montana, Steven Smallsalmon, speaking Montana Salish. He also leads classes in the language. This is probably one of the strangest sounding languages on Earth.

Montana Salish is rated 6, hardest of all.


Straits Salish has an aspectual distinction between persistent and nonpersistent. Persistent means the activity continues after its inception as a state. The persistent morpheme is . The result is similar to English:

figure out – nonpersistent
know – persistent

look at – nonpersistent
watch – persistent

take – nonpersistent
hold – persistent

is referred to as a “parasitic morpheme” and only occurs in stem that has an underlying ə which serves as a “host” for the morpheme.

How strange.

The Saanich dialect of Straits Salish is often listed in the rogue’s gallery of craziest grammars on Earth. The writing system is often listed as one of the worst out there. In addition, Saanich makes it onto craziest grammars lists for the parasitic morphemes and for having no distinction between nouns and verbs!

Straits Salish gets a 6 rating, hardest of all.

Halkomelem, spoken by 570 people around Vancouver, British Colombia, is widely considered to be one of the hardest languages on Earth to learn. In Halkomelem, many verbs have an orientation towards water. You can’t just say, She went home. You have say how she was going home in relation to nearby bodies of water. So depending on where she was walking home in relation to the nearest river, you would say:

She was farther away from the water and going home.
She was coming home in the direction away from the water.
She was walking parallel to the flow of the water downstream.
She was walking parallel to the flow of the water upstream.

Halkomelem gets a 6 rating, hardest of all.


Lushootseed is said to be just as hard to learn as Nuxálk. Lushootseed is one of the few languages on Earth that has no nasals at all, except in special registers like baby talk and the archaic speech of mythological figures. It also has laryngealized glides and nasals: w ̰ , m̥ ̰ , and n̥ ̰ .

Lushootseed is rated 6, hardest of all.


All Iroquoian languages are extremely difficult, but Athabaskan is probably even harder. Siouan languages may be equal to Iroquoian in difficulty.

Compare the same phrases in Tlingit (Athabaskan) and and  Cherokee (Iroquoian).


kutíkusa‘áatIt’s cold outside.
It’s cold right now.

In Tlingit, you can add or modify affixes at the beginning as prefixes, in the middle as infixes and at the end as suffixes. In the above example, you changed a part of the word within the clause itself.


doyáditlv uyvtlvIt is cold outside. (Lit. Outside it is cold)
ka uyvtlv It is cold now. (Lit. Now it is cold.)

As you can see, Cherokee is easier.


Cherokee is very hard to learn. In addition to everything else, it has a completely different alphabet. It’s polysynthetic, to make matters worse. It is possible to write a Cherokee sentence that somehow lacks a verb. There are five categories of verb classifiers. Verbs needing classifiers must use one. Each regular verb can have an incredible 21,262 inflected forms! All verbs contain a verb root, a pronominal prefix, a modal suffix and an aspect suffix. In addition, verbs inflect for singular, plural and also dual. For instance:

ᎠᎸᎢᎭ   a'lv'íha 

You have 126 different forms:
ᎬᏯᎸᎢᎭ  gvyalv'iha     I tie you up
ᏕᎬᏯᎸᎢᎭ degvyalviha  I'm tying you up
ᏥᏯᎸᎢᎭ  jiyalv'ha        I tie him up
ᎦᎸᎢᎭ                          I tie it
ᏍᏓᏯᎸᎢᎭ sdayalv'iha  I tie you (dual)
ᎢᏨᏯᎢᎭ  ijvyalv'iha    I tie you (pl)
ᎦᏥᏯᎸᎢᎭ gajiyalv'iha  I tie them (animate)
ᏕᎦᎸᎢᎭ                        I tie them up (inanimate)
ᏍᏆᎸᎢᎭ  squahlv'iha    You tie me
ᎯᏯᎸᎢᎭ  hiyalv'iha     You're tying him
ᎭᏢᎢᎭ   hatlv'iha         You tie it
ᏍᎩᎾᎸᎢᎭ skinalv'iha    You're tying me and him
ᎪᎩᎾᏢᎢᎭ goginatlv'iha  They tie me and him etc.

Let us look at another form:

to see

I see myself           gadagotia
I see you                gvgohtia
I see him/               tsigotia
I see it                    tsigotia
I see you two          advgotia
I see you (plural)    istvgotia
I see them (live)    gatsigotia
I see them (things) detsigotia

You see me                     sgigotia
You see yourself              hadagotia
You see him/her              higo(h)tia
You see it                        higotia
You see another and me  sginigotia
You see others and me    isgigotia
You see them (living)      dehigotia
You see them (living)      gahigotia
You see them (things)     detsigotia

He/she sees me                    agigotia
He/she sees you                   tsagotia
He/she sees you                   atsigotia
He/she sees him/her            agotia
He/she sees himself/herself  adagotia
He/she sees you + me          ginigotia
He/she sees you two             sdigotia
He/she sees another + me    oginigotia
He she sees us (them + me) otsigotia
He/she sees you (plural)       itsigotia
He/she sees them                 dagotia

You and I see him/her/it                igigotia
You and I see ourselves                 edadotia
You and I see one another             denadagotia/dosdadagotia
You and I see them (living)           genigotia
You and I see them (living or not) denigotia

You two see me                           sgninigotia
You two see him/her/it                 esdigotia
You two see yourselves                sdadagotia
You two see us (another and me) sginigotia
You two see them                        desdigotia

Another and I see you             sdvgotia
Another and I see him/her       osdigotia
Another and I see it                 osdigotia
Another and I see you-two      sdvgotia
Another and I see ourselves    dosdadagotia
Another and I see you (plural) itsvgotia
Another and I see them           dosdigotia

You (plural) see me        isgigoti
You (plural) see him/her etsigoti

They see me                    gvgigotia
They see you                   getsagotia
They see him/her             anigoti
They see you and me       geginigoti
They see you two             gesdigoti
They see another and me gegigotia/gogenigoti
They see you (plural)       getsigoti
They see them                 danagotia
They see themselves       anadagoti

I will see datsigoi
I saw      agigohvi

He/she will see dvgohi
He/she             sawugohvi

Number is marked for inclusive vs. exclusive and there is a dual. 3rd person plural is marked for animate/inanimate. Verbs take different object forms depending on if the object is solid/alive/indefinite shape/flexible. This is similar to the Navajo system.

Cherokee also has lexical tone, with complex rules about how tones may combine with each other. Tone is not marked in the orthography. The phonology is noted for somehow not having any labial consonants.

However, Cherokee is very regular. It has only three irregular verbs. It is just that there are many complex rules.

Cherokee is rated 5.5, close to most difficult of all.

Northern Iroquoian
Five Nations-Huronian-Susquehannock

Wyandot, a dormant language that has been extinct for about 50 years, has some unbelievably complex structures. Let us look at one of them. Wyandot is the only language on Earth that allows negative sentences that somehow do not contain a negative morpheme. Wyandot makes it onto craziest grammars lists. (To be continued).

Mississippi Valley-Ohio Valley Siouan
Mississippi Valley Siouan

Lakota and other Siouan languages may well be as convoluted as Iroquoian. In Lakota, all adjectives are expressed as verbs. Something similar is seen in Nahuatl.

Ógle sápe kiŋ mak’ú.
The shirt it is black he gave it to me.
He gave me the black shirt.

In the above, it is black is a stative verb and serves as an adjective.

Ógle kiŋ sabyá mak’ú.
Shirt the blackly he gave it to me.
He gave me the black shirt. (Lit. He gave me the shirt blackly.)

Bkackly is an adverb serving as an adjective above.

Lakota gets a 5.5 rating, hardest of all.


All Algonquian languages have distinctions between animate/inanimate nouns, in addition to having proximate/obviate and direct/inverse distinctions. However, most languages that have proximate/obviate and direct/inverse distinctions are not as difficult as Algonquian.

Proximate/obviative is a way of marking the 3rd person in discourse. It distinguishes between an important 3rd person (proximate) and a more peripheral 3rd person (obviative). Animate nouns and possessor nouns tend to be marked proximate while inanimate nouns and possessed nouns tend to be marked obviative.

Direct/inverse is a way of marking discourse in terms of saliency, topicality or animacy. Whether one noun ranks higher than another in terms of saliency, topicality or animacy means that that nouns ranks higher in terms of person hierarchy. It is used only in transitive clauses. When the subject has a higher ranking than the object, the direct form is used. When the object has a higher ranking than the object, the inverse form is used.

Central Algonquian

Cree is very hard to learn. It are written in a variety of different ways with different alphabets and syllabic systems, complicating matters even further. The syllabic alphabet has many problems and is often listed as one of the worst scripts out there. They are both polysynthetic and have long, short and nasal vowels and aspirated and unaspirated voiceless consonants. Words are divided into metrical feet, the rules for determining stress placement in words are quite complex and there is lots of irregularity. Vowels fall out a lot, or syncopate, within words.

Cree adds noun classifiers to the mix, and both nouns and verbs are marked as animate or inanimate. In addition, verbs are marked for transitive and intransitive. In addition, verbs get different affixes depending on whether they occur in main or subordinate clauses.

Cree is rated 6, hardest of all.


Ojibwa is said to be about as hard to learn as Cree as it is very similar.

Ojibwa is rated 6, hardest of all.

Plains Algonquian

Cheyenne is well-known for being a hard Amerindian language to learn. Like many polysynthetic languages, it can have very long words.

I truly don’t know Cheyenne very well.

However, Cheyenne is quite regular, but has so many complex rules that it is hard to figure them all out.

Cheyenne is rated 6, hardest of all.


Arapaho has a strange phonology. It lacks phonemic low vowels. The vowel system consists of i, ɨ~,u, ɛ, and ɔ, with no low phonemic vowels. Each vowel also has a corresponding long version. In addition, there are four diphthongs, ei, ou, oe and ie, several triphthongs, eii, oee, and ouu, as well as extended sequences of vowels such as eee with stress on either the first or the last vowel in the combination. Long vowels of various types are common:

I will turn out the lights.

It is raining.

There is a pitch accent system with normal, high and allophonic falling tones. Arapaho words also undergo some very wild sound changes.

Arapaho is rated 6, hardest of all.

Gros Ventre has a similar phonological system and similar elaborate sound changes as Arapaho.

Gros Ventre is rated 5, hardest of all.


Wichita has many strange phonological traits. It has only one nasal. Labials are rare and appear in only two roots. It also may have only three vowels, i, e, and a, with only height as a distinction. Such a restricted vertical vowel distribution is only found in NW Caucasian and the Papuan Ndu languages. There is apparently a three-way contrast in vowel length – regular, long and extra-long.

This is only found in Mixe and Estonian. There are some interesting tenses. Perfect tense means that an act has been carried out. The strange intentive tense means that one hopes or hoped to to carry out an act. The habitual tense means one regularly engages in the activity, not that one is doing so at the moment.

Long consonant clusters are permitted.


while sleeping

There are many cases where a CVɁ sequence has been reduced to due to loss of the vowel, resulting in odd words such as:


Word order is ordered in accordance with novelty or importance.

hira:wisɁiha:s kiyari:ce:hire:
Our ancestors God put us on this Earth.

weɁe hira:rɁ tiɁi na:kirih
God put our ancestors on this Earth.

In the sentence above, “our ancestors” is actually the subject, so it makes sense that it comes first.

Wichita has inclusive and exclusive 3rd person plural and has singular, dual and plural. There is an evidential system where if you say you know something, you must say how you know it – whether it is personal knowledge or hearsay.

Wichita gets a 6 rating, hardest of all.

Coastal Chantal

Huamelutec or Lowland Oaxaca Chantal has the odd glottalized fricatives , , ɬʼ and as its only glottalized consonants. They alternate with plain f, s, l and x. , ɬʼ and are extremely rare in the world’s languages, usually only found in 2-3 other languages, often in NW Caucasian. occurs only in one other language – Tlingit. is slightly more common, occurring five other languages including Tlingit. In other languages, these odd sounds derived from sequences of consonant + q: Cq -> Cʔ -> glottalized fricative.

Sentence structure is odd:

Hit the ball the man.
Hit the man the ball.
The man hit the ball.

All mean the same thing.

Huamelutec gets a 6 rating, hardest of all.


Karok is a language isolate spoken by a few dozen people in northern California. The last native speaker recently died, however, there are ~80 who have varying levels of L2 fluency.

In Karok, you can use a suffix for different types of containment – fire, water or a solid.

throw into a fire

throw into water

throw through a solid

The suffixes are unrelated to the words for fire, water and solid.

Karok gets a 5 rating, hardest of all.


Hopi is so difficult that even grammars describing the language are almost impossible to understand. For instance, Hopi has two different words for and depending on whether the noun phrase containing the word and is nominative or accusative.

Hopi is rated 6, hardest of all.

Southern Uto-Aztecan
Core Nahua

In Nahuatl, most adjectives are simply stative verbs. Hence:

Umntu omde waya eTenochtitlan.
The man he is tall went to Tenochtitlan.
The tall man went to Tenochtitlan.

He is tall is a stative verb in the above.

Nahuatl gets a 6 rating, hardest of all.

Central Numic

Comanche is legendary for being one of the hardest Indian languages of all to learn. Reasons are unknown, but all Amerindian languages are quite difficult. I doubt if Comanche is harder than other Numic languages.

Bizarrely enough, Comanche has very strange sounds called voiceless vowels, which seems to be an oxymoron, as vowels would seem to be inherently voiced. English has something akin to voiceless vowels in the words particular and peculiar, where the bolded vowels act something akin to a voiceless vowel.

Comanche was used for a while by the codespeakers in World War 2 – not all codespeakers were Navajos. Comanche was specifically chosen because it was hard to figure out. The Japanese were never able to break the Comanche code.

Comanche is rated 6, hardest of all.

Western Oto-Mangue

Chinantec, an Indian language of southwest Mexico, is very hard for non-Chinantecs to learn. The tone system is maddeningly complex, and the syntax and morphology are very intricate.

Chinantec is rated 6, hardest of all.

Lowland Valley

Jalapa Mazatec has distinctions between modal, creaky, breathy-voiced vowels along with nasal versions of those three. It also has creaky consonants and voiceless nasals. It has three tones, low, mid and high. Combining the tones results in various contour tones. In addition, it has a 3-way distinction in vowel length. Whistled speech is also possible. It has a phonemic distinction between “ballistic” and “controlled” syllables which is only present on Oto-Manguean.

Ballistic (short)
you plural

Controlled (half-long)
– six

Jalapa Mazatec is rated 6, hardest of all.

Upper Amazon
Eastern Nawiki

Tariana is a very difficult language mostly because of the unbelievable amount of information it crams into its morphology and syntax. This is mostly because it is an Arawakan language that has been heavily influenced by neighboring Tucanoan languages, with the result that it has many of the grammatical categories and particles present in both families.

This stems from the widespread bilingualism in the Vaupes Basin of Colombia, where many people grow up bilingual from childhood and often become multilingual by adulthood. Learning up to five different languages is common. Code-switching was frowned upon and anyone using a word from Language Y while speaking Language X would get laughed at. Hence the various languages tended to borrow features from each other quite easily.

For instance, Tariana has both a noun classifier system and a gender system. Noun classifiers and gender are sometimes subsumed under the single category of “noun classifiers.” Yet Tariana has both, presumably from its relationship to two completely different language families. So in Tariana is not unusual to get both demonstratives and verbs marked for both gender and noun classifier. Tariana borrowed such things as serialized perception verbs and the dubitative marker from Tucano.

In addition, Tariana has some very odd sounds, including aspirated nasals mh (), nh (n̺ʰ) and ñh (ɲʰ) and an aspirated w () of all things. They seem to be actually aspirated, not just partially devoiced as many voiceless nasals and liquids are.

Tariana gets 6, hardest of all.


Bora, a Wintotoan language spoken in Peru and Colombia near the border between the two countries, has a mind-boggling 350 different noun classes. The noun classifier system is actually highly productive and is often used to create new nouns. New nouns can be created very easily, and their meanings are often semantically transparent. In some noun classifier systems, classifiers can be stacked one upon the other. In these cases, typically the last one is used for agreement purposes.

Bora also is a tonal language, but it has only two tones. In addition, nearly all consonantal phonemes have phonemic aspirated and palatalized counterparts. The agreement structure in the language is also quite convoluted. The classifier system effectively replaces much derivational morphology on the noun and noun compounding processes that other languages use to expand the meanings of nominals.

Bora gets a 6 rating, hardest of all.

Eastern Tucanoan

Tuyuca is a Tucanoan language spoken in by 450 people in the department of Vaupés in Colombia. An article in The Economist magazine concluded that it was the hardest language on Earth to learn.

It has a simple sound system, but it’s agglutinative, and agglutinative languages are pretty hard. For instance, hóabãsiriga means I don’t know how to write. It has two forms of 1st person plural, I and you (inclusive) and I and the others (exclusive). It has between 50-140 noun classes, including strange ones like bark that does not cling closely to a tree, which can be extended to mean baggy trousers or wet plywood that has begun to fall apart.

Like Yamana, a nearly extinct Amerindian language of Chile, Tuyuca marks for evidentiality, that is, how it is that you know something. For instance:

Diga ape-wi. = The boy played soccer. (I saw him playing).
Diga ape-hiyi.
= The boy played soccer. (I assume he was playing soccer, though I did not see it firsthand).

Evidential marking is obligatory on all Tuyuca verbs and it forces you to think about how you know whatever it is you know.

Tuyuca definitely gets a 6 rating!

Central Tucanoan

Cubeo, a language spoken in the Vaupes of Colombia, has a small closed class of adjective roots similar to Juǀʼhoan below:


However, verbs can function as adjectives, and the adjective roots can either turn into nouns themselves or they can take the inflections of either nouns or verbs. Wild!

Similar to how the grammar of Tariana has been influenced by Tucano languages, the grammar of Tucanoan Cubeo has been influenced by neighboring Arawakan languages. The grammar has been described as either SOV or OVS. That would mean that the following:

The man the ball hit.
The ball hit the man.

Mean the same things. OVS languages are quite rare.

Morphemes belong to one of four classes:

  1. Nasal (many roots, as well as suffixes like -xã  = associative)
  2. Oral (many roots, as well as suffixes like -pe  = similarity, -du = frustrative)
  3. Unmarked (only suffixes, e.g. -re  = in/direct object)
  4. Oral/Nasal (some roots and some suffixes) /bãˈkaxa-/(mãˈkaxa-) – to defecate and -kebã = suppose

Just by looking at any given consonant-initial suffix, it is impossible to determine which of the first three categories it belongs to. They must be learned one by one.

Cubeo has nasal assimilation, common to many Amazonian languages. In some of these, nasalization is best analyzed at the syllable level – some syllables are nasal and others are not.

She recently went.

The underlying form dĩ-bI-ko is realized on the surface as nĩmĩko. The ĩ in dĩ-bI-ko nasalizes the d, the b, and the I on either side of it, so nasal spreading works in both directions. However, it is blocked from the third syllable because k is part of a class of non-nasalizable consonants.

Pretty difficult language.

Cuneo gets a 6 rating, hardest of all.


Hixkaryána is famous for being the only language on Earth to have basic OVS (Object-Verb-Subject) word order.

The sentence Toto yonoye kamara, or The man ate the jaguar, actually means The jaguar ate the man.

Toto yonoye kamara
Lit. The man ate the jaguar.
Gloss: The jaguar ate the man.

Grammatical suffixes attached to the end of the verb mark not only number but also aspect, mood and tense.

Hixkaryána gets a 6 rating, hardest of all.


This is actually a series of closely related languages as opposed to one language, but the Southern Nambikwara language is the most well-known of the family, with 1,200 speakers in the Brazilian Amazon.

Phonology is complex. Consonants distinguish between aspirated, plain and glottalized, common in the Americas. There are strange sounds like prestopped nasals glottalized fricatives. There are nasal vowels and three different tones. All vowels except one have both nasal, creaky-voiced and nasal-creaky counterparts, for a total of 19 vowels.

The grammar is polysynthetic with a complex evidential system.

Reportedly, Nambikwara children do not pick up the language fully until age 10 or so, one of the latest recorded ages for full competence. Nambikwara is sometimes said to be the hardest language on Earth to learn, but it has some competition.

Nambikwara definitely gets a 6 rating, hardest of all!


Pirahã is a language isolate spoken in the Brazilian Amazon. Recent writings by Daniel Everett indicate that not only is this one of the hardest languages on Earth to learn, but it is also one of the weirdest languages on Earth. It is monumentally complex in nearly every way imaginable. It is commonly listed on the rogue’s gallery of craziest languages and phonologies on Earth.

It has the smallest phonemic inventory on Earth with only seven consonants, three vowels and either two or three tones. Everett recently wrote a paper about it after spending many years with them. Previous missionaries who had spent time with the Pirahã generally failed to learn the language because it was too hard to learn. It took Everett a very long time, but he finally learned it well.

Many of Everett’s claims about Pirahã are astounding: whistled speech, no system for counting, very few Portuguese loans (they deliberately refuse to use Portuguese loans) evidence for the Sapir-Whorf linguistic relativity hypothesis, and evidence that it violates some of Noam Chomsky’s purported language universals such as embedding. It also has the t͡ʙ̥ sound – a bilabially trilled postdental affricate which is only found in two other languages, both in the Brazilian Amazon – Oro Win and Wari’.

Initially, Everett never heard the sound, but they got to know him better, they started to make it more often. Everett believes that they were ridiculed by other groups when they made the odd sound.

Pirahã has the simplest kinship system in any language – there is only word for both mother and father, and the Pirahã do not have any words for anyone other than direct biological relatives.

Pirahã may have only two numerals, or it may lack a numeral system altogether.

Pirahã does not distinguish between singular and plural person. This is highly unusual. The language may have borrowed its entire pronoun set from the Tupian languages Nheengatu and Tenarim, groups the Pirahã had formerly been in contact with. This may be one of the only attested case of the borrowing of a complete pronoun set.

There are mandatory evidentiality markers that must be used in Pirahã discourse. Speakers must say how they know something, whether they saw it themselves, whether it was hearsay or whether they inferred it circumstantially.

There are various strange moods – the desiderative (desire to perform an action) and two types of frustrative – frustration in starting an action (inchoative/incompletive) and frustration in completing an action (causative/incompletive). There are others: immediate/intentive (you are going to do something now/you intend to do it in the future)

There are many verbal aspects: perfect/imperfect (completed/incomplete) telic/atelic (reaching a goal/not reaching a goal), continuative (continuing), repetitive (iterative), and beginning an action (inchoative).

Each Pirahã verb has 262,144 possible forms, or possibly in the many millions, depending on which analysis you use.

The future tense is divided into future/somewhere and future/elsewhere. The past tense is divided into plain past and immediate past.

Pirahã has a closed class of only 90 verb roots, an incredibly small number. But these roots can be combined together to form compound verbs, a much larger category. Here is one example of three verbs strung together to form a compound verb:

xig ab op
take turn go
bring back, You take something away, you turn around, and you go back to where you got it to return it.

There are no abstract color terms in Pirahã. There are only two words for colors, one for light and one for dark. The only other languages with this restricted of a color sense are in Papua New Guinea. The other color terms are not really color terms, but are more descriptive – red is translated as like blood.

Pirahã can be whistled, hummed or encoded into music. Consonants and vowels can be omitted altogether and meaning conveyed instead via variations in stress, pitch and rhythm. Mothers teach the language to children by repeating musical patterns.

Pirahã may well be one of the hardest languages on Earth to learn.

Pirahã gets a 6 rating, hardest of all.


Quechua (actually a large group of languages and not a single language at all) is one of the easiest Amerindian languages to learn. Quechua is a classic example of a highly regular grammar with few exceptions. Its agglutinative system is more straightforward than even that of Turkish. The phonology is dead simple.

On the down side, there is a lot of dialectal divergence (these are actually separate languages and not dialects) and a lack of learning materials. Some say that Quechua speakers spend their whole lives learning the language.

Quechua has inconsistent orthographies. There is a fight between those who prefer a Spanish-based orthography and those who prefer a more phonemic one. Also there is an argument over whether to use the Ayacucho language or the Cuzco language as a base.

Quechua has a difficult feature known as evidential marking. This marker indicates the source of the speaker’s knowledge and how sure they are about the statement.

-mi expresses personal knowledge:

Tayta Wayllaqawaqa chufirmi.
Mr. Huayllacahua is a driver. (I know it for a fact.)

-si expresses hearsay knowledge:

Tayta Wayllaqawaqa chufirsi.
Mr. Huayllacahua is a driver (or so I’ve heard).

chá expresses strong possibility:

Tayta Wayllaqawaqa chufirchá.
Mr. Huayllacahua is a driver (most likely).

Quechua is rated 4, very difficult.


Aymara has some of the wildest morphophonology out there. Morpheme-final vowel deletion is present in the language as a morphophonological process, and it is dependent on a set of highly complex phonological, morphological and syntactic rules (Kim 2013).

For instance, there are three types of suffixes: dominant, recessive and a 3rd class is neither dominant nor recessive. If a stem ends in a vowel, dominant suffixes delete the vowel but recessive suffixes allow the vowel to remain. The third class either deletes or retains the vowel on the stem depending on how many vowels are in the stem. If the root has two vowels, the vowel is retained. If it has three vowels, the vowel is deleted.

Although all of this seems quite odd, Finnish has something similar going on, if not a lot worse.

Nevertheless, Aymara is still said to be a very easy language to learn. The Guinness Book of World Records claims it is almost as easy to learn as Esperanto.

Aymara gets a 2 rating, very easy to learn.


Australian Aborigine languages are some of the hardest languages on Earth to learn, like Amerindian or Caucasian languages. Some Australian languages have phonemic contrasts that few other languages have, such as apico-dental, lamino-dental, apico-post-alveolar, and lamino-postalveolar cononals.

Australian languages tend to be mixed ergative. Ordinary nouns are ergative-absolutive, but 1st and 2nd person pronouns are nominative-accusative. One language has a three way agent-patient-experiencer distinction in the 1st person pronoun. Australian pronouns typically have singular, plural and dual forms along with inclusive and exclusive 1st plural. In some sentences, they have what is known as double case agreement which is rare in the world’s languages:

I gave a spear to my father.
I gave a spear mine-to father’s-to.

Both elements of the phrase my father are in both dative and genitive.

However, Aboriginal languages do have the plus of being very regular.

All Australian languages are rated 6, most difficult of all.


Berik is a Tor-Orya language spoken in Indonesian colony of Irian Jaya in New Guinea.

Verbs take many strange endings, in many cases mandatory ones, that indicate what time of day something happened, among other things.

TelbenerHe drinks in the evening.

Where a verb takes an object, it will not only be marked for time of day but for the size of the object.

KitobanaHe gives three large objects to a man in the sunlight.

Verbs may also be marked for where the action takes place in reference to the speaker.

GwerantenaTo place a large object in a low place nearby.

Berik is rated 6, hardest of all.

Trans New Guinea

Amele is the world’s most complex language as far as verb forms go, with 69,000 finitive and 860 infinitive forms.

Amele is rated 6, hardest of all.


Valman is a bizarre case where the word and that connects two nouns is actually a verb of all things and is marked with the first noun as subject and the second noun as object.

John (subject) and Mary (object)

John is marked as subject for some reason, and Mary is marked as object, and the and word shows subject agreement with John and object agreement with Mary.

Valman gets a 6 rating, hardest of all.


Semitic languages such as Arabic and Hebrew are notoriously difficult to learn, and Arabic (especially MSA) tops many language learners’ lists as the hardest language they have ever attempted to learn. Although Semitic verbs are notoriously complex, the verbal system does have some advantages especially as compared to IE languages like Slavic. Unlike Slavic, Semitic verbs are not inflected for mood and there is no perfect or imperfect.


Arabic has some very irregular manners of noun declension, even in the plural. For instance, the word girls changes in an unpredictable way when you say one girl, two girls and three girls, and there are two different ways to say two girls depending on context. Two girls is marked with the dual, but different dual forms can be used. All languages with duals are relatively difficult for most speakers that lack a dual in their native language. However, the dual is predictable from the singular, so one might argue that you only need to learn how to say one girl and three girls.

Further, it is full of irregular plurals similar to octopus and octopi in English, whereas these forms are rare in English. With any given word, there might be 20 different possible ways to pluralize it, and there is no way to know which of the 20 paradigms to use with that word, and further, there is no way to generalize a plural pattern from a singular pattern. In addition, many words have 2-3 ways of pluralizing them. Some messy Arab plurals:

kalb -> kilaab
-> quluub
-> makaatib
-> tullaab
-> buldaan

When you say I love you to a man, you say it one way, and when you say it to a woman, you say it another way. On and on.

The Arabic writing system is exceeding difficult and is more of the hardest to use of any on Earth. Soft vowels are omitted. You have to learn where to insert missing vowels, where to double consonants and which vowels to skip in the script. There are 28 different symbols in the alphabet and four different ways to write each symbol depending on its place in the word.

Consonants are written in different ways depending on where they appear in a word. An h is written differently at the beginning of a word than at the end of a word. However, one simple aspect of it is that the medial form is always the same as the initial form. You need to learn not only Arabic words but also the grammar to read Arabic.

Pronouns attach themselves to roots, and there are many different verb conjugation paradigms which simply have to be memorized. For instance, if a verb has a و, a ي, or a ء  in its root, you need to memorize the patters of the derivations, and that is a good chunk of the conjugations right there. The system for measuring quantities is extremely confusing.

The grammar has many odd rules that seem senseless. Unfortunately, most rules have exceptions, and it seems that the exceptions are more common than the rules themselves. Many people, including native speakers, complain about Arabic grammar.

Arabic does have case, but the system is rather simple.

The laryngeals, uvulars and glottalized sounds are hard for many foreigners to make and nearly impossible for them to get right. The ha’(ح ), qa (ق ) and غ sounds and the glottal stop in initial position give a lot of learners headaches.

Arabic is at least as idiomatic as French or English, so it order to speak it right you have to learn all of the expressionistic nuances.

One of the worst problems with Arabic is the dialects, which in many cases are separate languages altogether. If you learn Arabic, you often have to learn one of the dialects along with classical Arabic. All Arabic speakers speak both an Arabic dialect and Classical Arabic.

In some Arabic as a foreign language classes, even after 1 1/2 years, not one student could yet make a complete and proper sentence that was not memorized.

Adding weight to the commonly held belief that Arabic is hard to learn is research done in Germany in 2005 which showed that Turkish children learn their language at age 2-3, German children at age 4-5, but Arabic kids did not get Arabic until age 12.

Arabic has complex verbal agreement with the subject, masculine and feminine gender in nouns and adjectives, head-initial syntax and a serious restriction to forming compounds. If you come from a language that has similar nature, Arabic may be easier for you than it is for so many others. Its 3 vowel system makes for easy vowels.

MSA Arabic is rated 5, extremely difficult.

Arabic dialects are often somewhat easier to learn than MSA Arabic. At least in Lebanese and Egyptian Arabic, the very difficult q’ sound has been turned into a hamza or glottal stop which is an easier sound to make. Compared to MSA Arabic, the dialectal words tend to be shorter and easier to pronounce.

To attain anywhere near native speaker competency in Egyptian Arabic, you probably need to live in Egypt for 10 years, but Arabic speakers say that few if any second language learners ever come close to native competency. There is a huge vocabulary, and most words have a wealth of possible meanings.

Egyptian Arabic is rated 4.5, very to extremely difficult.

Moroccan Arabic is said to be particularly difficult, with much vowel elision in triconsonantal stems. In addition, all dialectal Arabic is plagued by irrational writing systems.

Moroccan Arabic is rated 4.5, very to extremely difficult.

Maltese is a strange language, basically a Maghrebi Arabic language (similar to Moroccan or Tunisian Arabic) that has very heavy influence from non-Arabic tongues. It shares the problem of Gaelic that often words look one way and are pronounced another.

It has the common Semitic problem of difficult plurals. Although many plurals use common plural endings (-i, -iet, -ijiet, -at), others simply form the plural by having their last vowel dropped or adding an s (English borrowing). There’s no pattern, and you simply have to memorize which ones act which way. Maltese permits the consonant cluster spt, which is surely hard to pronounce.

On the other hand, Maltese has quite a few IE loans from Italian, Sicilian, Spanish, French and increasingly English. If you have knowledge of Romance languages, Maltese is going to be easier than most Arabic dialects.

Maltese is rated 4, very difficult.


Hebrew is hard to learn according to a number of Israelis. Part of the problem may be the abjad writing system, which often leaves out vowels which must simply be remembered. Also, other than borrowings, the vocabulary is Afroasiatic, hence mostly unknown to speakers of IE languages. There are also difficult consonants as in Arabic such as pharyngeals and uvulars.

The het or glottal h is particularly hard to make. However, most modern Israelis no longer make the het sound or a’ain sounds. Instead, they pronounce the het like the chaf sound and the a’ain like an alef. Almost all Ashkenazi Israeli Jews no longer use the het or a’ain sounds. But most Jews who came from Arab countries (often older people) still use the sound, and some of their children do (Dorani 2013).

Hebrew has complex morphophonological rules. The letters p, b, t, d, k and g change to v, f, dh, th, kh and gh in certain situations. In some environments, pharyngeals change the nature of the vowels around them. The prefix ve-, which means and, is pronounced differently when it precedes certain letters. Hebrew is also quite irregular.

Hebrew has quite a few voices, including active, passive, intensive, intensive passive, etc. It also has a number of tenses such as present, past and the odd juissive.

Hebrew also has two different noun classes. There are also many suffixes and quite a few prefixes that can be attached to verbs and nouns.

Even most native Hebrew speakers do not speak Hebrew correctly by a long shot.

Quite a few say Hebrew is as hard to learn as MSA or perhaps even harder, but this is controversial.

Hebrew gets a 5 rating for extremely difficult.


Berber languages are considered to be very hard to learn. Worse, there are very few language learning resources available.

Tamazight allows doubled consonants at the beginning of a word! How can you possibly make that sound?

Tamazight gets a 6 rating, hardest of all.

In Tachelhit , words like this are possible:

You took it off.

You gave it.

In addition, there are words which contain only one or two consonants:


feed on

Tachelhit gets a 6 rating, hardest of all.


Amharic is said to be a very hard language to learn. It is quite complex, and its sentence structures seem strange even to speakers of other Semitic languages. Hebrew speakers say they have a hard time with this language.

There are a multitude of rules which almost seem ridiculous in their complexity, there are numerous conjugation patterns, objects are suffixed to the verb, the alphabet has 274 letters, and the pronunciation seems strange. However, if you already know Hebrew or Arabic, it will be a lot easier. The hardest part of all is the verbal system, as with any Semitic language. It is easier than Arabic.

Amharic gets a 4.5 rating, very hard to extremely hard.

East Cushitic

Dahalo is legendary for having some of the wildest consonant phonology on Earth. It has all four airstream mechanisms found in languages: ejectives, implosives, clicks and normal pulmonic sounds. There are both glottal and epiglottal stops and fricatives and laminal and apical stops.

There is also a strange series of nasal clicks and are both glottalized and plain. Some of these clicks are also labialized. It has both voiced and unvoiced prenasalized stops and affricates, and some of the stops are also labialized. There is a weird palatal lateral ejective. There are three different lateral fricatives, including a labialized and palatalized one, and one lateral approximant. It contrasts alveolar and palatal lateral affricates and fricatives, the only language on Earth to do this.

The Dahalo are former elephant hunting hunter gatherers who live in southern Kenya. It is believed that at one time they spoke a language like Sandawe or Hadza, but they switched over to Cushitic at some point. The clicks are thought to be substratum from a time when Dahalo was a Sandawe-Hadza type language.

Dahalo gets a 6 rating, hardest of all.


Somali has one of the strangest proposition systems on Earth. It actually has no real prepositions at all. Instead it has preverbal particles and possessives that serve as prepositions.

Here is how possessives serve as prepositions:

habeennimada horteeda
the night her front
before nightfall

kulaylka dartiisa
the heat his reason
because of the heat

Here we have the use of a preverbal particle serving as a preposition:

kú ríd shandádda
Into put the suitcase.
Put it into the suitcase.

Somali combines four “prepositions” with four deictic particles to form its prepositions.

There are four basic “prepositions”:


These combine with a four different deictic particles:

toward the speaker
away from the speaker
toward each other
away from each other

Hence you put the “prepositions” and the deictic particles together in various ways. Both tend to go in front of and close to the verb:

Nínkíi bàan cèelka xádhig kagá sóo saaray.
…well-the rope with-from towards-me I-raised.
I pulled the man out of the well with a rope.

Way inoogá warrámi jireen.
They us-to-about news gave.
They used to give us news about it.

Prepositions are the hardest part of the Somali language for the learner.

Somali deals with verbs of motion via deixis in a similar way that Georgian does. One reference point is the speaker and the other is any other entities discussed. Verbs of motion are formed using adverbs. Entities may move:

towards each other    wada
away from each other  kala
towards the speaker   so
away from the speaker si


kala durka separate
si gal     go in (away from the speaker)
so gal     come in (toward the speaker)

Somali lacks orthographic consistency. There are four different orthographic systems in use – the Wadaad Arabic script, the Osmanya Ethiopic script, the Borama script and the Latin Somali alphabet, the current system.

All of the difficult sounds of Arabic are also present in Somali, another Semitic language – the alef, the ha, the qaf and the kha. There are long and short vowels.  There is a retroflex d, the same sound found in South Indian languages. Somali also has 2 tones – high and low. For some reason, Somali tends to make it onto craziest phonologies lists.

Somali pluralization makes no sense and must be memorized. There are seven different plurals, and there is no clue in the singular that tells you what form to use in the plural. See here:


áf  (language) -> afaf


hoóyo (mother) -> hoyoóyin

áabbe -> aabayaal

Note the tone shifts in all three of the plurals above.

There are four cases, absolutive, nominative, genitive and vocative. Despite the presences of absolutive and nominative cases, Somali is not an ergative language. Absolutive case is the basic case of the noun, and nominative is the case given to the noun when a verb follows in the sentence. There are different articles depending on whether the noun was mentioned previously or not (similar to the articles a and the in English). The absolutive and nominative are marked not only on the noun but also on the article that precedes it.

In terms of difficulty, Somali is much harder than Persian and probably about as difficult as Arabic.

Somali gets a 5 rating, extremely hard to learn.


Malayalam, a Dravidian language of India, was has been cited as the hardest language to learn by an language foundation, but the citation is obscure and hard to verify.

Malayalam words are often even hard to look up in a Malayalam dictionary.

For instance, adiyAnkaLAkkikkoNDirikkukayumANello is a word in Malayalam. It means something like I, your servant, am sitting and mixing s.t. (which is why I cannot do what you are asking of me). The part in parentheses is an example of the type of sentence where it might be used.

The above word is composed of many different morphemes, including conjunctions and other affixes, with sandhi going on with some of them so they are eroded away from their basic forms. There doesn’t seem to be any way to look that word up or to write a Malayalam dictionary that lists all the possible forms, including forms like the word above. It would probably be way too huge of a book. However, all agglutinative languages are made up of affixes, and if you know the affixes, it is not particularly hard to parse the word apart.

Malayalam is said to be very hard to pronounce correctly.

Further, few foreigners even try to learn Malayalam, so Malayalam speakers, like the French, might not listen to you and might make fun of you if your Malayalam is not native sounding.

However, Malayalam has the advantage of having many pedagogic materials available for language learning such as audio-visual material and subtitled videos.

Malayalam is rated 5, extremely difficult.


Tamil, a Dravidian language is hard, but probably not as difficult as Malayalam is. Tamil has an incredible 247 characters in its alphabet. Nevertheless, most of those are consonant-vowel combinations, so it is almost more of a syllabary than an alphabet. Going by what would traditionally be considered alphabetic symbols, there are probably only 72 real symbols in the alphabet. Nevertheless, Tamil probably has one of the easier Indic scripts as Tamil has fewer characters than other scripts due to its lack of aspiration. Compare to Devanagari’s over 1,000 characters.

But no Indic script is easy. A problem with Tamil is that all of the characters seem to look alike. It is even worse than Devanagari in that regard. However, the more rounded scripts such as Kannada, Sinhala, Telegu and Malayalam have that problem to a worse degree. Tamil has a few sharp corners in the characters that helps to disambiguate them.

In addition, as with other languages, words are written one way and pronounced another. However, there are claims that the difficulty of Tamil’s diglossia is overrated.

Tamil has two different registers for written and spoken speech, but the differences are not large, so this problem is exaggerated. Both Tamil and Malayalam are spoken very fast and have extremely complicated, nearly impenetrable scripts. If Westerners try to speak a Dravidian language in south India, more often than not the Dravidian speaker will simply address them in English rather than try to accommodate them.

Tamil has the odd evidential mood, similar to Bulgarian.

However, on the plus side, the language does seem to be very logical and regular, almost like German in that regard. In addition, there are a lot of language learning materials for Tamil.

Tamil is rated 4, very difficult.


Most agree that Korean is a hard language to learn.

The alphabet, Hangul at least is reasonable; in fact, it is quite elegant. But there are four different Romanizations- Lukoff, Yale, Horne, and McCune-Reischauer – which is preposterous. It’s best to just blow off the Romanizations and dive straight into Hangul. This way you can learn a Romanization later, and you won’t mess up your Hangul with spelling errors, as can occur if you go from Romanization to Hangul.

Hangul can be learned very quickly, but learning to read Korean books and newspapers fast is another matter altogether because you really need to know the hanja or Chinese character that are used in addition to the Hangul. After World War 2, the Koreas decided to officially get rid of their Chinese characters, but in practice this was not successful. With the use of Chinese characters in Korean, you can be a lot more precise in terms what you are trying to communicate.

Bizarrely, there are two different numeral sets used, but one is derived from Chinese so it should be familiar to Chinese, Japanese or Thai speakers who use similar or identical systems.

Korean has a wealth of homonyms, and this is one of the tricky aspects of the language. Any given combination of a couple of characters can have multiple meanings. Japanese has a similar problem with homonyms, but at least with Japanese you have the benefit of kanji to help you tell the homonyms apart. With Korean Hangul, you get no such advantage.

Similarly, there seem to be many ways to say the same thing in Korean. The learner will feel when people are using all of these different ways of saying the same thing that they are actually saying something different each time, but that is not the case.

One problem is that the bp, j, ch, t and d are pronounced differently than their English counterparts. The consonants, the pachim system and the morphing consonants at the end of the word that slide into the next word make Korean harder to pronounce than any major European language. Korean has a similar problem with Japanese, that is, if you mess up one vowel in sentence, you render it incomprehensible.

The vocabulary is very difficult for an English speaker who does not have knowledge of either Japanese or Chinese. On the other hand, Japanese or Chinese will help you a lot with Korean.

Korean is agglutinative and has a subject-topic discourse structure, and the logic of these systems is difficult for English speakers to understand. In addition, there are hundreds of ways of conjugating any given verb based on tense, mood, age or seniority. Adjectives also decline and take hundreds of different suffixes.

Meanwhile, Korean has an honorific system that is even wackier than that of Japanese. A single sentence can be said in three different ways depending on the relationship between the speaker and the listener. However, the younger generation is not using the honorifics so much, and a foreigner isn’t expected to know the honorific system anyway.

Maybe 60% of the words are based on Chinese words, but unfortunately, much of this Chinese-based vocabulary intersects with Japanese versions of Chinese words in a confusing way.

Speakers of Korean can learn Japanese fairly easily. Korean seems to be a more difficult language to learn than Japanese. There are maybe twice as many particles as in Japanese, the grammar is dramatically more difficult and the verbs are quite a bit harder. The phonemic inventory in Korean is also larger and includes such oddities as double consonants.

Korean is rated by language professors as being one of the hardest languages to learn.

Korean is rated 5, extremely hard.


Japanese also uses a symbolic alphabet, but the symbols themselves are sometime undecipherable in that even Japanese speakers will sometimes encounter written Japanese and will say that they don’t know how to pronounce it. I don’t mean that they mispronounce it; that would make sense. I mean they don’t have the slightest clue how to say the word! This problem is essentially nonexistent in a language like English.

The Japanese orthography is one of the most difficult to use of any orthography.

There are over 2,000 frequently used characters in three different symbolic alphabets that are frequently mixed together in confusing ways. Due to the large number of frequently used symbols, it’s said that even Japanese adults learn a new symbol a day a ways into adulthood.

The Japanese writing system is probably crazier than the Chinese writing system and it often makes it onto lists of worst orthographies. The very idea of writing an agglutinative language in a combination of two syllabaries and an ideography seems wacky right off the bat. Japanese borrowed Chinese characters. But then they gave each character several pronunciations, and in some cases as many as 24. Next they made two syllabaries using another set of characters, then over the next millennia came up with all sorts of contradictory and often senseless rules about when to use the syllabaries and when to use the character set. Later on they added a Romanization to make things even worse.

Chinese uses 5-6,000 characters regularly, while Japanese only uses around 2,000. But in Chinese, each character has only one or maybe two pronunciations. In Japanese, there are complicated rules about when and how to combine the hiragana with the characters. These rules are so hard that many native speakers still have problems with them. There are also personal and place names (proper nouns) which are given completely arbitrary pronunciations often totally at odds with the usual pronunciation of the character.

There are some writers, typically of literature, who deliberately choose to use kanji that even Japanese people cannot read. For instance, Ryuu  Murakami  uses the odd symbols 擽る、, 轢く、and 憑ける.

The Japanese system is made up of three different systems: the katakana and hiragana (the kana) and the kanji, similar to the hanzi used in Chinese. Chinese has at least 85,000 hanzi. The number of kanji is much less than that, but kanji often have more than one meaning in contrast to hanzi.

After WW2, Japan decided to simplify its language. They both simplified and reduced the number of Chinese characters used, and they unified the written and spoken language, which previously had been different.

Speaking Japanese is not as difficult as everyone says, and many say it’s fairly easy. However, there is a problem similar to English in that one word can be pronounced in multiple ways, like read and read in English.

A common problem is that a perfectly grammatically correct sentence uttered by a Japanese language learner, while perfectly correct, is still not acceptable by Japanese speakers because “we just don’t say it that way.” The Japanese speaker often cannot tell why the unacceptable sentence you uttered is not ok. On the other hand, this problem may be common to more languages than Japanese.

There is also a class of Japanese called “honorifics” or “keigo” that is quite hard to master. Honorifics are meant to show respect and to indicate one’s place or status in the social hierarchy. These typically effect verbs but can also affect particles and prefixes. They are usually formed by archaic or highly irregular verbs. However, there are both regular and irregular honorific forms. Furthermore, there are five different levels of honorifics. Honorifics vary depending on who you are and who you are talking to. In addition, gender comes into play.

Although it is true the Japanese young people are said to not understand the intricacies of keigo, it is still expected that they know how to speak this well. Consequently, many young Japanese will opt out of certain conversations because they feel that their keigo is not very good. Books explaining how to use keigo properly have been big sellers among young people in Japan in recent years as young people try to appear classy, refined or cultured.

In addition, Japanese born overseas (especially in the US), while often learning Japanese pretty well, typically have a very poor understanding of keigo. Instead of embarrassing themselves by not using keigo or using it wrong, these Japanese speakers often prefer to speak in English to Japanese people rather than bother with keigo-less Japanese. Overcorrection in keigo is also a problem when hypercorrection leads to someone making errors in keigo due to “trying to hard.” This looks like phony or insincere politeness and is often worse than not using keigo at all.

One wild thing about Japanese is counting forms. You actually use different numeral sets depending on what it is you are counting! There are dozens of different ways of counting things which involve the use of a complex numerical noun classifier system.

Japanese grammar is often said to be simple, but that does not appear to be the case on closer examination. Particles are especially vexing. Verbs engage in all sorts of wild behavior, and adverbs often act like verbs. Nouns can act like adjectives and adverbs. Meanwhile, honorifics change the behavior of all words. There are particles like ha and ga that have many different meanings. One problem is that all noun modifiers, even phrases, must precede the nouns they are modifying.

It’s often said that Japanese has no case, but this is not true. Actually, there are seven cases in Japanese. The aforementioned ga is a clitic meaning nominative, made is terminative case, -no is genitive and -o is accusative.

In this sentence:

The plane that was supposed to arrive at midnight, but which had been delayed by bad weather, finally arrived at 1 AM.

Everything underlined must precede the noun plane:

Was supposed to arrive at midnight, but had been delayed by bad weather, the plane finally arrived at 1 AM.

One of the main problems with Japanese grammar is that it is going to seem to so different from the sort of grammar and English speaker is likely to be used to.

Speaking Japanese is one thing, but reading and writing it is a whole new ballgame. It’s perfectly possible to know the meaning of every kanji and the meaning of every word in a sentence, but you still can’t figure out the meaning of the sentence because you can’t figure out how the sentence is stuck together in such a way as to create meaning.

The real problem is that the Japanese you learn in class is one thing, and the Japanese of the street is another. One problem is that in street Japanese, the subject is typically not stated in a sentence. Instead it is inferred through such things as honorific terms or the choice of words you used in the sentence. Probably no one goes crazier on negatives than the Japanese. Particularly in academic writing, triple and quadruple negatives are common, and can be quite confusing.

Yet there are problems with the agglutinative nature of Japanese. It’s a completely different syntactic structure than English. Often if you translate a sentence from Japanese to English it will just look like a meaningless jumble of words.

However, Japanese grammar has the advantage of being quite regular. For instance, there are only four frequently used irregular verbs.

Like Chinese, the nouns are not marked for number or gender. However, while Chinese is forgiving of errors, if you mess up one vowel in a Japanese sentence, you may end up with incomprehension.

Although many Japanese learners feel it’s fairly easy to learn, surveys of language professors continue to rate Japanese as one of the hardest languages to learn. A study by the US Navy concluded that the hardest language the corpsmen had to learn in the course of service was Japanese. However, it’s generally agreed that Japanese is easier to learn than Korean. Japanese speakers are able to learn Korean pretty easily.

Japanese is rated 5, extremely hard.

Classical Japanese is much harder to read than Modern Japanese. Though you can get by with much less kanji when reading the modern language, you will need a minimum knowledge of 3,000 kanji for reading Classical Japanese, and that’s using a dictionary. There are only about 500-1,000 frequently used characters, but there are countless other words that will come up in your reading especially say special words used in the Imperial Court. Many words have more than one meaning, and unless you know this, you will be lost. 東宮(とうぐう) for instance means Eastern Palace. However, it also means Crown Prince because his residence was to the east of the Emperor’s.

The movie The Seven Samurai (set in the late 1500’s) seems to use some sort of Classical Japanese, or at least Classical vocabulary and syntax with modern pronunciation. Japanese language learners say they can’t understand a word of the archaic Japanese used in this movie.

Classical Japanese gets 5.5, nearly hardest of all.

Western Oghuz

Turkish is often considered to be hard to learn, and it’s rated one of the hardest in surveys of language teachers, however, it’s probably easier than its reputation made it out to be. It is agglutinative, so you can have one long word where in English you might have a sentence of shorter words. One word is

Were you one of those people whom we could not turn into a Czechoslovakian?

Many words have more than one meaning. However, the agglutination is very regular in that each particle of meaning has its own morpheme and falls into an exact place in the word. See here:

göz            eye
göz-lük        glasses
göz-lük-çü     optician
göz-lük-çü-lük the business of an optician

Nevertheless, agglutination means that you can always create new words or add new parts to words, and for this reason even a lot of Turkish adults have problems with their language.

There is no verb to be, which is hard for many foreigners. Instead, the concept is wrapped onto the subject of the sentence as a -dim or -im suffix. Turkish is an imagery-heavy language, and if you try to translate straight from a dictionary, it often won’t make sense.

However, the suffixation in Turkish, along with the vowel harmony, are both precise. Nevertheless, many words have irregular vowel harmony. The rules for making plurals are very regular, with no exceptions (the only exceptions are in foreign loans). In Turkish, incredible as it sounds, you can make a plural out of anything, even a word like what, who or blood. However, there is some irregularity in the strengthening of adjectives, and the forms are not predictable and must be memorized.

Turkish is a language of precision in other ways. For instance, there are eight different forms of subjunctive mood that describe various degrees of uncertainty that one has about what one is talking about. This relates to the evidentiality discussed under Tuyuca above, and Turkish has an evidential form similar to Tamil and Bulgarian. On Turkish news, verbs are generally marked with miş, which means that the announcer believes it to be true though he has not seen it firsthand. The particle miş is interesting because this evidential form is coded into the tense system, which is an unusual use of evidentiality.

The Roman alphabet and almost mathematically precise grammar really help out. Turkish lacks gender and has but a single irregular verb – olmak. Nevertheless, there are many verbal forms. However, this is controversial and it depends on how you define grammatical irregularity. There is some strangeness in some of the verb paradigms, but it is argued that these oddities are rule-based. The aorist tense is said to have irregularity.

There is some irregular morphophonology, but not much. The oblique relative clauses have complex morphosyntax. Turkish has two completely different ways of making relative clauses, one of which may have been borrowed from Persian. There are many gerunds for verbs, and these have many different uses. At the end of the day, Turkish grammar is not as regular or as simple as it is made out to be.

Words are pronounced nearly the same as they are written. A suggestion that Turkish may be easier to learn that many think is the research that shows that Turkish children learn attain basic grammatical mastery of Turkish at age 2-3, as compared to 4-5 for German and 12 for Arabic. The research was conducted in Germany in 2005.

In addition, Turkish has a phonetic orthography.

However, Turkish is hard for an English speaker to learn for a variety of reasons. It is agglutinative like Japanese, and all agglutinative languages are difficult for English speakers to learn. As in Japanese, you start your Turkish sentence the way you would end your English sentence. As in the Japanese example above, the subordinate clause must precede the subject, whereas in English, the subordinate clause must follow the subject. The italicized phrase below is a subordinate clause.

In English, we say, “I hope that he will be on time.”

In Turkish, the sentence would read, “That he will be on time I hope.”

Turkish vowels are unusual to speakers of IE languages, and Turkish learners say the vowels are hard to make or even tell apart from one another.

Turkish is rated 3.5, harder than average to learn.



One test of the difficulty of any language is how much of the grammar you must know in order to express yourself on a basic level. On this basis, Finno-Ugric languages are complicated because you need to know quite a bit more grammar to communicate on a basic level in them than in say, German.


Finnish is very hard to learn, and even long-time learners often still have problems with it. Famous polyglot Barry Farber said it was one of the hardest languages he learned. You have to know exactly which grammatical forms to use where in a sentence. In addition, Finnish has 15 cases in the singular and 16 in the plural. This is hard to learn for speakers coming from a language with little or no case.

For instance,
talothe house


talon        house's
taloasome    of the house
taloksiinto  as the house
talossain    the house
talostafrom  inside the house
talooninto   the house
talollaon    to the house
taloltafrom  beside the house
talolleto    the house
taloistafrom the houses
taloissa     in the houses

It gets much worse than that. This web page shows that the noun kauppashop can have 2,253 forms.

A simple adjective + noun type of noun phrase of two words can be conjugated in up to 100 different ways.

Adjectives and nouns belong to 20 different classes. The rules governing their case declension depend on what class the substantive is in.

As with Hungarian, words can be very long. For instance:

non-commissioned officer cadet learning to be an assistant mechanic for airplane jet engines

Like Turkish, Finnish agglutination is very regular. Each bit of information has its own morpheme and has an exact place in the word.

Like Turkish, Finnish has vowel harmony, but the vowel harmony is very regular like that of Turkish. Unlike Turkish or Hungarian, consonant gradation forms a major part of Finnish morphology. In order to form a sentence in Finnish, you will need to learn about verb types, cases and consonant gradation, and it can take a while to get your mind around those things.

Finnish, oddly enough, always puts the stress on the first syllable. Finnish vowels will be hard to pronounce for most foreigners.

However, Finnish has the advantage of being pronounced precisely as it is written. This is also part of the problem though, because if you don’t say it just right, the meaning changes. So, similarly with Polish, when you mangle their language, you will only achieve incomprehension. Whereas with say English, if a foreigner mangles the language, you can often winnow some sense out of it.

However, despite that fact that written Finnish can be easily pronounced, when learning Finnish, as in Korean, it is as if you must learn two different languages – the written language and the spoken language. A better way to put it is that there is “one language for writing and another for speaking.” You use different forms whether conversing or putting something on paper.

Some pronunciation is difficult. The the contrast between short and long vowels and consonants is particularly troublesome. Check out these minimal pairs:



A problem for the English speaker coming to Finnish would be the vocabulary, which is alien to the speaker of an IE language. Finnish language learners often find themselves looking up over half the words they encounter. Obviously, this slows down reading quite a bit!

In the grammar, the partitive case and potential tense can be difficult. Here is an example of how Finnish verb tenses combine with various cases to form words:

I A-Infinitive
Base form mennä

II E-Infinitive
Active inessive    mennessä
Active instructive mennen
Passive inessive   mentäessä

III MA-Infinitive
Inessive            menemässä
Elative             menemästä
Illative            menemään
Adessive            menemällä
Abessive            menemättä
Active instructive  menemän
Passive instructive mentämän

Verbs in Finnish

Finnish verbs are very regular. The irregular verbs can almost be counted on one hand:


and a few others. In fact, on the plus side, Finnish in general is very regular.

One easy aspect of Finnish is the way you can build many forms from a base root:


to write

As in many Asian languages, there are no masculine or feminine pronouns, and there is no grammatical gender. The numeral system is quite simple compared to other languages. Finnish has a complete lack of consonant clusters. In addition, the phonology is fairly simple.

Finnish is rated 5, extremely hard to learn.


Estonian has similar difficulties as Finnish, since they are closely related. However, Estonian is more irregular than Finnish. In particular, the very regular agglutination system described in Finnish seems to have gone awry in Estonian. Estonian has 14 cases, including strange cases such as the abessive, adessive, elative and inessive. On the other hand, all of these cases can simply be analyzed as the genitive case plus a single unvarying suffix for each case. In addition, there is no gender, so the only things you have to worry about when forming cases are singular and plural.

Estonian has a strange mood form called the quotative, often translated as “reported speech.”

tema onhe/she/it is

tema olevatit’s rumored that he/she/it is or he/she/it is said to be

This mood is often used in newspaper reporting and is also used for gossip.

Estonian has an astounding 25 diphthongs. It also has three different varieties of vowel length, which is strange in the world’s languages. There are short, vowels and extra-long vowels and consonants.

linalinen – short n
the town’s – long n, written as nn
into the town – extra-long n, not written out!

There are differences in the pronunciation of the three forms above, but in rapid speech, they are hard to hear, though native speakers can make them out. Difficulties are further compounded in that extra-long sonorants (m, n, ng, l, and r) and vowels and are not written out. All in all, phonemic length can be a problem in Estonian, and foreigners never seem to get it completely down.

Estonian pronunciation is not very difficult, though the õ sound can cause problems. However, Estonian has completely lost the vowel harmony system it inherited from Finnish, resulting in words that seem very hard to pronounce.

At least in written form, Estonian is not as complex as Finnish. Estonian can be seen as an abbreviated and modernized form of Finnish. The grammar is also like a simplified version of Finnish grammar and may be much easier to learn.

Estonian is rated 4.5, very to extremely difficult.


Skolt Sami‘s Latinization is often listed as one of the worst Latinizations around. The rest of the language is quite similar to, and as difficult as, Finnish.

Skolt Sami gets a 5 rating, extremely hard to learn.


It’s widely agreed that Hungarian is one of the hardest languages on Earth to learn. Even language professors agree. The British Diplomatic Corps did a study of the languages that its diplomats commonly had to learn and concluded that Hungarian was the hardest. Hungarian grammar is maddeningly complex, and Hungarian is often listed on craziest grammar lists. For one thing, there are many different forms for a single word via word modification. This enables the speaker to make his intended meaning very precise. Looking at nouns, there are about 257 different forms per noun.

Hungarian is said to have from 24-35 different cases (there are charts available showing 31 cases), but the actual number may only be 18. Nearly everything in Hungarian is inflected, similar to Lithuanian or Czech. Similar to Georgian and Basque, Hungarian has the polypersonal agreement, albeit to a lesser degree than those two languages. There are many irregularities in inflections, and even Hungarians have to learn how to spell all of these in school and have a hard time learning this.

The case distinctions alone can create many different words out of one base form. For the word house, we end up with 31 different words using case forms:

házbainto the house
in the house
from [within] the house
onto the house
on the house
off [from] the house
to the house
until/up to the house
at the house
[away] from the house
– Translative case, where the house is the end product of a transformation, such as They turned the cave into a house.
as the house, which could be used if you acted in your capacity as a house or disguised yourself as one. He dressed up as a house for Halloween.
for the house, specifically things done on its behalf or done to get the house. They spent a lot of time fixing things up (for the house).
– Essive-modal case. Something like “house-ly” or in the way/manner of a house. The tent served as a house (in a house-ly fashion).

And we do have some basic cases:

ház – Nominative. The house is down the street.
– Accusative. The ball hit the house.
– Dative. The man gave the house to Mary.
– Similar to instrumental, but more similar to English with. Refers to both instruments and companions.

The genitive takes 12 different declensions, depending on person and number:

házammy house
my houses
your house
your houses
his/her/its house
his/her/its houses
our house
our houses
your house
your house
their house
their houses
church, as in the Catholic Church. (Literally one-house)

In addition, the genitive suffixes to the possession, which is not how the genitive works in IE.


az ember házathe man’s house (Lit. the man house-his)
a házammy house (Lit. the house-my)
a házadyour house (Lit. the house-your)

There are also very long words such as this:

for your (you all possessive) repeated pretensions at being impossible to desecrate

Being an agglutinative language, that word is made up of many small parts of words, or morphemes. That word means something like

The preposition is stuck onto the word in this language, and this will seem strange to speakers of languages with free prepositions.

Hungarian is full of synonyms, similar to English.

For instance, there are 78 different words that mean to move: halad, jár, megy, dülöngél, lépdel, botorkál, kódorog, sétál , andalog, rohan, csörtet, üget, lohol, fut, átvág, vágtat, tipeg, libeg, biceg, poroszkál, vágtázik, somfordál , bóklászik, szedi a lábát, kitér, elszökken, betér , botladozik, őgyeleg, slattyog, bandukol, lófrál, szalad, vánszorog, kószál, kullog, baktat, koslat, kaptat, császkál, totyog, suhan, robog, rohan, kocog, cselleng, csatangol, beslisszol, elinal, elillan, bitangol, lopakodik, sompolyog, lapul, elkotródik, settenkedik, sündörög, eltérül, elódalog, kóborol, lézeng, ődöng, csavarog, lődörög, elvándorol , tekereg, kóvályog, ténfereg, özönlik, tódul, vonul, hömpölyög, ömlik, surran, oson, lépeget, mozog and mozgolódik .

Only about five of those terms are archaic and seldom used, the rest are in current use. However, to be a fair, a Hungarian native speaker might only recognize half of those words.

In addition, while most languages have names for countries that are pretty easy to figure out, in Hungarian even languages of nations are hard because they have changed the names so much. Italy becomes Olazorszag, Germany becomes Nemetzorsag, etc.

As in Russian and Serbo-Croatian, word order is relatively free in Hungarian. It is not completely free as some say but rather is it governed by a set of rules. The problem is that as you reorder the word order in a sentence, you say the same thing but the meaning changes slightly in terms of nuance. Further, there are quite a few dialects in Hungarian. Native speakers can pretty much understand them, but foreigners often have a lot of problems. Accent is very difficult in Hungarian due to the bewildering number of rules used to determine accent. In addition, there are exceptions to all of these rules. Nevertheless, Hungarian is probably more regular than Polish.

Hungarian spelling is also very strange for non-Hungarians, but at least the orthography is phonetic. Nevertheless, the orthography often makes it onto worst orthographies lists.

Hungarian phonetics is also strange. One of the problems with Hungarian phonetics is vowel harmony. Since you stick morphemes together to make a word, the vowels that you have used in the first part of the word will influence the vowels that you will use to make up the morphemes that occur later in the word. The vowel harmony gives Hungarian a “singing effect” when it is spoken. The ty, ny, sz, zs, dzs, dz, ly, cs and gy sounds are hard for many foreigners to make. The á, é, ó, ö, ő, ú, ü, ű, and í vowel sounds are not found in English.

Verbs are marked for object (indefinite, definite and person/number), subject (person and number) tense (past, present and future), mood (indicative, conditional and imperative), and aspect (frequency, potentiality, factitiveness, and reflexiveness.

I could make others save you occasionally (on a disk).

Verbs change depending on whether the object is definite or indefinite.

Olvasok könyvet.
I read a book.
(indefinite object)

Olvasom a könvyet.
I read the book.
(definite object)

As noted in the introduction to the Finno-Ugric section, you need to know quite a bit of Hungarian grammar to be able to express yourself on a basic level. For instance, in order to say:

I like your sister.

you will need to understand the following Hungarian forms:

  1. verb conjugation and definite or indefinite forms
  2. possessive suffixes
  3. case
  4. how to combine possessive suffixes with case
  5. word order
  6. explicit pronouns
  7. articles

It’s hard to say, but Hungarian is probably harder to learn than even the hardest Slavic languages like Czech, Serbo-Croatian and Polish. At any rate, it is generally agreed that Hungarian grammar is more complicated than Slavic grammar, which is pretty impressive as Slavic grammar is quite a beast.

Hungarian is rated 5, extremely hard to learn.


It’s fairly easy to learn to speak Mandarin at a basic level, though the tones can be tough. This is because the grammar is very simple – short words, no case, gender, verb inflections or tense. But with Japanese, you can keep learning, and with Chinese, you often tend to hit a wall, often because the syntactic structure is so strangely different from English (isolating).

Actually, the grammar is harder than it seems. At first it seems simple, like a simplified English. No word is capable of declension, and there is no tense, case, and number, nor are there articles. But the simplicity makes it difficult. No tense means there is no easy way to mark time in a sentence. Furthermore, tense is not as easy as it seems. Sure, there are no verb conjugations, but instead you must learn some particles and special word orders that are used to mark tense. Mandarin has 12 different adverbs for which there is no good English translation.

Once you start digging into Chinese, there is a complex layer under all the surface simplicity. There is such things as aspect, serial verbs, a complex classifier system, syntax marked by something called topic-prominence, a strange form called the detrimental passive, preposed relative clauses, use of verbs rather than adverbs to mark direction, and all sorts of strange stuff. Verb complements can be baffling, especially potential and directional complements. The 把, 是 and 的 constructions can be very hard to understand.

The topic-prominence is interesting in that only a few major languages have topic-comment syntax, and most of those are Oriental languages with a lot of Chinese borrowing. Topicalization is not marked morphologically.

There are sentences where the entire meaning changes with the addition of a single character. Chinese sentences are SVO (Subject -Verb – Object) at their base, but that is a bit of an illusion. A sentence that causes you to discuss time duration makes you repeat the verb after the direct object – SVOVT (T= time phrase). In the case of topicalization, sentences can have the structure of OSV (Object – Subject – Verb). Relative clauses and all subordinate clauses come before the noun they modify. In other words:

English: The man who always wore red walked into the room.
Chinese: Who always wore red the man walked into the room.

The relative clause in the sentences above is marked in bold.

In Chinese, the prepositional phrase comes between the subject and the verb:

English: The man hit the ball into the yard.
Chinese: The man into the yard hit the ball.

The prepositional phrase is bolded in the sentences above.

In Chinese, adjectives are actually stative verbs as in Nahuatl and Lakota.

Nàgè rède cài hěnhǎochī.
The it is hot food is good to eat.
The hot food is delicious.

The symbol turns food hot into food it is hot, an attributive verb. means something like to be.

There are dozens of words called particles which shade the meaning of a sentence ever so slightly.

Chinese phonology is not as easy as some say. There are way too many instances of the zh, ch, sh, j, q, and x sounds in the language such that many of the words seem to sound the same. There is a distinction between aspirated and nonaspirated consonants. There is also the presence of odd retroflex consonants.

Chinese orthography is probably the most hardest orthography of any language. The alphabet uses symbols, so it’s not even a real alphabet. There are at least 85,000 symbols and actually many more, but you only need to know about 3-5,000 of them, and many Chinese don’t even know 1,000. To be highly proficient in Chinese, you need to know 10,000 characters, and probably less than 5% of Chinese know that many.

In addition, the characters have not been changed in 3,000 years, and the alphabet is at least somewhat phonetic, so we run into a serious problem of lack of a spelling reform.

The Communists tried to simplify the system (simplified Mandarin) but instead of making the connections between the phonetic aspects of character more sensible by decreasing their number and increasing their regularity (they did do this somewhat but not enough), they simply decreased the number of strokes needed for each symbol typically without dealing with the phonetic aspect of all. The simplification did not work well, so now you have a mixture of two different types of written Chinese – simplified and traditional.

In addition to all of this, Chinese borrowed a lot from the Japanese symbolic alphabet a full 1,000 years after it had already been developed and had not undergone a spelling reform, adding insult to injury.

Even leaving the characters aside, the stylistic and literary constraints required to write Chinese in an eloquent or formal (literary) manner would make your head swim. And just because you can read Chinese does not mean that you can read Classical Chinese prose. It’s as if it’s written in a different language – actually, it is technically a different language similar to Middle English or Old English. However, few Middle English or Old English texts are read anymore, and Classical Chinese is still widely read.

However, the orthography is at least consistent. 90% of characters have only one reading. Once you learn the character, you generally know the meaning in any context.

Writing the characters is even harder than reading them. One wrong dot or wrong line either completely changes the meaning or turns the symbol into nonsense.

It’s a real problem when you encounter a symbol you don’t know because there is no way to sound out the word. You are really and truly lost and screwed. There is a clue at the right side of the symbol, but it is not always accurate.You need to learn quite a bit of vocabulary just to speak simple sentences.

Similarly, a dictionary is not necessarily helpful when trying to read Chinese. You can have a Chinese sentence in front of you along with a dictionary, and the sentence still might not make sense even after looking it up in the dictionary.

Some Chinese Muslims write Chinese using an Arabic script. This is often considered to be one of the worst orthographies of all.

The tones are often quite difficult for a Westerner to pick up. If you mess up the tones, you have said a completely different word. Often foreigners who know their tones well nevertheless do not say them correctly, and hence, they say one word when they mean another. However, compared to other tone systems around the world, the tonal system in Chinese is comparatively easy.

A major problem with Chinese is homonyms. To some extent, this is true in many tonal languages. Since Chinese uses short words and is disyllabic, there is a limited repertoire of sounds that can be used. At a certain point, all of the sounds are used up, and you are into the realm of homophones.

Tonal distinctions are one way that monosyllabic and disyllabic languages attempt to deal with the homophone problem, but it’s not good enough, since Chinese still has many homophones, and meaning is often discerned by context, stress, rhythm and intonation. Chinese, like French and English, is heavily idiomatic.

It’s little known, but Chinese also uses different forms (classifiers) to count different things, like Japanese.

There is zero common vocabulary between English and Chinese, so you need to learn a whole new set of lexical forms.

In addition, nouns often show relatedness or hierarchy. For instance, in English, you can simply say my brother or my sister, but in Chinese, you cannot do this. You have to indicate whether you are speaking of an older or younger sibling.

mei meiyounger sister
jie jie
older sister
ge ge
older brother
di di
younger brother

Mandarin scored very high on a weirdest languages study.

On the positive side, Chinese grammar is fairly regular and word derivation, compound words are sensible and the meaning can be determined by looking at the word. In other languages, compound words are not necessarily so obvious.

Many agree that Chinese is the hardest to learn of all of the major languages. A recent survey of language professors rated Chinese as the hardest language on Earth to learn.

Mandarin gets a 5.5 rating for nearly hardest of all.

However, Cantonese is even harder to learn than Mandarin. Cantonese has eight tones to Mandarin’s four, and in addition, they continue to use a lot of the older traditional Chinese characters that were superseded when China moved to a simplified script in 1949. Furthermore, since non-Mandarin characters are not standardized, Cantonese cannot be written down as it is spoken.

In addition, Cantonese has verbal aspect, possibly up to 20 different varieties. Modal particles are difficult in Cantonese. Clusters of up to the 3 sentence final particles are very common. 我食咗飯 and 我食咗飯架啦喎 are both grammatical for I have had a meal, but the particles add the meaning of I have already had a meal, answering a question or even to imply I have had a meal, so I don’t need to eat anymore.

Cantonese gets a 5.5 rating, nearly hardest of all.

Min Nan is also said to be harder to learn than Mandarin, as it has a more complex tone system, with five tones on three different levels. Even many Taiwanese natives don’t seem to get it right these days, as it is falling out of favor, and many fewer children are being raised speaking it than before.

Min Nan gets a 5.5 rating, nearly hardest of all.

A recent 15 year survey out of Fudan University utilizing both the departments of Linguistics and Anthropology looked at 579 different languages in 91 linguistic families in order to try to find the most complicated language in the world. The result was that a Wu language dialect (or perhaps a separate language) in the Fengxian district of southern Shanghai (Dônđän Wu) was the most phonologically complex language of all, with 20 separate vowels (Wang 2012). The nearest competitor was Norwegian with 16 vowels.

Dônđän Wu gets a 5.5 rating, nearly hardest of all.

Classical Chinese is still read by many Chinese people and Chinese language learners. Unless you have a very good grasp on modern Chinese, classical Chinese will be completely wasted on you. Classical Chinese is much harder to read than reading modern Chinese.

Classical Chinese covers an era extending over 3,000 years, and to attain a reading fluency in this language, you need to be familiar with all of the characters used during this period along with all of the literature of the period so you can understand all the allusions. Even with a knowledge of Classical Chinese, you need to read it in context. If you are good at Classical Chinese and someone throws you a random section of it, it will take you a good amount of time to figure it out unless you know context.

The language is much more to the point than Modern Chinese, but this is not as good as it sounds. This simplicity leaves a room for ambiguity, and context plays an important role. A joke about some obscure historical or literary anecdote will be lost you unless you know what it refers to. For reading modern Chinese, you will need at least 5,000 characters, but even then, you will still need a dictionary. With Classical Chinese, there are no lower limits on the number of characters you need to know. The sky is the limit.

Classical Chinese gets a 6 rating, hardest of all.


In Quiang, a language of Sichuan Province in China, not only are there rhotic vowels, which are present in only 1% of the world’s languages, but there is also rhoticity harmony, where a non-rhotic vowel in a morpheme becomes rhotic when it is followed by a morpheme with a rhotic vowel.

ʀuɑ +e˞ > ʀuɑ˞kʰ
+ w ˞> mw

Rhotic vowels are found in US English – Unstressed ɚ: standard, dinner, Lincolnshire, editor, measure, martyr.

Qiang also has a very bad romanization, so bad that the Qiang will not even use it. Voiced consonants are written by adding a vowel to the symbol for the voiceless consonant. It has long and short vowels, but these are not represented in the system.

Qiang gets a 5 rating, extremely hard to learn.

Western Tibeto-Burman
Central Bodish

Tibetan probably has one of the least rational orthographies of any language. The orthography has not changed in ~1,000 years while the language has gone through all sorts of changes. A langauge learner in Tibet can get by using phonetic spelling. The problem comes when you try to spell using the Classical Alphabet. For instance:

Srong rtsan Sgam po (written)
soŋtsɛn ɡampo (spoken)

bsgrubs (written)

d`up (spoken)

While the orthography is etymological and completely outdated, it is quite predictable.

Tibetan gets a 5 rating, extremely hard to learn.


Dzongka, the official language of Bhutan, has some pretty wild phonology, in addition to having the Tibetan writing system, this time using Bhutanese forms of the Tibetan script.

It contrasts all of the following: s, , ʰs, ʰsʰ, ts, ʰts, tsʰ, z, ʱz, dz, ʱdz, ⁿsʰ, ᵐtsʰ, ⁿtsʰ, ⁿdz, ᵖts, ᵖtsʰ, ᵖtsʷʰ, and ᶲs, and in addition it has four tones, but there is no single word that is distinguished by tone only. On top of that, there are 22 different vowels.

Dzongka gets a 5 rating, extremely hard to learn.


Vietnamese is also hard to learn because to an outsider, the tones seem hard to tell apart. Therefore, foreigners often make themselves difficult to understand by not getting the tone precisely correct. It also has “creaky-voiced” tones, which are very hard for foreigners to get a grasp on.

Vietnamese grammar is fairly simple, and reading Vietnamese is pretty easy once you figure out the tone marks. Words are short as in Chinese. However, the simple grammar is relative, as you can have 25 or more forms just for I, the 1st person singular pronoun. In addition, the Latin orthography is said to be quite bad. It was invented by missionaries a few centuries ago, and it has never made much sense.

Vietnamese gets 5 rating, extremely hard to learn.


Khmer has a reputation for being hard to learn. I understand that it has one of the most complex honorifics systems of any language on Earth. Over a dozen different words mean to carry depending on what one is carrying. There are several different words for slave depending on who owned the slave and what the slave did. There are 28-30 different vowels, including sets of long and short vowels and long and short diphthongs. The vowel system is so complicated that there isn’t even agreement on exactly what it looks like. Khmer learners, especially speakers of IE languages, often have a hard time producing or even distinguishing these vowels.

Speaking it is not so bad, but reading and writing it is pretty difficult. For instance, you can put up to five different symbols together in one complex symbol. The orthographic script is even worse than the Thai one. There are actually rules to this mess, but no one seems to know who they are.

Khmer gets a  4.5 rating, very to extremely hard.

North Bahnaric

Sedang, a language of Vietnam, has the highest number of vowel sounds of any language on Earth, at 55 distinct vowel sounds.

Sedang gets a 5 rating, extremely hard to learn.


Hmong is widely spoken in this part of California, but it’s not easy to learn. There are eight tones, and they are not easy to figure out. It’s not obviously related to any other major language but the obscure Mien.

It has some very strange consonants called voiceless nasals. We have them in English as allophones – the m in small is voiceless, but in Hmong, they put them at the front of words – the m in the word Hmong is voiceless. These can be very hard to pronounce.

The romanization is widely criticized for being a lousy one, but the Hmong use it anyway.

Hmong gets a 5 rating, extremely hard to learn.


Tsou is a Taiwanese aborigine language spoken by about 2,000 people in Taiwan. It has the odd feature whereby the underlying glides y and w turn into or surface as non-syllabic mid vowels e̯ and o̯ in certain contexts:

jo~joskɨ -> e̯oˈe̯oskɨ  -= fishes

Tsou is also ergative like most Formosan languages. Tsou is the only language in the world that has no prepositions or anything that looks like a preposition. Instead it uses nouns and verbs in the place of prepositions. Tsou allows more potential consonant clusters than most other languages. About 1/2 of all possible CC clusters are allowed.

Tsou has an inclusive/exclusive distinction in the 1st person plural and a very strange visible and non-visible distinction in the 3rd person singular and plural. Both adjectives and adverbs can turn into verbs and are marked for voice in the same way that verbs are. Verbs are extensively marked for voice. Nouns are marked for a variety of odd cases, often referring to perception, (visible/invisible) person, and place deixis.

‘e –               visible and near speaker
si/ta –           visible and near hearer
ta –               visible but away from speaker
‘o/to –           invisible and far away, or newly introduced to discourse
na/no ~ ne – non-identifiable and non-referential (often when scanning a class of elements)

Tsou gets a 5 rating, extremely hard to learn.


Bahasa Indonesia is an easy language to learn. For one thing, the grammar is dead simple. There are only a handful of prefixes, only two of which might be seen as inflectional. There are also several suffixes. Verbs are not marked for tense at all. And the sound system of these languages, in common with Austronesian in general, is one of the simplest on Earth, with only two dozen phonemes. Bahasa Indonesia has few homonyms, homophones, homographs, or heteronyms. Words in general have only one meaning.

Though the orthography is not completely phonetic, it only has a small number of nonphonetic exceptions. The orthography is one of the easiest on Earth to use.

The system for converting words into either nouns or verbs is regular. To make a plural, you simply repeat a word, so instead of saying pencils, you say pencil pencil.

Bahasa Indonesia gets a 1.5 rating, extremely easy to learn.

Malay is only easy if you learn the standard spoken form or one of the creoles. Learning the literary language is quite a bit more difficult. However, the Jawi script, which is Malay written in Arabic script, is often considered to be perfectly awful.

Malay get a 2 rating for moderately easy.

Greater Central Philippine
Central Philippine

However, Tagalog is much harder than Malay or Indonesian. Compared to many European languages, Tagalog syntax, morphology and semantics are often quite different. Also, Tagalog is typically spoken very fast. Unlike Malay, verbs conjugate quite a bit in Tagalog. The main idea of Tagalog grammar is something called focus. Once you figure that out, the language gets pretty easy, but until you understand that concept, you are going to have a hard time.

Everything is affixed in Tagalog.

However, articles and creation of adjectives from nouns is very easy.


gandabeauty (noun)
magandabeautiful (adjective)

Tagalog gets a 4 rating, very difficult.

Central-Eastern Malayo-Polynesian
Eastern Malayo-Polynesian
Central-Eastern Oceanic
Remote Oceanic
Central Pacific
East Fijian-Polynesian

Maori and other Polynesian languages have a reputation for being quite easy to learn. The main problem for English speakers is that the sentence structure is backwards compared to English. In addition, macrons can cause problems.

One problem with Maori is dialects. The dialects are so diverse that this means that there are multiple words for the same thing. Swiss German has a similar issue, with up to 50 words for each common household item (nearly every major dialect has its own word for common objects):

ngongi, noni, koki, waiwater
, rarangi, hiri –  to plait, to twist, to weave
, maitaigood
, , tutehu, mātikato stand
, mouto hold
, pouto be exhausted
, tohorāwhale
, ngohifish
, kāwailine
, kori, keukeu, koukou, neke, nukuto move
, hara, here, horo, whanoto go, to come
, hapa, to be wrong
, wānanga, rūnangato discuss
, tahungapriest
, maikukufinger nail
, konohi, mata, whatu, kamo, karueye, face

Entire Maori sentences can be written with vowels only.

E uu aau?
Are yours firm?

I uaa ai.
It rained as usual.

I ui au ‘i auau aau?’
E uaua!
It will be difficult/hard/heavy!

On the plus side, the pronunciation is simple, and there is no gender. The language is as regular as Japanese. No Polynesian language has more than 16 sounds, and they all lack tones. They all have five vowels, which can be either long or short. A consonant must be followed by a vowel, so there are no consonant clusters. All consonants are easy to pronounce.

Maori gets a 3 rating, average difficulty.


Hawaiian is a pretty easy language to learn. It is easy to pronounce, has a simple alphabet, lacks complex morphology and has a fairly simple syntax.

Hawaiian gets a 2 rating, very easy to learn.

North and Central Vanuatu
East Santo

Sakao is a very strange langauge spoken by 4,000 people in Vanuatu.  It is very strange. It is a polysynthetic Austronesian language, which is very weird. It allows extreme consonant clusters. Sakao has an incredible seven degrees of deixis. The language has an amazing four persons: singular, dual, paucal and plural. The neighboring language Tomoko has singular, dual, trial and plural. The trial form is very odd. Sakao’s paucal derived from Tomato’s trial:

they, from three to ten

jørðœl løn
the five of them
(Literally, they three, five)

All nouns are always in the singular except for kinship forms and demonstratives, which only display the plural:

ðjœɣmy mother/aunt -> rðjœɣmy aunts

walðyɣmy child -> raalðyɣmy children

It has a number of nouns that are said to be “inalienably possessed”, that is, whenever they occur, they must be possessed by some possessor. These often take highly irregular inflections:

Sakao 	  English
œsɨŋœ-ɣ   my mouth
œsɨŋœ-m   thy mouth
ɔsɨŋɔ-n   his/her/its mouth
œsœŋ-...  ...'s mouth	

uly-ɣ 	  my hair
uly-m 	  thy hair
ulœ-n 	  his/her/its hair
nøl-...   ...'s hair

Here, mouth is either œsɨŋœ-, ɔsɨŋɔ- or œsœŋ-, and hair is either uly-, ulœ- or nøl-

Sakao, strangely enough, may not even have syllables in the way that we normally think of them. If it does have syllables at all, they would appear to be at least a vowel optionally  surrounded by any number of consonants.

i (V)

Having sung and stopped singing thou kept silent.

Sakao has a suffix -in that makes an intransitive verb transitive and makes a transitive verb ditransitive. Ditransitive verbs can take two arguments – a direct object and an instrumental.

Mɨjilɨn amas ara./Mɨjilɨn ara amas.
He kills the pig with the club
/He kills with the club the pig.

Sakao polysynthesis allows compound verbs, each one having its own instrument or object:

Mɔssɔnɛshɔβrɨn aða ɛðɛ.
He-shooting-fish-kept-on-walking with-a-bow the-sea.
He walked along the sea shooting the fish with a bow.

Sakao gets a 5 rating, extremely hard to learn.

Central-Eastern Oceanic
Southeast Solomonic
Malaita–San Cristobal
Northern Malaita

Kwaio is an Austronesian language spoken in the Solomon Islands. It has four different forms of number to mark pronouns – not only the usual singular and plural, but also the rarer dual and the very rare paucal. In addition, there is an inclusive/exclusive contrast in the non-singular forms.

For instance:

1 dual inclusive (you and I)
1 dual exclusive (I and someone else, not you)

1 paucal inclusive (you, I and a few others)
1 paucal exclusive (I and a few others)

1 plural inclusive (I, you and many others)
1 plural exclusive (I and many others)

Pretty wild!

Kwaio gets a 5 rating, extremely hard to learn.

Greater Barito
East Barito

Malagasy, the official language of Madagascar, has a reputation for being even easier to learn than Indonesian or Malay.

Malagasy gets a 1 rating, easiest of all to learn.


Thai is a pretty hard language to learn. There are 75 symbols in the strange script, there are no spaces between words in the script, and vowels can come before, after, above or below consonants in any given syllable. There seem to be many different glyphs for every consonant, but the different glyphs for the same consonant will sometimes change the sound of the neighboring vowel. The orthography is as insensible as that of English since centuries have gone by with no spelling reforms, in fact, Thai has not changed its system in 1000 years. The wild card of having tone thrown in adds to the insanity.

Consonant pronunciations vary depending on the location of the syllable in the word – for instance, s can change to t. There are many vowels which are spoken but not written. There are many consonants that are pronounced the same – for instance, there are six different t‘s, not counting the s‘s that turn into t‘s. The Thai script is definitely one of the most difficult phonetic scripts. Nevertheless, the Thai script is easier to learn than the Japanese or Chinese character sets. In spite of all of that, the syntax is simple, like Chinese.

There are five tones, including a neutral tone. Tones are determined by a variety of complex things, including a combination of tone marks, the class of consonants, if the syllable ends in a sonorant or a stop and what the tone of the preceding syllable was. Tone marking in the orthography is quite complex.

The vowels are different than in many languages, and there are some unusual diphthongs: eua, euai, aui and uu. There is a contrast between aspirated and unaspirated consonants.

There is a system of noun classifiers for counting various things, similar to Japanese. In addition, common to many Asian languages, there is a complicated honorifics system.

On the plus side, Thai is a regular language, with few exceptions to the rules. However, the rules are quite complex. The syntax is about as complex as that of Chinese, and the grammar is dead simple.

Thai gets a 5 rating, hardest of all to learn.

Lao is very similar to Thai, in fact it is identical to a Thai language spoken by 16 million people in northeast Thailand called Northeastern Thai. The Lao script is similar to Thai, but it has fewer letters so there is somewhat less confusion.

Lao gets a 4.5 rating, very to extremely hard to learn.


The Kam languages of the Dong people in southwest China were rated by the Fudan University study referenced above under Wu as the 2nd most phonologically complex on Earth (Wang 2012). There are 32 stem initial consonants, including oddities like , tɕʰ, , pʲʰ, ɕ, , kʷʰ, ŋʷ, tʃʰ, tsʰ. Note the many contrasts between aspirated and unaspirated voiceless consonants, including bilabial palatalized stops, labialized velar stops, and alveolar affricates. There are an incredible 64 different syllable finals, and 14 others that occur only in Chinese loans.

There are an astounding 15 different tones, nine in open syllables and six in checked syllables (entering tones). Main tones are high, high rising, high falling, low, low rising, low falling, mid, dipping and peaking. When they speak, it sounds as if they are singing.

Kam gets a 5 rating, extremely hard to learn.


According to the Fudan University study quoted above, Buyang in the 3rd most phonologically complex language in the world. Buyang is a cluster of 4 related languages spoken by 1,900 people in Yunnan Province, China. Buyang has a completely wild consonant inventory.

It has a full set of both voiced and voiceless plain and aspirated stops, including voiceless uvulars. The contrast between aspirated and plain voiced stops is peculiar. The stop series also has distinctions between palatalized and rounded stops throughout the series. It has a labialized voiceless palatal fricative and a voiceless dental aspirated lateral, unusual sounds. It has four different voiceless aspirated nasals. It has voiceless y and w, more odd sounds. It also has plain and labialized palatal glides.

That is one heck of a wild phonology.

Buyang gets a 5 rating, extremely hard to learn.


The African Bantu language Ga has a bad reputation for being a tough nut to crack. It is spoken in Ghana by about 600,000 people. It has two tones and engages in a strange behavior called tone terracing that is common to many West African languages. There is a phonemic distinction between three different types of vowel length. All vowels have 3 different lengths – short, long and extra long. It also has many sounds that are not in any Western languages.

Ga gets a 5 rating, extremely hard to learn.

Central Bia

Anyi is a language spoken by 610,000 people in Côte d’Ivoire.  It is relatively straightforward as far as African languages go. Probably the hardest part about the language is that it is tonal, and it does have two tones. The phonology does have the unusual +-ATR contrast which will seem very odd. ATR stands for advanced tongue root, so the language has a contrast between vowels with an advanced tongue root and without one. However, the grammar is pretty regular. There are few confusing phonological processes.

Anyi has a simple tense system, with only present, past and future. There is no aspect, mood or voice marking, and it lacks the noun class systems so common in many African languages. It has a plural marker, but it is often optional.

The syntax does have serial verbs, which will seem odd to Westerners. It distinguishes between relative clauses marked with and subordinate clauses marked with .

Anyi gets a 4 rating, very hard to learn.

Narrow Bantu

Ndali is a Bantu language with 150,000 speakers spoken in Malawi and Tanzania. It has many strange tense forms. For instance, in the past tense:

Past tense A: He went just now.
Past tense B: He went sometime earlier today.
Past tense C: He went yesterday.
Past tense D: He went sometime before yesterday.

Future tense is marked similarly:

Future tense A: He’s going to go right away.
Future tense B: He’s going to go sometime later today.
Future tense C: He’s going to go tomorrow.
Future tense D: He’s going to go sometime after tomorrow.

Ndali gets a 5 rating, extremely hard to learn.


Xhosa, a language of South Africa, is quite difficult, with up to nine click sounds. Clicks only exist in one language outside of Africa – the Australian language Damin – and are extremely difficult to learn. Even native speakers mess up the clicks sometimes. Nelson Mandela said he had problems making some of the click sounds in Xhosa. The phonemics in general of Xhosa are pretty wild.

Xhosa gets a 5 rating, extremely hard to learn.

Zulu and Ndebele also have these impossible click sounds. However, outside of click sounds, the phonology of Nguni languages is straightforward. All Nguni languages are agglutinative. These languages also make plurals by changing the prefix of the noun, and the manner varies according the noun class. If you want to look up a word in the dictionary, first of all you need to discard the prefix. For instance, in Ndebele,

imifula, but

–  amatsheyet


Ndebele gets a 5 rating, hardest of all.

Zulu has pitch accent, tones and clicks. There are nine different pitch accents, four tones and three clicks, but each click can be pronounced in five different ways. However, tones are not marked in writing, so it’s hard to figure out when to use them. Zulu also has depressor consonants, which lower the tone in the vowel in the following syllable. In addition, Zulu has multiple gender – 15 different genders. And some nouns behave like verbs. It also has 12 different noun classes, but 90% of words are part of a group of only three of those classes.

Zulu gets a 5 rating, extremely hard to learn.


For unknown reasons, Swahili is generally considered to be an easy language to learn. The US military ranks it 1, with the easiest of all languages to learn. This seems to be the typical perception. Why Swahili is so easy to learn, I am not sure. It’s a trade language, and trade languages are often fairly easy to learn. There’s also a lot of controversy about whether or not Swahili can be considered a creole, but that has not been proven. For the moment, the reasons why Swahili is so easy to learn will have to remain mysterious.

On the down side, Swahili has many noun classes, but they have the benefit of being more or less logical.

Swahili gets a 2 rating, moderately easy.

Southern Africa

!Xóõ (Taa), spoken by only 4,200 Bushmen in Botswana and Namibia, is a notoriously difficult Khoisan language replete with the notoriously impossible to comprehend click sounds. Taa has anywhere from 130 to 164 consonants, the largest phonemic inventory of any language. Of this vast wealth of sounds, there are anywhere from 30-64 different click sounds. There are five basic clicks and 17 accompanying ones. Speakers develop a lump on their larynx from making the click sounds.

In addition, there are four types of vowels: plain, pharyngealized, breathy-voiced and strident. On top of that, there are four tones. Taa appears on many lists of the wildest phonologies and craziest languages period on Earth.

Taa gets a 5 rating, extremely hard to learn.


Ju|’hoan, a Khoisan language spoken by 5,000 people in Botswana, has one of the wildest phonological inventories on Earth. The voiced aspirated consonants – sb͡pʰd͡tʰ , d͡tsʰ , d͡tʃʰ , ɡ͡kʰ , and ᶢǃʰ  – are particularly odd. Some question whether these segments actually exist and say that they are instead spoken with a “breathy-voice.” However, voiced aspirated consonants do appear to be real. In addition, Ju|’hoan has a closed class of only 17 adjectives since descriptive functions are done by verbs. They are the following:

(those remaining)
other (strange)
a certain

the numbers one through four

Ju|’hoan scored very high on a study of the weirdest languages on Earth.

Ju|’hoan gets a 5 rating, extremely hard to learn.


Inuktitut is extremely hard to learn. Inuktitut is polysynthetic-agglutinative, and roots can take many suffixes, in some cases up to 700. Verbs have 63 forms of the present indicative, and conjugation involves 252 different inflections. Inuktitut has the complicated polypersonal agreement system discussed under Georgian above and Basque below. In a typical long Inuktitut text, 92% of words will occur only once. This is quite different from English and many other languages where certain words occur very frequently or at least frequently. Certain fully inflected verbs can be analyzed both as verbs and as nouns. Words can be very long.

I truly don’t know how to speak Inuktitut very well.

You may need to analyze up to 10 different bits of information in order to figure out a single word. However, the affixation is all via suffixes (there are no prefixes or infixes) and the suffixation is extremely regular.

Inuktitut is also rated one by linguists one of the hardest languages on Earth to pronounce. Inuktitut may be as hard to learn as Navajo.

Inuktitut is rated 6, hardest of all.

Kalaallisut (Western Greenlandic) is very closely related to Inuktitut. Look at this sentence:

However, they will say that he is a great entertainer, but …

That word is composed of 12 separate morphemes. A single word can conceptualize what could be an entire sentence in a non-polysynthetic language.

Kalaallisut is rated 6, hardest of all.


Chukchi is a polysynthetic, agglutinating and incorporating language and is often listed as one of the hardest languages on Earth to learn.

I have a fierce headache.

There are five morphemes in that word, and there are three lexical morphemes (nouns or adjectives) incorporated in that word: meyŋgreat, levthead, and pəγtache.

Chukchi gets a 6 rating, hardest of all.


Basque, of course, is just a wild language altogether. There is an old saying that the Devil tried to learn Basque, but after seven years, he only learned how to say Hello and Goodbye. Many Basques, including some of the most ardent Basque nationalists, tried to learn Basque as adults. Some of them succeeded, but a very large number of them failed. Based on the number that failed, it does seem that Basque is harder for an adult to learn as an L2 than many other languages are. Basque grammar is maddeningly complex and it often makes it onto craziest grammars and craziest language lists.

There are 11 cases, and each one takes four different forms. The verbs are quite complex. This is because it is an ergative language, so verbs vary according to the number of subjects and the number of objects and if any third person is involved.

This is the same polypersonal agreement system that Georgian has above. Basque’s polypersonal system is a polysynthetic system consisting of two verb types – synthetic and analytical. Only a few verbs use the synthetic form.

Three of Basque’s cases – the absolutive (intransitive verb case), the ergative (intransitive verb case) and the dative – can be marked via affixes to the verb. In Basque, only present simple and past simple synthetic tenses take polypersonal affixes.

The analytical forms are composed of more than one word, while the synthetic forms are all one word. The analytic verbs are built via the synthetic verbs izanbe, ukanhave and egindo.


d-akar-ki-o-gu = We bring it to him/her. The verb is ekarribring.
z-erama-zki-gu-te-n = They took them to us. The verb is eramantake


Ekarriko d-i-o-gu = We’ll bring it to him/her. Literally: We will have-bring it to him/her. The analytic verb is built from ukanhave.

Eraman d-ieza-zki-gu-ke-te = They can take them to us. Literally: They can be taking them to us. The analytic verb is built from izanbe.

Most of the analytic verbs require an auxiliary which carries all sorts of information that is often carried on verbs in other languages – tense, mood, sometimes gender and person for subject, object and indirect object.

Jaten naiz.
Eat I-am-doing.
I am eating.

Jaten nintekeen.
Eat I-was-able-to.
I could eat.

Eman geniezazkiake.
Give we-might-have-them-to-you-male.
We might have given them to you.

In the above, naiz, nintekeen and geniezazkiake are auxiliaries. There are actually 2,640 different forms of these auxiliaries!

A language with ergative morphosyntax in Europe is quite a strange thing, and Basque is the only one of its kind. The ergative itself is quite unusual:

Gizona etorri da.The man has arrived.
Gizonak mutila ikusi du.
The man saw the boy.

= the

The noun gizon takes a different form whether it is the subject of a transitive or intransitive verb. The first sentence is in absolutive case (unmarked) while the second sentence is in the ergative case (marked by the morpheme -k). If you come from a non-ergative IE language, the concept of ergativity itself is difficult enough to conceptualize, much less trying to actually learn an ergative language. Consequently, any ergative language will automatically be more difficult than a non-ergative one for all speakers of IE languages.

Ergativity also works with pronouns.  There are four basic systems:

Nor:           verb has subject only
Nor-Nork:          "    subj. + direct complement
Nor-Nori:          "    subj. + indirect comp.
Nor-Nori-Nork:     "    subj. + indir. + dir. comps.

Some call Basque the most consistently ergative language on Earth.

If you don’t grow up speaking Basque, it’s hard to attain native speaker competence. It’s quite a bit easier to write in Basque than to speak it.

Nevertheless, Basque verbs are quite regular. There are only a few irregularities in conjugations and they have phonetic explanations. In fact, the entire language is quite regular. In addition, most words above the intermediate level are borrowings from large languages, so once you reach intermediate Basque, the rest is not that hard. In addition, pronunciation is straightforward.

Basque is rated 5.5, nearly hardest of all.


Dorani, Yakir. Hebrew speaker, Israel. August 2013. Personal communication.

Hewitt, B. G.. 2005. Georgian: A Learner’s Grammar, p. 29.

Kim, Yuni. December 16, 2003. Vowel Elision and the Morphophonology of Dominance in Aymara. UC Berkeley.

Kirk, John William Carnegie. 1905. A Grammar of the Somali Language: With Examples in Prose and Verse and an Account of the Yibir and Midgan Dialects, pp. 73-74.

Rogers, Jean H. 1978. Differential Focusing in Ojibwa Conjunct Verbs: On Circumstances, Participants, and Events. International Journal of American Linguistics 44: 167-179.

Wang, Chuan-Chao et al. 2012. Comment on ”Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa.” Science 335:657.

This research takes a lot of time, and I do not get paid anything for it. If you think this website is valuable to you, please consider a a contribution to support more of this valuable research.


Filed under !Xóõ, Afroasiatic, Algonquian, Altaic, Arabic, Austro-Asiatic, Austro-Tai, Austronesian, Bahasa Indonesian, Bakjalukasha, Bantu, Basque, Cantonese, Cherokee, Chinantec, Chinese language, Chukchi, Chukotko-Kamchatkan, Cree, Dene-Yenisien, Descriptive, Dravidian, Eskimo-Aleut, Finnic, Finnish, Finno-Ugric Languages, Hebrew, Hmong, Hmong-Mien, Hopi, Hungarian, Inuktitut, Iriquoian, Isolates, Japanese, Japonic, Khmer, Khoisan, Kootenai, Korean language, Language Families, Language Learning, Language Samples, Linguistics, Malayalam, Malayo-Polynesian, Malaysian, Maltese, Mandarin, Maori, Min Nan, Mon-Khmer, Na-Dene, Navajo, NE Caucasian, Nguni, Niger-Congo, Niger-Kordofanian, Nuxálk, Oghuz, Ojibwa, Oto-Manguean, Paleosiberian, Philippine, Quechua, Quechuan, Salishan, Semitic, Sinitic, Sino-Tibetan, Slavey, Tabasaran, Tai-Kadai, Tamil, Tsez, Turkic, Turkish, Ugric, Vietnamese, Xhosa, Yamana