What important words are missing from HSK?

Learning Chinese can sometimes lack structure and feel confusing, especially if you study on your own. There are few reliable reference points, and it’s easy to understand why many turn to standardised tests, not just for assessment, but for guidance as to what to study and when.

HSK (Hànyǔ Shuǐpíng Kǎoshì) is by far the most well-known such test, and there are many textbooks, courses and learning resources specifically geared towards taking students through levels of increasing difficulty. It’s not uncommon to hear about students who say that they’re “working their way through HSK3” and similar.

While I think the idea of using a proficiency test to guide your learning and as the main source of new vocabulary is a bit backwards, I also understand why people do so, especially if you need the certificate to apply for a scholarship or a job that requires Chinese. If you care about your grades, you decisions should not only be guided by what makes sense from a language learning perspective.

Tune in to the Hacking Chinese Podcast to listen to the related episode (#2 or #266).

Available on Apple Podcasts, Spotify, YouTube and many other platforms!

So, we have a large number of students who, for some reason, focus heavily on HSK study materials in general and HSK word lists in particular. This raises an interesting question: If you focus on HSK, what other things would you miss? Or, more specifically, if you learn words mostly from the HSK lists, what common words would you miss?

This article will provide an answer to that question. If you’re just interested in checking out the words, you can click here to skip to the word lists at the end of the article. If you’re learning Chinese in Taiwan and are more interested in the TOCFL test, check this follow-up article about that very topic: What important words are missing from TOCFL?

For those of you who want to know a little bit more, I’ll go through the process in more detail before we get to the actual words.

What important words are missing from HSK?

It should be clear that HSK is not meant to be a representation of the most commonly used Chinese words. This is very obvious in the lower levels, where words like “train station” and “bus” are part of HSK1, which has only 150 words in total. Those words are nowhere near the top 150 words in Chinese in general, but they are of course important for foreigners visiting and travelling in China, which is probably why they are included.

Overall, I think the lower levels of HSK match the needs of foreign students quite well. I have spent dozens of hours poring over these lists when creating the sentence pack for my beginner course, Unlocking Chinese, and in general, there aren’t that many weird decisions about which words to include.

In other words, the purpose of this article is not to complain about HSK, but rather to highlight some very common words that were left out in favour of other words. Most of them were left out for good reasons, but this doesn’t mean that you shouldn’t learn these words!

Chinesehasnospacingbetweenwordssofiguringoutwhatawordisisnoteasy

The biggest problem when discussing words in Chinese is that there is no clear definition of what a word actually is. Since there’s no spacing between words, figuring out what is a word and what isn’t is hard. 你 is a word, but is 你好 a word? Most dictionaries say no. What about 你们? Or if you think 你好 is a word, what about 你们好? What about 老师好?

I think you’ll agree that 你 is a word and that 老师好 is not a word, but where to draw the line is not obvious, especially if you have to rely on an automated method (needed to deal with databases with millions and millions of characters).

The question of wordhood in Chinese is complex, and something I can’t go into in this article, but the bottom line is that different methods of separating Chinese text into words (segmentation) will yield different results.

This means that it’s hard to compare a word frequency list to the HSK list directly, simply because they have different standards for what a word is. If you just check for things that appear in a frequency list, but not in HSK, many of the results you get will be things that are actually not words, such as 那个 and 出来.

What does “common” mean, anyway?

The next problem is what frequency list to choose. How do you decide what a “common” word is? There are many frequency lists, of course, but most are based on written Chinese, which is much more formal than the language most students encounter. If we compared one of these lists with HSK to see how they differed, the result is easy to predict: characters and words used in formal, written Chinese would appear high on the frequency list, but low, if at all, in HSK. That would be neither helpful nor interesting.

Instead, I choose to look at word frequencies from the SUBTLEX-CH corpus (Cai and Brysbaert, 2010), which consists of Chinese subtitles from movies and TV series. This is still not naturally spoken Chinese, but it’s a lot closer to that than books and newspapers are. For a thorough look at resources for word, character and component frequencies in Chinese, please refer to this article:

The most common Chinese words, characters and components for language learners and teachers

At first, I thought that the fact that the corpus includes foreign movies and TV series translated into Chinese would be a big disadvantage, but the more I worked on this project, the more I realised that it is actually a potential advantage.

Many of the words common in Chinese subtitles but that aren’t in the HSK lists are things that are non-Chinese, such as “baseball” and “jury”. Being a foreigner (why else would you study HSK), learning such words is useful, not because they have a natural place in China, but because they do in your home country, and you might want to talk about them in Chinese, especially if you aren’t living in China.

Plugging gaps in your Chinese vocabulary

Next, the goal is to identify holes in the vocabulary of a student who focuses on HSK vocabulary only, not to find any word that doesn’t exist in HSK. I normally advise students to only use word lists for plugging holes, not to expand vocabulary in general. The difference is that plugging holes is about finding words much more common than those you are currently learning, but which you have somehow missed.

For example, if you’re currently at HSK3 but somehow missed the word “train station”, that would be a hole in your vocabulary. It’s much easier than the HSK3 words you know, but you missed it somehow. However, if you don’t know the word for “elevator”, this can’t really be seen as a hole, because it’s on your level and something you can’t really say that you have “missed”.

Identifying common words missing from HSK

For each HSK level, I checked the general frequency list for words that were twice as common as the HSK level in question indicated, and listed all words missing from HSK.

For example, for HSK1-3, which contains 600 words, I checked the top  300 words in the frequency list, and noted all that did not appear in HSK1-3. This means that if you’ve completed HSK3, you might have missed these words. For HSK5, which contains a cumulative total of 2500 words, I checked the top 1250 words in the frequency list to see which were missing. This makes sure we’re talking about actual holes in your vocabulary.

This generated a list of roughly 1000 words that were missing from all HSK levels. I then manually went through the whole list, deciding which of these were actually words students might want to learn. Here are the decisions I made when deciding what words should be included, but you can get the full list at the end of if you prefer:

  • Words that are also part of words that are in the HSK are included. Example: 但是 is in HSK, but only 但 is not. I included 但 because it’s deemed to be a word. Some cases are less obvious, such as 唱歌, which is in HSK, but 唱 and 歌 are not there separately and might not be obvious for students.
  • Combinations of words that are in HSK and form phrases are excluded. Example: 这 and 个 are in HSK, but 这个 is not. 这个 is excluded because it’s not deemed to be a word.
  • Words plus particles that are in HSK are excluded. Example: 你们 is a combination of a word and a particle, and can be assumed to be known, even if it’s not in HSK.
  • Verbs plus complements are excluded if the meaning is obvious from the parts. Example: 找到 is ignored because it’s assumed that you know what it means if you know what 找 means and how 到 works.
  • Single-character words that are in HSK only as part of longer words are excluded if the meaning is obviousExample: It’s assumed that you know what 前 means if you know what 前面 means.
  • Duplications of words that are in HSK are excluded. Example: 看看 is not counted as a word, since 看 is in HSK.
  • Adverbs plus verbs are excluded if the meaning is obvious from the constituent parts, and those parts are in HSK. Example: 只是 is not included because its meaning is obvious from knowing 只 and 是.
  • All negated words are excluded, so 不要 or 不能 are not included, because these are normally not considered to be words. If the meaning is deemed non-obvious to students, such as 无法, it is included, though.
  • Characters that aren’t words that can be used on their own are excluded. For example, 者 is hardly ever used as a word on its own and is not included. It would only appear as part of words.
  • Phrases and expressions are not deemed to be words and are excluded. For example, 怎么样  and 没什么 are not included.
  • Logical extensions of words that are in the HSK are excluded, so even if 以前 is in HSK, but 以后 is not, 以后 is still not included.
  • All erisation (儿化音) is excluded. Example: 一点儿  is excluded if 一点 is included.

Remember, the goal here is to generate missing words in HSK that you might want to learn. Thus, it makes no sense to include 不要 in the list, because no one would regard that as a new word you actually need to learn. Similarly, if you know 饭馆儿, it doesn’t make sense to treat 饭馆 as a new word either.

Types of words left out of the HSK word lists

This culling resulted in a list of roughly 650 words (meaning that I manually removed around 300 based on the principles described above), which would then be actual words that I think there’s a real chance that you might genuinely want to learn as a student.

I identified several categories of words that were missing from HSK, presented below with some examples:

  • Many single-character words are missing – I included these only when they didn’t violate any of the principles above, and when they can actually be used on their own. i think most students will know what 饭 means, even if they have only learnt 吃饭, but I chose to include these because it’s not obvious that you can use these independently. If you’re the kind of student that only learns characters in the context of words, you should definitely learn these at least. Other such words missing from HSK: 话,山,车,美.
  • Names of places and countries are missing – These are highly relevant for students, but are not part of HSK. Most textbooks have them, but if you focus solely on HSK, you will miss important names like 美国,英国 and 日本. There are also common Western personal names, but these can be ignored.
  • Regional variants are missing – This might sound like it’s a good thing at first, but extremely common regionally preferred words are excluded entirely, mainly those being used in Mandarin spoken in the south. You should definitely learn these, even if you live in the north. Here are some examples of missing words: 这里,哪里,讲话,老公,礼拜一.
  • Profanity is missing entirely – This is not hard to understand, but if you look at the most common words in TV dramas and movies, there’s going to be a lot of swearing. None of that is in HSK, not even the mild ones. Examples: 傻瓜,笨蛋.
  • Foreign things are mostly missing – Things, places and phenomena that aren’t that common in China are not included in HSK. Examples: 棒球,女王,骑士. Most vocabulary related to religion is also missing: 教堂,上帝,圣诞节.
  • Particles in informal language are missing – While some are excluded, many are not: 哟,耶,哦,嗯. These are extremely common and it’s nice to know them.

Words that are significantly delayed in HSK

The above discussion is mostly about what’s left out of HSK entirely, but there are also words that have been significantly delayed. Prioritising words suitable for learners also means that other words that are very common have been pushed further down the lists. Which are they?

I have not attempted to sort these words into categories, but many of them are more formal or written expressions that are common in Chinese, but tend to be left out in learning materials, or at least delayed until written, formal language is introduced. This is true even if the frequency list I used for this project uses spoken language. I have included words that are significantly delayed in HSK as separate lists below.

Lists of missing and delayed words in all levels of HSK

I will now share the complete lists, including the raw list of missing words before my manual sorting for those who want to have a go themselves. For most students, though, simply check any HSK level at or below your current level and see what words you might have missed.

You will probably find that you know most of these, but you can safely assume that those that you don’t know would be good to know, at least if movie and TV subtitles are a good guide to spoken word frequency, which is shown to be the case in the paper linked to earlier (Cai and Brysbaert, 2010).

Please note that this sorting was done manually and probably contains some inconsistencies. My goal was to include words that students at this level might want to know and that there is a fair chance that you’d miss if you only focus on HSK. I have also created a deck with all these words in Skritter for your convenience!

  • Words missing in HSK1-3 (39)
  • Words delayed in HSK1-3 (42)
  • Words missing in HSK4 (67)
  • Words delayed in HSK4 (28)
  • Words missing in HSK5 (143)
  • Words delayed in HSK5 (37)
  • Words missing in HSK6 (408)
  • All missing words by level in Skritter
  • All missing words by level (CSV)
  • All missing words by level, raw unsorted (CSV)

If you have any questions or suggestions for how to use this material, please leave a comment below!

References and further reading

Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS one5(6), e10729.

The images used for the HSK levels for this article are from Skritter and are used here with permission.

Words missing from HSK1-3 (39)

  • Learn these words in Skritter
  • View meanings in MDBG
这样
这里
这么
那么
就是
一点
电话
那样
那里
之前
伙计
上帝
女人
今晚
哪里
女孩
美国

Words delayed in HSK1-3 (42)

  • Learn these words in Skritter
  • View meanings in MDBG
所有
也许
不过
发生
一切
抱歉
感觉
肯定
以为
生活
任何
家伙
继续
亲爱
父亲
完全
宝贝
可是

Words missing from HSK4 (67)

  • Learn these words in Skritter
  • View meanings in MDBG
男人
之后
那儿
里面
怎样
如此
无法
房子
听说
人们
混蛋
看来
长官
案子
之间
变成
极了
看上去
进入

Words delayed in HSK4 (28)

  • Learn these words in Skritter
  • View meanings in MDBG
如何
女士
根本
确定
兄弟
拜托
或许
唯一
表现
绝对
整个
处理
行动
失去
作为
曾经
总统
伤害
控制

Words missing from HSK5 (143)

  • Learn these words in Skritter
  • View meanings in MDBG
能够
昨晚
父母
玩笑
小子
男孩
美元
想法
纽约
大学
加油
多久
开枪
派对
介意
谋杀
兴趣
小孩
晚安
线
法官
记住
犯罪
白痴
加入
病人
同性恋
好运
面前
剩下
杀死
杀人
警官
自杀
同样
足够
之一
家人
停止
婊子
最终
大人
女朋友
为何
要么
洛杉矶
杀手
证人
垃圾
男朋友
测试
有时候
晚餐
脑子
见鬼
服务
汽车
哥们
英国
笨蛋
玩意
法庭
拯救

Words delayed in HSK5 (37)

  • Learn these words in Skritter
  • View meanings in MDBG
而已
夫人
尸体
现场
监狱
死亡
拥有
凶手
屁股
选手
投票
撒谎
武器
发誓
意识
线索
失踪
真相
恐怖
性感
意味着
交易
恶心
舞蹈
事件
攻击
试图
淘汰
子弹
毒品
袭击
灵魂
爆炸

Words missing from HSK6 (408)

  • Learn these words in Skritter
  • View meanings in MDBG
受害者
炸弹
联邦
评委
客气
其它
美好
名单
部落
教堂
救命
疯子
乐队
高中
惊喜
女性
说谎
黑人
警方
法国
老天
酒店
之类
约翰
傻瓜
英里
地狱
自我
祈祷
舞台
开车
指纹
住手
实验室
恶魔
舞会
家族
今年
意大利
起诉
长大
尿
陪审团
吸血鬼
普通
天使
队长
日本
墨西哥
旅馆
邪恶
追踪
好莱坞
女生
圣诞节
人民
小组
分子
白人
黑暗
晚饭
全都
打赌
伦敦
特工
停车
回头
队员
搜查
神父
之中
火车
逃跑
拍摄
年轻人
大街
箱子
耶稣
红色
骗子
留言
大脑
指控
星球
录像
怪物
杀害
内心
大多数
汤姆
午餐
午饭
歌曲
英尺
有的
船长
歌手
美女
讲话
团队
罢了
德国
大部分
分享
保安
变态
所谓
华盛顿
姐妹
查理
宝宝
蓝色
鬼魂
前进
乔治
牧师
杂种
警长
受害人
伤口
魔法
独自
麦克
女孩子
强大
分开
检察官
好看
机器人
打败
巴黎
好玩
早餐
弗兰克
男友
美金
夜晚
单身
女王
失陪
药物
女友
大麻
比尔
邮件
英语
干活
鞋子
陛下
血液
大卫
保佑
在场
上尉
谎言
葬礼
城里
委员会
议员
餐馆
罗马
部队
背后
装置
三明治
棒球
注意力
签名
海滩
描述
神像
医疗
隐藏
车子
阁下
当作
犹太人
提要
启动
加州
夏天
作证
好笑
说法
谋杀案
当成
飞行
老公
市长
男生
之内
犯人
名人
海军
车祸
早晨
辈子
礼拜
监控
拍照
彼得
做爱
非洲
警报
出局
面试
迈阿密
钱包
贱人
粉丝
走运
吸毒
手段
农场
口袋
印度
嫌犯
行李
玛丽
地板
悲伤
尖叫
小丑
演唱
唱片
骑士

Words delayed in HSK6 (0)

There are, by definition, no words delayed in HSK6, because there’s no higher level to delay them to. This will probably change in 2021, so I will likely revisit the topic of missing and delayed words in HSK then!

Leave a Reply