Chinese corpus linguistic research at Lancaster University


The Lancaster Corpus of Mandarin Chinese (LCMC)

The UCLA Written Chinese Corpus 2nd edition (UCLA2)

The ZJU Corpus of Translational Chinese (ZCTC)

The Lancaster - Los Angeles Spoken Chinese Corpus (LLSCC)

Callhome Mandarin Chinese Transcripts - XML Version

Routledge Chinese Frequency Dictionary Corpus

The PDC2000 corpus of Chinese News Text
The Babel Parallel English-Chinese Parallel Corpus

Major works on Chinese corpus linguistic research at Lancaster


As a major world class research centre for corpus linguistics, Lancaster University has developed a range of corpora of spoken and written Chinese, which are specifically designed for grammatical and lexical studies of Mandarin as well as contrastive and translation studies of Chinese and English. This briefing paper aims to provide an introduction of these corpus resources. For the purpose of this introduction, the corpora are grouped into the following categories: a) corpora of the Brown family; b) corpus for lexicographic research; c) spoken Chinese corpora; and d) specialised corpora.

1. Corpora of the Brown family

This section introduces three Chinese corpora of the Brown family, namely the Lancaster Corpus of Mandarin Chinese (LCMC), the UCLA Written Chinese Corpus - second edition (UCLA2), and the ZJU Corpus of Translational Chinese (ZCTC).

With the exponentially increasing power of computers, the drastically decreasing cost of data storage, and the growing availability of texts, especially those in electronic format, corpora have grown larger and larger, from the first electronic corpus in modern linguistics, namely the one-million-word Brown University corpus of American English (Brown) and its British counterpart LOB (the Lancaster-Olso-Bergen Corpus of British English) in the 1960s, to the 100-million-word British National Corpus (BNC) in the 1990s, and to the ever increasing monitor corpus Bank of English (BoE), which presently contains 500 million words. On the other hand, the classic Brown corpus model has not been abandoned because of its careful corpus design and balance, a quality which is essential in linguistic research but is unfortunately often missing in many existing gigantic corpora nowadays. On the contrary, the Brown model has been adopted in developing a series of corpora that represent American English and British English as used in 1901, 1931, 1961, 1991, and 2006, which have enabled synchronic and diachronic studies of the two major varieties of English. The Brown corpus model, however, is not confined to the English language.

1.1. The Lancaster Corpus of Mandarin Chinese

The first Chinese corpus that has been developed by following the Brown model is the Lancaster Corpus of Mandarin Chinese widely (LCMC, McEnery & Xiao 2004), which is designed as a Chinese counterpart of the FLOB corpus (a recent update of LOB to represent British English as used in the early 1990s) with the primary aim of contrasting English and Chinese. FLOB, following the Brown/LOB model, is composed of five hundred 2,000-word text chunks of written British English sampled from fifteen genres produced in 1991-1992, totaling one million words.

Table 1. Genres covered in FLOB



No. of samples



Press reportage




Press editorials




Press reviews








Skills, trades and hobbies




Popular lore




Biographies and essays




Reports and official documents




Science (academic prose)




General fiction




Mystery and detective fiction




Science fiction




Adventure fiction




Romantic fiction










Table 1 shows the genres covered in FLOB. In LCMC, the FLOB design is followed strictly except for two minor variations. The first variation relates to the sampling frame – western and adventure fiction (text category N) is replaced with martial arts fiction. There are four reasons for this decision. Firstly, there is virtually no western fiction written in Chinese for a Mainland Chinese audience. Secondly, martial arts fiction is broadly a type of adventure fiction and as such can be reasonably viewed as category N material. Thirdly, martial arts fiction is also a very popular and important fiction type in China and hence should be represented. Finally, the language used in martial arts fiction is a distinctive language type and hence, given the wide distribution of martial arts fiction in China, once more one would wish to sample it. The language of the martial arts fiction texts is distinctive in that even though these texts have been published recently, they are written in a form of modern Chinese styled to appear like classical Chinese, which is known as ‘vernacular Chinese’ (báihuà ‘plain speech’, as opposed to wényán wén ‘text of written language’, a traditional style of written Chinese).

The second variation in corpus design adopted from FLOB was caused by problems encountered in trying to keep to the FLOB sampling period. Because of the poor availability of electronic Chinese texts in some categories for 1991, the sampling period had to be modified slightly by including some samples ±2 years of 1991 when there were not enough samples readily available for 1991. As a result, around 87% of texts in LCMC were produced ±1 year of 1991. It is assumed that varying the FLOB model in this way will not affect the comparability between LCMC and FLOB substantially.

The LCMC corpus was constructed using written Mandarin texts published in Mainland China only so as to ensure some degree of textual homogeneity. Two forms of corpus annotation were undertaken on LCMC: word segmentation and part-of-speech tagging. As a written Chinese text contains a running string of characters without delimiting spaces between words, the first step in Chinese language processing is to segment the running text into legitimate word tokens, a process known as ‘word segmentation’ or ‘tokenization’ (McEnery et al. 2006: 35). The segmentation tool (ICTCLAS) used to process the LCMC corpus was the Chinese Lexical Analysis System developed by the Institute of Computing Technology, Chinese Academy of Sciences. The core of the system lexicon incorporates a frequency dictionary of 80,000 words with part-of-speech information. Based on a multi-layer hidden Markov model, the software package integrates modules for word segmentation, part-of-speech tagging and unknown word recognition (cf. Zhang et al. 2002). The integrated system is reported to achieve a precision rate of 97.58%, with a recall rate as high as 99.94% for word segmentation, and a precision rate of 97.16% for part-of-speech tagging, with a recall rate of over 90% for unknown words and 98% for Chinese person names (Zhang & Liu 2002).

Both metadata and linguistic annotation are marked up in extensible markup language (XML). While the original texts collected for inclusion in LCMC were encoded in the local character set GB2312, the LCMC corpus is enclosed in Unicode applying the Unicode Transformation Format 8-Bit (UTF-8). The combination of XML with Unicode represents the current international standard in corpus development, especially when non-ASCII writing systems are involved.

The Lancaster Corpus of Mandarin Chinese is published by the European Language Resources Association (No. W0039) and the Oxford Text Archive (No. 2474). It is an open source corpus which is licensed free of charge for academic and educational purposes (

1.2. The UCLA Written Chinese Corpus

This UCLA Written Chinese Corpus – second edition (UCLA2, Tao & Xiao 2012) has been created by the Lancaster team in collaboration with the University of California at Los Angeles, by following the LCMC corpus design as a recent update of LCMC in the first decade of the 21st century. Since this period is of special significance because of the impact of the Internet on language, especially on Chinese, this corpus is an excellent complement to LCMC.

The text samples in UCLA2 were all collected from written modern Chinese available from the Internet, during the period of 2000-2012, though some texts may have been converted from paper-based publications in earlier years. Text types are matched as closely as possible to the Brown corpus model, with some variations (e.g. adventure fictions) to accommodate Chinese characteristics, while the proportions for different text categories may vary from the English counterparts and LCMC as shown in Table 2.

Table 2. UCLA Written Chinese Corpus (second edition)






Press: reportage




Press: editorials




Press: reviews








Skills, trades and hobbies




Popular lore




Essays and biographies




Misc. (reports and official documents)




Academic prose




General fiction




Mystery and detective stories




Science fiction




Adventure stories




Romantic fiction










The same tool was used in tokenising and part-of-speech tagging the UCLA2 corpus; and like LCMC, this corpus is also Unicode (UTF-8) and XML-compliant. The two comparable monolingual corpora facilitate dichromic studies that explore the potential impacts of the Internet on the Chinese language between the early 1990s and the early 2000s when the Web developed rapidly in China.

The UCLA2 corpus is released by UCREL, Lancaster University and is publicly available via its online search engine (

1.3. The ZJU Corpus of Translational Chinese

Another corpus that follows the LCMC design is the ZJU Corpus of Translational Chinese (ZCTC), which is created as a translational counterpart of the native Chinese corpus LCMC with the explicit aim of studying the distinct features of translated Chinese in relation to comparable, non-translated Chinese texts.

Both LCMC and ZCTC corpora have sampled five hundred 2,000-word text chunks from fifteen written text categories published in China, with each amounting to one million words (see Table 1).

Since the LCMC corpus was designed as a Chinese match for the FLOB / Frown corpora  of British / American English, with the specific aim of comparing and contrasting English and Chinese, it has also followed the sampling period of FLOB / Frown and sampled written Mandarin Chinese within three years around 1991. While it was relatively easy to find texts of native Chinese published in this sampling period, it would be much more difficult to get access to translated Chinese texts of some categories - especially in electronic format - published in this time frame. This pragmatic consideration of data collection has forced us to modify the LCMC model slightly by extending the sampling period by a decade, i.e. to 2001, when we built the ZJU Corpus of Translational Chinese (Xiao 2010). This extension has been particularly useful because the popularisation of the Internet and online publication in the 1990s have made it possible and easier to access a large amount of digitalised texts. Readers are reminded of this modification when they interpret the results based on a comparison of the LCMC and ZCTC corpora.

While English is the source language of the vast majority of the text samples included the ZCTC corpus, we have also included a small number of texts translated from other languages to mirror the reality of the world of translations in China.

Like LCMC, the ZCTC corpus is marked up in XML, which is complaint with the Corpus Encoding Standards (CES). It is encoded in Unicode, applying the Unicode Transformation Format 8-Bit (UTF-8), which is a lossless encoding for Chinese while keeping the XML files at a minimum size. The LCMC and ZCTC corpora have been built by following comparable sampling criteria and the same sampling techniques, and they have been processed using the same tools to ensure maximum comparability.

ZCTC is an open source corpus (, which has recently been published by Shanghai Jiao Tong University Press and is searchable online via the CQP corpus hub hosted at Beijing Foreign Studies University (, with test as the username and password).

2. Corpus for lexicographic research

The three corpora introduced above are all balanced corpora that are very well suited for grammatical studies and research into distribution of linguistic features across fine-grained usage contexts or genres. But with each containing one million words, these corpora are rather small by today’s standard, particularly in lexicographic research. This section introduces a large balanced corpus that we have created for our frequency dictionary of Mandarin Chinese (Xiao et al 2009), namely the Routledge Chinese Frequency Dictionary Corpus.

For a dictionary that aims to provide a frequency-based core vocabulary for learners, a well composed corpus is essential. We think that such a corpus must satisfy four requirements for the intended purpose. First of all, it must be large enough to provide a basis for reliable quantification; secondly, it must achieve a reasonably wide coverage of registers so that learners are exposed to commonly used words in different communication contexts. Thirdly, the language contained in the corpus must be current. Finally, in addition to the quality of data per se, corpus processing must be sufficiently reliable, and this is particularly important for a Chinese frequency dictionary because running texts in Chinese must first of all be segmented into legitimate tokens (a computational process known as segmentation or tokenisation) before they can be annotated with word class information.

Table 3. Routledge Chinese Frequency Dictionary Corpus


Word tokens

Chinese characters
















The corpus is composed of written and spoken texts from four broad categories as shown in Table 3, totalling roughly 50 million word tokens (or 73 million Chinese characters). The Spoken component contains 3.4 million words, covering face-to-face conversations, telephone calls, cross-talks, movie and play scripts, interviews, storytelling, public lectures, radio broadcasts, and public debates, which were mostly produced in the 1990s and 2000-2006. The News component comprises 16 million words of newswire texts released in 1995 by the Xinhua News Agency and newspaper texts published by the People’s Daily in 1998 and 2000, in addition to the news categories in the Lancaster Corpus of Mandarin Chinese (LCMC) and the UCLA Written Chinese Corpus. The Fiction component amounts to 15 million words, including all fiction categories in LCMC and UCLA Chinese corpora in addition to novels and short stories sampled from various periods in the 20th century, with the majority published in the 1980s-1990s. The Non-fiction component is composed of all informative categories in LCMC and UCLA corpora, together with various non-literary texts of different genres such as official documents, academic prose, applied writing and popular lore which were sampled from different periods in the second half of the 20th century, totalling 15 million words.

Once the texts (including transcripts of spoken data) were collected, the next step was to segment the running strings of characters in these texts into word tokens. The segmentation tool we used to process our Chinese corpus is ICTCLAS, which was used to process LCMC. When the texts were tokenised and annotated with part-of-speech information, the corpus was converted from the local character encoding GB2312 into Unicode (UTF-8), with register information and linguistic annotation marked up in the extensible mark-up language (XML).

The corpus provided an up-to-date empirical basis that has allowed us to define a core vocabulary, for intermediate learners of Chinese as a foreign language, of 5,000 most commonly used Chinese words and 2,000 most common Chinese characters. In comparison with the Syllabus of Graded Words and Characters for Chinese Proficiency compiled by the Hànyǔ Shuǐpíng Kǎoshì (HSK, ‘the Chinese Proficiency Test’) Committee, which was published in 1992 and revised in 2001, our frequency-based word lists are arguably more relevant to the present-day life.

The compilation of the HSK lexical syllabus, which was also corpus-based, started in 1988 and the latest texts covered were produced in 1991. Unsurprisingly, most “new words” included in the syllabus are from the early 1980s while some words that were common in the 1970s-1980s, e.g. 少先队 shàoxiānduì ‘young pioneer’ will not be common enough to merit a place on the list nowadays. On the other hand, many well-established vocabulary items which are commonly used today, as attested in our corpus with a high frequency and dispersion rate, as a result of technological and social development are not covered in the syllabus, for example (in the order of frequency): 网络 wǎngluò ‘network’, 手机 shǒujī ‘mobile phone’, 媒体 méitǐ ‘media’, 客户 kèhù ‘client’, 机制 jīzhì ‘mechanism’, 市场经济 shìcháng jǐngjì ‘market economy’, 出国 chūguó ‘go abroad’,  品牌 pǐnpái ‘brand’, 消费者 xiāofèizhě ‘consumer’, 上网 shàngwǎng ‘go online’, 总裁 zǒngcái ‘CEO’, 董事长 dǒngshìzhǎng ‘chair of  the Board of Directors’, 打工 dǎgōng ‘do odd jobs’, 上市 shàngshì ‘put on market; (for a company to) be listed’, 开发区 kāifāqū ‘development zone’,  股市 gùshì ‘stock market’, 超市 chāoshì ‘supermarket’, 出租车 chūzūchē ‘taxi’, 屏幕 píngmù ‘screen’, 证券 zhèngquàn ‘stock, share’, 下岗 xiàgǎng ‘get laid off’, and 网站 wǎngzhàn ‘website’. A comparison of the HSK graded vocabulary and the frequency based on the Routledge Frequency Dictionary Corpus also suggests that the corpus on which the HSK vocabulary is based relies too heavily the Beijing dialect, as evidenced by dialectal usage like半拉 bànlā ‘half’ (Level 2) and words ending with the retroflective suffix -r, including Level 1 words such as 小孩儿 xiǎoháir ‘child’, 面条儿 miàntiáor ‘noodle, pasta’, Level 2 words such as 聊天儿 liáotiānr ‘chat’ and 墨水儿 mòshuǐr ‘ink’, and Level 3 words such as 拐弯儿 guǎiwānr ‘turn a corner, make a turn’ and 药水儿 yàoshuǐr ‘liquid medicine’. Words like these are normally listed in a dictionary without the retroflective -r, which is tagged in our corpus as a suffix.

We are delighted to find that some of the above vocabulary issues have been remedied in the latest edition of the six-level HSK test syllabus. For example, currently obsolete vocabulary items such as 少先队 shàoxiānduì ‘young pioneer’ and dialectal words such as 半拉 bànlā ‘half’ have been removed from the vocabulary list; the retroflective suffix -r has also been removed in words such as 拐弯儿 guǎiwānr ‘turn a corner, make a turn’, 聊天儿 liáotiānr ‘chat’, and面条儿 miàntiáor ‘noodle, pasta’ (but not in墨水儿 mòshuǐr ‘ink’); and some of the new vocabulary items in our frequency dictionary have also been included in the new HSK lexical syllabus (e.g. 网络 wǎngluò ‘network’, 手机 shǒujī ‘mobile phone’, 媒体 méitǐ ‘media’, 客户 kèhù ‘client’, 上网 shàngwǎng ‘go online’, 总裁 zǒngcái ‘CEO’, 超市 chāoshì ‘supermarket’, 出租车 chūzūchē ‘taxi’, and 网站 wǎngzhàn ‘website’).

3. Spoken Chinese corpora

An important feature of the Routledge Chinese Frequency Dictionary Corpus is that it contains some spoken data (about 6%). There are very few spoken Chinese corpora that are available and suitable for linguistic research. The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) and the Callhome Mandarin Chinese Transcripts - XML Version are among the few that exist.

3.1. The Lancaster Los Angeles Spoken Chinese Corpus

The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin that has been developed at Lancaster in collaboration with the University of California in Los Angeles (UCLA). It is composed of one million words of both spontaneous (57%) and scripted (43%) speech, in 73,976 sentences and 49,670 utterance units (paragraphs). Six genres are covered in this corpus: face-to-face conversation, telephone conversation, play/movie scripts, TV talk show transcripts, transcripts of formal debates over various topics, spontaneous oral narrative, and edited oral narrative as indicated in Table 4 (Xiao & Tao 2006).

Table 4. Lancaster Los Angeles spoken Chinese Corpus




Face-to-face conversations



Telephone conversations



Play & movie transcripts



TV talk show Transcripts



Public debate transcripts



Oral narratives



Edited oral narratives






The corpus is encoded in Unicode and marked up in XML, a combination that represents the current standard of corpus construction. Each corpus file is composed of a corpus header and a text body. The header gives general information of a corpus file. In the body part, utterance units, sentences and tokens are marked up, with each token also annotated for part of speech.

The LLSCC corpus can be used in combination with LCMC to compare written and spoken registers in spoken Mandarin and intralingual variation research. Because of copyright restrictions, the corpus is currently only available for in-house use by the Lancaster and UCLA teams (

3.2. Callhome Mandarin Chinese Transcripts - XML Version

The Callhome Mandarin Chinese Transcript corpus was originally released by the Linguistic Data Consortium (LDC) in 1996. It comprises a contiguous 5-to-10 minute segment taken from 120 unscripted telephone conversations between native speakers of Mandarin Chinese, totalling approximately 300,000 words. The corpus was grammatically analysed and marked up in XML in our research and republished by the Linguistic Data consortium as the CallHome Mandarin Chinese Transcript - XML Version (catalogue number LDC2008T17, McEnery & Xiao 2008).

4. Specialised corpora

In addition to the balanced corpora of written and spoken Chinese that cover a wide range of genres and registers for lexical and grammatical studies of the Chinese language, the Lancaster team has also produced some specialised corpora, including domain-specific and genre-specific corpora of Chinese as well as bilingual parallel corpora of English and Chinese.

4.1. The PDC2000 corpus of Chinese News Text

PDC2000 is a genre-specific corpus created using one year’s (year 2000) data provided by the People’s Daily Press, Beijing. The corpus contains approximately fifteen million word tokens. It is encoded in Unicode (UTF-8) and marked up in XML. There are 366 files in the corpus, one for a day, which is marked up for the month and the date. Each corpus file consists of a corpus header and the corpus text proper. The corpus header applies the ELDA (Evaluations and Language Resources Distribution Agency) Metadata Scheme version 1.40. The corpus text is marked up for paragraphs, sentences and tokens. Sentences are numbered consecutively within each file while tokens are annotated for part-of-speech, using the Peking University tagset.

4.2. The Babel Parallel English-Chinese Parallel Corpus

The Babel English-Chinese Parallel Corpus consists of 327 English articles and their translations in Mandarin Chinese. Of these 115 texts (121,493 English tokens plus 135,493 Chinese tokens) were collected from the World of English between October 2000 and February 2001 while the remaining 212 texts (132,140 English tokens plus 151,969 Chinese tokens) were collected from Time from September 2000 to January 2001. The corpus contains a total of 544,095 words (253,633 English words and 287,462 Chinese tokens).

The corpus is tagged with part of speech information and aligned at the sentence level. The English texts were tagged using the CLAWS C7 tagset while Chinese texts were tagged using the Peking University tagset. Sentence alignment was done automatically and corrected by hand. The corpus is also marked for paragraph and sentence.

The Babel parallel corpus can be publicly available via its online parallel concordance (, which allows users to search in either English or Chinese texts. The search engine returns matched whole sentences and their translations as well as their locations in the corpus indicated by sentence numbers. Users can also specify the format the output concordances as POS-tagged or plain texts.

While this briefing paper has focused on the presentation of various Chinese corpus resources created at Lancaster University, it is important to note that Lancaster is not only active in corpus development, but rather it is also a world leading research centre for corpus based studies of Chinese language and culture as well as contrastive and translation studies of English and Chinese (see Major works on Chinese corpus linguistic research at Lancaster).

*** *** ***

Major works on Chinese corpus linguistic research at Lancaster

2012. Ying Han Fanyi zhong de Hanyu Yiwen Yuliaoku Yanjiu. Shanghai: Shanghai Jiao Tong University Press.

2010. Corpus-based Contrastive Studies of English and Chinese. London and New York: Routledge.

2010. Using Corpora in Contrastive and Translation Studies. Newcastle: Cambridge Scholars Publishing.

2009. A Frequency Dictionary of Mandarin Chinese: Core vocabulary for learners. London and New York: Routledge.

2006. Corpus-based Language Studies: An advanced resource book. London and New York: Routledge.

2004. Aspect in Mandarin Chinese: A corpus-based study. Amsterdam and Philadelphia: John Benjamins.

Forthcoming. Corpus-based Studies of Translational Chinese in English-Chinese Translation. Berlin: Springer.

Forthcoming. Corpus-based Contrastive and Translation Studies of English and Chinese. Special issue of Corpus Linguistics and Linguistic Theory.


McEnery, T. & R. Xiao (2004) The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. In M. Lino, M. Xavier, F. Ferreire, R. Costa, R. Silva (eds.) Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC) 2004, 1175-1178. Lisbon, May 24-30, 2004.

McEnery, T. & R. Xiao (2008) CALLHOME Mandarin Chinese Transcripts - XML Version. Pennsylvania: Linguistic Data Consortium.

McEnery, T., R. Xiao & Y. Tono (2006) Corpus-based Language Studies. London and New York: Routledge.

Tao, H. & R. Xiao (2012) The UCLA Written Chinese Corpus. Lancaster: UCREL, Lancaster University.

Xiao, R. (2010) How different is translated Chinese from native Chinese. International Journal of Corpus Linguistics 15(1): 5-35.

Xiao, R., P. Rayson & T. McEnery (2009) A Frequency Dictionary of Mandarin Chinese: Core vocabulary for learners. London and New York: Routledge.

Xiao, R. & H. Tao. (2006) The Lancaster Los Angeles Spoken Chinese Corpus. Lancaster: UCREL, Lancaster University.

Zhang, H. & Q. Liu (2002) Model of Chinese words rough segmentation based on N-shortest-paths method. Journal of Chinese Information Processing 16(5): 1-7.

Zhang, H., Q. Liu, H. Zhang & X. Cheng (2002) Automatic recognition of Chinese unknown words based on role tagging. In SIGHAN '02 Proceedings of the First SIGHAN Workshop on Chinese Language Processing, 71-77. Taipei: SIGHAN.