Understanding Malay Corpora: A Content Analysis of 15 Malay Corpora


  • Jowati Juhary Language Centre, National Defence University of Malaysia, Malaysia
  • Erda Wati Bakar Language Centre, National Defence University of Malaysia, Malaysia
  • Mardziah Shamsudin Language Centre, National Defence University of Malaysia, Malaysia
  • Asniah Alias Language Centre, National Defence University of Malaysia, Malaysia




corpus, Malay corpora, multimodal corpora, concordance


Corpus research becomes an important area of research of late, especially in Malaysia and for the national language, Malay language. A corpus includes texts and transcriptions of speeches for variety of situations. For this short paper, the focus is on Malay language, which is the national and official language of Malaysia. The purposes of this paper are to identify features and types of Malay Corpora and to determine the needs for a military biased Malay Corpus. In so doing, as a short paper, the methodology involves only content analysis of relevant documents on the development of Malay language corpora. Preliminary findings suggest that there are at least 15 Malay corpora in existence, and that some of the features in these corpora overlap. Further, the researchers argue for the need for a Malay Corpus for Military Operations since the existing corpora do not fully cater for this type of corpus.


Download data is not yet available.


Allwood, J. (2008). Multimodal Corpora. In A. Lüdeling & Merja, K. (Eds.), Corpus Linguistics: An International Handbook (pp. 207-225). Berlin: Mouton de Gruyter.

Biber, D. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4), 243-257.

Bjorkenstam, K.N. (2013). What is a corpus and why are corpora important tools? Retrieved on March 19, 2021, from https://nordiskateckensprak.files.wordpress.com/2014/01/knb_whatisacorpus_cph-2013_outline.pdf.

Chung, S. F., Shih, M. H., Nomoto, I. H., & Moeljadi, D. (2019). An Annotated News Corpus of Malaysian Malay. NUSA: Linguistic studies of languages in and around Indonesia, 67, 7-34.

DBP Corpus database. (2021). http://lamanweb.dbp.gov.my/index.php/pages/view/76?mid=61. DBP: Kuala Lumpur.

Francis, W.N. (1992). Language Corpora B.C. In J. Svartvik. (Ed.), Directions in Corpus Linguistics (pp.17-32). Proceedings of Nobel Symposium 82. Berlin; New York, NY: Mouton de Gruyter.

Garside, R., Leech, G. & McEnery, T. (Eds.). (1997). Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Routledge.

Hajar Abdul Rahim. (2014). Corpora in Language Research in Malaysia. Kajian Malaysia, 32(1), 1-16.

Johansson, S. (2009). Some thoughts on corpora and second-language acquisition. In K. Aijmer. (Ed.), Corpora and Language Teaching (pp. 33-44). Amsterdam: John Benjamins.

Kennedy, G.D. (1998). An Introduction to Corpus Linguistics. London; New York, NY: Longman.

Knowles, G. & Zuraidah Mohd Don. (2006). Word Class in Malay: A Corpus-based Approach. Kuala Lumpur: Dewan Bahasa dan Pustaka.

McEnery, T., R. Xiao & Y. Tono. (2006). Corpus-based Language Studies: An Advanced Resource Book. London; New York, NY: Routledge.

Nasiroh Omar, Ahmad Farhan Hamsani, Nur Atiqah Sia Abdullah & Siti Zaleha Zainal Abidin. (2017). Construction of Malay Abbreviation Corpus Based on Social Media Data. Journal of Engineering and Applied Sciences, 12(3), 468-474.

Normi Sham Awang Abu Bakar. (2020). The Development of an Integrated Corpus for Malay Language. In Alfred R., Lim Y., Haviluddin H., On C. (Eds), Computational Science and Technology. Lecture Notes in Electrical Engineering, 603, (pp. 425-433). Singapore: Springer.

O’Grady, W. & Archibald, J. (2000). Contemporary Linguistic Analysis: An Introduction. Toronto: Addison Wesley Longman.

Rusli Abdul Ghani, Norhafizah Mohamed Husin & L. Y. Chin. (2006). Pangkalan Data Korpus DBP: Perancangan, Pembinaan dan Pemanfaatan. In Zaharani Ahmad. (Ed.), Aspek Nahu Praktis Bahasa Melayu (pp. 21-25). Bangi: Universiti Kebangsaan Malaysia Press.

Shahidi A. Hamid, Kartini Abd. Wahab & Sa’adiah Ma’alip. (2018). Kesinambungan Linguistik Melayu. Bangi: Penerbit Universiti Kebangsaan Malaysia.

Sinclair, J.M. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Sinclair, J.M. & Renouf, A. (1988). A lexical syllabus for language learning. In R. Carter & M. McCarthy. (Eds.), Vocabulary and Language Teaching (pp. 140-160). London; New York, NY: Longman.

Siti Aiesha Joharry & Hajar Abdul Rahim. (2014). Corpus Research in Malaysia: A Bibliographic Analysis. Kajian Malaysia, 32(1), 17-43.

Siti Syakirah Sazali, Nurazzah Abdul Rahman & Zainab Abu Bakar (In Press). Characteristics of Malay Translated Hadith Corpus. Journal of King Saud University – Computer and Information Sciences. https://doi-org.libproxy.upnm.edu.my/10.1016/j.jksuci.2020.07.011.

Tan, T., Xiao, X., Tang, E.K., Chng, E.S. & Li, H. (2009). MASS: A Malay Language LVCSR Corpus Resource. Oriental COCOSDA International Conference on Speech Database and Assessments, pp. 25-30.

Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam; Philadelphia, PA: John Benjamins.




How to Cite

Juhary, J., Bakar, E. W., Shamsudin, M., & Alias, A. (2021). Understanding Malay Corpora: A Content Analysis of 15 Malay Corpora . JOURNAL OF ADVANCES IN LINGUISTICS, 12, 18–26. https://doi.org/10.24297/jal.v12i.9122