Understanding Malay Corpora: A Content Analysis of 15 Malay Corpora

In recent years, corpus research has grown in importance, particularly in Malaysia and for the Malay language, the country ’ s official language. Texts and transcriptions of talks for a range of settings make up a corpus. The Malay language, which is Malaysia ’ s native and official language, is the focus of this short paper. The objectives of this paper are to identify the features and types of Malay Corpora, as well as the needs for a military-oriented Malay Corpus. The methodology used in this short paper consists solely of content analysis of pertinent texts on the establishment of Malay language corpora. Preliminary findings suggest that there are at least 15 Malay corpora in existence and that some of the features in these corpora overlap. Further, the researchers argue for the need for a Malay Corpus for Military Operations since the existing corpora do not fully cater for this type of corpus.


Introduction
The Malay language is part of the Austronesian language family, which also includes Malagasy, Tagalog, and Pilipino (O' Grady & Archibald, 2000). The language is spoken in Malaysia, Brunei, Indonesia, Singapore, and southern Thailand. The written form of the Malay language in the past was Jawi, an adapted Arabic script; however, Malaysia, Indonesia, and Brunei collaborated to "produce a more uniform form of the language utilising Roman alphabet" (Tan et al., 2009).
The Malay language has been the official and national language of Malaysia since its independence in 1957. The status and functions of the language are explicitly stated in Article 152 of the Malaysian Constitution. Despite this, due to allegedly its restrictions on syntax, morphology, pragmatics, and vocabulary, the Malay language has not been used in all domains of expertise, academic, or research. Some researchers argue that because of colonial influence, English, the country's second language, has become more prominent in a variety of fields of study. Despite being a dominant language in the region, Knowles and Zuraidah (2006) asserted that Malay is "one of the least known to current linguists in the western world" and maybe "the least regulated." The choice of corpus becomes crucial not just for corpus builders but also for corpus users when establishing a corpus because the range of questions that can be investigated is determined by the composition of the corpus. This study serves as a foundation for understanding the existing literature in the subject as part of a larger initiative that will eventually establish and develop a Malay Corpus for Military Operations. Malay is the tenth most widely spoken language worldwide. Because there are little or minimal digital resources on the Malay language, many studies on language and corpus focus on the English language (Nasiroh Omar et al., 2017). Therefore, this paper attempts to understand and identify what constitutes Malay Corpora.
To address the objectives and research questions, this paper is divided into four main sections, including this introduction, which consists of a short discussion on the methodology adopted in this paper and its research questions. The main findings of this paper are presented in the second section, where selected reports and past research on Malay corpora are examined and discussed. The third section answers the research questions posed, and then it provides some directions for the next course of action on completing the bigger research on a Malay Corpus for Military Operations. A brief conclusion closes this paper.
As this is a short paper, content analysis is used as the primary research approach. The first stage requires the researchers to search for documents on Malay Corpora written by local and international scholars. The researchers choose to use only Google Scholar at this stage. Next, the main findings are identified by the researchers to facilitate discussions and to answer the research questions at the end of this paper.
Two research questions will be answered, and these include,

Understanding the Debates on Malay Corpora
This section is divided into three sub-sections that are interrelated to Malay Corpora. These include the discussions on the corpus, linguistics, and Malay Corpora.

What is a Corpus?
A corpus is a collection of naturally occurring language text selected to represent a state or variant of a language (Sinclair, 1991). A corpus, as defined by Bjorkenstam (2013), is a collection of natural language that contains not only texts but also transcriptions of talks or signs. While most available corpora are text only, a rising number of multimodal corpora, such as sign language corpora, are becoming available. A multimodal corpus is "a computer-based collection of language and communication-related material drawing on more than one sensory modality or more than one production modality" (Allwood, 2008), where sensory modalities include sight, hearing, touch, smell or taste, and production modalities, for example, speech, signs, eye gaze, body posture, and gestures. Therefore, a multimodal corpus is a collection of videos and/or audio recordings of people communicating, and in a variety of scenarios and contexts.
However, any collection of audio and video cannot be simply labelled a corpus. Bjorkenstam (2013) further claimed that as a result, audio-visual content must be selected and filtered first, and metadata should be used to characterise the information. Second, the content should be analysed and reported consistently, with transcriptions and notes. A corpus, in theory, is a collection of language production samples chosen to be representative of a language (or sub-language) rather than a set of data collected at random. The sampling of the corpus defines how representative a corpus is for a certain research subject. To generate a generic corpus, language samples from men and women of all ages from various regions of the area where the language is spoken must be collected and included.

Linguistics
Linguistics is the scientific study of language, where language elements including syntax, morphology, pragmatics, semantics, and vocabulary are observed. Linguistics researchers examine all these aspects to see how the language is organised by the original speaker or the individual who makes the typical and systematic utterances (Shahidi A. Hamid, Kartini Abd. Wahab & Sa'adiah Ma'alip, 2018). The way the language is arranged can be seen through the corpus data. What is meant by corpus data? Corpus data are the collection of language or texts data. Language data can be in the form of oral or written or both which are saved in the computer and functioned as the language sample for linguistics research. There are three important aspects in explaining the corpus data: authentic data, electronic data, and oversized data (see Sinclair & Renouf, 1988;Francis, 1992;Kennedy, 1998;Tognini-Bonelli, 2001).
The term 'authentic' refers to data that are acquired from actual communications between people. These data are crucial in linguistics study because they indicate how a language behaves when it is utilised. The corpus data used as a sample will allow researchers to gain a true image of a language to create a language model or formula. The corpus data, on the other hand, is kept in an electronic format. Corpus data are always associated with the machine-readability phrase rather than electronic phrases. This means that the machine, that is the computer, oversees most of the corpus data control. The usage of computers has increased the amount of data that can be stored. For instance, Brown Corpus is the first corpus data produced in the form of machine-readable. It has recorded an amount of one million words which have been collected from 500 textbooks, and each text consists of 2,000 words (Garside, Leech & McEnery, 1997).
A few aspects need to be taken into considerations when developing a corpus (Biber, 1993 c. the size of corpus data that needs to be developed.
By these three considerations, researchers and scholars are reminded that text selections must include samples that are comprehensive in terms of gender, locality, and scenarios of the interlocutors. Further, depending on the research questions and objectives of the research, the sample size can be determined. No one research can cover all language samples! Finally, the size of the corpus data relies on the research questions and objectives explained earlier. Developing a corpus is done in stages because language is dynamic; it keeps growing as speakers keep using the language and when transfer or borrowing of words continue to occur.

Malay Corpora
This sub-section begins with a historical account of Malay Corpora in Malaysia.  (2014), with the expansion of Malay Corpora and its role as a Malay language resource centre, DBP has expanded corpus-based research, not only for systematic analysis and description of Malay linguistics but also for Malay language pedagogy, in collaboration with corpus linguists from within and outside the country. She went on to claim that the DBP Corpus database has produced new descriptions of the Malay language, which has influenced the content, material creation, and syllabus design for Malay language teaching and learning in the country.  (2014), the development of other types of corpora, such as the Malay language learner and pedagogic corpora, will allow research in the Malay language to expand beyond language description, translation, and lexicography to include Malay language learning and teaching issues, which are currently underserved.
Hajar Abdul Rahim (2014) reported that up until 2008, the DBP Corpus database comprised 128 million words which were compiled in 10 sub-corpora representing different genres of texts. These include books, magazines, newspapers, translations, ephemerals, drama, poems, material cards, traditional texts, and school textbooks. In 2017, the number of words in the DBP Corpus database has increased to 135 million words, and the sub-corpora and the number of words for each sub-corpus are illustrated in Table 1. One of the outputs of the DBP Corpus database, according to some, is the release of a dictionary, which is regarded as an important success in Malaysian corpus-based Malay lexicography. Other areas of pedagogy, however, can benefit from these corpora. Figure 1 depicts Johansson's (2009) diagrammatic description of corpora's usage in language teaching and learning, which shows how corpora can fully contribute to language pedagogy.    Table 3). In concluding this section, the researchers opine that Malay Corpora are indeed rich and that DBP as the responsible agency to ensure effective use of the Malay language in Malaysia has a massive task ahead. The main question remains whether these corpora are sufficient, or if more corpora should be developed, what is the best strategy to ensure incidents of missing words are avoided, and that all can be integrated.

Results and Discussions
This section focuses on answering the research questions posed earlier in this paper. At the same time, before closing this section, future directions of Malay Corpora will also be discussed.

Research Question 1 -What are the types and features of Malay Corpora?
Based on the literature discussed earlier and based on the DBP Corpus database (see Figure 2) and SEAlang Library Malay Corpus database (see Figure 3) that are available online, it can be summarised that Malay Corpora have been extensively developed and will continue to be expanded due to social, economic, and political shifts in Malaysia. For example, it is estimated that SEAlang contains about 2.5 million words collected from the web using a crawler (Chung et al., 2019). Given the above analysis, at least at this stage, there are nine types of Malay Corpora documented (see Tables 2 and 3). The researchers argue that there are other Malay Corpora, which might not have been properly documented, and some may be the sub-types of the existing Malay Corpus types.  Apart from the DBP Corpus database, Siti Syakirah Sazali, Nurazzah Abdul Rahman and Zainab Abu Bakar (In Press) argued that there are other but not annotated corpora available such as from the Institute of Language and Literature. These corpora provide multi-domains such as newspaper excerpts, magazines, novels and many more. Another corpus but also not annotated is Mutiara Hadith UiTM that provides translated Quran and Hadith documents publicly (see Table 4). In addition, there are annotated existing corpora on the terrorism-related corpus, news and biomedical articles, and Twitter excerpts; however, these are not publicly available for they are built as a form of experiments in natural language processing. Therefore, based on Table 4, it could be deduced that some corpora are annotated, and some are not. At the same time, some corpora are not publicly available, which makes it even challenging to analyse and understand the existing Malay Corpora. Regardless of this, there are a total of 15 Malay Corpora, four are not available for the public and two are not annotated (see Tables 2, 3 and 4).
Further, the features of Malay Corpora include the identified contexts, which are for general and educational purposes, and their applications, ranging from providing meanings (dictionary) to generating concordance and sentence analysis. The researchers argue that the contexts could be further categorised into other sub-contexts such as research for educational purposes and politics, economy and social for general purposes. This further sub-categorising allows for a more focused and directed work and understanding of Malay Corpora.

Research Question 2 -Why is there a need to develop a Malay Corpus for Military Operations in Malaysia?
The second research question appears to be easy to answer but poses a tricky scenario to linguists alike. Based on Figure 2, there exists the domain for Police and Military in the DBP Corpus database, but not Military Operations per se. As argued earlier, the contexts could have been extended to include more contexts or subcontexts to reflect the dynamics of the Malay language itself. At this stage, the researchers could only argue on the significance of developing a Malay Corpus for Military Operations. There are at least two reasons for developing the corpus in the context or sub-context of defence and security. Firstly, as the unknown challenges and threats in the world today increase, there is an urgent need to conduct relevant research in this area. The report of this research must also be written in the Malay language to ensure that Malaysians are updated with the current defence and security issues. This is where the corpus for Military Operations becomes significant.
Secondly, in the advent of Industrial Revolution 4.0 (IR4.0), the defence and security industry need a substantive corpus in the Malay language to enrich the understanding and strengthen the knowledge of Malaysians and international scholars interested in the Malay language alike. Most of the reports and documents on scientific findings are written in English, highlighting the assumption that the Malay language is a sub-language. The Malay language is rich with its terms and vocabulary, and by developing a corpus targeting Military Operations, other areas other than research and providing meaning such as the pedagogy of military training can also use the Malay language effectively due to this corpus.

What Next?
Much is yet to be done in terms of strengthening

Conclusions
Based on the discussions in this paper, it is evident that there are at least 15 Malay Corpora. Out of these, four are not made public for reasons unknown to the researchers, and two are not annotated. The researchers argue that there is no harm in developing a new corpus; this will only enrich the language itself because speakers, users and researchers alike get the benefits of various Malay Corpora. In addition, the researchers also argue that the features of Malay Corpora such as contexts could be further categorised into other sub-contexts. The other feature, which is applications of the corpora has been beneficial and significant especially in developing dictionaries.
To conclude, the researchers opine that there is a need to develop the Malay Corpus for Military Operations for the use of the country. Based on the discussions and analysis of the existing corpora, it is evident that there is no specific corpus that could enhance the understanding of military terms in the Malay language especially one that focuses on military operations. It is high time that this Malay Corpus for Military Operations is developed since Malaysia has massive experience in international military operations such as ones in Somalia and Lebanon under the United Nations, and these experiences must be documented in the Malay language since it is the national language and for future reference, locally and internationally.

Conflicts of Interest
There is no conflict of interest.

Funding Statement
The short paper is funded by the National Defence University of Malaysia, under the short-term research grant, UPNM/2020/GPJP/SSI/5.