Getting started: using Xara to explore the corpus

It is easy to explore the LCMC corpus using Xara. We have built two different servers for the character version and the Pinyin version of the corpus. The default server is set for the standard character version. To select this server, simply press the OK button after activating the Xara client by running Sara32.exe from the folder of Xara (Figure 1). To select the server for the Pinyin version, select Menu choose “LCMC_Pinyin” and confirm by pressing OK (Figure2):

Figure 1 Figure 2

All of the following operations are the same for the two versions except for one difference: type characters or Pinyin symbols as the search string depending upon the version of the LCMC you are using. In this section, we will use the default standard version for the purpose of demonstration. When a server is successfully selected, the client shows a list of corpus files available for that server (Figure 3).

Figure 3 Figure 4

There are several ways to explore the corpus with Xara. The simplest way is to type in a search word in the Quick query text box and press the Enter key. However, we can use Query builder to make very complex queries. Now suppose we want to extract all instances of the verbal-final了 -le (tagged as u) immediately followed (the link type defined as Next) by a noun (tagged as n) in sentence number 0010 in all of the 500 sample files in the 15 text categories. This complicated query can be made using Query builder of Xara. First, define the scope node (the left node in Query builder that indicates the context to search in) as “0010” using the s element (Figure 4). In the query node (the right node in Query builder), select AddKey (POS) to define the first part of the query as 了 and select the POS tag u, and the second part as Any and select the POS tag n. Then define the link type as Next (Figure 5). The search result is shown in Figure 6.

Figure 5 Figure 6

The upper part of the concordance window gives the query text (Select Query – Query text from the main menu to display the query text) while the lower window displays the concordances. The status bar of the concordance window shows the name of the corpus, the partition or subcorpus (in this case Null as we have not defined a partition), the current position of the pointer/mouse (i.e. concordance number 1), the total number of concordances (i.e., 25), the number of files in which the query is matched (10), the file name where the current concordance occurs (i.e. LCMC_A), and the file/sentence number for the current concordance (i.e. File A04 and sentence number sn0010). As we have searched in sentence number 0010 (in 500 sample files), this should be the sentence number for all of the concordances.

By comparison to many other corpus tools, one advantage of Xara is that it displays complete sentences while also centering the search query. Users are also given options to display concordances in the page (giving more context) or line mode (i.e. KWIC, as shown in Figure 6), in XML or plain text. Additionally, users can define their own style sheet to display selected XML elements. Xara can also compute significant collocates automatically using a statistic selected from those available by the user.

Note: To explore the corpus using WordSmith version 4, load the concodancer with corpus files in the folder \LCMC\character\texts or \LCMC\Pinyin\texts. Convert the encoding from UTF-8 to Unicode. However, make sure that you have made a copy of your data before you do so, as the original texts will be replaced!