Goto desktop  Move back one step  Move forward one step  Sitemap
Larger font Smaller font
  How Corpus Presenter works

  Loading files for searches
  How to search through texts
  How files are displayed
  Making word lists
  Non-West European languages
  Working with databases
  Help in Corpus Presenter
  Range of text searches
  Types of returns from searches
  Removing Corpus Presenter
  What to do with returns
  Generating charts from returns


Corpus Presenter works by loading files, displaying them and allowing users to search for text strings. It arranges retrieval returns so that users can evaluate these and then copy them to disk or the Windows clipboard for later processing.

The essential element in presenting a corpus (a set of files) in a hierarchical manner, like a tree in Windows Explorer, is called a data set file and consists of a list of the files of a corpus along with information about how these are to be displayed in tree form. Each corpus supplied with Corpus Presenter has its own data set file. A data set can be created and edited with the supplied utility Corpus Presenter Make Tree.

It is not necessary to construct a data set file each time you wish to examine texts with Corpus Presenter. You can simply load texts from your hard disk and start searching straight away. This is particularly useful if you just want to comb through some texts but you do not want to go to the trouble of arranging them as a structured corpus. For more information on loading files directly, consult the section Getting Started on this website.

With Corpus Presenter there is a test corpus with a data set file called SampleShortCorpus.cpd which you can load. With the full version of Corpus Presenter there is a slightly larger test corpus with the data set file Test_CP.cpd while the supplied A Corpus of Irish English uses the data set file CIE.cpd. You can have as many data set files as you like, you can even have more than one for a single corpus and thus have different means of presenting the corpus in a visually effective manner.

Structure of a CPD file

Loading files for searches


When you start working you must decide if you wish (i) to use a corpus which you already have (the supplied test corpora or one or more of your own) or (ii) to load your own files, perhaps by generating a data set file before this. Check the Flowchart of options on starting Corpus Presenter to get a visual demonstration of the startup options. Remember that the type of file you are dealing with is important. The four main types found in corpus processing are shown in the following table. Specific corpora may use their own customised extensions. Occasionally, the files of a corpus do not have any extension at all, as with the Helsinki Corpus of English Texts.

File type (extension) Full name Characteristics
.TXT Text Plain text, no formatting, yields very quick retrieval and can be transported between different types of software and computers.
.RTF Rich text format Specially encoded text which allows the specification of formatting features such as bold, italics, page layout, etc.
.HTML Hypertext markup language Text encoded according to the standard used in the internet. It can easily combine text with images, graphs and tables. Not very suitable for large texts.
.XML

What is XML ?

Extensible markup language A type of text encoding in which the formatting is determined individually. This makes it very flexible, for instance, the compilers of a corpus can use XML to encode their texts in a user-specified manner.

How files are displayed


  

Type Explanation Advantages
Tree The files are arranged as branches of a tree much as in Windows Explorer. You can look for files within a section of a corpus, e.g. in the Helsinki Corpus of English Texts you could search through just one period, say, the later Middle English period.
List The files are shown as a single list in which each item has a small check box to the left of it. You select files by ticking this box. You can select through a number of files – irrespective of where they are located in a corpus tree – by checking them. For instance, in A Corpus of Irish English you could search through the novel Castle Rackrent the plays of Sean O’Casey (although they are in different branches of the corpus tree) by ticking these in a list display.

Use F11 to toggle display on the main level, or choose Display, then Tree or list display in the menus at the top of the screen.

Bear in mind that you cannot search through PDF (portable data format) files. These are intended only to be read, not processed by users.



Non-West European languages


Most of the work on corpora is done using some West European language, usually English. But of course users may have corpora which use Slavic languages, Greek, Hebrew, Arabic or other languages located further from Europe. Thanks to the Unicode system of encoding characters for foreign languages, it is possible to present and process text in such languages. When using Corpus Presenter, one setting of the program will have to be changed to ensure that files which use Unicode encoding are shown and searched through properly. In the Display menu you will find an option Quick text display and retrieval. Here you will have to untick the option Use fast retrieval mode. The reason for this is that internally in Corpus Presenter Unicode encoding is *not* used (for reasons of speed). However, this encoding must be used for non-West European languages. This can be done by unticking the current option.


How to search through texts


Type Features
Find string in files Fast search with a simple string
Basic search A more flexible search routine which allows the use of wildcards (* and ?) as well as a list of input forms (useful for variant spellings or related morphological forms)
Advanced search This is the most sophisticated type of search, allowing for syntactic frames, input lists for Word1 and Word2 in a frame. Furthermore, you can specify if a sub-word string is to be found at the beginning or end of a word, or indeed if it represents an entire word.

  

    Range for text searches


Search type Explanation Restrictions
From first file Start searching from the first file in a tree or list of files None
Branch only Start searching from the top of a branch and encompasses all files below this. To work you must click on the top node of a branch in a tree on the main level. Only works with files displayed in tree form, i.e. only with a corpus data set such as SimpleSampleCorpus.cpd or Helsinki.cpd
Just current text Only the current selected text is searched. If working with a corpus in tree form, make sure you have selected a text at the end of a branch. None
From here to end All files from the currently selected on on the main level to the end (in either a file list or a corpus tree) will be searched. Those which are placed before the current file are ignored. None
Checked files If you change the display on the main level to "list", i.e. a list of all files of the corpus in the order in which they have been loaded or in which they are listed in the corpus data set, then you can check individual files (by ticking in the small box on the left). All search routines will now only examine the checked files (useful when examining a subset of files). None

    Types of returns from searches


Type What these look like (click on link)
RTF text file RTF returns
Line list Line list returns
Single-line grid Single-line grid returns
Multi-line grid Multi-line grid returns

What to do with returns


Type What you can do
RTF text file The RTF returns are listed in a text window the contents of which you can edit if you like. You can also transfer text from here to some other program, such as your own word processor by just selected text, putting it in the Windows clipboard via Ctrl-C and retrieving it from there via Ctrl-V or the Paste option in your word processor.
Line list The advantage of line list returns is the the finds are aligned vertically in a central column as users will recognised from other concordance software. You can select the lines you need (tick the box on the left) and then export them to a text file or to the Windows clipboard.
Single-line grid Line grid returns are more flexible than line list returns and you can choose to export them in a variety of ways (see screen shot below). You can also hide the keyword from returns (something which is useful when teaching students about linguistic structures: you can get them to guess what the keyword would be. Remember that you can rearrange the returns of a single-line grid as a grid of collocations.
Multi-line grid The multi-line grid returns, as the name suggests, can return several lines around a text find. You can also specify that it should return the entire sentence in which a find is located be returned. This can be useful when quoting the text from which a find comes in an article or book chapter which you might be writing. Selected rows of a multi-line grid can be exported, hold the Ctrl-key depressed and click on the grey border on the left-hand side of the grid to select a row.

The following screen shows the options available for storing line list, single- and multi-line grid returns to disk.


Making word lists


You can generate lists of words from the text files of a corpus. At a maximum, you can create a word list of all words in all text files of a corpus. This would take some time for a large corpus and is unlikely to be the aim of most users, but can be done on occasions of course. Instead users are probably interested in creating a list of selected words in a corpus. For this reason, one of the first options in the input window which opens on selecting this command is Input word list. Here you specify a plain ASCII file which consists of a list of words, one on each line. Such a word list can be easily created with Corpus Presenter Text Tool. The next item to remember, and which is concerned with restricting the words used for a list, is a stop word list. Essentially, this is a list of words (again in the form of a plain ASCII file) which are to be excluded from word list generation. For instance, if you choose to make a word list of an entire text, then it is unlikely that you want to have statistics on the occurrence of such common words as a, the, on, at, etc. These and similar words can be excluded by putting them into a stop word list and then specifying it on this level of the program.

To generate a word list Corpus Presenter examines a file or files, extracts all words and places each of these on a separate row in the grid which you see in the word list window. Each word is only entered once into the grid. The number of times a word occurs in a file is recorded in the frequency column.

When saving the results in the grid to disk you can choose to have these deposited in a plain text file or in a database (for further processing with one of the supplied database editors). You can also just select rows in the grid and then store only the contents of the selected rows (an extract of the entire grid so to speak). To select contiguous rows, hold the Shift-key depressed and mark the rows by moving with the up or down arrow key. To select non-contiguous rows, hold the Ctrl-key depressed and move from row to row with either the up or down arrow key. To select a row, press the SpaceBar (without releasing the Ctrl-key) or click on the row with the mouse.

Generating a word list is a somewhat slow process as the entire text must be combed through for each word. But it is something you can initiate and leave the computer to work away while you do something else. Note the option of generating a word list for the checked files of the current corpus tree. This option can be used to ensure that all the files you want to encompass are included in the operation.

Examples of a word list, based on input forms with the legal wild cards * and ?

Note that the question mark stands for a single unspecified character and the asterisk for several unspecified characters. The results given here can be repeated by selecting the text RIDERS.ASC (Synge’s Riders to the Sea) which is contained in the test corpus supplied with Corpus Presenter.

Input form Word(s) returned from text RIDERS.ASC
b?g big
gr??e grace, grave
g??e Give, gone
he*d head, heard, he'd
ho* holding, holds, hole, Holy, hook, hooker's, horses, hour, house, house-six, How

When you are deciding how returns are to be displayed you can choose between a plain list (which just includes the word and the frequency) or a grid. The latter is much more flexible and you can decide how many of five fields are to be included. The first one, Word, is obligatory, but the others can be determined by the user. If you choose to have the field Location then the search is liable to be slowed up if there are a lot of finds for each word. The reason is that the program now records the location in each text of all the finds. When looking for rare forms, this option can be very useful. So use prudently.


N.B. You can click on the column of a table to sort it alphabetically or numerically on that column. Clicking a second time will reverse the sorting order, i.e. turn ascending into descending order.

The list generated on this level can also be stored to disk as an HTML file (comparable to the option on the Basic Search and Advanced Search levels), in this case you should click on the tick box Output as HTML file in the output options window which opens when you click on the button Save as word list.


Generating charts from returns


New in Corpus Presenter from Version 10 onwards is the option of arranging returns in such a way that you can generate a chart from them. On both the Basic Search and the Advanced Search levels you now have the option of storing the returns in such a way that they can be used to generate a chart within Microsoft Excel.

The way this works is as follows: every time a search is carried out, Corpus Presenter returns not just the finds but a table in which the files with finds and the number of finds per file are shown as in the following screen shot.

The data in this table can be trasnferred to a grid. From this location you can now save the data in the grid to a database. Remember that a database is an internally structured file with rows and columns. In this case the database which is stored on disk contains a column called Node_label and one called Returns_1 which contain the names of the files with finds and the numbers of finds per file respectively (the names of the columns can altered if you wish, see screen further below).

The database created in this process can be loaded into Microsoft Excel. There you see the names of files and the numbers of returns. Now all you need do is mark all rows and columns via Ctrl-A (or a selection of these if you wish). Go to the Insert menu and choose Chart. A dialogue now begins in which you can specify just what type of chart you wish and how the database data is to be displayed in this.

You can edit the returns database if you wish, adding or re-arranging information if you feel this is necessary for your purposes. If you store the database to an Excel worksheet – with the extension .XLS – then the chart just generated is retained as are any changes you made to the rows and columns of data.

For further processing in Microsoft Excel the original database is not required and can be overwritten during a future work session if you wish. By default, Corpus Presenter calls the returns database Database_to_MS_Excel.dbf. You can choose any name you like when saving this to disk.

On the Advanced Search level returns are also arranged by file and numbers of return per file as shown in the following screen. Here the titles of columns can also be specified by the user. Bear in mind that a title has a maximum of 10 characters for the resulting database. However, after importation to Microsoft Excel you edit the column titles and have longer titles if you wish.

You can store up to six sets of returns in a grid and save these to a database for later chart generation in Microsoft Excel. This way you can see if there is a meaningful relationship between sets of returns across groups of files and demonstrate this visually in a chart.

No matter how well you specify the parameters for a search, it is very unlikely that the returns are going to be entirely accurate across several files. So it is imperative that the user check the returns before basing linguistic statements on these. In particular, spurious finds should be removed from a set of returns before these are stored in a database for later chart generation.

To remove finds, close the statistics window and edit the full list of returns. Just delete the rows with spurious finds. Then recalculate the statistics and the more accurate figures are transferred to the statistics window which contains the grid with values for files and number of returns per file.

Recalculation can be done on both the Basic Search and the Advanced Search levels.

The options for storing and saving data from returns on disk have been expanded as of Version 10 of Corpus Presenter. These include saving data in database form and reloading this in a later work session.

Note. Corpus Presenter recognises different versions of the Microsoft Office suite of programs. The way it proceeds is as follows: it checks to see if Office 2007 is installed on your computer and, if not, it looks for Office 2003. It this is not found, it tries for Office 2000. The first version which is located. working in reverse chronological order, is the one which is used. This applies to all programs of this suite. Word and Excel are the most commonly accessed programs, but Corpus Presenter also recognises PowerPoint and Publisher files.