Goto desktop  Move back one step  Move forward one step  Sitemap
Larger font Smaller font

  What is XML?


XML is an acronym for eXtensible Markup Language and in recent years this file format has gained popularity as a means of encoding texts capable of transportation across computer platforms and operating systems. Essentially, XML files work as follows: all formatting information is enclosed in angular brackets, starting with < and finishing with >. Any information between such brackets is called a ‘tag’. To apply a tag to a piece of text in a file, you must start with an opening tag and end with a closing tag which is the same as the former but includes a slash before the tag label (see examples below). Similar tags are used in HTML files (those normally used on the internet) but their number and type are fixed. The great advantage of XML is that user-defined tags are allowed as in the following text snippet:

<references>
      <author>
            Murphy, Paddy
      </author>
      <year>
            2005
      </year>
      <title>
            My view of linguistics
      </title>
      <publisher>
            Secret Publications Ltd.
      </publisher>
</references>

A text like this could be viewed as plain text, basically as shown above. But with XML one can also have a stylesheet, a file with the extension .xsl, which specifies how the tags found in the XML file are to be interpreted by a program with XML capability. For instance, an XSL file might specify that any text enclosed by the tags <author> and </author> be displayed (and probably printed) in bold or those enclosed by <title> and </title> be displayed (and probably printed) in italics.
      When preparing text corpora, XML is increasingly the preferred file format. As you can see from what has just been said, the compilers of a corpus could include much customised information in the form of user-defined tags and then further specify how these tags are to be processed by appropriate software.
      Bear in mind that user-defined tags can only be interpreted if information is provided about how this is to be done (via an XSL file). However, for text retrieval tasks with general software like Corpus Presenter, user-defined tasks have no meaning and are removed at the outset or ignored during processing as explained below.

Corpus Presenter has basically two ways of dealing with XML files:

1) You can convert XML files to text on the directory listing level by entering a directory containing XML files and then clicking on the button Convert files. The window which appears will list the XML files (assuming that the option XML to ASCII is chosen in the lower right-hand corner). After you have converted the XML files to ASCII texts you can load the latter and carry out any retrieval tasks you wish with Corpus Presenter. Note that the conversion process entails removing all XML tags from the input files. With all tags removed, the results are ASCII files, i.e. they consist of just text without any formatting information (in this case XML tags).

2) You can load any XML files (without prior conversion) and then select the option Search, Comment codes on the main program level. Click on the Load button and select the file XML_Tag_Delimiters.lst. Using this file as input for comment codes will ensure that XML tags are treated as comments, and so ignored, during retrieval operations. Do not forget to turn comment checking ‘on’ by ticking the appropriate check box at the bottom of the Comment codes window.