Since the early 1990’s a large number of corpora have become available which consist of texts covering periods in the history of English. The first major corpus in this area was the Helsinki Corpus of English Texts (1991, 1993) which includes extracts from various works ranging from Old English to the late modern period. At the University of Helsinki various additional corpora have been compiled since, focussing on a selection of texts, either of a particular genre, e.g. personal correspondence, medical texts, or from a particular region, e.g. Scottish texts. Other universities soon followed suit and by the end of the decade quite an impressive range of corpora was available.
Below a selection of corpora are listed to convey an impression of the variety and coverage of those currently available. This is an expanding field and with each passing year new corpora appear, some of which are put in the public domain by their compilers.
|Name||Compiling institution / individuals|
|ARCHER, a corpus of British and American English from 1650-1990||Douglas Biber and associates in Northwestern Arizona University in collboration with colleagues at the University of Freiburg, Germany|
|Australian Corpus of English||Department of Linguistics, Macquarie University, NSW, Australia|
|Bank of English||University of Bermingham, sponsored by the publisher HarperCollins|
|British National Corpus||Consortium under the aegis of Oxford University Press|
|The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English||A parsed section of the original Helsinki corpus prepared by a number of linguists|
|Brown Corpus of Standard American English.||W. Nelson Francis and Henry Kucera, Brown University, Providence, Rhode Island|
|Corpus of 19th Century English||Merja Kytö and associates, Uppsala University, Sweden|
|Corpus of Dialogues||Merja Kytö, Uppsala University, Sweden and Jonathan Culpeper, Lancaster University, England|
|Corpus of Early English Correspondence||Terttu Nevalainen and Helena Raumolin-Brunberg, University of Helsinki, Finland|
|A Corpus of Irish English||Raymond Hickey, Essen University, Germany (packaged with Corpus Presenter, Software for Language Analysis, Amsterdam: John Benjamins, 2003)|
|Corpus of London Teenage Language (COLT)||Anna-Britta Stenström and associates, Department of English, University of Bergen|
|Corpus of Middle English Prose and Verse||University of Michigan, Michigan|
|Freiburg-Brown Corpus of American English (FROWN)||Christian Mair and associates, University of Freiburg, Germany|
|Freiburg-LOB Corpus of British English (FLOB)||Christian Mair and associates, University of Freiburg, Germany|
|The Helsinki Corpus of Older Scots||Anneli Meurman-Solin, Department of English, University of Helsinki, Finland|
|Innsbruck Corpus Archive of Middle English Texts (ICAMET)||Manfred Markus, University of Innsbruck, Austria|
|International Corpus of English (ICE), collection of corpora from various anglophone countries, now (2005) partially completed||Co-ordinated by the Department of English, University College London, England|
|Kolhapur Corpus of Indian English||Shivaji University, Kolhapur|
|Lampeter Corpus of Early Modern English Tracts||Josef Schmied, Technical University Chemnitz, Germany|
|Lancaster-Oslo-Bergen Corpus of British English||Collaborative effort of the universities in the three cities named in title|
|London-Lund Corpus of Spoken English||Departments of English at University College London, England and Lund University, Sweden|
|Middle English Medical Texts||Irma Taavitsainen, Päivi Pahta and Martti Mäkinen, Department of English, University of Helsinki, Finland. Retrieval software by Raymond Hickey. Published by John Benjamins, 2005.|
|Northern Ireland Transcribed Corpus of Speech (NITCS)||John Kirk, Department of English, Queen’s University, Belfast, Northern Ireland|
|Penn-Helsinki Parsed Corpus of Middle English||University of Pennsylvania, Pittsburgh, Pennsylvania|
|Old Bailey Court Depositions||Department of History, University of Sheffield|
|Santa Barbara Corpus of Spoken American English||University of Santa Barbara, California|
|Zurich English Newspaper Corpus||Udo Fries and associates, Department of English, Zurich University|
If you want to see how corpus software works, you can download a free version of my software package, Corpus Presenter. This was published as a book and CD entitled Corpus Presenter, Software for Language Analysis (Amsterdam: John Benjamins) in 2003. The version with the book is 7.0; version 9.0 (November 2005) is available from my homepage at Essen University. For more details, go to the page Corpus Presenter on this website (click on the relevant branch of the tree on the left of the start screen).
The version which you can download via the link below contains all the functions of the full program except the third, and most sophisticated level of text retrieval. Already with the lite version, you can, however, carry out refined searches across sets of texts and use wild-cards, sets of input forms, etc. All returns which might be made can be stored to disk or copied to the Windows clipboard.
To allow you to get moving quickly, I have enclosed a small test corpus – called SmallSampleCorpus.cpd – which contains extracts from Beowulf, Chaucer's Canterbury Tales, some items by Shakespeare (two plays and the sonnets) as well as a number of Irish pieces. You can start doing searches with this corpus straight away. If you have your own files (in plain text, RTF, HTML or XML format) you can search through these equally well. Just select your files from the initial file listing and load them directly.
Corpus Presenter Lite (date: 18 December 2005; size: 7.6 MB)
The file you download is a ZIP file. You must unzip this file (to any directory you like) and then start the program setup.exe which will be in the list of files extracted from the ZIP file. The setup procedure will suggest installing Corpus Presenter Lite to the directory C:\Program Files\Corpus Presenter Lite which you should let it do. Once the installation is complete (and you have re-started your computer) there will be an entry Corpus Presenter Lite in the list under Start - Programs on the desktop of your computer.
Corpus Presenter Lite works best with Windows XP (and presumably later versions when these appear). You are not advised to use versions of the operating system older than Windows 2000. For legal reasons, I must stress that you use it at your own risk. The program can be removed easily from your computer via the Add/Remove Software module in the Control Panel of Windows XP.
Aarts, Jan and Willem Meijs (eds.) 1990. Theory and practice in corpus linguistics. Amsterdam: Rodopi.
Aijmer, Karin and Bengt Altenberg (eds.) 1991. English corpus linguistics: Studies in honour of Jan Svartvik. London: Longman.
Aijmer, Karin and Bengt Altenberg (eds.) 2004. Advances in Corpus Linguistics.
Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23) Amsterdam: Rodopi.
Altenberg, Bengt 1991. A bibliography of publications relating to English computer corpora. In English computer corpora: Selected papers and bibliography, ed. by Stig Johansson and Anna-Brita Stenström. 355-396. Boston: Mouton de Gruyter.
Altenberg, Bengt and Sylviane Granger (eds) 2001. Lexis in contrast. Corpus-based approaches. Amsterdam: John Benjamins.
Biber, Douglas, Susan Conrad and Randi Reppen 1998. Corpus linguistics. Investigating language structure and use. Cambridge: University Press.
Bridge, Derek and Stephen Harlow 1997. An introduction to computational linguistics. Oxford: Blackwell.
Briscoe, Ted and Brian Boguraev 1989. Computational lexicography for natural language processing. London: Longman.
Butler, Christopher 1985. Computers in linguistics. Oxford: Blackwell.
Butler, Charles 1985. Statistics in linguistics. Oxford: Blackwell.
Connor, Ulla and Thomas A. Upton (eds) 2004. Applied Corpus Linguistics. A Multidimensional Perspective. Amsterdam: Rodopi.
Conrad, Susan and Douglas Biber 2001. Variation in English - Multi-dimensional Studies. Harlow, England; New York: Longman.
Culpeper, Jonathan, and Merja Kytö. 1997. ‘Towards a Corpus of Dialogues, 1550–1750’, Language in Time and Space. Studies in Honour of Wolfgang Viereck on the Occasion of His 60th Birthday, eds. Heinrich Ramisch and Kenneth Wynne. Stuttgart: Franz Steiner Verlag, 60–71.
Fries, Udo, Gunnel Tottie and Peter Schneider (eds) 1994. Creating and using English language corpora. Amsterdam: Rodopi.
Fries, Udo, Viviane Müller and Peter Schneider (eds) 1997. From Ælfric to the New York Times. Amsterdam: Rodopi.
Garside, Roger, Geoffrey Leech and Geoffrey Sampson (eds) 1987. The computational analysis of English. London: Longman.
Granger, Sylviane and Stephanie Petch-Tyson (eds) 2003. Extending the scope of corpus-based research. New applications, new challenges. Amsterdam: Rodopi.
Greenbaum, Sidney 1996. Comparing English world-wide. The international corpus of English. Oxford: University Press.
Häcker, Martina 1998. Syntax and semantics of adverbial clauses in present-day Scots. A corpus-based study. Berlin: Mouton de Gruyter.
Hampe, Beate 2001. Superlative verbs. A corpus-based study of semantic redundancy in English verb-particle constructions. Tübingen: Narr.
Hasselgard, Hilde and Signe Oksefjell 1999. Out of corpora. Studies in honour of Stig Johansson. Amsterdam: Rodopi.
Hauenschild, Christa and Susanne Heizmann (eds) 1997. Machine translation and translation theory. Berlin: Mouton de Gruyter.
Hickey, Raymond 1993a. ‘Applications of software in the compilation of corpora’ In: Merja Kytö, Matti Rissanen and Susan Wright (eds), Corpora across the centuries Amsterdam: Rodopi, pp. 165-86.
Hickey, Raymond 1993b. ‘A corpus of Irish English’, In: Merja Kytö, Matti Rissanen and Susan Wright (eds), Corpora across the centuries. Amsterdam: Rodopi, pp. 23-31.
Hickey, Raymond. 1997a. ‘The computer analysis of medieval Irish English’, In: Hickey, Kytö, Lancashire and Rissanen (eds), pp. 167-83.
Hickey, Raymond 2000. ‘Processing corpora with Corpus Presenter’, ICAME Journal 24, 65-84.
Hickey, Raymond 2003. Corpus Presenter. Processing software for language analysis. includes A Corpus of Irish English. Amsterdam: John Benjamins.
Hickey, Raymond, Merja Kytö, Ian Lancashire and Matti Rissanen (eds) 1997. Tracing the trail of time. Proceedings of the conference on diachronic corpora, Toronto, May 1995. Amsterdam: Rodopi.
Hockey, Susan M. 1980. A guide to computer applications in the humanities. London: Duckworth.
Hunston, Susan and Gill Francis 2000. Pattern grammar. A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins.
Johansson, Stig and Anna-Brita Stenström (eds) 1991. English computer corpora. Selected papers and research guide. Berlin: Mouton de Gruyter.
Jung, Udo O. H. (ed.) 1991. Computers in applied linguistics and language teaching. Frankfurt/Bern: Lang.
Kennedy, Graeme 1998. An introduction to corpus linguistics. London: Longman.
Kirk, John (ed.) 2000. Corpora galore. Analyses and techniques in describing English. Amsterdam: Rodopi.
Krug, Manfred 2000. Emerging English Modals: A Corpus-Based Study of Grammaticalization [Topics in English Linguistics]. Berlin/New York: Mouton de Gruyter.
Kytö, Merja 1993. Manual to the diachronic part of the Helsinki corpus of English texts. 2nd. edition. Helsinki: Department of English.
Kytö, Merja. 1999. ‘Collocational and Idiomatic Aspects of Verbs in Early Modern English: A Corpus-based Study of MAKE, HAVE, GIVE, TAKE, and DO’, Collocational and Idiomatic Aspects of Composite Predicates in the History of English, eds. Laurel J. Brinton and Minoji Akimoto. Amsterdam/Philadelphia: Benjamins, 167–206.
Kytö, Merja, and Suzanne Romaine. 1997. ‘Competing Forms of Adjective Comparison in Modern English: What Could Be More Quicker and Easier and More Effective?’, To Explain the Present. Studies in the Changing English Language in Honour of Matti Rissanen (Mémoires de la Société Néophilologique 52), eds. Terttu Nevalainen and Leena Kahlas-Tarkka. Helsinki: Société Néophilologique, 329–52.
Kytö, Merja and Matti Rissanen 1988. ‘The Helsinki Corpus of English Texts: Classifying and coding the diachronic part’. In Kytö, Ihalainen and Rissanen (eds), pp. 169-80.
Kytö, Merja and Matti Rissanen 1992. ‘A language in transition: The Helsinki Corpus of English texts’, ICAME Journal 16: 7-27.
Kytö, Merja, and Matti Rissanen. 1997. ‘Language Analysis and Diachronic Corpora’, Tracing the Trail of Time. Proceedings from the Second Diachronic Corpora Workshop, New College, University of Toronto, Toronto, May 1995, eds. Raymond Hickey, Merja Kytö, Ian Lancashire, and Matti Rissanen. Amsterdam and Atlanta, GA: Rodopi, 9–22.
Kytö, Merja, Ossi Ihalainen, and Matti Rissanen (eds.) 1988. Corpus linguistics hard and soft. Amsterdam: Rodopi.
Kytö, Merja, Juhani Rudanko, and Erik Smitterberg. 2000. ‘Building a Bridge between the Present and the Past: A Corpus of 19th-century English’, ICAME Journal 24: 85–97.
Kytö, Merja (ed.). forthcoming. New Vistas into Victorian English: Studies in 19th-century Morpho-syntax. Publisher??
Kytö, Merja, Matti Rissanen and Susan Wright (eds) 1994. Corpora across the centuries. Amsterdam: Rodopi.
Lawler, John and Helen Aristar Dry (eds) 1998. Using computers in linguistics. A practical guide. London: Routledge.
Leech, Geoffrey and Christopher N. Candlin 1986. Computers in English language teaching and research. London: Longman.
Leech, Geoffrey, Greg Myers and Jenny Thomas (eds) 1995. Spoken English on computers. Transcription, mark-up, application. London: Longman.
Leech, Geoffrey, Paul Rayson and Andrew Wilson 2001. Word Frequencies in Written and Spoken English: based on the British National Corpus. London, New York: Longman.
Leitner, Gerhard (ed.) 1992. New directions in English language corpora. Methodology, results, software developments. Berlin: Mouton de Gruyter.
Lindquist, Hans and Christian Mair (eds) 2004. Corpus Approaches to Grammaticalization in English. Amsterdam: John Benjamins.
Ljung, Magnus (ed.) 1997. Corpus-based studies in English. Amsterdam: Rodopi.
Mair, Christian and Marianne Hundt (eds) 2000. Corpus Linguistics and Linguistic Theory. (Proceedings of ICAME 20). Amsterdam, Atlanta, GA: Rodopi.
Mason, Oliver 2000. Programming for corpus linguistics. Edinburgh: University Press.
McEnery, Tom and Andrew Wilson 2001. Corpus linguistics. An introduction. 2nd edition. Edinburgh: University Press.
Meurman-Solin, Anneli 1997. ‘Text profiles in the study of language variation and change’, in Hickey et al., pp. 199-214.
Meyer, Charles F. 2002. English corpus linguistics. An introduction. Cambridge: University Press.
Miall, David S. (ed.) 1990. Humanities and the computer. New directions. Oxford: Clarendon Press.
Moon, Rosamund 1998. Fixed Expressions and Idioms in English. A Corpus-Based Approach. Oxford: Clarendon Press.
Nevalainen, Terttu 1997. ‘Ongoing work on the Corpus of Early English Correspondence’, in Hickey et al., pp. 81-90.
Nevalainen, Terttu and Helena Raumolin-Brunberg (eds) 1996. Sociolinguistics and Language History. Studies based on the Corpus of Early English Correspondence. Amsterdam: Rodopi.
Nelson, Gerald, Sean Wallis, Bas Aarts ????. Exploring natural language. Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.
Oakes, M.P. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
Ooi, Vincent B. Y. 1998. Computer corpus lexicography. Edinburgh: University Press.
Partington, Alan 1988. Patterns and meanings. Using corpora for English language research and teaching. Amsterdam: John Benjamins.
Percy, Carol, Charles F. Meyer and Ian Lancashire (eds) 1996. Synchronic corpus linguistics. Amsterdam: Rodopi.
Pérez-Guerra, Javier 1999. Historical English syntax. A statistical corpus-based study on the organisation of English Modern English sentences. München: Lincom.
Peters, Pam, Peter collins and Adam Smith (eds) 2002. New Frontiers of Corpus Research.
Papers from the Twenty First International Conference on English Language Research on Computerized Corpora Sydney 2000. Amsterdam: Rodopi.
Raumolin-Brunberg, Helena 1997. ‘Incorporating sociolinguistic information into a diachronic corpus of English’, in Hickey et al., pp. 105-18.
Renouf, Antoinette (ed) 1998. Explorations in corpus linguistics. Amsterdam: Rodopi.
Renouf, Antoinette and Andrew Kehoe (eds) 2006. The Changing Face of Corpus Linguistics. Amsterdam: Rodopi.
Reppen, Randi, Susan M. Fitzmaurice and Douglas Biber (eds) 2002. Using Corpora to Explore Linguistic Variation. Amsterdam: John Benjamins.
Rissanen, Matti, Merja Kytö and Kirsi Heikkonen (eds) 1997. English in transition. Corpus-based studies in linguistic variation and genre styles. Berlin: Mouton de Gruyter.
Sampson, Geoffrey 1995. English for the computer. The SUSANNE corpus and analytic scheme. Oxford: University Press.
Schmid, Hans-Jörg 2000. English abstract nouns as conceptual shells. From corpus to cognition. Berlin: Mouton-de Gruyter.
Scott, Mike and Geoff Thompson (eds) 2000. Patterns of text. In honour of Michael Hoey. Amsterdam: John Benjamins.
Smitterberg, Erik 2005. The progressive in 19th-century English. A process of integration. Amsterdam: Rodopi.
Stenström, Anna-Britta, Gisle Andersen and Ingrid Kristine Hasund (eds) 2002. Trends in Teenage Talk. Corpus compilation, analysis and findings. Amsterdam: John Benjamins.
Stubbs, Michael 1996. Text and corpus analysis. Computer assisted studies of language and culture. Oxford: Blackwell.
Stubbs, Michael 2000. Word and phrases. Corpus-studies of lexical semantics. Oxford: Blackwell.
Svartvik, Jan and Randolph Quirk (eds) 1980. A corpus of English conversation. Lund: Gleerup.Thomas, Jenny and Mick Short (eds) 1996. Using corpora for language research. London: Longman.
Tognini-Bonelli, Elena 2001. Corpus Linguistics at Work. Amsterdam: John Benjamins.
Trotta, Joe 2000. Wh-Clauses in English. Aspects of Theory and Description. Amsterdam: Rodopi.
Wichmann, Anne, Steven Fligelstone, Tony McEnry and Gerry Knowles (eds) 1997. Teaching and language corpora. London: Longman.
Zampolli, Antonio, Nicoletta Calzolari and Martha Palmer (eds) 1994. Current issues in computational linguistics. In Honour of Don Walker. Dordrecht: Kluwer.