Corpus tools for analysing scientific texts
The Web as corpus
What all coipus tools have in common is the computer concordancing of a set of texts. A concordance is simply ‘a collection of the occurrences of a word-form, each in its own textual environment’.20 Concordances, therefore, provide students with instant access to authentic examples, indirectly boosting exposure to specialised vocabulary in context (e.g., biological diversity, hydro-electric power plant, pluripotent stew cells, urban community), and can be used to find frequent patterns of discourse use, either internal to a genre or part-genre or to a specialised rhetorical move (for example, the Introduction section of a research article as in
John Swales’ model.)21 In Ken Hyland’s words, analyses of concordances ‘provide information about users’ preferred meanings by displaying repeated co-occurrence of words, allowing us to see the characteristic associations and coxmections that they have, and how they take on specific meanings for particular individuals and in given communities’.22 Although strictly not a corpus put together for linguistic analysis, the World Wide Web can be taken as a large repository of texts one can query for word forms to answer vocabulary and grammar questions, retrieving search results (or hits) in a form which is very similar to a concordance, i.e., an index of word forms in context.
A first very general tool based on this idea is an application of the Web as Corpus paradigm.23 This consists in exploiting commercial search engines, using the filters of the advanced search options together with Boolean operators to narrow searches to academic websites or genre-specific portals. So, for example, restricting the search to .edu and .ac.uk roughly corresponds to querying the web as a repository of American and British academic discourse. This, together with the wildcard *, to search for an unspecified word in a certain position within a phrase, allows to check for common collocates of some target term or expression, e.g., cell, temperature, client brief, design project, or test frequent phraseology in specific sub- domains of the Web, such as causes a shift in *, have a * impact on, a * increase in * at, the latter searches aimed at retrieving adjectives commonly found to occur with nouns impact and increase and nouns likely to follow the prepositions on, and in.2* If we look at the top search results for the phrase ‘a * increase in * at' (the double scare quotes stand for ‘exact phrase match’) restricted to the subsection of the web corresponding to British academic sites only, we see some common adjectives that collocate with increase (listed here in alphabetical order), e.g., clear, dramatic, larger, greater, massive, pronounced, rapid, significant, slight, small, steady, steep, 30%, three-fold, etc., many of which express some kind of measurement; to the left of the noun, we find scientific entities that tell us of the scope of the increase, e.g., blood flow, blood pressure, density, temperature, thermal diffusivity, diagnoses, noise, etc., pointing to different disciplinary specialisms. Similarly, depending on the nouns following at, the preposition might acquire a temporal (e.g., at the end of the last ice age), spatial (at the ankle, at this site), or conditional meaning (at low concentrations, at low temperatures) specifying the experimental conditions.
Similarly, one could test extended collocations and multiword expressions that are frequent phrases of academic discourse (as can be seen, as a result of, it should be noted that, etc.).25 For example, a search for as can be seen *, narrowed first by British and then American academic websites, has returned the expression most of the times followed by a comma, when occurring in sentence-initial position, and otherwise followed by the prepositions in or from with an apparent preference for the latter in American English. These kinds of searches, narrowed further by known portals for the distribution of academic and content-specific publications (e.g., PubMed for medical publications, archdaily.com for architectural projects) might be used to confirm or disconfinn the phraseology of the landscape of what can be loosely called scientific English.26
In like manner, preferr ed grammar patterns could be tested, as occurring in specific sections of research genres in a given field, such as the abstract of a research article in Health axrd Life Sciences. Taking Swales’ rhetorical models of the abstract aird the research article, the Introduction-Materials & Methods-Results-Discussion (IMRD) model and the Create a Research Space (CARS) model for Introductions,27 one can test usages of the passive (we have analyzed vs. results have been analyzed) or verb tense preferences. One cair then see how these associate with specific sections of the research article, e.g., Methods, Results aird Discussion, or with moves aird steps within a section.
Verb tenses corresponding to specific moves may be exemplified through the query ‘study * conducted', searched for in the PubMed poital of medical journals (wwtv.ncbi.nhn.nih.gov/ pubmed). Because medical abstracts accompanying the articles are usually structured, i.e., organised into paragraphs with headings corresponding to the various sections,28 one can easily associate the tense to the corresponding rhetorical move. The search for the string ‘study * conducted' returned the simple past in the Design and Methods section (e.g., this /a cross sectional / a time and motion / two types of study / ... was / were conducted), the present simple (is conducted) to introduce the aim of the research in the Introduction section, and the present perfect also in the Introduction when mapping the territory, reviewing the literature, and to justify the research question (e.g., However, no randomized controlled study has been conducted on... /No large study has been conducted...). Further, one can click on each individual search result to access the full text and get a fuller view of the original context.
Overall, the point is that the use of academic language chunks is primarily genre-driven, therefore searching for them in a collection of medical journal articles such as those in PubMed, all prefaced by an abstract, is a useful aid to writing and self-editing processes in the disciplines. With advanced searches like the ones exemplified, the scientist-writer manages to control the dynamic nature of the Web, while exploiting it as a sizeable repository of attested language, one that is likely to be much bigger than any other more specialised language resource, notably corpora collected by linguists for the purposes of linguistic analyses. However, it is useful to also look briefly at some of these more targeted resources as they can be queried for free by users through Web-based interfaces (concordancing software).