Background

Testing vocabulary


By Boo Hever, MA

Contents

English Vocabulary Size among Students in Swedish Schools

Recent Investigations on Vocabulary Size and Frequencies

Database of Lemmatised Word List (34,000 lemmas)

Principles of Lemmatisation

Presentation of Tests for Estimating Vocabulary Size

 

English Vocabulary Size among Students in Swedish Schools
During the seventies I carried out several investigations on the acquisition of English vocabulary in Swedish schools at various levels. At Göteborg University, we were interested in finding out the students´ rate of acquisition at various levels and also the relation between a student´s general proficiency and his or her size of English vocabulary.

At the time of this investigation I had access to two corpora on which I could base my investigation, viz. The Brown Corpus and the Corpus used by Thorndike &Lorge. The Brown Corpus was much more modern (publ. 1961) but rather small (c. 1,000,000 running words). The Corpus used by Thorndike&Lorge was from the 1930´s but much bigger (c. 18 mill running words). It was decided that size was much more important that date of creation.

In "The Teacher´s Word Book of 30,000 Words" by Thorndike&Lorge, there is a lemmatised list giving occurrences per Corpus size. Knowing the total number of words within various frequency ranges and by finding out how many percent of a certain number of words in four different frequency ranges that the students knew, the size of the students´ English vocabulary was thus calculated. The test consisted of 80 words arranged in order of declining frequency.

Some tests were carried out using multiple choice alternatives in Swedish, but most tests used translation into Swedish.

Several thousand students took part in these tests in the ordinary Swedish state School. From 1975 the tests were also carried out at several Centres of Adult Education.

Conclusion:

In the above investigation it was estimated that the average student´s vocabulary was as follows:

after 4 years of English 1500 words
after 6 years of English 3500 words
after 9 years of English 7000 words

It is interesting to note that the acquisition of vocabulary is faster among students with higher marks. 
See diagram

There proved to be a 95% correlation between the marks predicted by the vocabulary tests and a student's final marks.

Recent investigations on Vocabulary Size and Frequencies
"A Study of TEFL Vocabulary" by Prof. Magnus Ljung at Stockholm University, compares a corpus of texts intended for the Swedish upper secondary schools in Sweden with a corpus of modern English (B´ham Corpus) which was compiled at The University of Birmingham. The result of this investigation was that in the TEFL corpus there was a predominance of words denoting concrete objects and physical actions. Words denoting abstractions and mental processes are under-represented.

Magnus Ljung says in his concluding remarks that "there is reason to be critical of the TEFL texts on at least two major counts, i.e. the low general level of lexical sophistication and the absence of a clear increase in vocabulary difficulty as we move from the early to the later school years. The words which are missing or under-represented in the TEFL texts are not, on the whole, particularly rare or abstruse. In most cases, they are precisely those words which it is necessary to know in order to know in order to read British or American (quality) newspapers and magazines, or to understand news broadcasts and discussions of current events on radio and TV."

 

Database of lemmatised word list (34,000 lemmas)
In 1987 I started to work on the creation of a lemmatised word list of the B´ham Corpus. At the time this was one of the most recent and extensive corpora available. It was compiled by the staff at the University of Birmingham. It was created during the 1980's and consists of nearly 20 million words. The project was set up as a joint venture between the University of Birmingham and Collins publishers.

The B'ham Corpus consists of the Main Corpus (7.3 mill. words) and the Reserve Corpus (13 mill. words).

One aim in creating the Main Corpus was a desire to gather as broad a spread of the language as possible. The following corpus components were agreed upon:

book authorship 75% male 25% female
Engl. language variety 70% British 20% American 5% other
language mode 75% writing 25% spoken

 

Principles of lemmatisation
The B'ham wordlist has about 250,000 different word types, which means that every inflexion of a word has been given a frequency number denoting how many times per Corpus size (18.6 mill. words) this particular form of the word occurs. When estimating vocabulary size, it is the lexical form of the word that is of interest , which means that the frequencies of the various word types had to be added up and included under a lexical form of a word (lemma). For practical purposes and time I decided that I would base the lemmatised word list on the word-types that appeared 3 times or more per Corpus size. By setting this limit, the number of word-types was reduced to c. 85,000.

The lemmatisation of record may serve as an example of how the lemmatisation was done:

record 1336
record's 4
recorded 527
recording 257
recordings 91
records 625
record 2840 n v

(n and v refer to parts of speech and were added to all lemmas in accordance with the categorisation in ED, Colleens English Dictionary.)

As a reference dictionary, the Colleens English Dictionary, 2nd edition, 1986, was used. It contains 170,000 references, which in almost all cases cover the word types in the B'ham word list down to F=3 or more. Collins Cobuild is too small (70,000 references). If in doubt about the existence of a word and if it was not found in CED, the "word" was deleted. At the same time as the manual lemmatisation was performed, word categories and parts of speech were added after the frequency number.

When all the 85,000 word types had been gone through, the number of lemmas in the database turned out to be about 34,000.

 

Presentation of tests for estimating vocabulary size
My vocabulary tests are based on synonymy and association. As a compromise between reliability and time to complete the test, I have decided to use 120 test words in each test with 20 words from six main frequency ranges. The first range contains the most common words and the last range contains the least common words. Furthermore each frequency range is subdivided into ten smaller ranges in order to spread the test words over as wide a range as possible. There are always five alternatives. It should be pointed out that in order to estimate a person´s approximate vocabulary size the test must be designed so that the test person knows most of the words within the first test range and only a small percentage of words in the last test range.

Example

Test word alternative 1 alternative 2 alternative 3 alternative 4 alternative 5
surgery anger celebrate defeat hospital wave
novelty booklet Christmas diamond new poem

Boo Hever