Background
English Vocabulary Size
among Students in Swedish Schools
During the seventies I carried out several investigations on the
acquisition of English vocabulary in Swedish schools at various levels. At
Göteborg University, we were interested in finding out
the students´ rate of acquisition at various levels and also the relation between a
student´s general proficiency and his or her size of English vocabulary.
At the time of this investigation I had access to two corpora
on which I could base my investigation, viz. The Brown Corpus and the Corpus used by
Thorndike &Lorge. The Brown Corpus was much more modern (publ. 1961) but rather small
(c. 1,000,000 running words). The Corpus used by Thorndike&Lorge was from the 1930´s
but much bigger (c. 18 mill running words). It was decided that size was much more
important that date of creation.
In "The Teacher´s Word Book of 30,000 Words" by
Thorndike&Lorge, there is a lemmatised list giving occurrences per Corpus size.
Knowing the total number of words within various frequency ranges and by finding out how
many percent of a certain number of words in four different frequency ranges that the
students knew, the size of the students´ English vocabulary was thus calculated. The test
consisted of 80 words arranged in order of declining frequency.
Some tests were carried out using multiple choice
alternatives in Swedish, but most tests used translation into Swedish.
Several thousand students took part in these tests in the
ordinary Swedish state School. From 1975 the tests were also carried out at several
Centres of Adult Education.
Conclusion:
In the above investigation it was estimated that the average
student´s vocabulary was as follows:
| after 4 years of English |
1500 words |
| after 6 years of English |
3500 words |
| after 9 years of English |
7000 words |
It is interesting to note that the acquisition of vocabulary
is faster among students with higher marks. See diagram.
There proved to be a 95% correlation between the marks
predicted by the vocabulary tests and a student's final marks.
Recent
investigations on Vocabulary Size and Frequencies
"A Study of TEFL Vocabulary" by Prof. Magnus Ljung at
Stockholm University,
compares a corpus of texts intended for the Swedish upper secondary schools in Sweden with
a corpus of modern English (B´ham Corpus) which was compiled at The University of Birmingham. The result of this
investigation was that in the TEFL corpus there was a predominance of words denoting
concrete objects and physical actions. Words denoting abstractions and mental processes
are under-represented.
Magnus Ljung says in his concluding remarks that
"there
is reason to be critical of the TEFL texts on at least two major counts, i.e. the low
general level of lexical sophistication and the absence of a clear increase in vocabulary
difficulty as we move from the early to the later school years. The words which are
missing or under-represented in the TEFL texts are not, on the whole, particularly rare or
abstruse. In most cases, they are precisely those words which it is necessary to know in
order to know in order to read British or American (quality) newspapers and magazines, or
to understand news broadcasts and discussions of current events on radio and TV."
Database of
lemmatised word list (34,000 lemmas)
In 1987 I started to work on the creation of a lemmatised word
list of the B´ham Corpus. At the time this was one of the most recent and extensive
corpora available. It was compiled by the staff at the University of Birmingham. It was
created during the 1980's and consists of nearly 20 million words. The project was set up
as a joint venture between the University of Birmingham and Collins publishers.
The B'ham Corpus consists of the Main Corpus (7.3 mill.
words) and the Reserve Corpus (13 mill. words).
One aim in creating the Main Corpus was a desire to gather as
broad a spread of the language as possible. The following corpus components were agreed
upon:
| book authorship |
75% male |
25% female |
|
| Engl. language variety |
70% British |
20% American |
5% other |
| language mode |
75% writing |
25% spoken |
|
Principles of lemmatisation
The B'ham wordlist has about 250,000 different word types, which
means that every inflexion of a word has been given a frequency number denoting how many
times per Corpus size (18.6 mill. words) this particular form of the word occurs. When
estimating vocabulary size, it is the lexical form of the word that is of interest , which
means that the frequencies of the various word types had to be added up and included under
a lexical form of a word (lemma). For practical purposes and time I decided that I would
base the lemmatised word list on the word-types that appeared 3 times or more per Corpus
size. By setting this limit, the number of word-types was reduced to c. 85,000.
The lemmatisation of record may serve as an example of
how the lemmatisation was done:
| record |
1336 |
| record's |
4 |
| recorded |
527 |
| recording |
257 |
| recordings |
91 |
| records |
625 |
| record |
2840 n v |
(n and v refer to parts of speech and were
added to all lemmas in accordance with the categorisation in ED, Colleens English
Dictionary.)
As a reference dictionary, the Colleens English Dictionary,
2nd edition, 1986, was used. It contains 170,000 references, which in almost all cases
cover the word types in the B'ham word list down to F=3 or more. Collins Cobuild is too
small (70,000 references). If in doubt about the existence of a word and if it was not
found in CED, the "word" was deleted. At the same time as the manual
lemmatisation was performed, word categories and parts of speech were added after the
frequency number.
When all the 85,000 word types had been gone through, the
number of lemmas in the database turned out to be about 34,000.
Presentation
of tests for estimating vocabulary size
My vocabulary tests are based on synonymy and association. As a
compromise between reliability and time to complete the test, I have decided to use 120
test words in each test with 20 words from six main frequency ranges. The first range
contains the most common words and the last range contains the least common words.
Furthermore each frequency range is subdivided into ten smaller ranges in order to spread
the test words over as wide a range as possible. There are always five alternatives. It
should be pointed out that in order to estimate a person´s approximate vocabulary size
the test must be designed so that the test person knows most of the words within the first
test range and only a small percentage of words in the last test range.
Example
| Test word |
alternative 1 |
alternative 2 |
alternative 3 |
alternative 4 |
alternative 5 |
| surgery |
anger |
celebrate |
defeat |
hospital |
wave |
| novelty |
booklet |
Christmas |
diamond |
new |
poem |
Boo Hever
|