MonographSeriesMS - 25JANUARY 2004Representing Language Usein the University: Analysis ofthe TOEFL 2000 Spoken andWritten Academic LanguageCorpusDouglas BiberSusan M. ConradRandi ReppenPat ByrdMarie HeltVictoria ClarkViviana CortesEniko CsomayAlfredo Urzua

Representing Language Use in the University: Analysis of the TOEFL 2000Spoken and Written Academic Language CorpusDouglas BiberNorthern Arizona UniversitySusan M. ConradPortland State UniversityRandi ReppenNorthern Arizona UniversityH. Patricia ByrdGeorgia State UniversityMarie HeltSacramento State UniversityVictoria ClarkNorthern Arizona UniversityViviana CortesIowa State UniversityEniko CsomaySan Diego State UniversityAlfredo UrzuaOld Dominion UniversityEducational Testing ServicePrinceton, New JerseyRM-04-03

Educational Testing Service is an Equal Opportunity/Affirmative Action Employer.Copyright 2004 by Educational Testing Service. All rights reserved.No part of this report may be reproduced or transmitted in any form or by any means, electronic ormechanical, including photocopy, recording, or any information storage and retrieval system, withoutpermission in writing from the publisher. Violators will be prosecuted in accordance with both U.S. andinternational copyright laws.EDUCATIONAL TESTING SERVICE, ETS, the ETS logos, Graduate Record Examinations, GRE, andTOEFL are registered trademarks of Educational Testing Service. The Test of English as a ForeignLanguage is a trademark of Educational Testing Service.To obtain more information about TOEFL programs and services, use one of the following:Email: [email protected] site:

ForewordThe TOEFL Monograph Series features commissioned papers and reports for TOEFL 2000 andother Test of English as a Foreign LanguageTM (TOEFL ) test development efforts. As part of thefoundation for the TOEFL 2000 project, a number of papers and reports were commissionedfrom experts within the fields of measurement and language teaching and testing. The resultingcritical reviews and expert opinions have helped to inform TOEFL program development effortswith respect to test construct, test user needs, and test delivery. Opinions expressed in thesepapers are those of the authors and do not necessarily reflect the views or intentions of theTOEFL program.These monographs are also of general scholarly interest, and the TOEFL program is pleased tomake them available to colleagues in the fields of language teaching and testing and internationalstudent admissions in higher education.The TOEFL 2000 project is a broad effort under which language testing at Educational TestingService (ETS ) will evolve into the 21st century. As a first step, the TOEFL program revisedthe Test of Spoken EnglishTM (TSE ) and introduced a computer-based version of the TOEFLtest. The revised TSE test, introduced in July 1995, is based on an underlying construct ofcommunicative language ability and represents a process approach to test validation. Thecomputer-based TOEFL test, introduced in 1998, takes advantage of new forms of assessmentand improved services made possible by computer-based testing, while also moving the programtoward its longer range goals, which include the development of a conceptual framework that takes into account models ofcommunicative competencea research agenda that informs and supports this emerging frameworka better understanding of the kinds of information test users need and want from theTOEFL testa better understanding of the technological capabilities for delivery of TOEFL tests intothe next centuryMonographs 16 through 20 were the working papers that laid out the TOEFL 2000 conceptualframeworks with their accompanying research agendas. The initial framework document,Monograph 16, described the process by which the project was to move from identifying the testdomain to building an empirically based interpretation of test scores. The subsequent frameworkdocuments, Monographs 17-20, extended the conceptual frameworks to the domains of reading,writing, listening, and speaking (both as independent and interdependent domains). Theseconceptual frameworks guided the research and prototyping studies described in subsequentmonographs that resulted in the final test model.As TOEFL 2000 projects are completed, monographs and research reports will continue to bereleased and public review of project work involved.TOEFL Program OfficeEducational Testing Serviceiii

AbstractTo date, there have been few large-scale empirical investigations of academic registers, andvirtually no such investigations of spoken academic registers. Given this lack of basicknowledge, it has been nearly impossible to evaluate the representativeness of English as aSecond Language/English as a Foreign Language (ESL/EFL) materials and assessmentinstruments. Specifically in the context of the TOEFL 2000 effort, we have lacked the tools todetermine whether the texts used on listening and reading exams accurately represent thelinguistic characteristics of spoken and written academic registers.The TOEFL 2000 Spoken and Written Academic Language (T2K-SWAL) Corpus wasconstructed and analyzed to help fill this gap. This report describes the design and analysis of thecorpus. Two major stages of analysis were completed: First, linguistic analyses of the textcategories in the T2K-SWAL Corpus were completed to identify the salient patterns of languageuse in each academic register (across registers, disciplines, and levels). Then, based on thosefindings, diagnostic tools were developed to indicate whether the language used in T2KListening and Reading Comprehension tasks is representative of real-life language use.Key words: academic registers, corpus linguistics, discourse analysis, ESP, multidimensionalanalysis, register studiesv

AcknowledgmentsThe T2K-SWAL Project was a collaborative effort involving many researchers and assistants atevery level. In addition to the coauthors, there were several research assistants who madeimportant contributions at Northern Arizona University, Iowa State University, Georgia StateUniversity, and California State University, Sacramento: Chandrika Balasubramanian, PaulaGarcia, Louise Gobron, Quynh Nguyen, Kristen Precht, Barrie Roberts, and Jenia Walter. Theproject also depended on the assistance of many transcribers at Northern Arizona University,students and faculty who helped with the recordings at all four universities, and other studentassistants; while there are simply too many people to list, the project would not have beenpossible without the willing assistance of these individuals. Finally, we were greatly helped bythe design suggestions and pilot testing carried out by Susan Nissan and Mary Schedl at ETS.vii

Table of ContentsPage1. Statement of the Problem.12. Background to the T2K-SWAL Project.32.1. Overview of the T2K-SWAL Project .63. The TOEFL 2000 Spoken and Written Academic Language Corpus .73.1. Collection of Texts for the T2K-SWAL Corpus.73.2. Transcription, Scanning, and Editing of Texts in the T2K-SWAL Corpus .163.3Grammatical Tagging and Tag-editing .174. Analytical Procedures .214.1. Procedures for Grammatical, Lexicogrammatical, and SemanticClass Analyses .264.2. Procedures for Vocabulary Analyses .354.3. Procedures for Lexical Bundle Analyses.444.4. Multidimensional Analysis Using the Biber (1988) Framework .454.5. Multidimensional Analysis Based on a New Factor Analysis of theT2K-SWAL Corpus .524.6. Procedures for Analysis of Explicit Definitions .605. Linguistic Analyses.605.1. Multidimensional Patterns of Variation Among University Registers(Based on Biber 1988 Dimensions) .645.2. Multidimensional Patterns of Variation Among University Registers,Based on the New T2K-SWAL Factor Analysis .815.3. Analysis of Explicit Definitions.976. Diagnostic Tools and Resources. 1046.1. LXMDCompare . 1056.2. VocabProfile . 1077. Implications for the TOEFL 2000 Project . 110ix

References . 113AppendixesA. List of Tags Assigned by the Biber Tagger . 121B. List of Words in the T2K-SWAL Corpus, Grouped Into Distribution Classes . 127C. List of Lexical Bundles in the T2K-SWAL Corpus, Grouped According toStructural Type and Distributional Classes . 253D. Mean Scores for Linguistic Features Across Modes, Registers, AcademicDisciplines, and Levels (Mean Rate of Occurrence per 1,000 Words) . 259x

List of TablesPageTable 1. Composition of the T2K-SWAL Corpus .8Table 2. Breakdown of Classroom Teaching by Interactiveness.9Table 3. Breakdown of Service Encounters by Type and Location.10Table 4. Breakdown of Class Sessions by Discipline and Level .11Table 5. Breakdown of Textbooks by Discipline and Level.12Table 6. Breakdown of Class Sessions by Subdiscipline.13Table 7. Breakdown of Textbooks by Subdiscipline .14Table 8. Breakdown of Texts Within Institutional Writing .15Table 9. Breakdown of Spoken Texts by University .16Table 10. Sample of Tagged Text from a University Textbook .19Table 11. Sample of Tagged Text From Classroom Teaching .20Table 12. List of Grammatical, Lexicogrammatical, Vocabulary, and LexicalBundle Features Analyzed in the T2K-SWAL Project.22Table 13. Words Included in the Semantic Classes for Nouns, Verbs, and Adjectives .27Table 14. Lexicogrammatical Features Used for Stance Analyses.33Table 15. Distributional Variables for Nouns, Verbs, Adjectives, and Adverbs.43Table 16. Summary of the Factor Analysis from Biber (1988) .47Table 17. Statistical Details for the T2K-SWAL Factor Analysis.54Table 18. Summary of the Factorial Structure of the T2K-SWAL MD Analysis .57Table 19. Descriptive Statistics for Linguistic Features (Entire T2K-SWAL Corpus) .61Table 20. Descriptive Statistics for Classroom Teaching and Textbooks by Discipline.78Table 21. Descriptive Statistics for Classroom Teaching and Textbooks by Level .79Table 22. ANOVA Results for Classroom Teaching Across Disciplines.80Table 23. ANOVA Results for Textbooks Across Disciplines.80Table 24. ANOVA Results for Classroom Teaching Across Levels.80Table 25. ANOVA Results for Textbooks Across Levels .80Table 26. Analysis of Verbs Potentially Signaling Explicit Definitions inSpoken Texts. 101xi

Table 27. Analysis of Verbs Potentially Signaling Explicit Definitions inWritten Texts . 102Table 28. Comparison of Linguistic Features in a Classroom Text to All ClassSession Texts . 106Table 29. Comparison of the Multidimensional Profile of a Classroom Text to AllClass-Session Texts . 106Table 30. Output Produced by VocabProfile, Showing the Breakdown of Words ina Target text: Number of Words From Each Frequency Level with Skewedand Not-Skewed Distributions. 108Table 31. Output Produced by VocabProfile, Showing the Breakdown of Words forEach Grammatical Class in a Target Text: Number of Nouns/Verbs/Adjectives/ Adverbs From Each Frequency Level With Skewed and NotSkewed Distributions . 109Table 32. Output Produced by VocabProfile, Showing the Proportional Breakdown ofWords in a Target Text: Percentage of Words From Each Frequency LevelWith Skewed and Not-Skewed Distributions . 109xii

List of FiguresFigure 1.Screenshot From the Dictionary of All Word Forms in the T2K-SWALCorpus (Baseline.db).Figure 2.Screenshot From the Dictionary of All Lemmas in the T2K-SWALCorpus (Lemma.db) .Figure 3.90Mean Scores of Disciplines Along T2K-SWAL Dimension 2:Procedural Versus Content-focused Discourse.Figure 14.88Mean Scores of Registers Along T2K-SWAL Dimension 4:Academic Stance.Figure 13.85Mean Scores of Registers Along T2K-SWAL Dimension 3:Narrative Orientation .Figure 12.83Mean Scores of Registers Along T2K-SWAL Dimension 2:Procedural Versus Content-focused Discourse.Figure 11.76Mean Scores of Registers Along T2K-SWAL Dimension 1:Oral Versus Literate Discourse.Figure 10.73Mean Scores of Registers Along Dimension 5: NonimpersonalVersus Impersonal Style .Figure 9.70Mean Scores of Registers Along Dimension 4: Overt Expressionof Persuasion.Figure 8.68Mean Scores of Registers Along Dimension 3: Situation-dependentVersus Elaborated Reference.Figure 7.66Mean Scores of Registers Along Dimension 2: Narrative VersusNonnarrative Concerns .Figure 6.42Mean Scores of Registers Along Dimension 1: Involved VersusInformational Production .Figure 5.39Screenshot From the Dictionary of All Lemmas in the T2K-SWALCorpus (Lemma.db), Showing Some of the Additional Fields.Figure 4.3793Mean Scores of Disciplines Along T2K-SWAL Dimension 3:Narrative Orientation. .xiii95

1. Statement of the ProblemThe development of materials for language instruction and assessment requires repeatedjudgments about language use, to decide on the words and structures that should be representedin these materials. These decisions have usually been based on gut-level impressions andanecdotal evidence of how speakers and writers use language: impressions that often operatebelow the level of consciousness but are regarded as accepted truths. Unfortunately, suchintuitions about language use are often wrong. As a result, teaching and assessment materialsoften fail to provide an accurate reflection of the language actually used by speakers and writersin natural situations.For example, English as a Second Language (ESL) teachers and textbook authors sharethe widespread belief that progressive verbs are the basic verb form used in conversation (e.g.,Cathy is eating pizza). This belief is reflected in the sequence of topics found in most ESLgrammar books, where the progressive is presented as one of the fundamental building blocks ofEnglish grammar: Most ESL textbooks introduce the progressive in the very first chapters, andmany books introduce the progressive before covering the simple present (Biber & Reppen,2002). Given the nature of this coverage, it would be entirely natural for learners to useprogressive verbs as their first choice, at least in conversation.Empirical analyses of representative corpora can provide a much more solid foundationfor descriptions of language use, and the results of these analyses are often surprising to Teachersof English as a Second Language (TESL) professionals, running counter to strongly heldintuitions about use. For example, corpus analysis of progressive aspect verbs shows that theyare not the norm in conversation. In fact, simple aspect verb phrases are more than 20 times ascommon as progressives in conversation (Biber, Johansson, Leech, Conrad, & Finegan, 1999,p. 461; Biber & Conrad, 2001). It is not at all uncommon to hear teachers commenting on theoveruse of the progressive by students. Such overuse is not surprising, however, because ESLinstructional materials and teaching practices suggest that progressives are more important (andfar more common) than they actually are. We could cite numerous other examples of this type.As language professionals, we tend to have strong intuitions about use, but recent empiricalanalyses of large corpora show that these intuitions are often wrong.For the assessment of university-level English language skills — the focus of the presentstudy — the first issue is to fully understand the linguistic challenges faced by students in1

university contexts. There are obviously special demands presented by academic reading andwriting, especially in relation to textbooks, research papers, and student essays and term papers.There are also special demands associated with academic listening, required for success inclassroom teaching c