刊讯｜SSCI 期刊《语言测试》2023年第1-2期

六万学者关注了→ 语言学心得 2024-02-19

刊讯｜SSCI 期刊《语言学习》2023年第3期

2023-09-26

刊讯｜SSCI 期刊《计算机辅助语言学习》2023年第5-6期

2023-09-21

刊讯｜SSCI 期刊《语言、认知与神经科学》 2023年第1-7期

2023-09-18

LANGUAGE TESTING

Volume 40, Issue 1-2, January 2023

LANGUAGE TESTING (SSCI一区，2022 IF：4.1，排名：11/194）2023年第1-2期共刊文32篇。2023年第1期共发文19篇，其中社论2篇，研究论文13篇，书评3篇，修订1篇。研究性论文主题涉及第二语言评估、自动作文评分系统、课堂阅读测评、英语水平与学业成绩关联度测试等。第4期共发文13篇，其中研究论文9篇，书评4篇。研究论文涉及评分者效应、成人二语习得、二语语言测试等。欢迎转发扩散！

往期推荐：

刊讯｜SSCI 期刊《语言测试》2022年第3-4期

刊讯｜SSCI 期刊《语言测试》2022第2期

刊讯｜SSCI 期刊 Language Testing 2022第1期

第1期目录

Editorial

■Forty years of Language Testing, and the changing paths of publishing, by Paula M. Winke, Pages 3–7.

Viewpoints on the 40th Year of Language Testing

■The vexing problem of validity and the future of second language assessment, by Vahid Aryadoust, Pages 8–14.

■Future challenges and opportunities in language testing and assessment: Basic questions and principles at the forefront, by Tineke Brunfaut, Pages 15–23.

■Reflections on the past and future of language testing and assessment: An emerging scholar’s perspective, by J. Dylan Burton, Pages 24–30.

■Administration, labor, and love, by April Ginther, Pages 31–39.

■Towards a new sophistication in vocabulary assessment, by John Read, Pages 40–46.

■Reframing the discourse and rhetoric of language testing and assessment for the public square, by Lynda Taylor, Pages 47–53.

■Test design and validity evidence of interactive speaking assessment in the era of emerging technologies, by Soo Jung Youn, Pages 54–60.

Articles

■Application of an Automated Essay Scoring engine to English writing assessment using Many-Facet Rasch Measurement, by Kinnie Kin Yee Chan, Trevor Bond, Zi Yan, Pages 61–85.

■The use of generalizability theory in investigating the score dependability of classroom-based L2 reading assessment, by Ray J. T. Liao, pages 86-106.

■Psychometric approaches to analyzing C-tests, by David Alpizar, Tongyun Li, John M. Norris, Lixiong Gu, Pages 107–132.

■Revisiting English language proficiency and its impact on the academic performance of domestic university students in Singapore, by Wenjin Vikki Bo, Mingchen Fu, Wei Ying Lim, Pages 133–152.

■How do raters learn to rate? Many-facet Rasch modeling of rater performance over the course of a rater certification program, by Xun Yan, Ping-Lin Chuang, Pages 153–179.

Book Reviews

■

Book Review: Challenges in Language Testing Around the World: Insights for Language Test Users, by Atta Gebril, Pages 180–183.

■Book Review: Multilingual Testing and Assessment, by Beverly A. Baker, Pages 184–188.

■Book Review: Assessing Academic English for Higher Education Admissions, by Diane Schmitt, Pages 189-192.

Introduction to the Virtual Special Issue

■Test-taker insights for language assessment policies and practices, by Yan Jin, Pages 193–203.

Post-Script

■Epilogue—Note from an outgoing editor, by Luke Harding, Pages 204–205.

Corrigendum

■Corrigendum, Page 206.

第1期摘要

The vexing problem of validity and the future of second language assessment

Vahid Aryadoust, National Institute of Education, Nanyang Technological University, Singapore

Abstract Construct validity and building validity arguments are some of the main challenges facing the language assessment community. The notion of construct validity and validity arguments arose from research in psychological assessment and developed into the gold standard of validation/validity research in language assessment. At a theoretical level, construct validity and validity arguments conflate the scientific reasoning in assessment and policy matters of ethics. Thus, a test validator is expected to simultaneously serve the role of conducting scientific research and examining the consequential basis of assessments. I contend that validity investigations should be decoupled from the ethical and social aspects of assessment. In addition, the near-exclusive focus of empirical construct validity research on cognitive processing has not resulted in sufficient accuracy and replicability in predicting test takers’ performance in real language use domains. Accordingly, I underscore the significance of prediction in validation, in contrast to explanation, and propose that the question to ask might not so much be about what a test measures as what type of methods and tools can better generate language use profiles. Finally, I suggest that interdisciplinary alliances with cognitive and computational neuroscience and artificial intelligence (AI) fields should be forged to meet the demands of language assessment in the 21st century.

Key words Artificial intelligence (AI), authenticity, interdisciplinary research, language assessment, neuroscience, validity, validity arguments

Future challenges and opportunities in language testing and assessment: Basic questions and principles at the forefront

Tineke Brunfaut, Department of Linguistics and English Language, Lancaster University, Lancaster LA1 4YL, UK

Abstract In this invited Viewpoint on the occasion of the 40th anniversary of the journal Language Testing, I argue that at the core of future challenges and opportunities for the field—both in scholarly and operational respects—remain basic questions and principles in language testing and assessment. Despite the high levels of sophistication of issues looked into, and methodological and operational solutions found, outstanding concerns still amount to: what are we testing, how are we testing, and why are we testing? Guided by these questions, I call for more thorough and adequate language use domain definitions (and a suitable broadening of research and testing methodologies to determine these), more comprehensive operationalizations of these domain definitions (especially in the context of technology in language testing), and deeper considerations of test purposes/uses and of their connections with domain definitions. To achieve this, I maintain that the field needs to continue investing in the topics of validation, ethics, and language assessment literacy, and engaging with broader fields of enquiry such as (applied) linguistics. I also encourage a more synthetic look at the existing knowledge base in order to build on this, and further diversification of voices in language testing and assessment research and practice.

Key words Construct, domain inference, ethics, language assessment literacy, target language use domain, technology in language testing and assessment, test purposes, validation

Reflections on the past and future of language testing and assessment: An emerging scholar’s perspective

J. Dylan Burton, Michigan State University, USA

Abstract In its 40th year, Language Testing journal has served as the flagship journal for scholars, researchers, and practitioners in the field of language testing and assessment. This viewpoint piece, written from the perspective of an emerging scholar, discusses two possible future trends based on evidence going back to the very first issue of this journal. First, this paper outlines past efforts to describe and define the construct of second language communication, noting that much work has yet to be done for a more complete description in terms of interactional competence and nonverbal behavior. The second trend highlights the growing movement in applied linguistics toward research transparency through Open Science practices, including replication studies, the sharing of data and materials, and preregistration. This paper outlines work to date in Language Testing that encourages open practices and emphasizes the importance of these practices in assessment research.

Key words Interactional competence, multimodality, nonverbal behavior, Open Science, replication

Administration, labor, and love

April Ginther, Purdue University, USA

Abstract Great opportunities for language testing practitioners are enabled through language program administration. Local language tests lend themselves to multiple purposes—for placement and diagnosis, as a means of tracking progress, and as a contribution to program evaluation and revision. Administrative choices, especially those involving a test, are strategic and can be used to transform a program’s identity and effectiveness over time.

Key words Administration, diagnosis, local, placement, testing

Towards a new sophistication in vocabulary assessment

John Read, University of Auckland, New Zealand

Abstract Published work on vocabulary assessment has grown substantially in the last 10 years, but it is still somewhat outside the mainstream of the field. There has been a recent call for those developing vocabulary tests to apply professional standards to their work, especially in validating their instruments for specified purposes before releasing them for widespread use. A great deal of work on vocabulary assessment can be seen in terms of the somewhat problematic distinction between breadth and depth of vocabulary knowledge. Breadth refers to assessing vocabulary size, based on a large sample of words from a frequency list. New research is raising questions about the suitability of word frequency norms derived from large corpora, the choice of the word family as the unit of analysis, the selection of appropriate test formats, and the role of guessing in test-taker performance. Depth of knowledge goes beyond the basic form-meaning link to consider other aspects of word knowledge. The concept of word association has played a dominant role in the design of such tests, but there is a need to create test formats to assess knowledge of word parts as well as a range of multi-word items apart from collocation.

Key words Depth of vocabulary knowledge, vocabulary assessment, vocabulary size, vocabulary test validation, word frequency

Reframing the discourse and rhetoric of language testing and assessment for the public square

Lynda Taylor, University of Bedfordshire, UK

Abstract As applied linguists and language testers, we are in the business of “doing language”. For many of us, language learning is a lifelong passion, and we invest similar enthusiasm in our language assessment research and testing practices. Language is also the vehicle through which we communicate that enthusiasm to others, sharing our knowledge and experience with colleagues so we can all grow in understanding and expertise. We are actually quite good at communicating within our own community. But when it comes to interacting with people beyond our own field, are we such effective communicators? Wider society—politicians, journalists, policymakers, social commentators, teachers, and parents—all seem to find assessment matters hard to grasp. And I am not sure we as language testers do much to help them. So I find myself wondering why that is? Is it that our language is too specialised, or overly technical? Do we choose unhelpful words or images when we talk about testing? Worse still, do we sometimes come across as rather arrogant or patronising, perhaps even irrelevant to non-specialists’ needs and concerns? If so, could we perhaps consider reframing our discourse and rhetoric in future to improve our communicative effectiveness, and how might we do that?

Key words Assessment ethics, corpus linguistics, critical discourse analysis, discourse in the public square, language assessment literacy, public understanding of assessment, stakeholder communication, validity frameworks

Test design and validity evidence of interactive speaking assessment in the era of emerging technologies

Soo Jung Youn, Daegu National University of Education, South Korea

AbstractAs access to smartphones and emerging technologies has become ubiquitous in our daily lives and in language learning, technology-mediated social interaction has become common in teaching and assessing L2 speaking. The changing ecology of L2 spoken interaction provides language educators and testers with opportunities for renewed test design and the gathering of context-sensitive validity evidence of interactive speaking assessment. First, I review the current research on interactive speaking assessment focusing on commonly used test formats and types of validity evidence. Second, I discuss recent research that reports the use of artificial intelligence and technologies in teaching and assessing speaking in order to understand how and what evidence of interactive speaking is elicited. Based on the discussion, I argue that it is critical to identify what features of interactive speaking are elicited depending on the types of technology-mediated interaction for valid assessment decisions in relation to intended uses. I further discuss opportunities and challenges for future research on test design and eliciting validity evidence of interactive speaking using technology-mediated interaction.

Key words Intelligent personal assistants, interactive speaking, spoken dialog system, technology-mediated interaction, validity evidence

Application of an Automated Essay Scoring engine to English writing assessment using Many-Facet Rasch Measurement

Kinnie Kin Yee Chan, Hong Kong Metropolitan University, Hong Kong

Trevor Bond, James Cook University, Australia

Zi Yan, The Education University of Hong Kong, Hong Kong

Abstract We investigated the relationship between the scores assigned by an Automated Essay Scoring (AES) system, the Intelligent Essay Assessor (IEA), and grades allocated by trained, professional human raters to English essay writing by instigating two procedures novel to written-language assessment: the logistic transformation of AES raw scores into hierarchically ordered grades, and the co-calibration of all essay scoring data in a single Rasch measurement framework. A total of 3453 essays were written by 589 US students (in Grades 4, 6, 8, 10, and 12), in response to 18 National Assessment of Educational Progress (NAEP) writing prompts at three grade levels (4, 8, & 12). We randomly assigned one of two versions of the assessment, A or B, to each student. Each version comprised a narrative (N), an informative (I), and a persuasive (P) prompt. Nineteen experienced assessors graded the essays holistically using NAEP scoring guidelines, using a rotating plan in which each essay was rated by four raters. Each essay was additionally scored using the IEA. We estimated the effects of rater, prompt, student, and rubric by using a Many-Facet Rasch Measurement (MFRM) model. Last, within a single Rasch measurement scale, we co-calibrated the students’ grades from human raters and their grades from the IEA to compare them. The AES machine maintained equivalence with human scored ratings and were more consistent than those from human raters.

Key words Automated Essay Scoring (AES) system, English essay assessment, FACETS, human raters, Intelligent Essay Assessor (IEA), Many-Facet Rasch Measurement (MFRM)

The use of generalizability theory in investigating the score dependability of classroom-based L2 reading assessment

Ray J. T. Liao, The University of Iowa, USA

Abstract Among the variety of selected response formats used in L2 reading assessment, multiple-choice (MC) is the most commonly adopted, primarily due to its efficiency and objectiveness. Given the impact of assessment results on teaching and learning, it is necessary to investigate the degree to which the MC format reliably measures learners’ L2 reading comprehension in the classroom context. While researchers have claimed that the longer the reading test (i.e., more test items and passages), the higher its overall reliability, few studies have investigated the optimal number of items and passages required for reliable classroom-based L2 reading assessment.

To address this research gap, I adopted generalizability (G) theory to investigate the score reliability of the MC format in classroom-based L2 reading tests. A total of 108 ESL students at an American college completed an English reading test that included four passages, each of which was accompanied by five MC comprehension questions. The results showed that the score reliability of the L2 reading test was critically influenced by the number of items and passages, inasmuch as a different combination of the number of passages and items altered the degree of reliability. Implications for practitioners and educational researchers are discussed.

Key words Academic reading, generalizability theory, L2 reading assessment, question format, score reliability

Psychometric approaches to analyzing C-tests

David Alpizar, Washington State University, USA

Tongyun Li, Educational Testing Service, USA

John M. Norris, Educational Testing Service, USA

Lixiong Gu, Educational Testing Service, USA

Abstract The C-test is a type of gap-filling test designed to efficiently measure second language proficiency. The typical C-test consists of several short paragraphs with the second half of every second word deleted. The words with deleted parts are considered as items nested within the corresponding paragraph. Given this testlet structure, it is commonly taken for granted that the C-test design may violate the local independence assumption. However, this assumption has not been fully investigated in the C-test research to date, including the evaluation of alternative psychometric models (i.e., unidimensional and multidimensional) to calibrate and score the C-test. This study addressed each of these issues using a large data set of responses to an English-language C-test. First, we examined the local item independence assumption via multidimensional item response theory (IRT) models, Yen’s Q3, and Jackknife Slope Index. Second, we evaluated several IRT models to determine optimal approaches to scoring the C-test. The results support an interpretation of unidimensionality for the C-test items within a paragraph, with only minor evidence of local item dependence. Furthermore, the two-parameter logistic (2PL) IRT model was found to be the most appropriate model for calibrating and scoring the C-test. Implications for designing, scoring, and analyzing C-tests are discussed.

Key words C-test, item response theory, local item dependence, testlet

Revisiting English language proficiency and its impact on the academic performance of domestic university students in Singapore

Wenjin Vikki Bo, Singapore University of Social Sciences, Singapore

Mingchen Fu, Nanjing Normal University, China

Wei Ying Lim, Singapore University of Social Sciences, Singapore

Abstract The role of international students’ English language proficiency has been extensively researched to understand its impact on academic achievement in English-medium universities, mainly because of students’ non-English-speaking backgrounds. However, the relationship between language proficiency and academic achievement among English-speaking-background students remains under-researched, especially in multilingual societies, such as Singapore. The present study explored the relationship among university students’ previous academic experience, English language proficiency, and their current academic performance within a sample of 514 Singaporean students (252 females and 262 males). Findings showed that students’ proficiency scores significantly predicted their current grade point average (GPA) with their prior academic performance being controlled. Moreover, proficiency scores significantly strengthened the association between students’ prior academic performance and their current GPA. Finally, academic discipline showed a marginally significant moderating effect in the relationship between proficiency scores and current GPA. Implications and limitations of the study are discussed.

Key words Academic performance, English proficiency, English-speaking-background students, grade point average, higher education

How do raters learn to rate? Many-facet Rasch modeling of rater performance over the course of a rater certification program

Xun Yan, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana–Champaign, USA

Ping-Lin Chuang, University of Illinois at Urbana–Champaign, USA

Abstract This study employed a mixed-methods approach to examine how rater performance develops during a semester-long rater certification program for an English as a Second Language (ESL) writing placement test at a large US university. From 2016 to 2018, we tracked three groups of novice raters (n = 30) across four rounds in the certification program. Using many-facet Rasch modeling, rater performance was examined in terms of rater agreement, rater consistency, and rater severity. These measurement estimates of rating quality were subjected to multivariate analysis to examine whether and how rater performance changes across rounds. Rater comments on the essays were qualitatively analyzed to obtain a deeper understanding of how raters learn to use the scale over time. The quantitative results showed a non-linear, three-staged developmental pattern of rater performance for all three groups of raters. Findings of this study suggest that rater development resembles a learning curve similar to how one acquires a language and other skills. We argue that understanding the developmental pattern of rater behavior is crucial not only to understanding the effectiveness of rater training, but also to the investigation of rater cognition and development. We will also discuss the practical implications of this study in relation to the effort and expectations needed for rater training for writing assessments.

Key words Longitudinal development, many-facets Rasch measurement, rater cognition, rater reliability, u-shaped learning curve

Epilogue—Note from an outgoing editor

Luke Harding, Lancaster University, UK

Abstract In this brief epilogue, outgoing editor Luke Harding reflects on his time as editor and considers the future Language Testing.

Key words Editing, epilogue, journal, language testing, publishing

第2期目录

ARTICLES

■A sequential approach to detecting differential rater functioning in sparse rater-mediated assessment networks, by Stefanie A. Wind, Pages 209–226.

■Who succeeds and who fails? Exploring the role of background variables in explaining the outcomes of L2 language tests, by Ann-Kristin Helland Gujord, Pages 227–248.

■Comparing holistic and analytic marking methods in assessing speech act production in L2 Chinese, by Shuai Li, Ting Wen, Xian Li, Yali Feng, Chuan Lin, Pages 249–275.

■A meta-analysis on the predictive validity of English language proficiency assessments for college admissions, by Samuel Dale Ihlenfeldt, Joseph A. Rios, Pages 276–299.

■L2 English vocabulary breadth and knowledge of derivational morphology: One or two constructs? by Dmitri Leontjev, Ari Huhta, Asko Tolvanen, Pages 300–324.

■Temporal fluency and floor/ceiling scoring of intermediate and advanced speech on the ACTFL Spanish Oral Proficiency Interview–computer, by Troy L. Cox, lan V. Brown, Gregory L. Thompson, Pages 325–351.

■Challenges in rating signed production: A mixed-methods study of a Swiss German Sign Language form-recall vocabulary test, by Aaron Olaf Batty, Tobias Haug, Sarah Ebling, Katja Tissi, Sandra Sidler-Miserez, Pages 352–374.

■The typology of second language listening constructs: A systematic review, by Vahid Aryadoust, Lan Luo, Pages 375-409.

■Towards more valid scoring criteria for integrated reading-writing and listening-writing summary tasks, by Sathena Chan, Lyn May, Pages 410–439.

Book Reviews

■Book Review: The Routledge Handbook of Language Testing, by John M. Norris, Pages 440–449.

■Book Review: An Introduction to the Rasch Model with Examples in R, by Zhiqing Lin, Huilin Chen, Pages 450–453.

■Book Review: Reflecting on the Common European Framework of Reference for Languages and its companion volume, by Claudia Harsch, pages 453-457.

■Book Review: Looking Like a Language, Sounding Like a Race: Raciolinguistic Ideologies and the Learning of Latinidad, by Kamran Khan, pages 457-460.

第2期摘要

A sequential approach to detecting differential rater functioning in sparse rater-mediated assessment networks

Stefanie A. Wind, The University of Alabama, USA

Abstract Researchers frequently evaluate rater judgments in performance assessments for evidence of differential rater functioning (DRF), which occurs when rater severity is systematically related to construct-irrelevant student characteristics after controlling for student achievement levels. However, researchers have observed that methods for detecting DRF may be limited in sparse rating designs, where it is not possible for every rater to score every student. In these designs, there is limited information with which to detect DRF. Sparse designs can also exacerbate the impact of artificial DRF, which occurs when raters are inaccurately flagged for DRF due to statistical artifacts. In this study, a sequential method is adapted from previous research on differential item functioning (DIF) that allows researchers to detect DRF more accurately and distinguish between true and artificial DRF. Analyses of data from a rater-mediated writing assessment and a simulation study demonstrate that the sequential approach results in different conclusions about which raters exhibit DRF. Moreover, the simulation study results suggest that the sequential procedure results in improved accuracy in DRF detection across a variety of rating design conditions. Practical implications for language testing research are discussed.

Key words Many-facet Rasch model, performance assessment, rater bias, rater effects, rater-mediated assessment

Who succeeds and who fails? Exploring the role of background variables in explaining the outcomes of L2 language tests

Ann-Kristin Helland Gujord, University of Bergen, Norway

Abstract This study explores whether and to what extent the background information supplied by 10,155 immigrants who took an official language test in Norwegian affected their chances of passing one, two, or all three parts of the test. The background information included in the analysis was prior education, region (location of their home country), language (first language [L1] background, knowledge of English), second language (hours of second language [L2] instruction, L2 use), L1 community (years of residence, contact with L1 speakers), age, and gender. An ordered logistic regression analysis revealed that eight of the hypothesised explanatory variables significantly impacted the dependent variable (test result). Several of the significant variables relate to pre-immigration conditions, such as educational opportunities earlier in life. The findings have implications for language testing and also, to some extent, for the understanding of variation in learning outcomes.

Key words Adult L2 acquisition, L2 language tests, ordered logistic regression, variation in language outcomes, variation in test outcomes

Comparing holistic and analytic marking methods in assessing speech act production in L2 Chinese

Shuai Li, Georgia State University, USA

Ting Wen, Beijing Language and Culture University, China

Xian Li, Georgia State University, USA

Yali Feng, Georgia State University, USA

Chuan Lin, Georgia State University, USA

Abstract This study compared holistic and analytic marking methods for their effects on parameter estimation (of examinees, raters, and items) and rater cognition in assessing speech act production in L2 Chinese. Seventy American learners of Chinese completed an oral Discourse Completion Test assessing requests and refusals. Four first-language (L1) Chinese raters evaluated the examinees’ oral productions using two four-point rating scales. The holistic scale simultaneously included the following five dimensions: communicative function, prosody, fluency, appropriateness, and grammaticality; the analytic scale included sub-scales to examine each of the five dimensions. The raters scored the dataset twice with the two marking methods, respectively, and with counterbalanced order. They also verbalized their scoring rationale after performing each rating. Results revealed that both marking methods led to high reliability and produced scores with high correlation; however, analytic marking possessed better assessment quality in terms of higher reliability and measurement precision, higher percentages of Rasch model fit for examinees and items, and more balanced reference to rating criteria among raters during the scoring process.

Key words Analytic marking, holistic marking, L2 Chinese, marking methods, pragmatics, speech acts

A meta-analysis on the predictive validity of English language proficiency assessments for college admissions

Samuel Dale Ihlenfeldt, The University of Minnesota, USA

Joseph A. Rios, The University of Minnesota, USA

Abstract For institutions where English is the primary language of instruction, English assessments for admissions such as the Test of English as a Foreign Language (TOEFL) and International English Language Testing System (IELTS) give admissions decision-makers a sense of a student’s skills in academic English. Despite this explicit purpose, these exams have also been used for the practice of predicting academic success. In this study, we meta-analytically synthesized 132 effect sizes from 32 studies containing validity evidence of academic English assessments to determine whether different assessments (a) predicted academic success (as measured by grade point average [GPA]) and (b) did so comparably. Overall, assessments had a weak positive correlation with academic achievement (r = .231, p < .001). Additionally, no significant differences were found in the predictive power of the IELTS and TOEFL exams. No moderators were significant, indicating that these findings held true across school type, school level, and publication type. Although significant, the overall correlation was low; thus, practitioners are cautioned from using standardized English-language proficiency test scores in isolation in lieu of a holistic application review during the admissions process.

Key words Academic success, IELTS, meta-analysis, predictive validity, TOEFL iBT

L2 English vocabulary breadth and knowledge of derivational morphology: One or two constructs?

Dmitri Leontjev, University of Jyvaskyla, Finland

Ari Huhta, University of Jyvaskyla, Finland

Asko Tolvanen, University of Jyvaskyla, Finland

Abstract Derivational morphology (DM) and how it can be assessed have been investigated relatively rarely in language learning and testing research. The goal of this study is to add to the understanding of the nature of DM knowledge, exploring whether and how it is separable from vocabulary breadth. Eight L2 (second or foreign language) English DM knowledge measures and three measures of the size of the English vocabulary were administered to 120 learners. We conducted two confirmatory factor analyses, one with one underlying factor and the other treating vocabulary breadth and DM as separate. As neither model had a satisfactory fit without introducing a residual covariance to the two-factor model, we conducted an exploratory factor analysis, which suggested two separate DM factors in addition to vocabulary breadth. Regardless, the analysis demonstrated that the DM knowledge was separate from learners’ vocabulary breadth. However, learners’ vocabulary breadth factor still explained a substantial amount of variance in learners’ performance on DM measures. We discuss theoretical implications and implications for L2 assessment.

Key words Constructs, derivational morphology, English as a foreign language, factor analysis, vocabulary

Temporal fluency and floor/ceiling scoring of intermediate and advanced speech on the ACTFL Spanish Oral Proficiency Interview–computer

Troy L. Cox, Brigham Young University, USA

lan V. Brown, University of Kentucky, USA

Gregory L. Thompson, Brigham Young University, USA

Abstract The rating of proficiency tests that use the Inter-agency Roundtable (ILR) and American Council on the Teaching of Foreign Languages (ACTFL) guidelines claims that each major level is based on hierarchal linguistic functions that require mastery of multidimensional traits in such a way that each level subsumes the levels beneath it. These characteristics are part of what is commonly referred to as floor and ceiling scoring. In this binary approach to scoring that differentiates between sustained performance and linguistic breakdown, raters evaluate many features including vocabulary use, grammatical accuracy, pronunciation, and pragmatics, yet there has been very little empirical validation on the practice of floor/ceiling scoring. This study examined the relationship between temporal oral fluency, prompt type, and proficiency level based on a data set comprised of 147 Oral Proficiency Interview - computer (OPIc) exam responses whose ratings ranged from Intermediate Low to Advanced High [AH]. As speakers progressed in proficiency, they were more fluent. In terms of floor and ceiling scoring, the prompts that elicited speech a level above the sustained level generally resulted in speech that was slower and had more breakdown than the floor-level prompts, though the differences were slight and not significantly different. Thus, temporal fluency features alone are insufficient in floor/ceiling scoring but are likely a contributing feature.

Key words Proficiency scales, prompt difficulty, rating, Spanish, speaking assessments, temporal fluency

Challenges in rating signed production: A mixed-methods study of a Swiss German Sign Language form-recall vocabulary test

Aaron Olaf Batty, Keio University, Japan

Tobias Haug, University of Teacher Education in Special Needs (HfH), Switzerland

Sarah Ebling, University of Zurich, Switzerland

Katja Tissi, University of Teacher Education in Special Needs (HfH), Switzerland

Sandra Sidler-Miserez, University of Teacher Education in Special Needs (HfH), Switzerland

Abstract Sign languages present particular challenges to language assessors in relation to variation in signs, weakly defined citation forms, and a general lack of standard-setting work even in long-established measures of productive sign proficiency. The present article addresses and explores these issues via a mixed-methods study of a human-rated form-recall sign vocabulary test of 98 signs for beginning adult learners of Swiss German Sign Language (DSGS), using post-test qualitative rater interviews to inform interpretation of the results of quantitative analysis of the test ratings using many-facets Rasch measurement. Significant differences between two expert raters were observed on three signs. The follow-up interview revealed disagreement on the criterion of correctness, despite the raters’ involvement in the development of the base lexicon of signs. The findings highlight the challenges of using human ratings to assess the production not only of sign language vocabulary, but of minority languages generally, and underscore the need for greater effort expended on the standardization of sign language assessment.

Key words Many-facets Rasch measurement, rater behavior, sign-language assessment, Swiss German Sign Language, vocabulary assessment

The typology of second language listening constructs: A systematic review

Vahid Aryadoust, National Institute of Education, Nanyang Technological University, Singapore

Lan Luo, Guangxi University of Foreign Languages, China

Abstract This study reviewed conceptualizations and operationalizations of second language (L2) listening constructs. A total of 157 peer-reviewed papers published in 19 journals in applied linguistics were coded for (1) publication year, author, source title, location, language, and reliability and (2) listening subskills, cognitive processes, attributes, and listening functions potentially measured or investigated. Only 39 publications (24.84%) provided theoretical definitions for listening constructs, 38 of which were general or had a narrow construct coverage. Listening functions such as discriminative, empathetic, and analytical listening were largely unattended to in construct conceptualization in the studies. In addition, we identified 24 subskills, 27 cognitive processes, and 54 listening attributes (total = 105) operationalized in the studies. We developed a multilayered framework to categorize these features. The subskills and cognitive processes were categorized into five principal groups each (10 groups total), while the attributes were divided into three main groups. This multicomponential framework will be useful in construct delineation and operationalization in L2 listening assessment and teaching. Finally, limitations of the extant research and future directions for research and development in L2 listening assessment are discussed.

Key words Construct definition, construct operationalization, listening assessment, listening comprehension, multimodality, process-based listening, second language, second language listening, subskill

Towards more valid scoring criteria for integrated reading-writing and listening-writing summary tasks

Sathena Chan, University of Bedfordshire, UK

Lyn May, Queensland University of Technology, Australia

Abstract Despite the increased use of integrated tasks in high-stakes academic writing assessment, research on rating criteria which reflect the unique construct of integrated summary writing skills is comparatively rare. Using a mixed-method approach of expert judgement, text analysis, and statistical analysis, this study examines writing features that discriminate summaries produced by 150 candidates at five levels of proficiency on integrated reading-writing (R-W) and listening-writing (L-W) tasks. The expert judgement revealed a wide range of features which discriminated R-W and L-W responses. When responses at five proficiency levels were coded by these features, significant differences were obtained in seven features, including relevance of ideas, paraphrasing skills, accuracy of source information, academic style, language control, coherence and cohesion, and task fulfilment across proficiency levels on the R-W task. The same features did not yield significant differences in L-W responses across proficiency levels. The findings have important implications for clarifying the construct of integrated summary writing in different modalities, indicating the possibility of expanding integrated rating categories with some potential for translating the identified criteria into automated rating systems. The results on the L-W indicate the need for developing descriptors which can more effectively discriminate L-W responses.

Key words Integrated tasks, listening-writing, rating scale, reading writing, scoring validity, summary

期刊简介

Language Testing is an international peer reviewed journal that publishes original research on foreign, second, additional, and bi-/multi-/trans-lingual (henceforth collectively called L2) language testing, assessment, and evaluation. Since 1984 it has featured high impact L2 testing papers covering theoretical issues, empirical studies, and reviews. The journal's scope encompasses the testing, assessment, and evaluation of spoken and signed languages being learned as L2s by children and adults, and the use of tests as research and evaluation tools that are used to provide information on the language knowledge and language performance abilities of L2 learners. Many articles also contribute to methodological innovation and the practical improvement of L2 testing internationally. In addition, the journal publishes submissions that deal with L2 testing policy issues, including the use of tests for making high-stakes decisions about L2 learners in fields as diverse as education, employment, and international mobility.

《语言测试》是一份国际同行评审期刊，发表关于外国、第二、辅助和双/多/跨语言（以下统称为L2）语言测试、评估和评估的原创研究。自1984年以来，它以高影响力的L2测试论文为特色，涵盖理论问题、实证研究和评论。该期刊的范围包括对儿童和成人作为L2学习的口语和手语的测试和评估，以及使用测试作为研究和评估工具，用于提供有关语言知识和语言表现的信息L2学习者的能力。许多文章还为国际上二语测试的方法创新和实际改进做出了贡献。此外，该期刊还发表处理L2测试政策问题的论文，包括使用测试对L2学习者在教育、就业和国际流动等不同领域做出高风险决策。

The journal welcomes the submission of papers that deal with ethical and philosophical issues in L2 testing, as well as issues centering on L2 test design, validation, and technical matters. Also of concern is research into the washback and impact of L2 language test use, the consequences of testing on L2 learner groups, and ground-breaking uses of assessments for L2 learning. Additionally, the journal wishes to publish replication studies that help to embed and extend knowledge of generalisable findings in the field. Language Testing is committed to encouraging interdisciplinary research, and is keen to receive submissions which draw on current theory and methodology from different areas within second language acquisition, applied linguistics, educational measurement, psycholinguistics, general education, psychology, cognitive science, language policy, and other relevant subdisciplines that interface with language testing and assessment. Authors are encouraged to adhere to Open Science Initiatives.

该期刊欢迎提交涉及 L2 测试中的伦理和哲学问题的论文，以及以 L2 测试设计、验证和技术问题为中心的问题。同样值得关注的是对 L2 语言测试使用的反作用和影响、测试对 L2 学习者群体的影响以及 L2 学习评估的开创性使用的研究。此外，该杂志希望发表有助于贡献和扩展该领域可推广发现的知识的重复研究。《语言测试》 致力于鼓励跨学科研究，并接收来自二语习得、应用语言学、教育测量、心理语言学、通识教育、心理学、认知科学、语言政策和与语言测试和评估相关的其他相关子学科研究。鼓励作者遵守开放科学倡议。

官网地址：

https://journals.sagepub.com/home/LTJ

本文来源：LANGUAGE TESTING官网

点击文末“阅读原文”可跳转官网

刊讯｜SSCI 期刊《语言测试》2023年第1-2期

六万学者关注了→ 语言学心得

语言学心得

向上滑动看下一个

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

刊讯｜SSCI 期刊《语言测试》2023年第1-2期

刊讯｜SSCI 期刊《语言测试》2023年第1-2期

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突 认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

生成图片，分享到微信朋友圈

刊讯｜SSCI 期刊《语言测试》2023年第1-2期

刊讯｜SSCI 期刊《语言测试》2023年第1-2期

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡