The following topics provide a brief overview of research concerning the EIKEN tests.

Overview of the EIKEN framework

  The EIKEN tests are a seven-level set of tests made and administered by the Society for Testing English Proficiency (STEP), a non-profit foundation established in Japan in 1963. As shown in Table 1, the seven levels of EIKEN are designated as “grades,” and range from Grade 5 (beginner) to Grade 1 (advanced), with two bridging levels (Grades Pre-1 and Pre-2). Each grade is administered on a pass/fail basis. Examinees for all the grades are required to take a first stage test consisting of reading and listening sections for all grades and an additional writing section for Grades 2 through 1. For Grades 3 through 1, examinees who have passed the first stage test then take a second stage test which consists of a face-to-face speaking test. For Grades 4 and 5, examinees are able to take a computer-based speaking test regardless of the results of the first stage test. The EIKEN tests, then, are actually a framework of separate tests designed to cover different parts of a broad spectrum of language ability. The tests are administered at sites across Japan and are taken by approximately 2.5 million test takers a year. The advanced levels are used for high stakes decisions including admissions to English-medium universities, while the Ministry of Education, Culture, Sports, Science and Technology (MEXT) has listed lower grades as benchmarks of recommended English ability for junior high school and high school graduates (MEXT, 2003).

Table 1: Overview of EIKEN tests and their uses
EIKEN Grade LEVEL Recognition / Uses
Grade 1 Advanced

International admissions to graduate and undergraduate programs
Grade Pre-1
Grade 2 MEXT benchmarks for high school graduates
Grade Pre-2
Grade 3 MEXT benchmark for junior high school graduates
Grade 4
Grade 5

  This framework of separate tests targeting different levels allows for flexibility in that each grade can be tailored to the needs of the typical learners who take it and the typical uses to which scores will be put. This same flexibility, however, presents particular problems for validation and the constant efforts to review each of the tests and make adjustments where necessary to meet the changing needs of test takers and the changing uses of the tests in Japanese society. Given the wide variation in test uses across grades, each grade needs to be considered individually. Nonetheless, it is still necessary to describe the tests within the same conceptual framework, as they are designed to help Japanese EFL learners move through a defined spectrum of difficulty levels in measured and manageable steps.

  To meet these sometimes competing aims, we have found it useful to refer to Bachman and Palmer’s (1996) framework of test usefulness to help make transparent, principled decisions regarding the optimal balance of various features for each grade. Bachman and Palmer’s framework lists six categories—construct validity, reliability, interactiveness, authenticity, impact, and practicality (Bachman & Palmer, 1996; Chappelle & Stoynoff, 2005). The optimal balance of test characteristics for each of the EIKEN grades, and the degree of importance attached to each of the six characteristics in the usefulness framework, will be quite different because of the very different ages and profiles of typical test takers and the very different uses and interpretations which are made of test scores at each grade. For example, the lower grades have very large numbers of test takers (Table 2) whose principal exposure to the target language is in formal educational contexts. At these levels, a higher priority is placed on practicality in order to make the tests as accessible as possible by delivering them cheaply and efficiently across all areas of Japan. For the higher grades, however, such as Grade 1 and Grade Pre-1, a higher priority is placed on maximizing validity and reliability, often at the expense of practicality.

  Bachman and Palmer’s framework allows for the differential management of characteristics according to the needs of each level, while still enabling—and requiring—that the degree to which each category is realized for each grade contributes to an overall evaluative judgment of “useful” for that grade. Certain compromises will always to need to be made in designing and operating a testing program according to the availability of resources for test development and the specific needs of test takers, etc. The Bachman and Palmer framework allows for this, but also provides a means for specifying and evaluating whether such comprises are indeed justified in order to maintain usefulness for the uses and interpretations that test users will make of that grade.

  • Table 2: Numbers of examinees in 2008
    EIKEN Grade Number of examinees
    Grade 1 22,055
    Grade Pre-1 71,533
    Grade 2 312,034
    Grade Pre-2 503,638
    Grade 3 661,798
    Grade 4 464,819
    Grade 5 306,745

Advantages and constraints of a levels-based framework

  A levels-based framework such as EIKEN has both advantages and particular constraints. On the one hand, it allows for each level, or grade in the case of EIKEN, to maximize those features most relevant to the needs of test takers at that grade, as described above in the discussion of test usefulness. However, the more limited range of ability levels which each grade targets also brings constraints. Jones (2004), in discussing the design of the Cambridge ESOL Main Suite exams, gives a good overview of the properties of one such levels-based framework which also consists of several separate tests targeted at different levels, and describes how such frameworks differ from single-test designs that try to span a much wider range of ability levels within a single test with a wider range of items.

  In fact, one way of conceptualizing the different EIKEN grades is that each separate grade is designed to focus on a narrower range of ability which might correspond loosely to one or two “bands” or a particular range of scores on a single test designed to cover a broader spectrum of ability levels. In doing so, tests such as those in the EIKEN framework are able to offer more items concentrated within a specific range of ability, and are also able to tailor items in terms of topic and content, etc., to be more relevant to the typical test takers for that grade. In one manner of speaking, it is similar to taking a magnifying glass to blow up or look more closely at one part of a larger whole (in this case, one part of the difficulty spectrum which actually extends from beginner through advanced). At the same time, there are technical implications for psychometric concepts such as measuring reliability because of the more limited range of variance available in test scores from tests targeted at narrower ranges of ability (Jones, 2000).

Interpreting claims of test comparability:
important considerations, concerns, and caveats

  It is important to use caution when making any kind of comparison between two tests which have been produced for different purposes, to different test specifications, and which contain different content. As such, test users are urged to be very careful about interpreting the kinds of comparison charts which are presented on this website. The following article will detail some of the risks inherent in making such comparisons. It will also outline a framework for the principled application of relevant theory in order to help test developers provide information that can help test users make informed decisions to meet their real-world needs.

  The first important point to remember is that different tests, for example the EIKEN Grade 1 test and the TOEFL® iBT, are not usually considered to be equivalent. Within the field of educational measurement, equivalence is generally used to refer to the technical properties which ensure comparability between two alternative forms of the same test (Taylor, 2004). For example, when test takers take a version of test A administered in January, and then another version of test A on another testing date, they do not typically expect both versions to contain exactly the same questions and content. In fact, such a case would usually be considered a threat to the security of the test, and to have a potentially negative impact on the validity of the interpretations of the test scores. But test takers do expect that appropriate steps have been taken to insure the two different forms cover the same range of content which the test is designed to cover, and that appropriate procedures have been employed to ensure the results or test scores are in fact comparable.

  There are many technical procedures associated with maintaining the comparability of different forms of the same test, and these days such procedures commonly include the use of IRT analysis and equating methodology to ensure that test scores from different forms can be placed on a common scale and are indeed comparable. However, while the technical properties associated with maintaining comparability of separate forms of the same test are often quite clearly elaborated (though nonetheless difficult to achieve), the issue of what constitutes comparability or an appropriate level of comparison for scores or results obtained from different tests is not so clear. Professional standards and codes of conduct issue guidelines for test developers regarding both types of comparisons, but these too are not always specific. For example, Section C of the International Language Testing Association’s (ILTA) Code of Practice contains the following standard: If a decision must be made on candidates who did not all take the same test or the same form of a test, care must be taken to ensure that the different measures used are in fact comparable. The responsibility is clear, but the procedures are not, although several more detailed standards in the same Code relate to the comparison of different forms of the same test. When tests have different numbers of tasks and test items, and include different content because they are built from different test specifications designed for populations, uses, and educational contexts which may overlap but are not identical, the question of how useful score comparisons are is a legitimate one. Test users should be aware that there are many experts in the field who question the legitimacy of such comparisons.

  Test developers, however, are faced with the dilemma that many test users have legitimate needs to consider the results from different assessments when making the decisions which test results are designed to inform. Admissions officers and employers do receive applications from applicants from different national (and thus educational) backgrounds. Although admissions officers or employers may be more familiar with one type of assessment tool or test, it would be exclusive and in terms of consequences or impact, potentially extremely detrimental to many test takers to require them to take only one type of test when they may already have suitable qualifications which are recognized locally but not internationally. Learners, too, who wish simply to make informed decisions about their own learning achievements and goals arguably can benefit from making careful comparisons of their ability as measured by some test which they have access to and other tests designed to measure similar abilities but which are used in different (perhaps wider or international) contexts.

  Of course, studies to assess the degree of correlation between two different tests designed to measure the same or similar abilities also form a long-standing part of test validation, traditionally falling under the category of criterion-related validity studies. In fact, Kane (1992) has noted that prior to the widespread adoption of the unitary concept of validity which now underpins validity theory, such correlation studies were perhaps the most common form of empirical test validation research.

  At the Eiken Foundation of Japan, we take the position that test users—those who require and make decisions based on test scores as well as learners themselves—do make comparisons between the variety of language tests available, and as such, rather than ignoring these legitimate concerns because of the difficulties involved, we prefer to provide information to help test users make informed decisions. At the same time, we would like to stress that such information is limited in its scope and application and needs to be approached with care. A good overview of many of the concerns and issues involved with making test comparisons can be found in Taylor (2004). Concerning the use of common frameworks such as the CEFR to facilitate comparison, she cautions:

There is no doubt that comparative frameworks can serve a useful function for a wide variety of test stakeholders... But we need to understand that they have their limitations, too: they may mask significant differentiating features, they tend to encourage oversimplification and misinterpretation, and there is always a danger that they are adopted as prescriptive rather than informative tools. (Taylor, 2004)

  Despite the caveats and potential pitfalls inherent in the activity of comparing different tests, it should be noted that methodological frameworks for linking, that is comparing, different assessments have been suggested by measurement specialists and which account for different levels of robustness in terms of the claims that can be made. Mislevy (1992) and Linn (1993) have both outlined five-level systems ranging from equating (the most robust and strict in terms of requirements) to social moderation (with Mislevy's projection corresponding to Linn's prediction). In Mislevy's framework, projection, statistical moderation, and social moderation all allow for comparison in which tests "don't measure the same thing." Of course there still has to be some sound, substantive, and explicitly stated reasoning for why making such a comparison is reasonable. But well documented statistical procedures exist for making comparisons within such theoretical frameworks as Mislevy's and Linn's. Social moderation, defined by North (2000) as “building up a common framework of reference through discussion and reference to work samples,” has traditionally been seen as the least robust of the procedures for linking exams. However, North (ibid) suggests that when combined with other procedures, the building up of a common framework of reference through social moderation can also be a powerful tool. 

  The comparisons included on this website have come from several perspectives. The TOEFL® score comparisons could be seen as falling in the traditional sphere of projection or prediction in Mislevy’s and Linn’s frameworks. The CEFR is a common framework for reference which includes a descriptive scale of language proficiency (with a claim to empirical scaling). It is not a single test and cannot be approached with the same procedures employed in the EIKEN-TOEFL® studies. Linking to the CEFR necessarily involves social moderation as different users build up a common interpretation of what the different descriptions of proficiency for each level in the framework mean in real terms. At the same time, statistical procedures designed for setting cut-off scores on tests, known as standard setting, have been increasingly employed in projects to link or compare tests to the CEFR in an effort to help build a more principled understanding of the relationship between different tests and the CEFR, and to insure any claims are underpinned, as far as standard setting allows, by empirical data.

  We would suggest that those who have proposed and supported the different kinds of techniques available to help test developers investigate the relationship of their tests with other measures of the same or similar abilities stress the need for transparency and accountability. In other words, it is incumbent on test developers making such claims to make clear what procedures were used to arrive at the results and to make sure people are aware of what level of meaning they contain. Section D of the ILTA Code of Practice contains the following standard which is relevant to the issue of transparency and accountability: those preparing and administering tests will “refrain from making false or misleading claims about the test.” We have tried to spell out as clearly as possible the intended purpose and necessary limitations of the information we have provided regarding the comparison of the EIKEN tests with other measures of English proficiency. The claims on this website are deliberately intended to be limited in scope and as far as they are supported by studies designed to gather data within an appropriate theoretical framework, we can safely say they are not false. But we also need to avoid deliberately ambiguous language which would cause non-specialists to arrive at unwarranted conclusions (i.e. the tests being compared are in fact the same, or two scores from the different tests mean exactly the same thing because they have been positioned to look that way in a comparison table). In line with the professional requirement outlined in the ILTA Code of Practice, we want to reiterate that the comparisons on this website are meant to be taken as one source of information to help test users make informed decisions. Test users, however, need to be aware of the limitations of such comparisons in general, and in particular be aware of the methods and caveats involved in the specific studies mentioned here.

  Equally, it is important to reiterate that the comparison tables presented on this website do NOT suggest in any way equivalence of content between the EIKEN tests and the other measures, either in the technical sense or the everyday sense of the word. Information on the specific procedures employed in arriving at the different comparison tables is included under the headings for the relevant sections (Investigating the relationship of the EIKEN tests with the CEFR, and Comparing EIKEN and the TOEFL® Tests). We would stress, however, that no one procedure is designed to be sufficient on its own to support any claim of comparability.

  As a final point we would emphasize that for any important decision regarding a test taker’s future, it is rarely appropriate to take only one source of information into account. In fact the Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 2004) is very clear on this point, stating: “Avoid using a single test score as the sole determinant of decisions about test takers. Interpret test scores in conjunction with other information about individuals.” We would suggest that such a caveat extends to the kinds of test comparisons included on this website. Test users are invited to view these comparisons critically and should be aware that such comparisons cannot and should not be made (and certainly cannot be adequately supported) by the use of one study or one procedure. Especially for high-stakes purposes, one source of evidence is almost never sufficient. The information and research results we can provide through studies within the framework of projection, etc., should be considered useful, but limited and never sufficient on their own.


International Language Testing Association. (2007). International Language Testing Association Code of Practice. Retrieved April 2, 2010 from

Joint Committee on Testing Practices. (2004). Code of Fair Testing Practices in Education. Retrieved April 2, 2010 from

Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.

Linn, R.I., (1993) Linking results of different assessments. Applied Meusurement in Education 6 (1), 83-102

Mislevy, J. (1992) Linking Educational Assessments. Concepts, Issues, Methods and Prospects. Educational Testing Service, Princeton, NJ

North, B., (2000) Linking language assessments: an example in a low stakes context. System 28, 555-577.

Taylor, L. (2004) Issues of test compatability. Research Notes 15, 2-5.

Investigating the relationship of the EIKEN tests with the CEFR

  To help give a clearer picture of the range of abilities at which each EIKEN grade is aimed, Table 3 gives an overview of the different grades of the EIKEN tests compared to the Common European Framework of Reference: for languages: learning, teaching, assessment (CEFR).

  • Table 3: Comparison of EIKEN Grades to the CEFR
    C1 Grade 1
    B2 Grade Pre-1
    B1 Grade 2
    A2 Grade Pre-2
    A1 Grade 3
    Grade 4
    Grade 5

  The comparison with the CEFR is based on a long-term research project being undertaken by STEP. The initial phase of the project focused on specification, as described in the Manual for Linking Exams to the CEFR (2003). This involves investigating the relevance of the content areas covered by the tests in relation to the CEFR. This content analysis and comparison, in conjunction with a review of real-world usages to which the tests are put, such as the use of the lower levels as benchmarks for lower secondary school, etc., formed the initial basis for devising the comparison table. The relationship between the first stage written tests for Grades 1 and Pre-1 has been empirically investigated in a series of standard setting workshops carried out in 2007. The results of those workshops (Dunlea & Matsudaira, 2009) demonstrate that a test taker who has passed the first stage of the Grade Pre-1 test can be considered to have demonstrated a strong B2 performance (i.e. they are not just borderline proficient at B2 level, but are at a solid B2 level). A test taker who has passed the first stage of the Grade 1 can be considered to have demonstrated a strong C1 performance (i.e. they are not just borderline proficient at C1 level, but are at a solid C1 level).

  However, it needs to be borne in mind that these results relate to the first stage of the upper two grades only. Research to confirm the relationship between the Grade 1 and Pre-1 speaking tests and the CEFR, along with the first and second stage tests for the other grades, is being undertaken and will be reported when complete. A full technical report covering both the content analysis and specifications stage as well as the results of standard setting workshops, along with information on the technical measurement properties of the tests, is planned, and of course will be required to meet the guidelines for supporting claims of comparability between tests and the CEFR as outlined in the Manual. Information describing what typical learners at each of the CEFR levels can do is available from the Council of Europe and can be downloaded from the Council of Europe website.

Comparing EIKEN and the TOEFL® tests


  Beginning in 2004, Kapi’olani Community College (KCC), a campus of the University of Hawaii, initiated a series of investigations into the possibility of using EIKEN as an alternative English-proficiency screening test for admission purposes. The investigations were conducted in three phases. Phase 1 was a pilot study and focused on establishing a score-prediction relationship from EIKEN Grades 2 and Pre-2 to TOEFL PBT scores, while Phase 2 aimed to provide score-linking information for institutions that have adopted or intend to adopt EIKEN for admissions, specifically the two highest levels: Grades 1 and Pre-1. Phase 3 was a trace study, which followed the academic progress of KCC students who were admitted to the college based on EIKEN results. Results from the KCC study were used as a reference in developing the score comparison table that appears on this website.


  A two-year study at the University of Hawaii at Manoa, under the direction of Professor J.D. Brown, investigated comparability of EIKEN and TOEFL iBT scores.

Introduction to validity research concerning the EIKEN tests

  Before discussing specific aspects of the research and validation program for the EIKEN tests, it will be useful to contextualize that research within the historical context of recent developments in testing and validity theory.

  Prior to the 1990s, test validity was generally investigated by gathering evidence from one of three distinct categories: construct validity, content validity, and criterion-related validity (Bachman, 1990; Kane, 1992; Messick, 1989; Messick, 1996). Indeed, Kane (1992) notes that there was a tendency to rely on criterion-related validity evidence by investigating the correlations between tests. From the 1990s a significant change occurred as a consensus was gradually reached around the principles promoted by Messick (1989). Rather than separate categories, validity came to be seen as a unitary concept, with all kinds of evidence contributing to an overall evaluative judgment, rather than belonging to distinct categories. The unitary concept of validity has since become a central pillar of current thinking on test validity research. It is also now generally agreed that validity is not an absolute quality inherent in a test, but is rather a matter of degree which results from the meaningfulness and appropriateness of the uses and interpretations of test scores (Bachman, 2004; Bachman, 2005; Cizek et al, 2008; Stoynoff & Chappelle, 2005). For example, using the Grade 5 EIKEN test as a motivating learning goal for beginner-level learners in lower secondary school in Japan, whose first and perhaps only experience of the target language has been in the formal educational context of the classroom, may be a valid use (though this too is a claim which needs justification). But regardless of how valid that particular use for those particular test takers is, it would be very difficult to justify using the results from the Grade 5 test as a measure of the English ability needed by brain surgeons to practice in an English-speaking country. This is, of course, an extreme example, but illustrates the point well that validity can only be evaluated in terms of the appropriateness of the uses to which we intend to put a particular test, and does not reside as some absolute quality within the test itself.

  A second major result of Messick’s work was to promote the importance of investigating the impact that tests have, what Messick (1994) called the consequential aspect of validity. Messick stressed that both intended and unintended consequences of test use require consideration, but also cautioned that “consequences associated with testing are likely to be associated with numerous factors,” making those factors associated directly with test use difficult to isolate and evaluate (p. 251). Bachman and Palmer (1996) have described two broad levels of impact, the micro-level impact of tests or the impact tests have on individuals, and the macro-level impact, which is the impact on “the educational system or society.” Wall (1997) makes a distinction between impact as “any of the effects that a test may have on individuals, policies or practices, within the classroom, the school, the educational system, or society as a whole,” and washback, which she refers to as “the effects of tests on teaching and learning.” While not all researchers adopt the same terminology (Bailey, ; Cheng & Curtis, 2004), the concepts of tests having potentially broad consequences and the importance of understanding those consequences is now well established, and Bachman (2004, 2005) has stressed the need to incorporate the consequential aspects of test use into validation theory more explicitly. Nonetheless, Cizek et al (2008) have recently noted the difficulty many test developers have in trying to practically evaluate test consequences within a validity research framework. Regarding washback, as with the broader issue of test consequences in general, there is still much discussion on the problems of how to actually measure washback. Nonetheless, while research carried out largely since the 1990s has resulted in a recognition that the relationship between tests and the behaviors and attitudes of teachers and learners is indeed complex and not easy to define in a simple cause-and-effect way, it is now generally accepted that washback does exist, and is an important consideration for language testers (Alderson, 2004).

  As the above very brief overview shows, the EIKEN tests, which were first administered in 1963, pre-date much of the work on modern validity theory. However, it is interesting to note that from the outset, the EIKEN tests had a strong emphasis on what would now be called the consequential aspects of validity. From their outset, the tests explicitly aimed to enhance positive washback and contribute to the improvement of language education in Japanese society, as well as create positive attitudes towards learning English. This manifested itself in two ways: a commitment to maintaining maximum accessibility, and a strong focus on interacting with and understanding the needs of teachers and educators in order to make the content of the tests as relevant as possible to learners. The commitment to accessibility is one reason behind the decision to adopt a levels-based framework of separate tests. In that way, the various features of the different levels could be manipulated to allow the tests to be designed for as wide a group of test takers in terms of age and ability as possible, with each grade targeting a particular ability level. This commitment has also led to a strong focus on practicality and efficiency so the tests could be administered cheaply in all areas of Japan.

  The second way that the commitment to positive impact manifested itself was in a very strong focus on the traditional concept of content validity. When they were first implemented, the tests were not actually required for any particular learning program or purpose. Their utility derived from being seen as well-defined steps that could act as both motivational goals and concrete measures of English ability. The tests had to be, and be seen as, relevant to learners at each level.  To ensure content relevance, the Eiken Foundation placed most of its validation resources into the content validation area, investing in an extensive network of expert committees and outside reviewers, made up of testing experts and practicing teachers, to constantly review and revise test materials during all stages of item and test development. The commitment to positive washback also led to a major feature of the testing program: the public release of test materials. Test takers take test booklets home with them, and the answer keys are published online. Educators and learners are thus able to incorporate high quality testing materials into their teaching and learning. This also allows for a process of open content validation: the fact the content is made public allows for constant review and analysis by any interested party. Content validity has thus been constantly reviewed and verified in an open and interactive manner with all stakeholders.

  However, recent changes in the educational context of Japan mean that there is no longer as clear a consensus concerning such things as “typical” levels of expected ability for school leavers. In this environment it has become important for the foundation to reassess many of the aspects of the EIKEN testing program that were built over time through discussion and consensus with educators and learners, and to explicitly validate these aspects. The increasing need to communicate with educators and test researchers outside Japan has also increased the need to be able to clearly describe the EIKEN tests within the context of current validity theory, and to gather evidence to support these claims.

  In this regard, despite the points of general consensus in the field of validity studies, many researchers have still noted that current definitions remain abstract measurement terms with few concrete guidelines to specify how much of what is needed in establishing the validity of the interpretations and uses of test scores (Bachman, 2004; Cizek et al, 2008; Stoynoff & Chappelle, 2005). To answer these criticisms, recent developments have focused on employing a comprehensive and cohesive narrative argument structure to clearly lay out the claims for the uses and interpretations of test scores, accompanied by the necessary evidence to support such claims. The validity argument approach, which builds on the work of Kane (1992) has been discussed by many researchers (Bachman, 2004; Bachman, 2005; Chapelle et al, 2008; Messick, 1994) and was used by Chapelle et al (2008) as the basis for developing the validity argument for the TOEFL iBT. The term “validity argument” has since gained wide currency, and in many cases probably does not maintain a strict relationship to the Toulmin argument structure on which Kane, Bachman, and Chappelle et al have based their work. Several common points of agreement, however, can be seen consistently in the way the term is currently used in the language testing literature, even when not strictly adhering to the original Toulmin structure or the later developments that Bachman or Chapelle et al have made: 1) the need to clearly describe the claims that are being made about appropriate uses and interpretations of the test scores; 2) the use of the clear and transparent description of claims to identify the evidence that would be necessary to provide adequate support to justify those claims; 3) combining these claims and the evidence to support them into a comprehensive narrative structure to fully describe the claims that the test provider is making, utilizing both empirical and logical evidence to support those claims.

  Eiken has begun carrying out a research program designed to gather evidence which will be able to support the construction of a comprehensive and cohesive validity argument justifying the interpretations and uses of the tests and test scores that comprise the EIKEN testing program. This is a long-term focus that aims to take account of the latest developments in validity theory, and requires that we explicitly move beyond the strong content validation focus and reliance on a consensus-based approach which previously formed the core of the justification for the uses and interpretations of EIKEN tests and test scores. The aim is to conduct a program of research which will allow us to collect evidence relevant to a wide audience, not just for educators and learners in Japan, but including international educators, researchers, and testing specialists.

  For any test this is a complicated endeavor. For the EIKEN testing program it is more so, as we often have to retrospectively establish the evidential basis for past decisions taken within a different framework of reference (that is, the consensus that was built through discussion and interaction with the educational community). At this stage, we are still in the process of collecting “pieces” of evidence, but the aim is to combine these pieces into a comprehensive, cohesive whole. We have listed in the next section a number of projects which we have undertaken, or are undertaking. These projects have contributed to our understanding of the EIKEN tests and have also contributed greatly to our confidence in offering these tests as robust measurement tools relevant to and appropriate for the needs and requirements of each grade. The research carried out so far is helping us to design an ongoing research agenda which can continue to add to the growing body of evidence concerning the EIKEN tests. After a brief description of each project, a list of references for where information on those projects has been presented or published is also provided.

Demonstrating validity: list of recent projects

Recent validation projects concerning the EIKEN tests, with sources where information on the projects has been presented or published.

Some titles require the ability to display Japanese characters.

EIKEN Testing Context Analysis

  • Comprehensive review of EIKEN testing program conducted by Dr. James Dean Brown
  • Combines qualitative and quantitative methodology to review aspects of relevance, utility, value implications, and social consequences as they relate to EIKEN tests
  • Involves large-scale surveys of EIKEN test takers and other test users to build a comprehensive picture of the use and interpretation of EIKEN tests in Japanese society

English Use and Study Habits Survey

A survey of 8,000 test takers undertaken in conjunction with the Can-do Project.

EIKEN-TOEFL®PBT Comparability Study

A four-year study conducted by researchers from the University of Hawaii at Manoa and Kapiolani Community College to investigate comparability of EIKEN and TOEFL®PBT scores.


A long-term study to investigate the relationship between the EIKEN tests in Japan and the Common European Framework of Reference. First phase focusing on Grade 1 and Pre-1 First Stage Tests completed in 2008. Second phase in progress (March 2010).

Investigating differences in direct and semi-direct speaking tests

Joint research with the Department of Functional Brain Imaging, IDAC, Tohoku University, Sendai, Japan. Study used functional magnetic resonance imaging (fMRI) in order to uncover differences in the neural basis between direct and semi-direct interviews.

EIKEN-TOEFL® iBT Comparability Study 

A study conducted under the direction of Dr. James Dean Brown at the University of Hawaii at Manoa to investigate comparability of EIKEN and TOEFL®iBT scores.

Lexile® Measurement of Tests: EIKEN & Test of English for Academic Purposes

A report on the Lexile® Measures of the Reading sections of EIKEN and Test of English for Academic Purposes (TEAP) .

For a Japanese summary, please follow this link.


Alderson, C, (2004). Forward. In L. Cheng & Y. Watanabe (Eds), Washback in Language Testing. New Jersey: Lawrence Erlbaum Associates.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, England: Oxford University Press.

Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge,  Cambridge University Press.

Bachman, L. F. (2005). "Building and supporting a case for test use." Language  Assessment Quarterly, 2, 1–34.

Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford: Oxford University Press.

Bailey, K.M. (1999). Washback in Language Testing. Princeton, N.J., Educational Testing Service

Cizek, G., Rosenberg, S., & Koons, H., (2008). "Sources of Validity Evidence for  Educational and Psychological Tests." Educational and Psychological Measurement, 68 (3), 397-412.

Dunlea, J., & Matsudaira, T. (2009). "Investigating the relationship between the EIKEN tests and the CEFR." In N. Figueras & J. Noijons (Eds.), Linking to the CEFR levels: Research perspectives. Arnhem: CITO and EALTA.

Kane, M. (1992). "An argument-based approach to validity." Psychological Bulletin,  112, 527–535.

Messick, S. (1989). "Validity." In R.L. Linn (Ed.), Educational measurement (3rd ed.), pp. 13-103). New York: Macmillan.

Messick, S. (1996) Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. Princeton: Educational Testing Service.

Stoynoff, S., & Chapelle C. (2005). ESOL tests and testing. Virginia, Teachers of English to Speakers of Other Languages, Inc.