- Define reliability, including the different types and how they are assessed.
- Define validity, including the different types and how they are assessed.
- Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.
Unlabelled: Validity and reliability relate to the interpretation of scores from psychometric instruments (eg, symptom scales, questionnaires, education tests, and observer ratings) used in. An introduction to psychometric theory with applications in R (under development). The chapters are available as pdfs. This is a draft and comments are appreciated. Nunnally, Jum & Bernstein, Ira (1994) Psychometric Theory New York: McGraw Hill, 3rd ed.(recommended). The Central Position of Validity in Decision Making on the Use of Psychometrics. To make appropriate judgements on the use of an assessment for selection we need to understand: How the traits we are measuring relate to job success (n.b. Empirical validity) How the specific product we are using assesses the trait (n.b. Construct Validity). Jum Nunnally (one of the founders of modern psychometrics) claimed this was “silly question ”! The point wasn ’t that tests shouldn’t be “valid” but that a test ’s validity must be assessed relative to. the specific construct(s) it is intended to measure. the population for which it is intended (e.g., age, level). In psychotherapy research, “validity” is canonically understood as the capacity of a test to measure what is purported to measure. However, we argue that this psychometric understanding of validity prohibits working researchers from considering the validity of their research. Psychotherapy researchers often use measures with a different epistemic goal than test developers intended, for.
Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Psychologists do not simply assume that their measures work. Instead, they collect data to demonstrate that they work. If their research does not demonstrate that a measure works, they stop using it.
As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.
refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).
When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time, and then looking at between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing Pearson’s r. Figure 5.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.
A second kind of reliability is , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioural and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.
Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a . This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 5.3 shows the split-half correlation between several university students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s r for these data is +.88. A split-half correlation of +.80 or greater is generally considered good internal consistency.
Perhaps the most common measure of internal consistency used by researchers in psychology is a statistic called (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.
Many behavioural measures involve significant judgment on the part of an observer or a rater. is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. Inter-rater reliability would also have been measured in Bandura’s Bobo doll study. In this case, the observers’ ratings of how many acts of aggression a particular child committed while playing with the Bobo doll should have been highly positively correlated. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.
is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimetre longer than another’s would indicate nothing about which one had higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure. Here we consider three basic kinds: face validity, content validity, and criterion validity.
is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behaviour, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.
is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
is the extent to which people’s scores on a measure are correlated with other variables (known as ) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as ; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as (because scores on the measure have “predicted” a future outcome).
Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. This is known as .
Assessing convergent validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-report Need for Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982). In a series of studies, they showed that people’s scores were positively correlated with their scores on a standardized academic achievement test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009).
, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.
When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence of discriminant validity by showing that people’s scores were not correlated with certain other variables. For example, they found only a weak correlation between people’s need for cognition and a measure of their cognitive style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found no correlation between people’s need for cognition and measures of their test anxiety and their tendency to respond in socially desirable ways. All these low correlations provide evidence that the measure is reflecting a conceptually distinct construct.
- Psychological researchers do not simply assume that their measures work. Instead, they conduct research to show that they work. If they cannot show that they work, they stop using them.
- There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
- Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
- The reliability and validity of a measure is not established by any single study but by the pattern of results across multiple studies. The assessment of reliability and validity is an ongoing process.
- Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-half correlation (even- vs. odd-numbered items). Compute Pearson’s r too if you know how.
- Discussion: Think back to the last college exam you took and think of the exam as a psychological measure. What construct do you think it was intended to measure? Comment on its face and content validity. What data could you collect to assess its reliability and criterion validity?
- Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42, 116–131. ↵
- Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behaviour (pp. 318–329). New York, NY: Guilford Press. ↵
Types of Validity
Face Validity is concerned with whether a selection instrument appears to measure what it was designed to measure. Whilst face validity has no technical or statistical basis, it must not be overlooked if a test is to be accepted by the respondent. In a personality questionnaire, the question of face validity often relates to the questions asked. For example, the question “Does your soul ever drift away from your body?” (from the MMPI) would probably be felt to have low face validity for occupational use.
Faith Validity is the least defensible type of validity but the most difficult to influence. It is simply a conviction, a belief of blind faith that a selection test is valid. There is no empirical evidence and, what is more, none is wanted. This type of thinking is particularly disturbing with personality questionnaires. Clearly it is relatively easy to label a scale ‘Leadership’ or ‘Honesty’ but it is quite another matter to demonstrate whether it really works.
Content Validity (also known as logical validity) refers to the extent to which a measure represents all facets of the construct (e.g. personality, ability) being measured. For example, a depression scale may lack content validity if it only assesses the affective dimension of depression but fails to take into account the behavioural dimension. When constructing any test or questionnaire, the items should be representative of the domain to be measured. For example, a spelling test containing only the names of professional footballers would be of poor content validity as a general purpose of spelling.
Empirical Validity (or predictive validity) is the relationship between test scores and some criterion of performance obtained, e.g. job performance or academic performance. One type of validation study involves people currently employed. Thus we might test a group of computer programmers and correlate their test results with their supervisors’ ratings of work performance. Another type of study involves testing applicants, and returning after a period of time to determine the relationship between test performance and measures of job performance. Large scale empirical validation studies are difficult to undertake and people usually rely on published empirical studies (e.g. Ghiselli, Hunter & Schmidt, Smith).
Construct Validity is the extent to which a test measures some established construct or trait. Such constructs might be mechanical, verbal or spatial ability, emotional stability or intelligence. Correlations with other scales will provide useful information on a test’s construct validity (we would expect a scale of dominance to be more highly related to other measures of dominance than to traits of anxiety). Factor analysis is often used to investigate the construct validity of personality questionnaires.
Case Study: Construct Validity of a personality questionnaire
The data on “construct validity” in the 15FQ manual relates the primary global factors to correlations with 16PF global scales and Costa and McCrae’s five factor model.
|15FQ+ Global FactorsLeft meaning V Right meaning||16PF5 Global Factors||Correlation with NEO Five Factors (Costa McCrae)|
|Practical V Openness||-.65 (Tough Minded)||.66 (Openness)|
|Spontaneousness V Control||.79 (Self Control)||.67 (Conscientiousness)|
|Introversion V Extraversion||.88 (Extraversion)||.74 (Extraversion)|
|Independence V Agreeableness||-.81 (Independence)||.61 (Agreeableness)|
|Relaxed V aNxiety||.87 (Anxiety)||.77 (Neuroticism)|
From the table above we can infer that the correlations between 15FQ+, 16PF5 and NEO suggest the questionnaires are broadly measuring the same traits. For example we would expect someone with a sten score of 3 on “Spontaneousness V Control” (15FQ+) to have a below average score for self control (16PF5) and conscientiousness (NEO)
The Central Position of Validity in Decision Making on the Use of Psychometrics
To make appropriate judgements on the use of an assessment for selection we need to understand:
- How the traits we are measuring relate to job success (n.b. Empirical validity)
- How the specific product we are using assesses the trait (n.b. Construct Validity)
- The specific demands of the job we are selecting for (n.b. Content Validity)
All of these factors can be related to validity research. Empirical validity is particularly concerned with how the traits we are measuring relate to job success. Construct Validity is particularly concerned with how the specific product we are using assesses the trait. Finally Content Validity addresses how the specific demands of the job we are selecting for relates to the traits we are assessing.
The “Empirical Validation” of Objective Assessments in Work Environments
In the years following the end of the Second World War research focused on the predictive power of standardised assessments. Initially this research focused on ability tests and work samples, however from the late 1950’s assessment centres and structured interviews were shown to have a significant level of predictive validity. “Assessment centre” is the term applied when we use a variety of standardised assessments to provide supportive evidence for each other. Typically assessment centres will include ability tests, personality questionnaires, role plays, case studies and an interview. The table below is based on a revisiting of the Hunter & Schmidt’s 1998 research as presented by Michael Smith (Testing People at Work, 2005)
This graphic illustrates the relative effectiveness of the different types of method used in selection processes in terms of their ability to predict future job performance. A validity coefficient of 0 means that the method is no better than chance selection in terms of selecting someone who will perform successfully in the role. Perfect prediction is implausible when prediciting human behaviour.
Taylor Russell tables can help us make more appropriate inferences about future performance in a job for which a particular type of assessment is relevant using the validity coefficient and a margin of error.
|Ratio Below average/ Above Average job performance for a high assessment performer (e.g. Sten 8)|
Below % / Above %
|Ratio Below average/ above Average job performance for a lower assessment performer (e.g. Sten 3)|
Below % / Above %
0% / 100%
15 / 85
31 / 79
26 / 74
31 / 69
36 / 64
41 / 59
50 / 50
100% / 0%
79 / 31
74 / 26
69 / 31
64 / 36
59 / 41
55 / 45
50 / 50
Using the table above we can predict the likelihood of job success for a high assessment performer in a relevant job. For example if a graduate applicant for a marketing role had above average general reasoning ability (sten 8) we could estimate that they have 69% likelihood of above average performance (based on a validity for the assessment of .4). Similarily we can can predict the likelihood of job success for an underperformer on a career assessment relevant to their career interests. For example if student’s considering careers in different professional areas had below average general reasoning ability (sten 3) we could estimate that based on ability factors alone they have 31% likelihood of above average performance in any particular professional career (based on a validity for the assessment of .4). In considering the above it is essential that we can give feedback in an appropriately positive manner (ref section 4). Below average cognitive abilities (sten 3 or lower) mean the individual needs to make particular efforts to find an area where they have other advantages (specialist interest, motivation, complementary personality traits). A consistent pattern of below average cognitive ability on assessments should alert us to the fact that the individual will need more time to master tasks not that they cannot succeed. Others may find the career path easier but given a high level of motivation people with diverse cognitive abilities can succeed in the most cognitively challenging careers with time and effort.
Ability tests are widely accepted as being positively correlated with job performance across administrative, managerial and professional roles. (eg. Schmidt et al 1998). Appropriate technical tests are seen as having higher validity. The correlation of general reasoning ability and performance in administrative roles is widely accepted as being in the region of +.3 to + .6, while the validity of technical tests for craft roles has been found to be at least as high. While a positive correlation of .3 may sound low, even a medium correlation of this type can have significant impact on the job. See the illustration of an expectancy table below.
Verbal Reasoning Scores
Management Performance Average High
Looking at the “high” (top 30%) test scorers, 73% are high performing individuals and only 27% are low performers. This trend is reversed for “low” (bottom 30%) test scorers, where only 32% are high performers, but 68% are in the low performing group. This indicates that high test scorers are more than twice as likely as low test scorers to be high performers (73% against 32%). From validity research studies, ability/aptitude tests have been the single best predictors of training & job success. They have also been found to predict training capability slightly better, but do not predict tenure. Research also supports the idea that such tests predict better than educational level, interviews and references and comparably to bio-data, work-samples and peer evaluation.
The validity of personality questionnaires has proved to be more controversial than ability tests. Some psychologists have suggested the average validity of personality questionnaires to be as low as .10, while others claim that it could be in the region of .4 (Smith, 1988; Ghiselli, 1973). Many of these differences of opinion can be related to how effectively personality scales are related to job criteria. Validity is a key issue for any academic reviewer of a personality questionnaire. Published evidence of correlations with similar constructs (construct validity) or job performance (empirical criteria) are highly respected by professional reviewers. Instruments published with little or no validity data will be criticised in professional circles.
Limitations on carrying Out Empirical Validity Studies
One of the most important challenges to overcome in carrying out an empirical study is the quality of the criterion data available. Job performance data, for example, if it exists at all, is frequently rather subjective and can be difficult to quantify. The objectivity or reliability of the criterion data is critical to the success of a validity study. Reliability (of either measure) sets the upper limits to validity, so validity studies which have used unreliable criterion data will not show good results, irrespective of the relevance of the predictor. The use of quantifiable measures such as quantity of output, performance against target, etc, should be sought wherever possible. Behaviourally anchored rating scales designed specifically for the study are generally better then appraisal data; and two independent ratings of performance are generally better than one, as this enables an estimate to be made of inter-rater reliability. Sample size is another important consideration, and validity studies based on small samples (less than 100) should generally be avoided. Restriction in range is another common problem affecting validity studies; and this can affect both predictor and criterion variables, sometimes both. Where local validity studies cannot be undertaken larger scale studies can be used to provide justification for the relevance of an assessment tool. The landmark work of Schmidt has already created the argument of validity generalisation. It is now accepted in US courts that well administered tests of general reasoning ability are appropriate to assess analytical capabilities required for clerical, administrative, and managerial roles.
Content Validity, Fair Selection and Legality
There is an increasing requirement to justify the use of psychometric instruments from a legal point of view. This issue is particularly significant where assessment instruments are used in the process of selection (for a job, for a development opportunity etc). In this context fairness and legality require careful consideration of validity issues as well as quality assured administration processes.
The legal case for any assessment process is established by demonstrating that the assessment is relevant.
Content validation is the technical term for ensuring an assessment process is relevant to job content. This requires that a robust analysis of job content has been undertaken. The development of a job relevant competency framework usually fulfils this function. A job analysis which demonstrates a clear need for analytical capability, e.g. working with numbers and working with words, can therefore justify the use of such aptitude tests. Tests and personality questionnaires are often introduced into an assessment situation to improve the objectivity and, thus, fairness of the assessment. There is evidence to suggest that, for example, differences between ethnic groups in terms of personality are very small while interviewers continue to exhibit bias towards their own race. Nevertheless best practice is to use personality questionnaires in conjunction with an interview. By using a validation interview to test hypotheses from personality questionnaires we add breadth and depth to our interview. If you are using assessments for selection purposes, best practice is to monitor selection outcomes with respect to all of the 9 grounds*. In monitoring assessment outcomes we must watch out for adverse impact. Adverse impact is not illegal where there are relevant job demands for which one group are more capable. However where it occurs we must satisfy ourselves which element of the process is causing adverse impact and that there is a legitimate job demand underpinning the adverse impact.
Validity Test Psychology
* In Ireland this comes under The Employment Equality Act, 1998, 2004; which states that individuals should not be discriminated against on the grounds of their gender, marital status, family status, race, age, religion, sexual orientation, disability or membership of the traveller community.