< - - Back to     Measurement matrix         MEGs Search

Quick Links:   Sample   Response Rate   Internal Consistency    DIF Method    DIF Evidence    DIF Impact    Expert Opinion

 

 Kim et. al.   ( see abstract )

Beck Depression Inventory (BDI)

Name/ Reference

Kim Y, Pilkonis PA, Frank E, Thase ME, Reynolds CF: Differential Functioning of the Beck Depression Inventory in Late-Life Patients: Use of Item Response Theory. Psychology and Aging (2002) 17(3):379–391.

Source contact info

Correspondence concerning this article should be addressed to Yookyung Kim, Department of Health and Community Systems, University of Pittsburgh, 360 Victoria Building, Pittsburgh, Pennsylvania 15261.

E-mail: ykk@pitt.edu.

Availability (private or public)

Public.

Conceptual framework

The purpose of the present analyses was to examine age-related measurement bias in the performance of the Beck Depression Inventory (BDI) for older patients, using item response theory (IRT) models.

Purpose of measure & application (clinical, research, survey, screening)

The revised BDI is a self-report measure for assessing the severity of depression in clinical populations, and detecting depression in non-clinical populations.

Sample characteristics

Data (total N = 831) with the 1978 version of the BDI collected at baseline prior to therapy were aggregated from three outpatient treatment protocols on depression in late life (age 60 or older, n= 218) and five outpatient treatment protocols on depression in midlife (less than 60 years old, n= 613). 

“Age, and marital status, largely as a function of age (e.g., loss of spouse by death), differed significantly. There were no significant differences in the gender and racial composition of the groups. The mean BDI score was significantly lower, whereas the mean Hamilton Rating Scale for Depression score was significantly higher for the late-life sample at pretreatment baseline.”

Recruitment methods

NA

Data collection method

“For all eight protocols, an intake diagnosis of primary, nonpsychotic, nonbipolar major depressive disorder was made according to the Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM–IV) criteria on the basis of a semistructured clinical interview conducted by a clinician and faculty psychiatrist. The diagnosis of major depressive disorder was then confirmed with a second, structured interview: either the Schedule for Affective Disorders and Schizophrenia or the Structured Clinical Interview for DSM”

Response rate

Not provided

Format & design (readability, # of items, time to complete, response categories)

Not provided

Type of measurement (nominal, ordinal, interval, ratio)

Not provided

Scoring (range, direction, rules, missing data)

Not provided

Availability of translations & source

Not provided

Psychometric Properties:

Scale construction

Not provided

Basic summary statistics

Not provided

Variability

Not provided

Test-retest reliability

Not provided

Interrater reliability

Not provided

Internal consistency

“Studies of psychometric properties have suggested that the reliability and validity of the BDI for elderly samples are reasonably good, and it  has been adopted widely for use with older samples” (see ref 9 below)

 The internal consistency of the BDI as assessed by coefficient alpha was .85 for midlife and .88 for late-life patients in the current study.

Content validity

Not provided

Construct validity

Not provided

Concurrent validity

Not provided

Predictive validity

Not provided

Sensitivity to change

Not provided

Differential Item Functioning (DIF):

Variable studied (e.g., groups)

Age (midlife (20-59) vs late life (60-91))

Sample size

Midlife (N = 613) Late life (N= 218)

DIF method used

(e.g., MH, IRT, Logistic regression, MIMIC, other factor analysis)

 IRT Log-likelihood ratio test: The 2p graded IRT model.

 “In 2P models, the first parameter (designated "i ) describes how well an item can discriminate between patients low on the dimension being examined and those high on the dimension, and we assumed that BDI items would, indeed, vary in this ability and that the large size of  the sample would allow to accurately estimate such differences. The second (multiple threshold) parameter (estimated in both 1P and 2P models) can be interpreted as an inflection or cutting point between two adjacent response levels.” (Pg 380)

Test of model assumptions

“The unidimensionality assumption was examined through factor analysis of the BDI, using polychoric correlations computed with PRELIS 2 within the two age groups.  The present principal-axis factor analyses relied on polychoric correlations, and they provided good evidence of a single superordinate construct of depression as assessed by the BDI in both late-life and midlife patients. The primary factor (with eigenvalues of 7.8 and 5.8 for late-life and midlife patients, respectively) accounted for 37.0% and 27.4% of the overall variance in these samples”. (pg 380)

“A two-factor model was also reviewed to test whether it provided a significantly better fit, but this solution resulted in a second factor that had only two item loadings greater than .40 in both patient groups, which is too few items to have much confidence in the corresponding subscale. Thus, it was concluded that the data were sufficiently unidimensional for further IRT analysis”.  (pg 381)

“To justify the use of a 2P versus 1P graded (or other Rasch family) IRT model, a statistical comparison of the models was performed, estimated by MULTILOG 6. The 1P and 2P models can be compared by examining the “negative twice the loglikelihood” statistic for each model, because the models are nested (i.e., the 2P model estimates all the parameters of the 1P model plus additional parameters). The hypothesis was that the 2P model, although more complex, would be preferred because it assumes that the items on the BDI do vary in their general ability to discriminate between patients with more versus less depression. To determine model sufficiency, the MULTILOG program was used to estimate item and severity (theta parameters). The procedures for assessing model fit for this study included (a) estimating item and severity parameters by using the graded IRT model, (b) obtaining an expected-score response distribution for an item by using the item parameter estimates and defined severity subgroups, and (c) comparing these predictions with observed-score response distributions. Many goodness-of-fit statistics have been proposed to test the fit of IRT model A goodness-of-fit statistic, Q1 for polytomous item was used for the present analyses.” (pg 380)

Purification

“The initial explorations for the presence of DIF between the two groups indicated that measurement of depression in late-life and midlife groups was not on a common metric. Thus, a series of 21 alternative models were compared with Model 1 to identify anchor items that were invariant across groups. One item at a time, both discrimination and threshold parameters were constrained to be equal between groups.” (Pg 382)

“Each anchor item was also evaluated for whether it would be a valid indicator of the underlying latent variable (depression). Application of linear models from classical test theory such as (corrected) item–total correlations were used to examine the quality of the anchor items in the entire group.” (pg 382)

Evidence of uniform DIF

“Uniform DIF occurs when the score of one group is higher or lower than that of another at all levels of theta (as illustrated by the loss of libido item).” (pg 383)

According to the authors,  “The ICCs for midlife and late-life groups on the mood item (one of four anchor items) overlap, indicating that the item functioned in the same way in both patient groups. The ICCs for the loss of libido item, however, do not overlap, indicating that the item functioned differently in the two age groups”. (pg 383).

“Three of the items reflected uniform DIF across all levels of depression: loss of libido; weight loss; and, disappointment in self. In two of the three cases (loss of libido and disappointment in self), midlife patients endorsed consistently higher scores than late-life patients. With the third item (weight loss), late-life patients endorsed higher scores, but this was a poor item in general, with low levels of general endorsement (low elevation) and a flat slope (poor discriminating power across levels of depression).” (pg 384)

The somatic anchor items, loss of appetite and fatigability, functioned similarly in both groups, and loss of libido was more characteristic of midlife patients. (pg 384)

Evidence of non-uniform DIF

“Non-uniform DIF occurs when the score of one group is higher than that of another at some points of the severity level and the same or lower at other points.” (pg 384)

“The patterns of DIF found in the self-accusation and the sleep disturbance items were different: With self-accusation, more DIF occurred at lower levels of depression, whereas sleep disturbance showed a pattern in which more DIF occurred at higher levels of depression. The finding with this latter item is consistent with other work, suggesting that sleep disturbance is an important signal of depression for older adults. The present DIF analyses interpreted the crossover in the ICCs in non-uniform DIF in a more descriptive way. The significance of the crossover point in non-uniform DIF, however, could be evaluated more rigorously by using confidence intervals or “envelopes” that account for the density of the sample at different points. Confidence envelopes provide a description of the sampling variation of item-response curves in the space of the fitted functions and more systematically assess the degree of overlap (or lack thereof) between the ICC curves for two groups.” (pg 384)

Non-uniform DIF was the most common finding (8 of 11 items); there were two opposite patterns. Midlife patients endorsed higher raw scores at low to average levels of depression, with fewer differences (or even a crossover) at high levels of depression for items self-criticism;, social withdrawal; irritability; guilt feelings; and sense of failure). The other pattern in which both groups endorsed similar scores at low levels of depression, with late-life patients endorsing higher scores at more severe levels of depression was observed for: sleep disturbance;  somatic preoccupation; and, work inhibition.

The authors report that, “late-life patients tended to report fewer cognitive symptoms on the BDI (e.g., disappointment in self, self-criticism, guilt, and sense of failure), especially at low to average levels of depression. Conversely, they tended to report more somatic symptoms (e.g., sleep disturbance, somatic preoccupation, weight loss), especially at higher levels of depression, although this was not true with every somatic item, because loss of appetite and fatigability served as anchor items that functioned the same in both groups and loss of libido was more characteristic of midlife patients.” (pg 384)

Magnitude of DIF

“The ICCs for the self-accusation (DIF value = 1.94; percentage of the total differential test functioning = 7.7%) and sleep disturbance (DIF value = 2.27; percentage of the total differential test functioning = 9.0%) items illustrate non-uniform DIF. The self-accusation and the sleep disturbance items showed the second and fourth largest DIF among the items in the BDI.” (pg 384)

“The loss of libido item showed the largest DIF (DIF value= 3.13; percentage of the total differential test functioning= 12.5%), and the ICCs indicate that the item was consistently difficult for late-life patients to endorse regardless of the severity of their depression”. (pg 383)

Impact of DIF

“Eleven items (Items 21, loss of libido; 16, sleep disturbance; 19, weight loss; 8, self-accusation; 7, self-dislike; 12, social withdrawal; 20, somatic preoccupation; 11, irritability; 15, work inhibition; 5, guilt feelings; and 3, sense of failure) of the 21 items in the BD I each accounted for 5% or more of the total differential test functioning. Six items (Items 6, sense of punishment; 10, crying; 14, distorted body image; 13, indecisiveness; 2, pessimism; and 9, suicidal wishes) accounted for 1%–5% of the total differential test functioning. The 4 anchor items performed the same in both groups. Thus, approximately half the items (11 of 21) on the BDI accounted for about 80% of the differential test functioning.” (pg 384)

Approximately half the items (11 of 21) on the BDI accounted for about 80% of the differential test functioning.

“Using the adjusted cutoffs for late-life patients lowered the false negative rate by more than half (9.6% to 4.6% of clinically diagnosed patients whose scores on the BDI would have placed them in the nondepressed range). The percentage of severely depressed late-life patients also decreased (17.0% to 13.3%), and the percentage of moderately depressed patients increased (41.3% to 49.5%). The general point is that cutoff scores adjusted on the basis of our IRT model had a significant impact on assignment to these varying levels of severity of depression.” (pg 385)

 

Strengths:

According to the authors, the analyses provide additional perspective on the newer revisions of the BDI, such as the BDI-II (Beck, Steer, & Brown, 1996) and the BDI-PC for primary care settings (Beck, Guth, Steer, & Ball, 1997).

This is a very strong paper, with an excellent review, explication, and execution of the likelihood ratio approach to DIF detection. 

1. The differences documented in the present analyses emphasize that researchers and clinicians must be mindful of age as a factor that may contribute to variability in test scores. The possibility that some of the discrepancies in the research literature concerning the prevalence of depression in older versus younger cohorts may be the results of differential functioning of the BDI for late-life patients is highlighted. 

2. Sample size was adequate for the exercise.

3. Both uniform and non-uniform DIF were examined.

4. Purification was performed. 

5. Possible reasons for age-related differences in the performance of the BDI are discussed.

6. The use of cutoff scores for the late-life group as a way for adjusting for DIF in the BDI was discussed. Cutoff scores were provided.  The obvious caveat (as discussed by the authors) is that such adjustments are sample dependent and may not be cross-validated. 

Possible Limitations: 

There are no obvious limitations to these analyses, except that this method does not permit inclusion of covariates.  

Key references:

1. BECK AT, GUTH D, STEER RA, BALL R: Screening for major depression disorders in medical inpatients with the Beck Depression Inventory for primary care. Behaviour Research & Therapy (1997) 35:785–791.

2. BECK AT, STEER RA: Manual for the Beck Depression Inventory. San Antonio, TX: Psychological Corporation. (1993).

3. BECK AT, STEER RA, BALL R, RANIERI, WF: Comparison of Beck Depression Inventories–IA and –II in psychiatric outpatients. Journal of Personality Assessment (1996) 67:588–597.

4. BECK AT, STEER RA, BROWN, GK: Manual for the Beck Depression Inventory–II. San Antonio, TX: Psychological Corporation. (1996).

5. BECK AT, STEER RA, GARBIN MG: Psychometric properties of the Beck Depression Inventory: Twenty-five years of evaluation. Clinical Psychological Review (1988) 8:77–100.

6. SANTOR DA, RAMSAY JO, ZUROFF DC : Nonparametric item analyses of the Beck Depression Inventory: Evaluating gender item bias and response option weights. Psychological Assessment (1994) 6:255–270.

7. STEWART, RB, BLASHFIELD R, HALE WE, MOORE MT, MAY FE, MARKS RG: Correlates of Beck Depression Inventory scores in an ambulatory elderly population: Symptoms, diseases, laboratory values, and medications. Journal of Family Practice (1991) 32: 497–502.

8. TALBOTT MM: Age bias in the Beck Depression Inventory: A proposed modification for use with older women. Clinical Gerontologist (1989) 9(2):23–35.

9. GALLAGHER D, BRECKENRIDGE J, STEINMETZ J, THOMPSON L: The Beck Depression Inventory and research diagnostic criteria: Congruence in an older population. Journal of Counseling and Clinical Psychology (1983) 51:945–946.

( see abstract )

Back to  TOP    Measurement matrix         MEGs Search