Assessment+ | Why Validity Isn’t As Important As You May Think (When it Comes to 360 Assessments) – And 6 Best Practices Instead! - Assessment+
3418
singular,single,single-post,postid-3418,single-format-standard,ajax_fade,page_not_loaded,smooth_scroll,
title

Blog

Why Validity Isn’t As Important As You May Think (When it Comes to 360 Assessments) – And 6 Best Practices Instead!

360 Assessments

We’re occasionally asked about the statistical validity and reliability of the 360 Assessment process, from the forethought behind drafting the items to the analysis after receiving the data. They’re common questions when it comes to 360s – Are my results valid? What does my data mean in the grand scheme of things?

 

Although such statistical measurements as validity and reliability are relevant, and even required, for other types of assessments that measure specific skills and/or knowledge – say a U.S. History exam that should accurately determine your familiarity with the Civil War – they’re not necessarily relevant for 360 Assessments. This is because, by design, a 360 assessment is a more subjective measurement of the perceptions and opinions of multiple people who view the world (and the behaviors of others) through different lenses.

 

Throughout these assortments of tests that can be run to “prove” statistical validity or reliability, many don’t quite pertain to 360 assessments. Here’s why:

 

  • Test-Retest Reliability – This analysis must be done over a short period of time to see if ratings remain stable. However, the whole point of a 360 assessment (and subsequent follow-up assessments is IMPROVEMENT! We don’t want the ratings to remain the same – we want positive change in behaviors, perceptions, and performance.
  • Internal Consistency Reliability – In order for an assessment to prove this sort of reliability, items within a competency or category must rate consistently. However, it is actually quite likely that a rating for one area of communication, for example, will be much higher than a rating for a different area. The differences in ratings (by design) is exactly what we want to see to help the leader pin-point strengths and areas in which they can improve.
  • Inter-Rater Reliability – This test of reliability provides another prime example of a situation in which what some statisticians would deem “inaccurate” or “meaningless” is actually something that we find treasured in the data! The differences in raters’ perspectives are not treated as statistical errors, but as valuable information that makes the feedback even more reliable and provides a deeper meaning.

 

So, if these three tests ultimately fail when it comes to proving validity of 360 assessments, what can we rely on instead? Here are some of our best practices:

 

    1. Ensure questions are psychometrically sound – Avoid posing any questions that could be ambiguous or vague. Measure one behavior in each question so that raters are not forced towards the middle of the scale. If a two-part question, the person being rated might excel in one of those areas, but need development in the other. The rater is left to “ride the fence” on the rating, and makes the “average” rating less relevant.
    2. Select larger groups of raters – Validity will be stronger with 15-20 raters vs. 3-5 raters.
    3. Write questions with “Face validity” in mind – When the questions and competencies appear to make sense for the job level and purpose of the assessment, “face validity” has been achieved.
    4. Identify the full range of behaviors/skills believed to represent how your organization defines successful leadership.
    5. Customize Scale Anchors – While we recommend customizing everything for your particular organization, the design of the scale – in particular – can limit ambiguity. Rather than using anchors of “agreement” or “frequency,” consider using anchors that are clearer and linked to the employees’ demonstrated behaviors.
    6. Ensure the items are actionable – Don’t attempt to measure one’s belief about why a leader demonstrates an action; merely measure whether the action is being demonstrated.