Testing Behavior Tests

Just how accurate are behavioral assessments?
By Jessica Hekman DVM, MS, October 2014

It’s an almost impossible situation. Shelters need to avoid putting an aggressive dog up for adoption, but how can they discover that dog’s true behavior? Nine-and-a-half times out of 10, they have no information about the dog’s behavior in a home environment, or in any other environment, for that matter. Too often, overworked and undertrained staff members are left to make a decision after interacting with a dog for less than an hour. A mistake in one direction can mean that a new adopter is bitten, perhaps badly. A mistake in the other can mean that a good dog doesn’t get a home or, even worse, is needlessly euthanized.

In an effort to improve the odds, many shelters use behavioral assessment protocols, tests that place a dog in a series of situations that are meant to simulate challenges he might encounter in a home: pinching his flank to mimic harassment by a child, introducing a person in a funny hat to test his tolerance for a wide range of human appearances, exposing him to another dog to see if he is aggressive to his own species.

These tests are, of course, a series of approximations of actual situations. We don’t know if these approximations— no matter how carefully designed— successfully trigger aggressive behavior in truly aggressive dogs, or if they successfully avoid triggering aggressive behavior in safe dogs. But that’s what science is for, right? Testing the world to see if our predictions are correct? And in fact, interest in shelter research has taken off over the past decade. As a consequence, shelter behavior researchers are coming to grips with a pressing question: can these tests be relied upon?

The two most widely used behavioral assessment tools in the United States today are SAFER (developed by Emily Weiss, PhD, of the ASPCA) and Assess-a-Pet (developed by Sue Sternberg of Rondout Valley Animals for Adoption). In 2012, Sara Bennett, DVM—at the time, a resident in a shelter behavior program—asked whether these two tests, applied to pet dogs with known behavioral problems, could successfully categorize safe and unsafe dogs. (Bennett et al. 2012) Her goal was to validate the two assessments, to prove that their results mean what we think they mean. In other words, if they say a dog is safe, the dog actually is safe. And, on the flip side, if they say a dog is not safe, then that dog is indeed not safe.

To do this, Bennett recruited dogs from the veterinary clinic where she worked, including dogs with known behavior problems. In order to compare SAFER and Assess-a-Pet to an assessment tool she could trust, she asked all the owners to complete a Canine Behavioral Assessment and Research Questionnaire (C-BARQ). This questionnaire, a widely used method for determining a dog’s temperament, is based on information from the person who knows the dog best: the owner. C-BARQ’s ability to predict a dog’s temperament has previously been validated. (Hsu and Serpell 2003)

Bennett asked: are SAFER and Assess-a-Pet as good as this validated questionnaire at detecting unsafe dogs —are the associations between these tests’ scores and the C-BARQ scores better than chance? And if so, is the association strong enough that these tests can be trusted to consistently give accurate answers?

She found that the answer to all these questions was clearly “no.” On the one hand, Assess-a-Pet and C-BARQ agreed 73 percent of the time when they classified a dog as aggressive. Assuming that C-BARQ was correct and these were truly unsafe dogs, that’s not a bad success rate. However, the test didn’t do so well in the other direction: Assess-a-Pet incorrectly classified 41 percent of nonaggressive dogs as aggressive.

This high rate of finding aggression where it probably didn’t exist is concerning because, in a shelter environment, it could lead to euthanasia of animals who are, in reality, safe to place in a home. Technically, Assess-a-Pet was validated by this study because its agreement with the C-BARQ was better than random chance. But it didn’t do very much better than chance, so its utility in making life-or-death decisions is questionable. A test that gives you a 60/40 rather than 50/50 chance of making the right choice would seem to be of marginal value.

SAFER did even worse. Its agreement with the C-BARQ was so close to chance that this assessment was determined to be not valid. When the C-BARQ found a dog to be aggressive, SAFER agreed only 60 percent of the time. And when the C-BARQ found a dog to be not aggressive, SAFER agreed only 50 percent of the time; there was a 50/50 chance that a safe dog would be recognized as such.

These are pretty chilling results. They could be interpreted to mean that the two most widely used behavioral assessments in the United States are not doing even a passable job of predicting aggression, and that shelters are not doing much more than flipping a coin when they use an assessment to decide whether a dog will be put on the adoption floor or, potentially, euthanized.

While this study gave us some compelling information, it isn’t the last word in whether these two tests actually work in shelters. Remember that while behavioral assessment tests are intended to be used on dogs who have been in a shelter environment for days, weeks or months, Bennett’s study tested owned animals. It may not be realistic to extrapolate these assessments’ performance when applied to shelter dogs, most of whom have been living in incredibly stressful environments for extended periods of time.

This may sound like a finicky point, but a dog’s reaction to any sort of stimulus can be exquisitely responsive to the situation he’s in. I don’t think this study provides a final answer on whether these tests work or don’t work. I do think, however, that it gives us some very important information that should be taken seriously, and that it demands follow-up studies.

How Hard Is It to Test a Test?
Testing a test is hard because duplicating real-world situations in a research environment is hard. Bennett’s study was well designed, but it was inevitably limited by some realities. It is surprisingly difficult to design a study that truly tests whether a given behavioral assessment succeeds at predicting dog behavior.

Ideally, such a study would incorporate a large number of dogs as they come into a shelter. This group would then go to the adoption floor in its entirety; dogs whom the shelter suspected of being aggressive would not be removed from the group. Once the dogs were adopted, their new owners would participate in multiple interviews over a long period of time. Such a study would allow us to really get at the question of how many dogs the assessment correctly assigned to the categories of safe and unsafe, and how many it assigned incorrectly.

Of course, actually running a study like this presents a number of problems, the biggest being ethical. If you suspect that an animal is aggressive, can you ethically place it into a household? Of course you can’t. But without doing that, how can you know whether your suspicions of aggression will be borne out? This problem—the importance of not endangering adopters—represents the core difficulty in evaluating the accuracy of behavioral assessments.

There are plenty of practical problems, too. Shelters have their hands full dealing with normal day-to-day matters; supporting large-scale studies can be asking too much of an overburdened system. And owners are hard to pin down for follow-up interviews. They don’t really like to answer survey questions, which are annoying and boring and always seem to come at inconvenient times. Then there are those who adopt dogs but no longer have them; it’s an uncomfortable situation and they can be particularly difficult to get information from, yet they can potentially offer the most important insights.

Some researchers, hoping to do better, have designed new studies from scratch. Shortly after the SAFER/ Assess-a-Pet validation study was published, Kate Mornement, a practicing behaviorist studying behavioral testing as part of her PhD program, described the Behavioural Assessment for Rehoming K9’s, or B.A.R.K. (Mornement et al. 2014) Whereas SAFER and Assessa- Pet were created before the upsurge in shelter research studies, B.A.R.K. was developed with input from nine experts on canine behavior, people familiar with the problems encountered by other assessment designs.

To determine if B.A.R.K. was more successful than the older tools in assessing behavior, 102 shelter dogs were tested. Then, two to eight months after adoption, owners were asked general questions about their new dogs: how anxious, fearful, friendly, active and compliant were they? Unfortunately, there was little correlation between their responses and the dogs’ B.A.R.K. scores. The test just didn’t do a very good job of predicting how these animals would act in a home.

As Mornement recognized, this study was deeply hampered by the selection of dogs who were tested. Safety concerns excluded from the study dogs with known aggression issues. As a result, B.A.R.K. was applied to a group of dogs who were very likely to be non-aggressive. So, while it’s hard to tell how this test does at specifically predicting aggression, its difficulty predicting fear and anxiety is concerning, and provides reason to doubt that any assessment can do the job well.

Pass/Fail
Ultimately, we don’t really know which factors make a test succeed or fail in predicting canine aggression. The previously discussed studies all take the results of the entire test battery as a single score; none attempts to understand the individual components of that score. However, each sub-test in the test battery is intended to get at a different part of the dog’s temperament, and it’s possible that some of these subtests succeed while others fail. Perhaps, for example, a test battery successfully predicts food aggression (one sub-test) but fails to predict dog-dog aggression (a different sub-test).

Recent studies have started looking at these individual sub-tests. Researchers at the ASPCA (Mohan-Gibbons et al. 2012) specifically assessed one of the most controversial sub-tests, food guarding. In this test, a fake hand is used to touch the dog’s bowl while he is eating, and then to take the food bowl away. Problematic reactions range from freezing and a hard stare to growling or biting the fake hand. In this study, 96 dogs determined by the SAFER assessment to have food-guarding issues were adopted out. Adopters were given information on how to manage and modify the dogs’ behavior.

When adopters were contacted up to three months after adoption, only six reported any aggression over food, and that aggression was transient. Even more interesting, adopters reported that they had essentially ignored the management and modification techniques recommended by the shelter. They had felt free to touch their dogs while the dogs were eating, and to take the dogs’ food away. They had not been bitten.

This was a really stunning revelation: of 96 dogs who had tested positive for food aggression, only six displayed it in their new homes. This raised more interesting questions: Is it possible that dogs are showing food aggression in the shelter due to stress? Is food-aggression testing completely useless?

A follow-up study performed at the Center for Shelter Dogs in Boston, Mass., dug deeper into the question. (Marder et al. 2013) It followed dogs who did and did not test as food aggressive in the shelter, and followed them longer than the ASPCA study. The analysis in this study is really fascinating. They asked the new owners if their dogs were food aggressive and, overwhelmingly, were told no. Then they asked more specific questions, such as, “Does your dog growl when you pick up his food?” Well, yes, the adopters said, but that wasn’t a big deal. This study, in other words, found that while the test may be successfully predicting foodguarding behavior, that behavior seems to very rarely escalate into true aggression, and isn’t considered a problem by the vast majority of adopters.

Asking Better Questions
Research into shelter behavioral assessments seems to have finally found its footing with these recent studies. This new approach—investigating specific behaviors and asking adopters what kinds of behavioral problems really matter to them—may aid in designing more effective tests. Or maybe they’ll just help us better interpret the behavioral assessments we already have. Either way, behavioral assessment researchers have a clear path to follow.

In the meantime, how should we interpret existing behavioral assessments? Here are two cautionary tales about extreme ends of the spectrum; they come from time I spent in two different shelters during my shelter medicine veterinary internship. In one shelter, I was handling a young mixedbreed dog who ripped open the fake hand that was used to take her food bowl away. If that had been my hand, I would have been in the emergency room. Despite my reservations about the validity of behavioral assessments, I took that particular act of aggression very seriously.

In another shelter, I observed a behavioral assessment in which a dog was repeatedly harassed with a fake hand because the shelter staff had a suspicion that he would bite. As the tester continued to provoke him long after this sub-test would normally have ended, the dog froze, then growled, then finally bit the hand, but not hard enough to damage it. Despite his restraint in the face of persistent harassment, he was labeled as aggressive by the shelter staff. In both instances, the dogs were euthanized.

Not all cases are as clear as these two, but I think there’s something to be learned from them. Shelter behavioral assessments can give us useful insights into the behavior of our charges, but they are not the final word. Even those who design behavioral assessments caution against taking these tests as blackand- white answers to the question of whether or not to put a dog up for adoption, and we must be very careful to abide by that recommendation.

Even in the chaotic world of a shelter, time must be taken to consider all of the information available about a dog. We must do so generously, giving the dog every chance to succeed, and cautiously, providing prospective adopters with all the information we can.

In the world of shelter research, we must continue to ask more, and more detailed, questions about these tests. Not just, do they succeed or fail at predicting aggression, but why they succeed or fail, how they work, what they test. We also need to determine what adopters actually want from their pets, not what we think they want.

There is a lot of work to do.

Jessica Hekman, DVM, MS, completed a specialty internship in shelter medicine and now studies the genomics of dog behavior at the University of Illinois, Urbana.