Assessing your ability to measure behaviour effectively

In order to ensure the data collected is the best it can be in terms of accuracy and that subsequent interpretations from the data are robust, it is important to have an idea of how ‘good’ the measurements were that were taken, how biased they may be and consequently, how representative they are of what was actually observed (Martin and Bateson, 2007).

Instruments, to be valuable for behavioural measurement, must be both valid and reliable (Sinn et al., 2010). If the instrument measuring behaviour is a human observer, they too must also be reliable and valid!


Validity gives information as to how accurately the research and the methods employed in the measurement of behaviour represent reality (i.e. the population observed). In other words, it gives an idea of how much the method that we are using measures what it claims to measure, as opposed to something else.

Example. We want to know the number of hours that pet dogs spend sleeping. We decide to count the hours that the animals spend laying down with their eyes closed. However, dogs may lay down with eyes closed also without sleeping (e.g. when they rest but are awake). Therefore this is not a valid method for recording the behaviour that we want to observe.

It should be remembered that the validity of a study is affected by the methodology used (data collection, analysis and interpretation of results) (Lehner, 1996). The example above shows how the method of data collection may impair the study. Another cause of invalid results may be the issue of not recognising some confounding factor during the interpretation of the results.

Example. We want to know the effect of a new psychoactive medication on fear in dogs. We set up a study in which fearful dogs are treated with the medication in addition to behavioural therapy. However, we do not consider that owners may change their behaviour if they know that their dog is being medicated, influencing the outcome of the therapy (e.g. they may feel more committed and follow the behavioural therapy more accurately). In this case both the methodology (having the owners not “blind” to the therapy) and the wrong interpretation of results (not considering the effect of the owner) may invalidate the study.

To decide whether the measure is valid, two separate features should be considered.

  • Accuracy

Measurements are accurate if they are free from systematic errors (Martin and Bateson, 2007).        

Example. Imagine that you want to evaluate the preference of rabbits for the features of two kinds of carriers. Therefore you provide a plastic carrier and a wire carrier to a group of rabbits and you measure the time that the animals spend inside each of them and the number of times they enter each carrier. However, you do not notice that the wire carrier is in more shade than the plastic carrier and data collection is occurring on a warm day. This may lead to a systematic error as the rabbits may spend more time in the wire carrier to gain access to shade rather than demonstrate a true preference for the wire carrier. The error is named “systematic” as it systematically influences all the measurements in the same way.

  • Specificity

The more specific a measurement is, the more it measures only the variable (e.g. behaviour) that you want to observe and nothing else (Martin and Bateson, 2007).   

Example. Imagine that you want to observe how the colour of feathers influences female common pet parakeets’ choice of a male. When observing a group of parakeets, you can decide to measure the time that females spend with each male and look for a correlation in your measurements. However, in this type of observation, the time spent together by each pair may be influenced not only by the colour, but possibly also by other features, e.g. chemical stimuli, vocalisations, behaviour, temperament.

Task: Watch the following video of cats grooming and measure the total time spent licking fur using A) a stop watch B) your mobile phone and C) the second hand on a watch. Compare results and reflect on measurement tools and influence on validity.


Reliability gives an idea of how much a measure is free from errors (unbiased).  A number of factors are relevant for an observation to be considered reliable.

  • Precision

Precision refers to the absence of random errors in the measurement (Martin and Bateson, 2007). Random errors are errors which do not occur in all measurements, but just happen in some of them e.g. because the experimental condition is not properly set up so that unplanned events can occur without the control of the experimenter.

Example. You want to know if pet dogs like a new toy (e.g. a rubber ball). You decide to set up a preference test in a park. The problem is that in that environment you cannot control the situation, therefore your dogs may be distracted during the test by other dogs, people, noises, smells etc. These events would happened randomly across your population sample and bias your results.

  • Sensitivity

Sensitivity is the ability of the instrument that is used for measuring behaviour to register the smallest changes in the real value observed (Martin and Bateson, 2007).         

Example. Imagine that you want to study vocal behaviour of lab rats. In order to do this you need an instrument capable of measuring ultrasound, as rats’ vocalisations include sounds in this range.

  • Resolution

The resolution is the smallest change in the real value observed, that can be registered by the instrument used for measuring behaviour (Martin and Bateson, 2007). Each study will determine the degree of resolution required for accurate conclusions to be drawn. In some studies comparisons of hours spent in an activity may be appropriate, for example time per month that cats spend hunting, whereas for others you may need to record minutes, seconds or fractions of a second for very brief duration behaviours.           

Example. Imagine that you are observing play behaviour in puppies. If you decide to record your data using a watch then you can record the behaviour that you observe referring only to seconds; however, if you use a stop-watch then you can record changes which occur in fractions of a second.

  • Consistency

A measurement of behaviour is consistent if, when repeating the measurement, the same scores are obtained (Martin and Bateson, 2007).

Example. An example of consistency is the use of an instrument, e.g. a scale. A scale is consistent when it gives the same value if the weight of the same dog is measured repeatedly (during one recording session in this case so that measures cannot change for external reasons, e.g. food intake). On the contrary, if the scale gives different weights for the same dog each time we try to measure it, it would not be a consistent instrument.

When reliability is referred to with respect to people taking measurements, two categories are used inter-rater reliability and intra-rater reliability.

Inter-rater reliability

Inter-rater reliability (i.e.between-rater reliability) is an index of the extent to which different observers obtain the same results from measuring the behaviour of a certain individual the same way (Sinn et al., 2010). This value is important when a group of observers are involved in the same experiment. For their study to be reliable, it is important that all of them describe the behaviours observed in exactly the same way. In this way we can say that all animals undergo the same experiment and data can be pooled and compared. If different observers measure slightly different things we could not be certain whether differences measured were a true representation of the individuals in the study or due to observer differences .Each observer using the same standardised ethogram can help overcome this problem.

Example. A group of researchers decide to assess the behaviour and health of working equines in a number of developing countries. Because of the large geographic range of the study, 42 observers are employed. To ensure that all the observers measure behaviour in the same way, they all undergo a training course on measuring equid behaviour and their measurements are compared with an experienced trainer: Only people who attain more than 80% of agreement with the trainer are involved in the study (Burn et al., 2010). This means that all the observers measure the behaviour in a similar way (consistently); even if  their measurements are biased (and thus less valid), they are at least very likely to be all affected by the same bias and thus reliable, although of course it is imperative to try to remove bias to ensure that the study is valid.

Intra-rater reliability

Intra-rater reliability (also known as intra-rater repeatability or test-retest reliability) is a measure of reliability which gives an index of the extent to which an observer describes an individual or several individuals in the same way when the same observation is repeated in different time periods (e.g. records behaviour from the same video observation several times) or one observer records the behaviour of an animal in several contexts. Therefore it describes the extent to which an individual’s scores generalize across testing occasions (Sinn et al., 2010).

Example. A group of researchers wants to assess the reliability of a test used to assess dogs’ behaviour in a group of shelters. Particularly, their aim is to evaluate whether their results are replicated if observations are repeated after a period. Forty observers and 17 rehoming centres are involved. In order to assess the intra-rater reliability, the observers repeat their measurements in the same conditions 2 months after the first observation (Diesel et al., 2008). Having good intra-rater repeatability would mean for this study that the observers assess the dogs in the same way during the first and the second observation.

Task: Pick any video on this website and have a number of individuals calculate frequency and duration of the behaviours exhibited. Repeat this exercise some time later and compare results for both intra-rater reliability and inter-rater reliability.