Introduction

Hypertension is an important worldwide public health problem and is a major risk factor for cardiovascular disease, the leading cause of death and disability worldwide.1, 2 Testing for hypertension is the most commonly performed screening test in medical practice and is important not least because individuals with a normal blood pressure (BP) at the age of 50 still run a 90% life-time risk of developing hypertension.3 Furthermore, trial evidence indicates that even small differences in systolic BP of 2–4 mm Hg are clinically important, thus accurate measurement is vital.4, 5, 6

Errors in measurement occur through three main sources: device errors, user errors and patient errors.7 Device errors are potentially the most fundamental of these three sources, and, in particular, accurate measurement is underpinned by the requirement for an instrument that measures pressure accurately in patients across the range of pressures.8 For many years, the standard instrument for BP measurement was a mercury sphygmomanometer, but in recent years this has been superseded in many clinical settings by automated electronic devices.

The need to validate the accuracy of such devices has led to the development of various protocols. (Box 1).9, 10 These protocols stipulate the minimum testing population with an acceptable range of mean and s.d. of errors or an s.d. below a threshold, which is dependent on the level of mean difference, but do not specify the proportion of individuals within those populations that should have accurate readings.11 However, it is possible for a BP monitor to meet the above validation criteria, but still record BP measurements in error by >5 mm Hg in half of the individuals assessed.6 It has, therefore, been argued that validation standards should be based, in part, on the percentage of individual readings within an accurate range.12, 13 To understand the potential effect of such a change in policy, we aimed to systematically examine the proportion of accurate readings attained by automatic digital BP devices in published clinical validation studies.

Methods

Eligibility and search strategy

We included all published studies of automatic digital BP devices that (1) validated studies of BP devices using a recognized protocol in a clinic setting or in the community; (2) used a sphygmomanometer with upper arm BP measurement; (3) included adult patients (18 years of age) with no significant severe intercurrent health problems and (4) included data on the mean and s.d. of the BP difference between calibrated standard sphygmomanometer (mercury or random-zero sphygmomanometers) and (semi-)automatic (oscillometric or aneroid) sphygmomanometer. There were no language restrictions.

Reasons for study exclusion were devices used in special clinical situations (for example pregnancy, at altitude, during exercise, in trauma patients, neonates, children and serious diseases), measurements not from upper arm BP monitoring devices (for example wrist and ABPM), comparison with other sphygmomanometers (for example direct central pressure and ambulatory intra-arterial pressure) and published papers reporting no mean difference or s.d.. We searched Ovid versions of MEDLINE and EMBASE, and Cochrane Library through Wiley Interscience from January 1989 to June 2008. Medical Subject Headings and synonym terms used were sphygmomanometer, BP, hypertension, calibration and monitor device. We also searched the reference lists of identified papers and reviews.

Data abstraction

Two authors reviewed the titles and abstracts of all identified articles and selected full-text papers. Articles clearly not meeting the criteria were excluded at this stage. The remaining articles were then reviewed in detail by three authors for inclusion. We extracted data on the percentage of BP measurements within 5 mm Hg, mean and s.d. of difference between calibrated standard sphygmomanometer and (semi-) automatic sphygmomanometer. We also obtained information on detail of process, including study design, sample size, device characteristics and the relevant guidelines used to assess the device accuracy.

One author (YW) extracted data and this was rechecked by a second author (CH) independently, disagreements were resolved by discussion with a third author (RP). The reviewers were not masked to any aspect of the studies (for example journal type, author names or institution).

Data analysis

We used Review Manager (RevMan) Version 5.0. Copenhagen: The Nordic Cochrane Centre, The Cochrane Collaboration, 2008 for the statistical analysis and forest plots (sorted by mean effect and device type for systolic BP and diastolic BP).

We summarized data as mean and s.d. of differences between measured and observed BP (the observer control measurement used in the protocol), and proportion of measurements within 5 mm Hg of observed BP and ordered the rows of the figures by the percentage of measurements with errors <5 mm Hg. When this information was not available, we replaced it by a bias-corrected modelled value, defined as 0.27+0.76 × P, where P is the percentage of errors <5 mm Hg assuming a normal distribution, and the bias correction coefficients (0.27 and 0.76) were derived by linear regression on the trials that report all outcomes. We analysed the effect of year of protocol by separating studies into four groups (before 1995, 1996–2000, 2001–2005 and after 2006) and used analysis of variance and the η2 to estimate the amount of variation explained by year of protocol These calculations were performed using STATA (Intercooled STATA 10 for Windows).

We also calculated the additional number of measurements required assuming a normal distribution for a selection of devices to achieve 95% of readings within 5 mm Hg based on the assumption of a mean difference equal to zero and a dispersion of 2.5 mm Hg (similar to the top-rated device for each protocol).

Results

From 5912 potentially relevant records, 79 articles (10 783 participants) reporting 113 studies (including subgroups for different devices and protocols) met the eligibility criteria (Figure 1). Of these, 96 studies were recommended devices by the various protocols, 13 were not recommended and 14 devices were validated in the community (including six not recommended by the protocol).

Figure 1
figure 1

Flowchart of search results.

Studies were from 22 different countries: US (23); UK (20); France (15); Italy (10); Ireland and Poland (6 each); Japan (5); Canada (4); Russia, Germany and Turkey (3 each); Denmark, Greece, Spain and Switzerland (2 each); and Argentina, Brazil, China, Israel, Netherlands, Sweden and Tanzania (1 each).

Studies used five different protocols for validating devices: 35 used the ESH international protocol (ESH-IP), 9 the BHS protocol, 19 ANSI/AAMI protocol and 22 used both the BHS and ANSI/AAMI protocols. The results from these are presented in Figures 2, 3, 4, respectively. A further 16 studies used different protocols: 3 the American Heart Association BP measurement guidelines, 2 the Health Insurance Portability and Accountability Act protocol and 11 studies did not state which protocol they used. All of these 16 were excluded from the present analysis, and on re-analysis, four aneroid devices were removed from the analysis.

Figure 2
figure 2

Means and s.d. of errors in systolic BP in devices passing, or failing (marked dagger), the British Hypertension Society protocol.

Figure 3
figure 3

Means and s.d. of errors in systolic BP in devices passing, or failing (marked dagger), the AAMI protocol.

Figure 4
figure 4

Means and s.d. of errors in systolic BP in devices passing, or failing (marked dagger), the ESH-IP protocol.

For each study, the citation, device tested and mean and s.d. of differences are shown, together with the reported proportion of measurements within 5 mm Hg of the observed value. Some studies and/or devices appear more than once in which results were reported using different sampling schemes and/or protocols or where more than one group validated a device.

Overall, 25/31 (81%, 95% CI: 67–95%), 37/41 (90, 81–99%) and 34/35 (97, 92–99.5%) of devices passed the relevant protocol (BHS, AAMI and ESH-IP, respectively). No particular pattern was seen either in terms of the date of assessment or of manufacturer in terms of likelihood of a machine in terms of accuracy (data not shown).

Devices passing the BHS protocol had between 60 and 86% of measured values within 5 mm Hg of the observed value (Figure 2). Devices failing the BHS protocol (marked with a †) had between 35 and 59% of test device values within 5 mm Hg. Devices passing the AAMI protocol had between 47 and 94% of measured values within 5 mm Hg of the correct value (Figure 3) and for the four devices failing 36–58% within 5 mm Hg. For the ESH-IP protocol devices passing the protocol, it had 54–89% within 5 mm Hg, and for the one device reported to fail the International protocol, 52% were within 5 mm Hg (Figure 4).

Some studies appear twice as they undertook the protocol, but using different techniques. For example, in the Reinders' study,14 when automated measurements using the BHS protocol were taken simultaneously with the reference standard, 96% were within 5 mm Hg of the reference value; when taken sequentially, 80% were within 5 mm Hg.

In devices tested in community/clinic-based studies (studies performed from a population-based perspective), between 35 and 46% of measured values were within 5 mm Hg of the observed value. We found only two devices tested using both a protocol and in the community. For the Omron HEM-705CP, 86% of measurements were within 5 mm Hg of the observed value under the BHS protocol15 compared with 46% within 5 mm Hg in the community-based study.16 For the IVAC model 4200, these figures were 59% under the BHS protocol and 39% in the community-based study.17 When we studied the performance of the monitors for diastolic BP, results were broadly similar.

Figure 5 shows the mean and single s.d. of errors in BP monitors in repeated studies of the same monitor using either the same or different protocols. The results for the same device could vary substantially when a different protocol was used. For example, the Omron HEM-907 device could vary by 18% of readings within 5 mm Hg depending on which protocol was used: 80% (using the AAMI protocol)18 compared with 62% (ESH-IP protocol).19 The same device could also vary by 22% of readings within 5 mm Hg when the same protocol was used: 62% (El Assaad 2002)19 compared with 84% (De Greef 2007).20

Figure 5
figure 5

Mean and single s.d. of errors in BP monitors in repeated studies of the same monitor.

There was no significant difference between the three different protocols (P=0.20). However, the year of protocol was significantly correlated with the proportion of measurements (r=0.257, P<0.01). Overall, 19% of the variation could be explained by changes over time (η2=2581.11/13600.77=0.19; F=6.513, P<0.001). Subgroup analysis by protocol showed the ESH-IP to be the contributor to this correlation (the majority of its studies were later with 64% after 2005), whereas the BSH and AAMI (earlier studies) were not.

Even if a given device could be improved to have a mean zero, it would still be necessary to reduce the variation. For example, for a device with 74% of BP measurements <5 mm Hg [the Welch Allen Vital Signs (s.d.: 5)], a further six BP measurements would be required to reduce the variation, so that 95% of readings fall within 5 mm Hg. To obtain the same precision for the Pharma Smart PS-2000 (s.d.: 7), a device with 66% <5 mm Hg, a further 11 measurements are required (based on the assumption that in each individual the BP differences between the tested device and the reference method are random). For the ESH-IP protocol when using a device with 81% <5 mm Hg [Artasan CS410 (s.d.: 5.9)], a further eight BP measurements are required to reduce variation, so that 95% of readings are within 5 mm Hg; whereas for the Omron M6 Comfort (s.d.: 7.4) with 68% <5 mm Hg, a further 12 measurements are required. The mean difference between devices that passed the protocols was relatively stable: BHS range: −2.4–1.5 mm Hg, AAMI: 2.8–2.7 mm Hg and ESH-IP: −3.2–2.0 mm Hg.

Discussion

Devices that passed international validation protocols for BP accuracy vary greatly in quality. Taking ‘percentage of readings within 5 mm Hg of the observed value’ as a measure of quality, we found wide variations in quality among devices that passed protocol testing, whether the protocol was BHS, AAMI or ESH-IP, with slight improvements in performance using the ESH-IP protocol. This is a reflection in part of the numbers of individuals required to fulfil the protocols, which affects their ability to distinguish between machines of differing accuracies: the BHS protocol requires testing on 85 patients, and, therefore, has the power to distinguish differences between devices of >15%, whereas the international protocol requires only 33 patients (less than the other protocols), and can only detect differences between devices of >24%.

Moreover, devices passing a protocol are not necessarily more accurate than devices that failed, and devices that pass may perform substantially worse when tested in a community setting. In addition, a device that passes validation does not mean that it provides accurate readings in all patients; inaccuracy is more likely in older and diabetic patients.21, 22

In general, we found devices performing badly do so because of wide variation between differences (large s.d.) rather than systematic over- or underestimation of BP (Figures 2, 3, 4). This is reflected in the relatively stable mean difference, which leads to a large number of additional measurements for devices relative to only a small drop off in precision (that is lower proportion of measurements within 5 mm Hg). This effect is negated by multiple readings over multiple clinic visits. Yet, as a BP approaches the threshold for diagnosis, this inaccuracy can substantially under- or overestimate the burden of hypertension.13 In addition, community clinic-based validations were fewer and the results much worse. This is particularly of concern, as this setting is where devices are most often used for the assessment and ongoing management of hypertension.

However, it is worth noting that the change from mercury to electronic BP measurement in practice has led to no consistent change in mean BP after their introduction, but there was a large and significant fall in terminal digit preference suggesting improved precision of recording with electronic devices.23

This study had a number of limitations: the reproducibility of protocol testing is currently unknown. We found no studies, which repeated earlier validations. Some studies may have been missed. Although the search of 5912 potentially relevant records was comprehensive, with analysis of 3356 potentially relevant records, studies in which machines failed the validation are less likely to be published, particularly if the manufacturer is involved in the funding. However, missing such studies would have the effect of worsening the results and hence would not alter the conclusions. All validation studies found used the standard cuff size. A significant proportion of adults will need to use the larger cuff and yet there is no evidence as to whether this will be accurate. The analysis was limited by the lack of reporting of proportional data in some cases. Development of an international consensus on minimal standards of reporting, including representations of proportional data, would go some way to allowing comparison of devices under different validation protocols. Our estimates of the proportion within 5 mm Hg of the observed value may not be accurate, because they assume a study-specific normal distribution that leads to some degree of inaccuracy, but were sufficient to inform the ordering of the figures.

The existing protocols have several limitations. First, the procedure specifications are sometimes ambiguous. For example, the AAMI protocol allows a choice of data collection methods (simultaneous vs sequential) and a choice of data-analysis methods, either of which may affect the final results. Problems with validation protocol methods not being detailed enough has earlier been cited as a problem for interpretation.24 Second, not all protocols directly address the margin of error between measurement and observed value, and those that do allow margins of error as wide as 10 or 15 mm Hg, result in measurements that are up to half the time >5 mm Hg from the observed value in some monitors that pass protocols. Third, we found few examples of the protocols being tested in community settings (that is, real clinics rather than artificial settings), and the results from these studies were generally poor. This is of concern because it may suggest that most protocol studies overestimate device quality, although we found too few community studies to draw firm conclusions. Finally, the British and American protocols are difficult or costly to obtain, which could discourage further evaluation of BP monitors. Further information on protocol publications is available from the BHS website (http://www.bhsoc.org/Blood_pressure_Publications.stm).

Protocols that include measurements for a certain proportion of readings to be within a certain value and have unambiguous procedures could encourage the manufacture of higher quality BP monitors. Monitors need to be re-evaluated in community settings (‘real clinics’) to verify their accuracy in practice. The advent of a central, open access repository of evaluation studies would be beneficial to clinics and healthcare providers when choosing digital monitors.

Current protocols for validating BP devices give no guarantee of accuracy in clinical practice. Devices may pass even rigorous protocols with as few as 60% of readings within 5 mm Hg of the observed value. The limited evidence available suggests that performance in clinical practice will be even worse than under protocol testing. BP standards are essential to provide clinicians and patients with accurate information on which to base diagnostic and treatment decisions.