The Systematic Review Assistant-Deduplication Module (SRA-DM) project was developed in 2013 at the Bond University Centre for Research in Evidence-Based Practice (CREBP). The project aimed to reduce the amount of time taken to produce systematic reviews by maximising the efficiency of the various review stages such as optimising search strategies and screening, finding full text articles and removing duplicate citations.
The deduplication algorithm was developed using a heuristic-based approach with the aim of increasing the retrieval of duplicate records and minimising unique records being erroneously designated as duplicates. The algorithm was developed iteratively with each version tested against a benchmark dataset of 1,988 citations. Modifications were made to the algorithm to overcome errors in duplicate detection (Table 1). For example, errors often occurred due to variations in author names (e.g. first-name/surname sequence, use/absence of initialisation, missing author names and typographical errors), page numbers (e.g. full/truncated, or missing), text accent marks (e.g. French/German/Spanish) and journal names (e.g. abbreviated/complete, and ‘the’ used intermittently). The performance of the SRA-DM algorithm was compared with EndNote’s default one step auto-deduplication process. To determine the reliability of SRA-DM, we conducted a series of validation tests with results of different literature searches (cytology screening tests, stroke and haematology) which were retrieved from searching multiple biomedical databases (Table 2).

SRA-DM algorithm changes

IterationsChanges to algorithms
First iterationMatching criteria were based on simple field comparison (ignoring punctuation) with checks against the year field since this field has a lower probability for errors because it is restricted to integers 0–9 and therefore the best non-mistakable field.
Second iterationShort format page numbers were converted to full format (e.g. 221–226, 221–6), and the algorithm was further modified to increase the sensitivity by incorporating matching criteria on authors OR title.
Third iterationMatch author AND title with the extension of the non-reference fields from only ‘year’ to year OR volume OR edition.
Fourth iterationThe fourth algorithm extended the matching criteria of the third algorithm, with the addition of an improved name matching system. This was context aware of author name variations, i.e. initialisation, punctuation and rearranged author listings using fuzzy logic, so that differences could be accommodated. For example, the following names are all syntactically equivalent and will match as identical authors:
1. William Shakespeare
2. W. Shakespeare
3. W Shakespeare
4. William John Shakespeare
5. William J. Shakespeare
6. W. J. Shakespeare
7. W J Shakespeare
8. Shakespeare, William
9. Shakespeare, W
10. Shakespeare, W, A
11. Shakespeare, W, A, B, C
12. William Shakespeare 1st
13. William Shakespeare 2nd
14. William Shakespeare IV
15. William Adam Bob Charles Shakespeare XVI

Databases searched for retrieval of citations for validation testing

DatasetsDatabases searched
Cytology screening tests dataset1. Cochrane Controlled Trials Register (CCTR)
2. Cochrane Database of Systematic Reviews (CDSR)
3. EMBASE
4. MEDLINE
5. National Research Register (NRR)
6. Database of Assessments of Reviews of Effectiveness (DARE)
7. NHS Health Technology Assessment (HTA)
8. PreMEDLINE
9. Science Citation Index
10. Social Sciences Citation Index
Haematology dataset1. MEDLINE
2. EMBASE
3. MEDLINE In-Process
4. Biological Abstracts
5. NHS Health Technology Assessment (HTA)
6. Cochrane Controlled Trials Register (CCTR)
7. Cochrane Database of Systematic Reviews (CDSR)
8. CINAHL
9. Science Citation Index
10. Social Sciences Citation Index
Stroke dataset1. MEDLINE
2. EMBASE
3. CENTRAL
4. CINAHL
5. PsycInfo
Free full text: Click here