Improving Systematic Review Efficiency Through Deduplication
The Systematic Review Assistant-Deduplication Module (SRA-DM) project was developed in 2013 at the Bond University Centre for Research in Evidence-Based Practice (CREBP). The project aimed to reduce the amount of time taken to produce systematic reviews by maximising the efficiency of the various review stages such as optimising search strategies and screening, finding full text articles and removing duplicate citations. The deduplication algorithm was developed using a heuristic-based approach with the aim of increasing the retrieval of duplicate records and minimising unique records being erroneously designated as duplicates. The algorithm was developed iteratively with each version tested against a benchmark dataset of 1,988 citations. Modifications were made to the algorithm to overcome errors in duplicate detection (Table 1). For example, errors often occurred due to variations in author names (e.g. first-name/surname sequence, use/absence of initialisation, missing author names and typographical errors), page numbers (e.g. full/truncated, or missing), text accent marks (e.g. French/German/Spanish) and journal names (e.g. abbreviated/complete, and ‘the’ used intermittently). The performance of the SRA-DM algorithm was compared with EndNote’s default one step auto-deduplication process. To determine the reliability of SRA-DM, we conducted a series of validation tests with results of different literature searches (cytology screening tests, stroke and haematology) which were retrieved from searching multiple biomedical databases (Table 2).
SRA-DM algorithm changes
Iterations
Changes to algorithms
First iteration
Matching criteria were based on simple field comparison (ignoring punctuation) with checks against the year field since this field has a lower probability for errors because it is restricted to integers 0–9 and therefore the best non-mistakable field.
Second iteration
Short format page numbers were converted to full format (e.g. 221–226, 221–6), and the algorithm was further modified to increase the sensitivity by incorporating matching criteria on authors OR title.
Third iteration
Match author AND title with the extension of the non-reference fields from only ‘year’ to year OR volume OR edition.
Fourth iteration
The fourth algorithm extended the matching criteria of the third algorithm, with the addition of an improved name matching system. This was context aware of author name variations, i.e. initialisation, punctuation and rearranged author listings using fuzzy logic, so that differences could be accommodated. For example, the following names are all syntactically equivalent and will match as identical authors:
1. William Shakespeare
2. W. Shakespeare
3. W Shakespeare
4. William John Shakespeare
5. William J. Shakespeare
6. W. J. Shakespeare
7. W J Shakespeare
8. Shakespeare, William
9. Shakespeare, W
10. Shakespeare, W, A
11. Shakespeare, W, A, B, C
12. William Shakespeare 1st
13. William Shakespeare 2nd
14. William Shakespeare IV
15. William Adam Bob Charles Shakespeare XVI
Databases searched for retrieval of citations for validation testing
Datasets
Databases searched
Cytology screening tests dataset
1. Cochrane Controlled Trials Register (CCTR)
2. Cochrane Database of Systematic Reviews (CDSR)
3. EMBASE
4. MEDLINE
5. National Research Register (NRR)
6. Database of Assessments of Reviews of Effectiveness (DARE)
Deduplication algorithm iterations (first, second, third, and fourth)
dependent variables
Retrieval of duplicate records
Misidentification of unique records as duplicates
control variables
Benchmark dataset of 1,988 citations
EndNote's default one-step auto-deduplication process
controls
Positive control: Comparison of SRA-DM algorithm performance against EndNote's default one-step auto-deduplication process
Negative control: Not explicitly mentioned
Annotations
Based on most similar protocols
Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.
As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.
About PubCompare
Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.
We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.
However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.
Ready to
get started?
Sign up for free.
Registration takes 20 seconds.
Available from any computer
No download required