Improving Systematic Review Efficiency Through Deduplication

The Systematic Review Assistant-Deduplication Module (SRA-DM) project was developed in 2013 at the Bond University Centre for Research in Evidence-Based Practice (CREBP). The project aimed to reduce the amount of time taken to produce systematic reviews by maximising the efficiency of the various review stages such as optimising search strategies and screening, finding full text articles and removing duplicate citations.
The deduplication algorithm was developed using a heuristic-based approach with the aim of increasing the retrieval of duplicate records and minimising unique records being erroneously designated as duplicates. The algorithm was developed iteratively with each version tested against a benchmark dataset of 1,988 citations. Modifications were made to the algorithm to overcome errors in duplicate detection (Table 1). For example, errors often occurred due to variations in author names (e.g. first-name/surname sequence, use/absence of initialisation, missing author names and typographical errors), page numbers (e.g. full/truncated, or missing), text accent marks (e.g. French/German/Spanish) and journal names (e.g. abbreviated/complete, and ‘the’ used intermittently). The performance of the SRA-DM algorithm was compared with EndNote’s default one step auto-deduplication process. To determine the reliability of SRA-DM, we conducted a series of validation tests with results of different literature searches (cytology screening tests, stroke and haematology) which were retrieved from searching multiple biomedical databases (Table 2).Table 1

SRA-DM algorithm changes

Iterations	Changes to algorithms
First iteration	Matching criteria were based on simple field comparison (ignoring punctuation) with checks against the year field since this field has a lower probability for errors because it is restricted to integers 0–9 and therefore the best non-mistakable field.
Second iteration	Short format page numbers were converted to full format (e.g. 221–226, 221–6), and the algorithm was further modified to increase the sensitivity by incorporating matching criteria on authors OR title.
Third iteration	Match author AND title with the extension of the non-reference fields from only ‘year’ to year OR volume OR edition.
Fourth iteration	The fourth algorithm extended the matching criteria of the third algorithm, with the addition of an improved name matching system. This was context aware of author name variations, i.e. initialisation, punctuation and rearranged author listings using fuzzy logic, so that differences could be accommodated. For example, the following names are all syntactically equivalent and will match as identical authors:
	1. William Shakespeare
	2. W. Shakespeare
	3. W Shakespeare
	4. William John Shakespeare
	5. William J. Shakespeare
	6. W. J. Shakespeare
	7. W J Shakespeare
	8. Shakespeare, William
	9. Shakespeare, W
	10. Shakespeare, W, A
	11. Shakespeare, W, A, B, C
	12. William Shakespeare 1st
	13. William Shakespeare 2nd
	14. William Shakespeare IV
	15. William Adam Bob Charles Shakespeare XVI

Table 2

Databases searched for retrieval of citations for validation testing

Datasets	Databases searched
Cytology screening tests dataset	1. Cochrane Controlled Trials Register (CCTR)
	2. Cochrane Database of Systematic Reviews (CDSR)
	3. EMBASE
	4. MEDLINE
	5. National Research Register (NRR)
	6. Database of Assessments of Reviews of Effectiveness (DARE)
	7. NHS Health Technology Assessment (HTA)
	8. PreMEDLINE
	9. Science Citation Index
	10. Social Sciences Citation Index
Haematology dataset	1. MEDLINE
	2. EMBASE
	3. MEDLINE In-Process
	4. Biological Abstracts
	5. NHS Health Technology Assessment (HTA)
	6. Cochrane Controlled Trials Register (CCTR)
	7. Cochrane Database of Systematic Reviews (CDSR)
	8. CINAHL
	9. Science Citation Index
	10. Social Sciences Citation Index
Stroke dataset	1. MEDLINE
	2. EMBASE
	3. CENTRAL
	4. CINAHL
	5. PsycInfo

Free full text: Click here

Rathbone J., Carter M., Hoffmann T, & Glasziou P. (2015). Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module. Systematic Reviews, 4(1), 6.

Publication 2015

Accent Biological Cytology tests Health technology assessment Sensitivity Spanish Stroke

Corresponding Organization :

Other organizations : Bond University

Top 5 similar protocols

Protocol cited in 18 other protocols

Variable analysis

independent variables

Deduplication algorithm iterations (first, second, third, and fourth)

dependent variables

Retrieval of duplicate records
Misidentification of unique records as duplicates

control variables

Benchmark dataset of 1,988 citations
EndNote's default one-step auto-deduplication process

controls

Positive control: Comparison of SRA-DM algorithm performance against EndNote's default one-step auto-deduplication process
Negative control: Not explicitly mentioned

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!