Smart Merge Algorithm
Smart Merge is a multi-stage record-linkage framework for duplicate detection across Web of Science and Scopus. It builds on probabilistic record-linkage principles inspired by the Fellegi–Sunter model and combines deterministic identifiers with similarity-based matching to balance precision and recall.

The five stages
- Metadata normalization — DOI standardization, Unicode normalization, title cleaning, author-name harmonization.
- Blocking — records are grouped by publication year and author to limit pairwise comparisons (no O(n²) blow-up).
- Hierarchical matching — exact DOI and identifier matching first, then Jaro–Winkler title similarity, then rule-based rejection; each candidate pair receives a confidence score.
- Confidence routing — high scores auto-merge; mid-range pairs go to the optional borderline-review queue (with LLM adjudication if configured); low scores are rejected.
- Field-level conflict resolution — when both databases disagree on a field, predefined source-preference rules decide which value wins.
Confidence routing in practice
| Confidence | Action | | --- | --- | | High (e.g. exact DOI match) | Merged automatically | | Borderline | Queued for review — you (or an optional LLM assistant) decide | | Low | Kept as separate records |
Borderline pairs stay accessible later via the Uncertain Pairs panel on the Records page — reviewing them is never a blocking step.
Field conflict resolution
When WoS and Scopus both provide a value, the merge keeps the preferred source per field (e.g. WoS for cited references, Scopus for affiliations) and cross-fills gaps (WC↔SC category mapping). Every decision is recorded in the audit log and reversible.
Why not just match DOIs?
Roughly 5–15% of records in a typical corpus lack a DOI in at least one database. Title-similarity matching with blocking recovers those duplicates while the confidence threshold keeps false merges out.
