ORCID-First Author Disambiguation
Author-name disambiguation is the most consequential harmonization task for collaboration and co-authorship networks. Unlike traditional string-based approaches, BibexPy prioritizes ORCID identifiers as deterministic evidence and falls back to a constrained field-similarity decision only when identifier coverage is incomplete.

The five stages
- ORCID resolution — identifiers are read from the OI field; when missing, an on-demand OpenAlex lookup via the record DOI fills the gap.
- Classification by ORCID — a shared identifier deterministically merges author mentions; disjoint identifiers separate them. No guesswork.
- Field-similarity fallback — for mentions lacking ORCID evidence, similarity over subject categories and keywords (WC, SC, DE, ID) and co-occurrence patterns is computed. String and context measures only — no predictive ML.
- Constrained decision — highly similar mentions auto-merge, dissimilar ones separate, borderline cases route to the optional review queue.
- Normalization — confirmed identities become consistent author nodes for downstream science mapping.
The review queue
Borderline blocks (typically 5–10% of authors) appear in a review list with the evidence that triggered uncertainty. If an LLM provider is configured (Settings → LLM), it can propose cluster suggestions with a confidence score and rationale — but:
- Nothing is applied automatically. Data stays unchanged until you approve.
- A snapshot is taken before applying.
- Responses are cached — the same block is never sent twice.
What you get
- Cleaner co-authorship networks — one node per real person, not per spelling variant.
- An auditable trail: every merge/separation decision is logged with its evidence and is reversible.
Why ORCID-first matters
String similarity alone merges different people with similar names and splits one person across initials variants. ORCID provides ground truth where available; the fallback is deliberately conservative everywhere else.
