BibexPy — V2 Helium

ORCID-First Author Disambiguation

Author-name disambiguation is the most consequential harmonization task for collaboration and co-authorship networks. Unlike traditional string-based approaches, BibexPy prioritizes ORCID identifiers as deterministic evidence and falls back to a constrained field-similarity decision only when identifier coverage is incomplete.

Disambiguation diagram
ORCID resolution → identifier classification → field-similarity fallback → constrained decision → normalization.

The five stages

  1. ORCID resolution — identifiers are read from the OI field; when missing, an on-demand OpenAlex lookup via the record DOI fills the gap.
  2. Classification by ORCID — a shared identifier deterministically merges author mentions; disjoint identifiers separate them. No guesswork.
  3. Field-similarity fallback — for mentions lacking ORCID evidence, similarity over subject categories and keywords (WC, SC, DE, ID) and co-occurrence patterns is computed. String and context measures only — no predictive ML.
  4. Constrained decision — highly similar mentions auto-merge, dissimilar ones separate, borderline cases route to the optional review queue.
  5. Normalization — confirmed identities become consistent author nodes for downstream science mapping.

The review queue

Borderline blocks (typically 5–10% of authors) appear in a review list with the evidence that triggered uncertainty. If an LLM provider is configured (Settings → LLM), it can propose cluster suggestions with a confidence score and rationale — but:

  • Nothing is applied automatically. Data stays unchanged until you approve.
  • A snapshot is taken before applying.
  • Responses are cached — the same block is never sent twice.

What you get

  • Cleaner co-authorship networks — one node per real person, not per spelling variant.
  • An auditable trail: every merge/separation decision is logged with its evidence and is reversible.

Why ORCID-first matters

String similarity alone merges different people with similar names and splits one person across initials variants. ORCID provides ground truth where available; the fallback is deliberately conservative everywhere else.