BibexPy — V2 Helium

Address Harmonization

Institutional collaboration networks are frequently distorted by affiliation variants that refer to the same organization — Wuhan Univ, Wuhan University, department-level address forms — while country-level analyses suffer from inconsistent naming conventions. A two-track module addresses both.

Address harmonization diagram
Track 1: organization roll-up to canonical parents. Track 2: dictionary-based country standardization.

Track 1 — Organization roll-up

  1. Each C1 address is parsed into organization candidates.
  2. Affiliation variants are clustered by Jaro–Winkler similarity.
  3. Each cluster maps to a canonical parent institution (e.g. departments roll up to their university).
  4. High-confidence matches consolidate automatically; borderline cases route to review.

Track 2 — Country standardization

Dictionary-based normalization resolves alternative spellings, abbreviations and historical names (USA / United States / U.S.A.; Türkiye / Turkey). Unresolved tokens go to the same review workflow rather than being guessed.

Why it matters

| Without harmonization | With harmonization | | --- | --- | | "Wuhan Univ" and "Wuhan University" are two institutions | One canonical node with combined output | | Country counts split across spelling variants | One row per country | | Institutional rankings undercount fragmented affiliations | Roll-up reflects real institutional volume |

All consolidations are logged and reversible; the canonical mapping is visible per record, so you can always trace an aggregated count back to the original address strings.

Run enrichment first

Enrichment can add ROR identifiers and country codes from authoritative sources — extra deterministic evidence the roll-up uses before resorting to string similarity.