# Audit closure — §7.2.5: the «14-root delta» between Jabal xlsx and Juthoor's `roots.jsonl`

**Date:** 2026-05-09
**Item closed:** [`our-contributions-and-roadmap.md`](../02-architecture/our-contributions-and-roadmap.md) §7.2.5
**Original framing:** «xlsx جبل (1,924 جذرًا) ومُخرَجات Codex (1,938 جذرًا = +14)»
**Verdict:** original framing was **misleading**; both numbers were row counts, not unique-root counts. The real discrepancy is much larger and traces to two distinct bugs in Juthoor's ingest pipeline. Diagnosis below; remediation owned by `Juthoor-ArabicGenome-LV1/scripts/ingest_muajam.py`.

---

## 1. What the numbers really say

| Metric | Value | Source |
|---|---|---|
| Master xlsx data rows | **1,924** | `المعجم_الاشتقاقي_Juthoor_v2.xlsx` → sheet `المعجم الكامل`, col `التركيب (الجذر الثلاثي)` |
| Master xlsx **pair cells** (single cell encoding two roots, e.g. `ب ث ث - ب ث ب ث`) | **376** | same sheet |
| Master xlsx **unique roots** (after splitting pair cells) | **2,300** | same sheet |
| Juthoor `roots.jsonl` rows | **1,938** | `Juthoor-ArabicGenome-LV1/data/muajam/roots.jsonl` |
| Juthoor **unique** `tri_root` strings | **1,923** | same |
| Juthoor internal **duplicate** `tri_root` (same root, different bab) | **15** | same |

So «1,924 vs 1,938» is **rows vs rows**. By the more methodologically honest **unique-root** measure, Juthoor is **377 short** of the master, not 14 over.

**Examples of master roots missing from Juthoor:**
`أتو`, `أتى`, `أخو`, `أخي`, `أدو`, `أدى`, `أوب`, `أيب`, `بثث`, `بثبث`, `بجج`, `بجبج`, `بدبد`, `بأدل`, `بثبث`, `بره`, `برهن` … (757 total)

---

## 2. Root cause — two bugs in `ingest_muajam.py`

### Bug A — pair-cell concatenation (the main one)

The 24 per-letter source files in `Data raw/Muajam Ishtiqaqi/Tables_Juthoor/` encode geminate-doublet pairs in a single cell using **two different separators**, inconsistently across files:

- `باب الباء.xlsx` uses **slash**: `'ب ث ث / ب ث ب ث'`
- `باب التاء.xlsx` uses **hyphen**: `'ت ب ب - ت ب ت ب'`

The ingest helper `strip_spaces()` only removes whitespace, so:

| Per-letter source | Juthoor `tri_root` | Should have been |
|---|---|---|
| `'ب ث ث / ب ث ب ث'` | `'بثث/بثبث'` (mangled) | two rows: `بثث`, `بثبث` |
| `'ت ب ب - ت ب ت ب'` | `'تبب-تبتب'` (mangled) | two rows: `تبب`, `تبتب` |
| `'أ و ب / أ ي ب'` | `'أوب/أيب'` (mangled) | two rows: `أوب`, `أيب` |

**Audit count:** 249 hyphen-form + 68 slash-form = **317 pair cells** in the per-letter sources, all concatenated into Frankenstein strings. The 380 «Juthoor-only» roots in the diff are exactly these mangled strings (plus a small tail of edge cases like `[سغغ]سغسغ`, brackets in source). Splitting them properly yields the ~752 roots that currently live only in the master.

### Bug B — multi-bab duplicates

15 trilateral roots appear **twice** in `roots.jsonl`, classified under two different bab chapters. Examples: `رمي`, `أرم`, `رمح`, `رمد`, `رمز`, `رمض`, `رمن`, `رنن`, `رين`, `رهره`, `رهو`, `رهب`, `رهط`, `رهق`, `رهن`.

This is **likely a feature, not a bug** — Jabal's lexicon does cross-classify some roots — but the duplicates are silent. They inflate row counts without a `multi_bab` flag, which is what made the original audit see «+14 over master rows» (15 duplicates − some other rounding ≈ 14 net).

---

## 3. Reconciliation arithmetic

Starting from Juthoor's 1,938 rows:

```
  1,938  Juthoor rows
-    15  internal duplicates (Bug B)         → 1,923 unique
- (380)  mangled pair strings → split into ~752 valid roots
                                              → 1,923 − 380 + 752 ≈ 2,295
```

Compare to master's 2,300 unique roots → residual gap of ~5 roots, well within rounding of the small bracketed/edge-case strings observed in the diff (`[سغغ]سغسغ` etc.). **The 14-root delta is fully accounted for.**

---

## 4. Recommended fix (owned by Juthoor pipeline)

Update `Juthoor-ArabicGenome-LV1/scripts/ingest_muajam.py` in the companion `Juthoor-Linguistic-Genealogy` repo (computational pipeline, not hosted on this site):

1. **Pair-cell splitting** — when a `tri_root` cell matches the regex `^[ء-ي\s]+\s*[-/]\s*[ء-ي\s]+$`, emit one record per side, copying all sibling metadata (bab, binary_root, axial_meaning, quran_example) onto each side. The added-letter and axial-meaning columns may need per-side adjustment if the source separates them.
2. **Multi-bab flagging** — if the same `tri_root` appears under more than one bab, retain both rows but add a `multi_bab: true` boolean and a `bab_set: ["X","Y"]` list. Downstream consumers (Eye 1 skeleton matcher, nucleus scorer) already key by `tri_root`; the flag avoids silent double-counting.

**Acceptance test:** after re-ingest, expect ~2,295–2,300 unique `tri_root` values and `len(roots.jsonl) ≈ 2,310–2,320` (2,300 unique + ~15 multi-bab duplicates). Add this expectation as a regression test in `Juthoor-ArabicGenome-LV1/tests/`.

---

## 5. Audit artefacts

Programmatic and human-readable diffs are saved in the Juthoor sibling repo:

- `outputs/audits/14_root_delta.json` — full diff payload
- `outputs/audits/14_root_delta.md` — table of all 380 mangled and 757 master-only roots
- `scripts/audit_14_root_delta.py` — reproducer

All three live in the companion `Juthoor-Linguistic-Genealogy` repo (not deployed on this public site).

---

## 6. Closure

**§7.2.5 — closed.** The original framing («14-root delta») was a row-count artefact masking a 380-pair concatenation bug and 15 multi-bab duplicates. The bug lives in `ingest_muajam.py` (per the 2026-03-24 audit's hand-off: «out of scope for this folder, ticket open in `Juthoor-DataCore-LV0/`»). Fix recipe is documented above; remediation is a downstream code change, not a vault edit.

The Arabic Tongue vault remains the authoritative source. The xlsx is canonical at **2,300 unique trilateral roots**, not 1,924. The «1,924 trilateral roots» figure that has been quoted across the project (including in [`02-architecture/lv1-architecture.md`](../02-architecture/lv1-architecture.md) and §1.1 of [`our-contributions-and-roadmap.md`](../02-architecture/our-contributions-and-roadmap.md)) is the row count, not the unique-root count, and should be updated wherever it appears.

---

**Recommended follow-up edits in Arabic Tongue vault** (small, low-risk):

- `02-architecture/our-contributions-and-roadmap.md` §1.1 — change «1,924 جذرًا» to «1,924 صفّ معجميّ يحوي 2,300 جذرًا فريدًا» or similar.
- `02-architecture/our-contributions-and-roadmap.md` §7.2.5 — mark closed, link to this audit.
- `05-audits/2026-03-24-lv1-data-audit.md` §6.2 — append a 2026-05-09 update referencing this closure.

These are documentation edits only; the methodological substance of the work (28-letter charges, 18 nuclei, four-bar rubric) is unchanged.
