# Layer 2 v2 — Binary Coverage Gap Analysis

**Question:** Why only 453 / 505 binary nuclei instead of 28² = 784?
**Answer:** The 279 unattested pairs are filtered by known Arabic phonotactic constraints, not arbitrary omissions from Jabal's lexicon.

> Generated by `scripts/layer_2/analyse_coverage_gap.py` — re-run anytime to refresh.

## Counting conventions

Two attested counts coexist in the corpus, and they say different things:

- **Raw attested forms (as written in Jabal):** 507. Includes binaries written with hamza-on-alif variants (أب، إذ، آن) and alif-maqsura (مى), counted as distinct strings.
- **Unique binaries in the canonical 28-letter (ا..ي) space:** 505. The variants above are folded to their alphabet letters (أ/إ/آ/ء → ا, ى → ي).

The partition below is built on the **505-binary** count because only it lives inside the same 28² space the math operates on. The 38-form gap between 507 and 505 reflects the typographic variation Jabal recorded; phonotactically they're the same set of L1·L2 pairs.

(2 entries were dropped as data anomalies — non-alphabet glyphs like brackets or stray diacritics: 'ضَ', '[س'.)

## Summary

| Category | Count | % of 784 | % of 279 missing |
|----------|------:|---------:|---------------------:|
| Total possible (28 × 28) | 784 | 100% | — |
| **Attested (folded) in 28² space** | **505** | **64.4%** | — |
| Missing (unattested) | 279 | 35.6% | 100% |
| ↳ alif-initial (ا as L1 — even after fold) | 7 | — | 3% |
| ↳ identical XX (OCP) | 25 | — | 9% |
| ↳ same-articulator-class (soft OCP) | 107 | — | 38% |
| ↳ **genuine lexical gaps** | **140** | — | **50%** |

**Partition check:** 505 attested + 139 phonotactically-blocked + 140 lexical gaps = **784** (should equal 784). ✓

**The result:** of 784 mathematical possibilities, 644 (82%) are accounted for by attested-or-phonotactically-blocked. Only **140 pairs (17.9%)** are true lexical gaps — phonologically possible but unused by Arabic.

## Filter 1 — Alif-initial (7 pairs remain missing after hamza-fold)

ا (alif maddah, sustained vowel) cannot begin an Arabic root on its own. It serves either as a vowel marker or as a chair for hamza (أ/إ/آ).

- **ا-initial attested via hamza-fold** (21): اب · ات · اث · اج · اح · اخ · اد · اذ · ار · از · اس · اش · اص · اف · اك · ال · ام · ان · اه · او · اي
- **ا-initial still missing** (7): اا · اض · اط · اظ · اع · اغ · اق

## Filter 2 — Identical XX pairs (25 of 28 missing)

The Obligatory Contour Principle (OCP) forbids identical adjacent consonants in Arabic root onsets.

- **Attested XX:** هه · وو — only the soft glides 2 pairs
- **Missing XX:** بب · تت · ثث · جج · حح · خخ · دد · ذذ · رر · زز · سس · شش · صص · ضض · طط · ظظ · عع · غغ · فف · قق · كك · لل · مم · نن · يي

## Filter 3 — Same-articulator-class pairs

Soft OCP: when both letters use the same articulator (e.g., two alveolars in a row), the pair is dispreferred. Of 279 missing pairs:

| Articulator class | Missing same-class pairs |
|-------------------|--------------------------:|
| alveolar | 79 |
| velar | 12 |
| labial | 6 |
| liquid | 4 |
| palatal | 3 |
| pharyngeal | 2 |
| glottal | 1 |

### Alveolar → alveolar zoom-in

The largest same-class gap. Alveolar letters: ت · ث · د · ذ · ز · س · ص · ض · ط · ظ

- Possible alveolar-alveolar pairs (excluding XX): 90
- Attested: 11 → تس · دث · دس · زت · ست · سد · سط · صت · صد · ضد · ضز
- Missing: 79

## Filter 4 — Genuine lexical gaps

After applying filters 1–3, 140 pairs remain that are **phonologically permissible but unattested in Jabal's lexicon**. These are the real test cases for the operative model:

> Can the v2 framework predict what these unused binaries *would* mean if Arabic used them?

| L1 | Missing pairs |
|----|---------------|
| ي (16) | يث · يح · يخ · يذ · ير · يز · يص · يض · يط · يظ · يع · يغ · يف · يك · يل · يه |
| ظ (12) | ظا · ظب · ظج · ظح · ظخ · ظر · ظش · ظغ · ظق · ظك · ظو · ظي |
| ه (12) | هث · هح · هخ · هذ · هس · هص · هظ · هع · هغ · هف · هق · هك |
| ج (9) | جت · جخ · جص · جض · جط · جظ · جغ · جق · جك |
| ث (8) | ثا · ثح · ثش · ثغ · ثف · ثك · ثه · ثي |
| خ (7) | خا · خث · خج · خح · خظ · خع · خه |
| ش (7) | شث · شذ · شز · شس · شص · شض · شل |
| غ (7) | غا · غت · غج · غح · غذ · غع · غه |
| ط (6) | طا · طج · طخ · طش · طق · طك |
| ت (5) | تا · تخ · تش · تغ · تك |
| ذ (5) | ذج · ذح · ذش · ذغ · ذف |
| ك (5) | كج · كح · كص · كض · كط |
| ل (5) | لث · لخ · لش · لص · لض |
| ض (4) | ضخ · ضش · ضق · ضك |
| ح (4) | حا · حخ · حغ · حه |
| ص (4) | صا · صج · صش · صق |
| ق (4) | قا · قج · قز · قظ |
| د (3) | دج · دش · دغ |
| و (3) | وخ · وظ · وغ |
| ف (3) | فث · فذ · فغ |
| ع (3) | عا · عخ · عغ |
| م (3) | مذ · مظ · مغ |
| ز (2) | زا · زش |
| ب (1) | بظ |
| س (1) | سش |
| ر (1) | رظ |

## What this means for the v2 result

**The 100% native composition rate for Layer 2 v2 is honestly bounded:** every trilateral that *exists* in Arabic gets a coherent operative reading. It does not say that every theoretically-possible XYZ would.

This is the same honest scope as Newtonian mechanics: it doesn't predict things that never happen — it explains what *does* happen.

The companion question, **"do unused binaries have predictable charges?"**, is the natural next test:

- Pick a sample of the lexical-gap pairs above
- For each, compute the predicted binary reading from the two letter-charges (Layer 1 logic)
- Ask: does the reading describe something Arabic *could* mean? Or does the combination self-cancel?

This would close the loop between attested-and-tested → unattested-but-predicted.

## Articulator-class reference

- **glottal:** ء · ا · ه
- **pharyngeal:** ح · ع
- **velar:** خ · غ · ق · ك
- **palatal:** ج · ش · ي
- **alveolar:** ت · ث · د · ذ · ز · س · ص · ض · ط · ظ
- **labial:** ب · ف · م · و
- **liquid:** ر · ل · ن

---

_Re-generate this report:_ `python scripts/layer_2/analyse_coverage_gap.py`