Building an AI-Assisted Shoghi Effendi Translation Lookup Dictionary

I want to tell you about one of the most satisfying pieces of software I’ve ever built. It’s a dictionary — but not a normal one. It’s a concordance that maps every significant Arabic and Persian word in Shoghi Effendi’s translations to the English rendering he chose, organized by trilateral root, with cross-references to every passage where that root appears. And it runs instantly, with zero AI calls at query time.

The path to getting there was full of surprises — and a complete architectural restart. What started as a straightforward “batch-process every unique word” approach turned into a deep dive into Arabic morphology, Unicode edge cases, word-level alignment, and the surprising power of letting AI verify its own work.

The corpus

Shoghi Effendi (1897-1957) was the Guardian of the Baha’i Faith — a unique position making him the only authorized interpreter of the meaning of the revelation after ‘Abdu’l-Baha. Among his many contributions, he translated key Baha’i texts from Arabic and Persian into English with a distinctive literary style that’s been studied by translators ever since. His word choices were deliberate — when he rendered an Arabic word as “sovereignty” in one passage and “dominion” in another, the distinction mattered.

Our corpus contains 2,521 source/translation paragraph pairs across 11 of his major translations: the Kitab-i-Iqan, Gleanings from the Writings of Baha’u’llah, Epistle to the Son of the Wolf, Prayers and Meditations, the Hidden Words, Will and Testament of ‘Abdu’l-Baha, and several shorter works. About 126,000 content-word tokens. Not huge by modern standards, but linguistically dense.

The goal: let a researcher type an Arabic or Persian phrase and instantly see every occurrence of each word across the entire corpus, grouped by trilateral root, with the exact English rendering Shoghi Effendi used in each passage.

The first attempt: AI at runtime

Our first implementation was the obvious one. When a user searched for a phrase like نار الحبّ (“the fire of love”):

Call Claude to extract terms — “What are the significant words here? Skip the particles.” Claude would return structured data: نار (nār, root ن-و-ر, “fire”) and حبّ (ḥubb, root ح-ب-ب, “love”).
Search Meilisearch for each term — our search engine has all 2,521 passage pairs indexed with OpenAI embeddings for hybrid semantic+keyword search. This finds passages containing the word or morphological variants.
Call Claude again with the passages — “Here are 20 passages containing this word. For each one, tell me the exact Arabic form, which English word renders it, and give me short excerpts.”

It worked. And it was beautiful when it worked. But it had problems:

Slow: 3-8 seconds per query. Two AI round-trips plus search.
Expensive: Every query burned API tokens.
Fragile: Required both Meilisearch and an Anthropic API key at runtime.
Inconsistent: AI sometimes gave slightly different analyses for the same word across queries.

Then it hit me: the corpus is fixed. These 2,521 passages aren’t changing. Why compute the same answer every time someone asks?

The insight: pre-compute everything

If we run the AI analysis once for every word in the corpus and store the results, runtime becomes pure database lookups. No AI, no search engine, no network calls. Just SQLite.

Version 1: batch word processing

The first pre-computation approach was straightforward. Extract every unique content word from the corpus (~20,000 after normalization), batch them into groups of 15, search Meilisearch for context passages, and send each batch to Claude Haiku: “Here are 15 words with example passages — give me the root, transliteration, meaning, and English renderings.”

About 1,350 API calls. $5-15 at Haiku pricing. The result: 2,537 roots with 31,432 occurrences. It worked, but it had fundamental problems:

Wrong root matches: أَحَبُّ from root ḥ-b-b got matched to the preposition “b” because the AI was seeing words out of context
Split renderings: “fair-minded” and “fair-minded person” counted as separate renderings because the AI processed words in batches, not in their original passages
Missing words: complex forms like لِتَمْلِكَ failed lookup entirely because the AI had to guess alignment without seeing the actual translation context
Low coverage: 31,000 occurrences across 126,000 tokens meant we were missing the majority of words

The core problem: processing words in isolation loses the alignment between source and translation. The AI was doing morphological analysis and trying to identify English renderings and trying to match passages, all at once, with only keyword-search context. Too many opportunities for errors to compound.

Version 2: alignment-first architecture

The fix was to invert the process. Instead of “extract words, then find their renderings,” we do “align every word in every paragraph, then extract the concordance from the alignments.”

The key insight: word-level alignment is the foundation for everything. Once you know which Arabic word maps to which English word in every paragraph, you get two things at once: interactive side-by-side paragraph display (hover over an Arabic word, see the English rendering highlight) and a clean concordance extracted from the alignments. One AI operation, two major features.

Phase 1: Word-level alignment

The alignment approach is elegant in its simplicity. For each of the 2,521 paragraphs, we tokenize both the Arabic/Persian source and the English translation into indexed word lists. Then Claude Sonnet maps each source word index to the English word index (or index range) that translates it.

The prompt provides vocabulary hints from any existing concordance data — for each Arabic word, we show what root and meaning Jafar already knows. This grounds the alignment in established terminology:

You are aligning Arabic text with its English translation by Shoghi Effendi.

ARABIC WORDS (with vocabulary hints from Shoghi Effendi's corpus):
[0] يَا — root y-ʾ "O"
[2] ابْنَ — root b-n-w "son"
[3] الرُّوحِ — root r-w-ḥ "spirit"

ENGLISH WORDS:
[0] O
[1] SON
[2] OF
[3] SPIRIT

For each Arabic word index, give the English word index that translates it.
Examples: "2": 3 (single word), "2": [6, 7] (inclusive range), "0": null (no match).

The AI returns a clean index mapping: {"0": 0, "2": [1, 2], "3": 3}. Shoghi Effendi freely rearranged clause order, so the rules emphasize matching by meaning, not position. Each English index can appear in at most one mapping (no overlaps), and mappings are capped at 4 English words per Arabic word to keep them tight.

We convert these index pairs into character-offset ranges anchored to the original text: {src: [0, 3], tgt: [0, 1], ar: "يَا", en: "O"}. Adjacent Arabic words mapping to the same English range get merged. The result is an alignment array stored in each paragraph’s JSON file.

This alignment data serves double duty. At display time, the side-by-side paragraph viewer uses it for interactive word highlighting — tap an Arabic word and see its English rendering light up. At concordance-build time, it tells us exactly which English word Shoghi Effendi used for each Arabic word in each passage. No guessing, no keyword-search approximation.

Cost: ~$20-25 for all 2,521 paragraphs with Sonnet. Resumable — progress saved after each paragraph.

Phase 2: Linguistic enrichment with generate + verify

With alignments in hand, we walk every paragraph and ask Claude Sonnet to analyze each word. The prompt includes each word alongside its aligned English rendering, so the AI has full context:

ARABIC WORDS TO ANALYZE (with English alignment):
[0] يَا → "O"
[2] ابْنَ → "SON OF"
[3] الرُّوحِ → "SPIRIT"

For each word, the AI returns: trilateral root (dash-separated Arabic letters), uninflected lemma, literal dictionary meaning, part of speech, verb form (I-X for verbs), and a proper noun flag.

The quality mechanism: two independent AI calls per paragraph.

Call 1 (generate): Produce the full linguistic analysis.

Call 2 (verify): A fresh Sonnet call reviews Call 1’s output against the same source material. It returns only corrections — most paragraphs come back with 1-3 fixes; some come back clean.

// Call 1: Generate
const generated = await aiCall(generatePrompt); // root, base, literal, pos, verb_form, is_name

// Call 2: Verify (independent perspective)
const corrections = await aiCall(verifyPrompt); // Only returns fixes
for (const [idx, fixes] of Object.entries(corrections)) {
  Object.assign(generated[idx], fixes); // Apply corrections
}

Phase 3: Aggregate into the concordance

The enrichment files give us per-word metadata for every word in every paragraph. Building the concordance from this is relatively straightforward post-processing — no AI needed:

Filter functional words: Particles, pronouns, prepositions, and other stop words are excluded from the concordance. The AI already tagged parts of speech, so filtering is a simple check.
Filter proper names: Words flagged as is_name get their own root entries with the name flag set, keeping personal names from polluting the rendering spectrum of their root (e.g., “Bahá” the name vs. بَهاء the concept of “glory”).
Calculate base words: The raw English alignment (“the justice”, “His sovereignty”) gets article-stripped to its core rendering (“justice”, “sovereignty”) for cleaner spectrum analysis.
Group by root: All forms of ق-ل-ب across all paragraphs become occurrences under one root entry.
Build cross-root links: Roots that share English renderings get linked — if ق-ل-ب and ف-ء-د both produce “heart,” they reference each other.

This post-processing is the easy part. The hard part — figuring out which English word goes with which Arabic word — was solved by the alignment step. Everything downstream is deterministic.

The result: 4,035 roots with 105,869 occurrences — 3.4x more occurrences than Version 1, because we’re indexing every word in every paragraph rather than sampling via keyword search. And the renderings are more accurate because each one comes from explicit word-level alignment, not from an AI guessing correspondences from keyword-search context.

The normalization rabbit hole

Before any AI processing, every Arabic/Persian token goes through normalization. This turned out to be one of the deepest rabbit holes in the project.

Here’s something that surprised me. Arabic and Persian both use the Arabic script, but they use different Unicode characters for some of the same letters. The letter that English speakers would think of as “y” has two Unicode representations:

ي (U+064A) — Arabic Ya
ی (U+06CC) — Persian Ya

They look almost identical. In most fonts they render the same way. But to a computer, they are completely different characters. بين (Arabic ya) and بین (Persian ya) — both meaning “between” — are different strings. They won’t match each other in a database lookup. A stop word list containing one won’t catch the other.

The same problem exists for kaf:

ك (U+0643) — Arabic Kaf
ک (U+06A9) — Persian Kaf

And it gets worse. Arabic has six different forms of alef depending on whether it carries a hamza (a glottal stop marker) and where:

Character	Name	Unicode
ا	Plain Alef	U+0627
أ	Alef with Hamza Above	U+0623
إ	Alef with Hamza Below	U+0625
آ	Alef with Madda	U+0622
ٱ	Alef Wasla	U+0671

Plus taa marbuta (ة) which is often interchangeable with ha (ه) — it’s the feminine ending marker, but in Persian texts it’s typically written as ه. And alef maqsura (ى) which looks like ya but isn’t.

Our corpus, being a mix of Arabic texts and Persian texts with Arabic quotations, had all of these variants. The word “to” (إلى in formal Arabic) might appear as إلى, الی, or إلی depending on the text. That’s three different strings for the same word.

The normalizer

The solution was a function that collapses all variants to a single canonical form:

function normalize(token) {
  let t = token
    // Strip punctuation and zero-width characters
    .replace(/[.*,:;\?\!\(\)\[\]\{\}«»،؛؟۔…‌‍‎‏]/g, '');
  // Strip tashkil (vowel diacritics)
  t = t.replace(/[\u064B-\u065F\u0670]/g, '');
  // Unify character variants
  t = t.replace(/ي/g, 'ی')   // Arabic ya → Persian ya
    .replace(/ك/g, 'ک')      // Arabic kaf → Persian kaf
    .replace(/ؤ/g, 'و')      // hamza on waw → waw
    .replace(/ئ/g, 'ی')      // hamza on ya → ya
    .replace(/ٱ/g, 'ا')      // alef wasla → plain alef
    .replace(/آ/g, 'ا')      // alef madda → plain alef
    .replace(/أ/g, 'ا')      // hamza above → plain alef
    .replace(/إ/g, 'ا')      // hamza below → plain alef
    .replace(/ة/g, 'ه')      // taa marbuta → ha
    .replace(/ى/g, 'ی');     // alef maqsura → ya
  return t;
}

After normalization, كَلِمَة, کلمه, and كلمة all become کلمه. One canonical form.

The stop word trap

With normalization working, I wrote a stop word list — about 150 Arabic and Persian function words: particles (و، في، من), pronouns (هو، هي), prepositions (على، إلى), relative pronouns (الّذي، الّتي), conjunctions (بل، ثمّ), Persian auxiliaries (شد، بود، نمود), and so on.

I tested it. And discovered that بين (“between,” appearing 315 times) was sailing right through the stop word filter. So were الذین (“those who,” 270 times), التی (“which,” 236 times), and إلا (“except,” 439 times).

The bug was subtle and infuriating. My stop word list contained بين with Arabic ya (ي). But the tokens from the corpus, after going through the normalizer, came out as بین with Persian ya (ی). The normalizer was working correctly — it unified everything to Persian ya. But the stop word list itself hadn’t been normalized. I was comparing normalized tokens against un-normalized stop words.

The fix was embarrassingly simple: run the stop words through the same normalizer.

const _STOP_RAW = ['الّذي', 'إلى', 'بين', ...]; // Human-readable
const STOP_WORDS = new Set(_STOP_RAW.map(normalize)); // Machine-comparable

This is one of those lessons you don’t forget: if you normalize your data, you must normalize your reference sets through the exact same function.

The invisible characters

There was one more normalization surprise. Persian text commonly uses the Zero-Width Non-Joiner (U+200C, ZWNJ) to control how letters connect. In the word می‌خواهد (“he wants”), there’s a ZWNJ between می and خواهد that prevents the two parts from visually joining. You can’t see it. It takes up no space. But it’s there in the Unicode, and it makes the string different from میخواهد without the ZWNJ.

There’s also the Zero-Width Joiner (U+200D), Left-to-Right Mark (U+200E), and Right-to-Left Mark (U+200F). None of them produce visible output, but all of them create string mismatches.

We strip all of them.

Why AI is genuinely necessary

You might wonder: can’t we just use a stemmer? Arabic has well-known stemming algorithms. Why involve AI at all?

The answer is broken plurals. Arabic has two plural systems. Regular (sound) plurals add a suffix: معلم (teacher) → معلمون (teachers). A rule-based stemmer handles these fine.

But broken plurals change the internal vowel pattern of the word: قلب (heart) → قلوب (hearts). كتاب (book) → كتب (books). رجل (man) → رجال (men). There’s no suffix to strip. The word has been restructured from the inside. Arabic has dozens of broken plural patterns, and knowing which one applies to which word requires knowing the word — it’s lexical knowledge, not algorithmic.

There’s more:

Root identification: Determining the trilateral root from a surface form requires morphological knowledge. استقامت (steadfastness) comes from root ق-و-م (to stand). You can’t derive that by stripping affixes — the root letters are buried inside a ten-letter word under layers of derivational morphology.

Cross-lingual alignment: This is the core value. Given an Arabic passage and its English translation side by side, the AI identifies which English word corresponds to which Arabic word. Shoghi Effendi freely rearranged clause order — matching by meaning, not position, requires genuine bilingual comprehension.

The two-step verification: A single AI call makes mistakes — wrong root assignments, misidentified parts of speech. But when a second call independently reviews the first, it catches errors the first pass misses. This generate-then-verify pattern is more reliable than trying to get it right in one shot.

Quality assurance

After the initial concordance build, a separate three-stage upgrade script cleans the data:

Stage 1 (deterministic): Unicode normalization, duplicate merging, orphan cleanup, cross-root link regeneration. Safe to re-run anytime.

Stage 2 (heuristic): Reassign skeleton mismatches (if a form’s consonant pattern doesn’t match its assigned root), strip English article prefixes, flag heterogeneous clusters (roots with too many unrelated renderings).

Stage 3 (AI-assisted): Claude Sonnet reviews flagged items. Returns verdicts with proposed corrections. A separate script applies verified corrections.

The database schema

Two tables, nine indexes.

CREATE TABLE roots (
  id              INTEGER PRIMARY KEY,
  root            TEXT NOT NULL UNIQUE,   -- 'ق-ل-ب'
  transliteration TEXT NOT NULL,          -- 'q-l-b'
  meaning         TEXT NOT NULL,          -- 'heart; to turn'
  slug            TEXT,                   -- 'qalb-heart' (URL-safe)
  is_name         INTEGER DEFAULT 0,     -- proper noun flag
  similar         TEXT                    -- JSON: [42, 87]
);

CREATE TABLE occurrences (
  id        INTEGER PRIMARY KEY,
  root_id   INTEGER NOT NULL REFERENCES roots(id),
  form      TEXT NOT NULL,     -- 'قلوبهم' (original)
  form_norm TEXT NOT NULL,     -- 'قلوبهم' (normalized)
  stem      TEXT NOT NULL,     -- 'قلب' (dictionary lemma)
  en        TEXT NOT NULL,     -- 'hearts' (aligned English)
  en_base   TEXT,              -- 'heart' (article-stripped)
  src       TEXT NOT NULL,     -- source context excerpt
  tr        TEXT NOT NULL,     -- translation context excerpt
  ref       TEXT NOT NULL,     -- 'HW§2'
  pair_id   TEXT NOT NULL,     -- 'the-hidden-words/2'
  is_name   INTEGER DEFAULT 0, -- proper noun occurrence
  pos       TEXT,              -- 'noun', 'verb', 'adj', etc.
  verb_form TEXT               -- 'I' through 'X' (verbs only)
);

The current database: 4,035 roots, 105,869 occurrences. About 45 MB. The similar column uses JSON instead of a join table. The data is read-only — simplicity wins.

The en_base column deserves mention. The raw en field contains the aligned text exactly as it appears: “the justice”, “His sovereignty”, “of hearts”. The en_base strips leading/trailing grammatical words to get the core rendering: “justice”, “sovereignty”, “hearts”. This enables cleaner rendering spectrum analysis — “justice” appears 45 times, not split across “the justice” (12), “of justice” (8), “His justice” (3).

Runtime: the payoff

At query time, the concordance module does no AI calls, no search engine queries, no network requests. Here’s the full lookup cascade when a user types نار الحبّ:

Normalize: Strip tashkil, unify character variants, remove punctuation → ['نار', 'حب']
Filter stop words: Neither word is a stop word, so both proceed.
For each word, cascade through five lookup strategies:
- Exact form_norm match
- Affix-stripped form_norm variants (strip common Arabic/Persian prefixes and suffixes)
- Exact stem match (AI-assigned dictionary lemma)
- Affix-stripped stem variants
- Root consonant skeleton (handles hamza normalization — last resort)
Once a hit is found, look up the root and fetch all occurrences for that root — not just the matching ones. If you search قلوب (hearts), you get قلب (heart), قلوبهم (their hearts), بقلبک (with your heart) — every form of root ق-ل-ب in the corpus.
Resolve cross-root links: Parse the similar JSON array, look up those roots, include them tagged as “similar.”
Deduplicate: If two input words resolve to the same root, include it only once.

The entire operation takes about 1 millisecond. All of the hard work — root identification, broken plural resolution, alignment, rendering matching — was done once by AI at build time and cached forever in SQLite.

For production, the same SQLite database runs on Cloudflare D1 at the edge. Same SQL, same data, same speed — globally.

The enrichment pipeline

The concordance database feeds a three-stage enrichment pipeline that annotates each paragraph with detailed linguistic analysis:

Stage 1 (no AI): Tokenize the Arabic/Persian text, look up each token in Jafar using the five-level cascade, find the English rendering by matching token positions against the alignment array. This produces enriched terms with rendering spectrums, deviation detection, cross-references, and condensation/expansion notes. Safe to re-run anytime.

Stage 2 (Sonnet): Re-align source and translation when alignment is missing or stale. Destructive — must be explicitly requested.

Stage 3 (Haiku): Generate comparative notes explaining how Shoghi Effendi’s rendering relates to the dictionary meaning of each word. Validates that SE rendering quotes actually appear in the spectrum data.

The runtime enrichment module (study-enrichment.js) mirrors Stage 1 exactly, running at SSR time on Cloudflare Workers. Any logic change in one must be mirrored in the other — they produce identical output shapes.

What I learned

Alignment is the foundation, not morphology

Version 1 tried to build the concordance from morphological analysis of isolated words. Version 2 starts with word-level alignment — which English words translate which Arabic words — and builds morphology on top. This is the right order. Alignment gives you ground truth for what each word means in context. Morphological analysis without alignment is just dictionaries.

Two-pass AI beats single-pass

The generate-then-verify pattern caught 1-3 corrections per paragraph on average. Some of these were subtle: a verb form labeled “I” when it was really “IV,” or a proper noun not flagged as a name. The cost of the second pass is small relative to the quality improvement, and the corrections compound — a wrong root assignment in the concordance would propagate to every downstream lookup.

AI is great at one-time enrichment

The pattern that emerged — “use AI to pre-compute a static resource, then use dumb lookups at runtime” — feels incredibly powerful. The AI handles the genuinely hard parts (morphological analysis, cross-lingual alignment, broken plurals) during a one-time build. Runtime gets the intelligence for free via indexed lookups.

This inverts the usual AI-in-production pattern. Instead of calling AI on every request (slow, expensive, variable), you call it once and crystallize the results. The database becomes a snapshot of the AI’s knowledge, frozen in time, queryable forever at microsecond latency.

Normalization is the foundation

Every subsequent step — stop word filtering, database lookups, affix stripping — depends on normalization being correct and consistent. The bug where stop words weren’t normalized through the same function wasted hours and was completely invisible until I inspected the frequency counts.

If you’re working with any non-Latin script, invest heavily in normalization up front. Test it by running your full dataset through the normalizer and examining the frequency distribution. The most common words should be stop words. If you see function words in your top-30, your normalizer or your stop word list has a gap.

Checkpoint everything

The full build costs real money and takes hours. Network connections drop. Rate limits hit. Every step saves progress to a JSON file and resumes cleanly on restart. The extra code saved hours of rebuild time and dollars of API cost.

The result

A researcher can now type any Arabic or Persian phrase from the Baha’i sacred texts and instantly see:

Every significant word, identified by trilateral root with part of speech
Academic transliteration and English meaning for each root
Every occurrence of that root across all 11 translations — all 105,000 indexed words
The exact English rendering Shoghi Effendi chose in each passage
A “rendering spectrum” showing the full range of translations for each root
Cross-references to roots with similar English renderings
Deviation analysis flagging where SE’s rendering diverges from dictionary meaning