Back to Blog
Technical12 min read

Why RAG Fails on UAE Banking Documents (and How to Fix It)

Why off-the-shelf RAG breaks on UAE banking documents — Arabic-English mix, tables, audit reports — and what production-grade architecture actually looks like.

CodexaAI EngineeringJune 13, 2026
Share:

The 30-second version

Most UAE banks running a RAG pilot in 2026 have the same complaint: the retrieval layer returns plausible-looking chunks that turn out to be wrong, irrelevant, or missing the context the model needs to answer correctly. The model then hallucinates a confident-sounding answer, the compliance team escalates, the project stalls.

The model is not the problem. The RAG layer is. Off-the-shelf RAG was designed for clean English web content. UAE banking documents are the opposite — bilingual Arabic-English, table-heavy, multi-column, scanned, signed, stamped, redacted, and subject to UAE PDPL plus sector-specific residency rules from CBUAE, DFSA and FSRA.

This guide is for the Head of Innovation, Head of Data, or AI Architect at a UAE bank, insurer, family office or sovereign-wealth-adjacent firm who already has a RAG pilot in flight and is debugging why it does not work. Seven failure modes, then the architectural fixes that actually ship.

If you want background on what RAG is at a higher level first, our companion piece RAG vs Fine-Tuning Guide covers the trade-off in plain terms.

Why this matters right now for UAE banks

Three forces converge in 2026 and make RAG the bottleneck rather than the breakthrough.

Federal Law No. 10 of 2025 on AML and CFT tightened the documentation expected during CBUAE inspections. Banks now have to demonstrate, with audit trails, how every decision was reached and which underlying documents supported it. AI-assisted decisions are not exempt — they are scrutinised harder.

Compressed timelines. The same compliance pressure that creates the documentation burden is what made banks rush their AI pilots into production. The pilots were scoped to look impressive in a demo, not to handle the messy reality of a 47-page corporate KYC pack with mixed Arabic-English signatures and a redacted beneficial-ownership annex.

No good off-the-shelf options. The dominant commercial RAG stacks were built for English-language SaaS knowledge bases. They handle UAE banking documents the way a generic translation app handles a poem — technically functional, practically useless.

What RAG actually is, in 60 seconds

Retrieval-Augmented Generation is a pattern, not a product. It has three moving parts:

  1. Indexing. Source documents are split into chunks. Each chunk is turned into a numerical vector by an embedding model. The vectors are stored in a vector database alongside the original text.
  2. Retrieval. A user asks a question. The question is turned into a vector by the same embedding model. The vector database returns the chunks whose vectors are closest to the question vector.
  3. Generation. The retrieved chunks are stuffed into the context window of a large language model. The model is asked to answer the question using only the retrieved chunks.

The promise is that the model only answers from your real documents, not from its training data. The reality is that all three steps can fail — and in UAE banking, they usually fail at step 2.

The seven ways RAG fails on UAE banking documents

These are the failure modes we see most often in production pilots. Most pilots hit several of them at once.

1. Layout-blind extraction turns tables into garbage

A KYC pack typically includes a Memorandum of Association with a 4-column shareholder table, a trade licence with a 2-column activity table, and a board resolution with a list of authorised signatories. Off-the-shelf PDF extractors and basic OCR tools turn these tables into linear text, losing the column-row relationships that gave the data meaning.

When the model is then asked "Who owns 25% of Acme Trading LLC?" the retrieved chunk has the shareholder name on one line and the percentage on a separate line two paragraphs away. The model either fails to find the link or invents one.

The fix. Layout-aware extraction — vision-language models such as the larger Claude or GPT models with explicit table detection, or specialised document AI tools that produce structured Markdown / HTML preserving table structure. The output of the extraction step matters more than the embedding model that follows it.

2. Mixed Arabic-English chunks confuse the embedding model

UAE banking documents routinely mix Arabic and English on the same page — Arabic header, English clause, Arabic signature line, English footnote. A naïve chunker splits this into chunks that are partly Arabic, partly English, with right-to-left and left-to-right text fragments interleaved.

A general-purpose embedding model — particularly one trained dominantly on English — produces poor-quality vectors for these mixed chunks. They land in odd parts of the vector space, and retrieval misses them entirely.

The fix. Detect language at the sentence level during chunking. Route Arabic sentences and English sentences through embedding models that handle them well. Falcon-H1 Arabic and Jais are good options for Arabic finance vocabulary. A bilingual multilingual model can work for mixed chunks if it has been benchmarked on Arabic finance content — most have not.

3. Multi-column layouts get linearised wrong

UAE banking documents — particularly translated bilingual contracts and DIFC-format agreements — frequently use a two-column layout with the English on the left and the Arabic on the right. A column-blind extractor reads the page horizontally instead of vertically, interleaving English and Arabic line by line.

The output is unreadable to both the embedding model and the LLM. Even worse, the model will try to make sense of it and produce answers that mix terms from both columns.

The fix. Column detection during extraction, with explicit column-aware reading order. Tested on the bank's actual document templates, not on a generic benchmark.

4. Chunk size and section structure mismatch

The standard chunking recipe — 500 to 1,000 tokens with 50 to 100 tokens of overlap — was designed for prose web content. UAE banking documents are not prose. They are structured into clauses, sub-clauses, schedules and annexes. Cutting a 1,000-token chunk through the middle of a clause loses the clause's identifying header and the conditional language that qualifies it.

When the retrieval layer returns the middle of clause 4.2(b), the model has no idea that 4.2(a) said "Notwithstanding clause 4.1, the following exceptions apply..." and confidently answers with a rule that does not apply.

The fix. Structure-aware chunking. Chunks must respect clause boundaries, carry their section headers as metadata, and where possible inherit a one-line summary of the parent section. Hierarchical RAG — retrieving from a small index of section summaries first, then drilling into the specific clauses — is often the right shape.

5. Embedding models miss Arabic finance vocabulary

General-purpose embedding models are trained on the open web. The open web has lots of English finance content and very little Arabic banking-specific vocabulary. Terms like مصرف (bank), مرابحة (Murabaha), أمانة (trust account), or specific Sharia-compliant instrument names land near unrelated tokens in the vector space.

A question in Arabic about a Murabaha facility may not retrieve the chunks that describe that facility, because the embedding model does not know they are related.

The fix. Use an embedding model fine-tuned or trained specifically for Arabic finance — or, more practically, an embedding model fine-tuned on the bank's own corpus of past documents. Even a small fine-tune on a few hundred thousand chunks of the bank's own content typically moves retrieval quality dramatically.

6. Pure-vector retrieval misses exact-match queries

UAE banking queries are often very specific — "Show me clauses that reference 'KYC Update Form B'" or "Find the audit observation dated 14 March 2025". Pure-vector retrieval is bad at exact-string matching. The vector for "KYC Update Form B" is similar to vectors for "KYC update form" and "KYC update" and a dozen related phrases — and the retrieval may return any of them, missing the specific document the user asked about.

The fix. Hybrid retrieval. Combine vector retrieval with classic keyword search — BM25 or similar — and merge the results. The two methods are complementary. Vector retrieval is good at semantic similarity. BM25 is good at exact string matching. A reranker on top combines them.

7. No reranking and no source citation

Once chunks are retrieved, most pilot RAG systems stuff them straight into the model's context and ask for an answer. Two failure modes follow.

First, the top-K chunks include some irrelevant ones, and the model treats them all as equally trustworthy. Second, when the model produces an answer, there is no traceable link back to which specific chunk supported which specific claim. The compliance team cannot audit the answer because the answer cannot be audited.

The fix. Add a reranking step between retrieval and generation. A cross-encoder reranker scores each retrieved chunk against the question and discards the lowest-scoring ones. Then prompt the model to cite the specific chunk that supports each claim, and render those citations in the UI. The compliance team can verify, and the audit trail is built in.

What working RAG actually looks like for UAE banking

The corrected architecture has more moving parts than the pilot, but each part earns its place.

1. Layout-aware document ingestion. Vision-language extraction that preserves tables, columns, headers and lists. Output is structured Markdown or HTML with section-level metadata.

2. Language-aware chunking. Sentence-level language detection. Structure-preserving boundaries. Each chunk carries its section path as metadata.

3. Bilingual embedding. Arabic chunks go through an Arabic-tuned embedding model. English chunks go through an English-tuned model. Mixed chunks go through a multilingual model benchmarked on the bank's content.

4. Hybrid retrieval. Vector search plus BM25, merged by a reranker. Top 5 to 10 chunks land in the model context.

5. Cited generation. The LLM is prompted to answer only from retrieved chunks and to cite the specific chunk that supports each claim. The UI shows citations as clickable links to the source document and page.

6. Audit logging. Every question, every retrieved chunk, every model output, every citation logged with timestamps, model version, prompt version and reranker scores. This is the layer CBUAE inspectors will eventually ask to see.

7. UAE-region deployment. Embedding model, vector index and LLM all inside UAE borders for documents subject to UAE PDPL or CBUAE residency rules. Sovereign cloud, private VPC or on-premises as the regulator demands.

Three patterns we deploy

Pattern A: KYC pack RAG for corporate onboarding

Documents: 30 to 100 pages per corporate customer. Includes trade licence, MOA, board resolutions, signature lists, beneficial ownership declarations, source-of-funds statements. Heavy on tables and structured fields.

Shape: Layout-aware extraction, structure-preserving chunking, hybrid retrieval, cited generation. The KYC analyst asks specific questions ("Who is the beneficial owner above 25%? What's the registered activity? Has any signatory changed since the prior KYC?") and gets cited answers in seconds rather than spending an hour reading the pack.

Typical first-year outcome: KYC review time per file drops from 45 to 90 minutes down to 10 to 15 minutes. Audit-defensibility improves because every analyst decision is now traceable to a specific chunk.

Pattern B: AML alert investigation RAG

Documents: transaction history exports, customer due-diligence files, internal SAR templates, sanctions-list documents, prior alert dispositions. Mixed table and prose.

Shape: Same architectural pattern as KYC, with three additions. First, structured fields like transaction amounts and dates are extracted into a separate relational store and retrieved through SQL rather than vector. Second, sanctions-screening hits are joined to the alert at retrieval time. Third, prior-alert dispositions become a retrievable source so investigators see how similar cases were resolved before.

Typical first-year outcome: investigation time per alert drops by 40 to 60 percent. False-positive triage becomes consistent because the agent surfaces the same prior dispositions to every investigator.

Pattern C: Internal audit and policy RAG

Documents: internal policy documents, audit reports, regulatory correspondence, prior inspection findings. Long-form prose with formal structure.

Shape: Hierarchical RAG works particularly well here. A small index of section summaries acts as the first retrieval stage, then specific clauses are pulled from the second-stage index. Citations link back to the policy paragraph.

Typical first-year outcome: internal queries that previously took hours of human research become single-question answers with cited policy support. Audit-readiness improves because every internal policy interpretation is now traceable.

What to do if your RAG is failing today

Five diagnostic questions, in order. If you answer "no" to any of them, that is where to start.

  1. Are you using layout-aware extraction, or are you running PDFs through a basic OCR or text extractor? Most failing pilots are still on basic extraction.
  2. Is your chunking respecting document structure, or are you using fixed-size 1,000-token chunks? Structure-aware chunking alone often doubles retrieval quality on banking documents.
  3. Have you benchmarked your embedding model on your actual document corpus — particularly on Arabic content? Most teams discover their general-purpose model is weak on their specific vocabulary.
  4. Are you using hybrid retrieval — vector plus keyword search — or pure vector? Hybrid is the single biggest retrieval-quality upgrade available.
  5. Are you reranking the retrieved chunks before they reach the model? Reranking removes the noise that drives hallucinations.

If you answer "no" to most of these, your RAG layer is at the typical UAE banking pilot maturity. The fixes are well-understood. The order matters: extraction first, then chunking, then embedding, then retrieval, then reranking, then citation. Tuning a step before the previous step is correct is wasted work.

So what should you do next?

Three honest options.

Option 1: Rebuild the RAG layer. If the pilot is still in the diagnostic phase, an architectural rebuild costs less than continuing to tune a broken stack. Plan for 6 to 10 weeks for a focused workflow.

Option 2: Run a targeted RAG audit. If the pilot has been in production for months and the team is invested, a 2 to 3 week audit can identify the specific failure modes and a sequenced fix plan. Often the audit alone catches 60 to 80 percent of the lost retrieval quality.

Option 3: Talk to us. If you want a candid 30-minute review of where your RAG pilot is failing and what the fix path looks like, book a discovery call. Bring the architecture diagram and a sample of the documents you are struggling with — we will tell you which of the seven failure modes are affecting you and which fix to attempt first.

For background on the broader IDP and security context, see our enterprise LLM security architecture, our intelligent document processing solution, and the finance industry page for CBUAE-aligned engagement patterns.


This article is for educational and informational purposes only. Specific architectural choices depend on document corpus, regulatory perimeter and risk appetite. Numbers reflect typical ranges from production deployments; your results will vary. Decisions about architecture and vendor selection should be made with qualified advisors who understand your specific regulatory environment.

Ready to Transform Your Business with AI?

Our team of experts can help you implement the strategies discussed in this article.

Schedule a Consultation