This article needs additional citations for verification. (May 2026) |
The Unicode Bidirectional Algorithm (UBA), formally defined in Unicode Standard Annex #9 (UAX #9), is a specification developed by the Unicode Consortium that determines how text containing a mixture of left-to-right and right-to-left scripts is displayed. It is a normative part of the Unicode Standard and is required for conformance wherever characters from right-to-left scripts such as Arabic or Hebrew are rendered.
| Unicode Bidirectional Algorithm | |
|---|---|
| Status | Active |
| Year started | 1999 |
| Latest version | Unicode 17.0.0 (Revision 51, 13 August 2025) |
| Organization | Unicode Consortium |
| Editors | Manish Goregaokar, Robin Leroy |
| Website | www |
Background
editMost writing systems display text from left to right, but several scripts—including Arabic, Hebrew, Thaana, and Syriac—are written from right to left. When text from both directions appears in the same document, the result is known as bidirectional text (or bidi text). Without a clear specification, ambiguities arise in determining the correct display order of characters.
The Unicode Standard prescribes a logical order for storing characters in memory, regardless of their visual direction. The UBA translates this logical order into a correct visual display order.
Directional Formatting Characters
editThe UBA defines several categories of special control characters used to influence text direction:
Implicit Directional Marks
editLightweight, zero-width characters that act as directional anchors without affecting display:
| Abbreviation | Code Point | Name |
|---|---|---|
| LRM | U+200E | LEFT-TO-RIGHT MARK |
| RLM | U+200F | RIGHT-TO-LEFT MARK |
| ALM | U+061C | ARABIC LETTER MARK |
Explicit Directional Embeddings
editSignal that a piece of text is to be treated as embedded in a given direction:
| Abbreviation | Code Point | Name |
|---|---|---|
| LRE | U+202A | LEFT-TO-RIGHT EMBEDDING |
| RLE | U+202B | RIGHT-TO-LEFT EMBEDDING |
Explicit Directional Overrides
editForce characters to be treated as strongly directional, overriding their implicit types:
| Abbreviation | Code Point | Name |
|---|---|---|
| LRO | U+202D | LEFT-TO-RIGHT OVERRIDE |
| RLO | U+202E | RIGHT-TO-LEFT OVERRIDE |
Explicit Directional Isolates
editIntroduced in Unicode 6.3, isolates prevent the enclosed text from affecting the surrounding text's ordering:
| Abbreviation | Code Point | Name |
|---|---|---|
| LRI | U+2066 | LEFT-TO-RIGHT ISOLATE |
| RLI | U+2067 | RIGHT-TO-LEFT ISOLATE |
| FSI | U+2068 | FIRST STRONG ISOLATE |
| PDI | U+2069 | POP DIRECTIONAL ISOLATE |
Terminating Characters
edit| Abbreviation | Code Point | Name | Terminates |
|---|---|---|---|
| U+202C | POP DIRECTIONAL FORMATTING | LRE, RLE, LRO, RLO | |
| PDI | U+2069 | POP DIRECTIONAL ISOLATE | LRI, RLI, FSI |
The Algorithm
editThe UBA processes text in four main phases:
1. Paragraph Separation
editText is split into paragraphs at paragraph separator characters (type B). Each paragraph is processed independently.
2. Initialization
editEach character is assigned a bidirectional character type (e.g., L, R, AL, EN, AN) from the Unicode Character Database. An embedding level list is also initialized.
3. Resolving Embedding Levels
editA series of rules resolves the embedding level of each character:
- P1–P3: Determine the paragraph embedding level (0 for LTR, 1 for RTL).
- X1–X10: Assign explicit embedding levels based on directional formatting characters.
- W1–W7: Resolve weak types (e.g., European numbers, separators).
- N0–N2: Resolve neutral and isolate formatting types, including bracket pairs.
- I1–I2: Resolve implicit embedding levels.
The maximum embedding depth is 125 levels, a value guaranteed not to change in future versions of the standard.[1]
4. Reordering
editRules L1–L4 reorder characters on each line for display:
- L1: Resets trailing whitespace and separators to the paragraph embedding level.
- L2: Reverses contiguous sequences of characters at the highest embedding levels, progressively down to the lowest odd level.
- L3: Reorders combining marks relative to their base characters.
- L4: Applies glyph mirroring to characters with the
Bidi_Mirroredproperty when their resolved direction is right-to-left (e.g., "(" becomes ")").
Bidirectional Character Types
editCharacters are classified into the following categories:
| Category | Type | Description |
|---|---|---|
| Strong | L | Left-to-Right (e.g., Latin, Han) |
| R | Right-to-Left (e.g., Hebrew) | |
| AL | Right-to-Left Arabic (e.g., Arabic, Syriac) | |
| Weak | EN | European Number |
| ES | European Number Separator | |
| ET | European Number Terminator | |
| AN | Arabic Number | |
| CS | Common Number Separator | |
| NSM | Nonspacing Mark | |
| Neutral | B | Paragraph Separator |
| S | Segment Separator | |
| WS | Whitespace | |
| ON | Other Neutrals |
Conformance
editA conforming implementation must:
- Display all visible characters in the order described by the UBA (UAX9-C1).
- Only apply higher-level protocol overrides as defined in Section 4.3 of the specification (UAX9-C2).
Higher-Level Protocols
editThe UBA permits six higher-level protocol overrides (HL1–HL6), including:
- HL1: Override the paragraph embedding level.
- HL3: Emulate explicit directional formatting characters via markup (e.g., HTML
dirattribute). - HL4: Apply the UBA independently to segments of structured text (e.g., XML, source code).
- HL6: Apply additional glyph mirroring beyond the standard
Bidi_Mirroredproperty.
HTML and CSS Equivalents
editSecurity Considerations
editThe misuse of bidirectional formatting characters poses significant security risks, as they can be used to make malicious code or text appear benign. This is documented in Unicode Technical Report #36 (UTR36). Directional overrides (LRO, RLO) are particularly dangerous and should be avoided where possible.
History
edit- Unicode 1.0 (1991): Basic bidirectional support introduced.
- Unicode 6.3 (2013): Major revision introducing directional isolates (LRI, RLI, FSI, PDI) and bracket pair resolution (rule N0). These additions were made to address the overly strong effect of directional embeddings on surrounding text.
- Unicode 17.0 (2025): Current version (Revision 51).
See also
editReferences
edit- ↑ "Unicode Standard Annex #9: Unicode Bidirectional Algorithm". Unicode Consortium. Retrieved 2025-08-13.