Skip to main content
Log in

struct Codepoint

A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values.

This type is restricted to store a single Unicode scalar value, typically encoding a single user-recognizable character.

All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges.

Codepoints vs Scalar Values

Formally, Unicode defines a codespace of values in the range 0 to 0x10FFFF inclusive, and a Unicode codepoint is any integer falling within that range. However, due to historical reasons, it became necessary to "carve out" a subset of the codespace, excluding codepoints in the range 0xD7FF–0xE000. That subset of codepoints excluding that range are known as Unicode scalar values. The codepoints in the range 0xD7FF-0xE000 are known as "surrogate" codepoints. The surrogate codepoints will never be assigned a semantic meaning, and can only validly appear in UTF-16 encoded text.

The difference between codepoints and scalar values is a technical distiction related to the backwards-compatible workaround chosen to enable UTF-16 to encode the full range of the Unicode codespace. For simplicities sake, and to avoid a confusing clash with the Mojo Scalar type, this type is pragmatically named Codepoint, even though it is restricted to valid scalar values.

Implemented traits

AnyType, CollectionElement, Copyable, EqualityComparable, EqualityComparableCollectionElement, ExplicitlyCopyable, Intable, Movable, Stringable, UnknownDestructibility

Methods

__init__

__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])

Construct a Codepoint from a code point value without checking that it falls in the valid range.

Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type.

Args:

  • unsafe_unchecked_codepoint (SIMD[uint32, 1]): A valid Unicode scalar value code point.

__init__(out self, codepoint: SIMD[uint8, 1])

Construct a Codepoint from a single byte value.

This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values.

Args:

  • codepoint (SIMD[uint8, 1]): The 8-bit codepoint value to convert to a Codepoint.

__eq__

__eq__(self, other: Self) -> Bool

Return True if this character has the same codepoint value as other.

Args:

  • other (Self): The codepoint value to compare against.

Returns:

True if this character and other have the same codepoint value; False otherwise.

__ne__

__ne__(self, other: Self) -> Bool

Return True if this character has a different codepoint value from other.

Args:

  • other (Self): The codepoint value to compare against.

Returns:

True if this character and other have different codepoint values; False otherwise.

from_u32

static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Codepoint]

Construct a Codepoint from a code point value. Returns None if the provided codepoint is not in the valid range.

Args:

  • codepoint (SIMD[uint32, 1]): An integer representing a Unicode scalar value.

Returns:

A Codepoint if codepoint falls in the valid range for Unicode scalar values, otherwise None.

ord

static ord(string: StringSlice[origin]) -> Self

Returns the Codepoint that represents the given single-character string.

Given a string containing one character, return a Codepoint representing the codepoint of that character. For example, Codepoint.ord("a") returns the codepoint 97. This is the inverse of the chr() function.

This function is similar to the ord() free function, except that it returns a Codepoint instead of an Int.

Args:

  • string (StringSlice[origin]): The input string, which must contain only a single character.

Returns:

A Codepoint representing the codepoint of the given character.

unsafe_decode_utf8_codepoint

static unsafe_decode_utf8_codepoint(_ptr: UnsafePointer[SIMD[uint8, 1]]) -> Tuple[Codepoint, Int]

Decodes a single Codepoint and number of bytes read from a given UTF-8 string pointer.

Safety: _ptr MUST point to the first byte in a known-valid UTF-8 character sequence. This function MUST NOT be used on unvalidated input.

Args:

  • _ptr (UnsafePointer[SIMD[uint8, 1]]): Pointer to UTF-8 encoded data containing at least one valid encoded codepoint.

Returns:

The decoded codepoint Codepoint, as well as the number of bytes read.

__int__

__int__(self) -> Int

Returns the numeric value of this scalar value as an integer.

Returns:

The numeric value of this scalar value as an integer.

__str__

__str__(self) -> String

Formats this Codepoint as a single-character string.

Returns:

A string containing this single character.

is_ascii

is_ascii(self) -> Bool

Returns True if this Codepoint is an ASCII character.

All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8.

Returns:

A boolean indicating if this Codepoint is an ASCII character.

is_ascii_digit

is_ascii_digit(self) -> Bool

Determines whether the given character is a digit [0-9].

Returns:

True if the character is a digit.

is_ascii_upper

is_ascii_upper(self) -> Bool

Determines whether the given character is an uppercase character.

This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".

Returns:

True if the character is uppercase.

is_ascii_lower

is_ascii_lower(self) -> Bool

Determines whether the given character is an lowercase character.

This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".

Returns:

True if the character is lowercase.

is_ascii_printable

is_ascii_printable(self) -> Bool

Determines whether the given character is a printable character.

Returns:

True if the character is a printable character, otherwise False.

is_python_space

is_python_space(self) -> Bool

Determines whether this character is a Python whitespace string.

This corresponds to Python's universal separators: " \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029".

Examples

Check if a string contains only whitespace:

from testing import assert_true, assert_false

# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord(" ").is_python_space())

# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())
from testing import assert_true, assert_false

# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord(" ").is_python_space())

# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())

.

Returns:

True if this character is one of the whitespace characters listed above, otherwise False.

is_posix_space

is_posix_space(self) -> Bool

Returns True if this Codepoint is a space character according to the POSIX locale.

The POSIX locale is also known as the C locale.

This only respects the default "C" locale, i.e. returns True only if the character specified is one of " \t\n\v\f\r". For semantics similar to Python, use String.isspace().

Returns:

True iff the character is one of the whitespace characters listed above.

to_u32

to_u32(self) -> SIMD[uint32, 1]

Returns the numeric value of this scalar value as an unsigned 32-bit integer.

Returns:

The numeric value of this scalar value as an unsigned 32-bit integer.

unsafe_write_utf8

unsafe_write_utf8(self, ptr: UnsafePointer[SIMD[uint8, 1]]) -> UInt

Shift unicode to utf8 representation.

Safety: ptr MUST point to at least self.utf8_byte_length() allocated bytes or else an out-of-bounds write will occur, which is undefined behavior.

Unicode (represented as UInt32 BE) to UTF-8 conversion:

  • 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
    • a
  • 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
    • (a >> 6) | 0b11000000, b | 0b10000000
  • 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
    • (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000
  • 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc 10dddddd
    • (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 .

Args:

  • ptr (UnsafePointer[SIMD[uint8, 1]]): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data.

Returns:

Returns the number of bytes written.

utf8_byte_length

utf8_byte_length(self) -> UInt

Returns the number of UTF-8 bytes required to encode this character.

The returned value is always between 1 and 4 bytes.

Returns:

Byte count of UTF-8 bytes required to encode this character.