Skip to main content
Log in

Mojo struct

Char

struct Char

A single textual character.

This type represents a single textual character. Specifically, this type stores a single Unicode scalar value, typically encoding a single user-recognizable character.

All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges.

Implemented traits

AnyType, CollectionElement, Copyable, EqualityComparable, EqualityComparableCollectionElement, ExplicitlyCopyable, Intable, Movable, Stringable, UnknownDestructibility

Methods

__init__

__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])

Construct a Char from a code point value without checking that it falls in the valid range.

Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type.

Args:

  • unsafe_unchecked_codepoint (SIMD[uint32, 1]): A valid Unicode scalar value code point.

__init__(out self, codepoint: SIMD[uint8, 1])

Construct a Char from a single byte value.

This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values.

Args:

  • codepoint (SIMD[uint8, 1]): The 8-bit codepoint value to convert to a Char.

__eq__

__eq__(self, other: Self) -> Bool

Return True if this character has the same codepoint value as other.

Args:

  • other (Self): The codepoint value to compare against.

Returns:

True if this character and other have the same codepoint value; False otherwise.

__ne__

__ne__(self, other: Self) -> Bool

Return True if this character has a different codepoint value from other.

Args:

  • other (Self): The codepoint value to compare against.

Returns:

True if this character and other have different codepoint values; False otherwise.

from_u32

static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Char]

Construct a Char from a code point value. Returns None if the provided codepoint is not in the valid range.

Args:

  • codepoint (SIMD[uint32, 1]): An integer representing a Unicode scalar value.

Returns:

A Char if codepoint falls in the valid range for Unicode scalar values, otherwise None.

ord

static ord(string: StringSlice[origin]) -> Self

Returns the Char that represents the given one-character string.

Given a string representing one character, return a Char representing the codepoint of that character. For example, Char.ord("a") returns the codepoint 97. This is the inverse of the chr() function.

This function is similar to the ord() free function, except that it returns a Char instead of an Int.

Args:

  • string (StringSlice[origin]): The input string, which must contain only a single character.

Returns:

A Char representing the codepoint of the given character.

unsafe_decode_utf8_char

static unsafe_decode_utf8_char(_ptr: UnsafePointer[SIMD[uint8, 1]]) -> Tuple[Char, Int]

Decodes a single Char and number of bytes read from a given UTF-8 string pointer.

Safety: _ptr MUST point to the first byte in a known-valid UTF-8 character sequence. This function MUST NOT be used on unvalidated input.

Args:

  • _ptr (UnsafePointer[SIMD[uint8, 1]]): Pointer to UTF-8 encoded data containing at least one valid encoded codepoint.

Returns:

The decoded codepoint Char, as well as the number of bytes read.

__int__

__int__(self) -> Int

Returns the numeric value of this scalar value as an integer.

Returns:

The numeric value of this scalar value as an integer.

__str__

__str__(self) -> String

Formats this Char as a single-character string.

Returns:

A string containing this single character.

is_ascii

is_ascii(self) -> Bool

Returns True if this Char is an ASCII character.

All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8.

Returns:

A boolean indicating if this Char is an ASCII character.

is_ascii_digit

is_ascii_digit(self) -> Bool

Determines whether the given character is a digit [0-9].

Returns:

True if the character is a digit.

is_ascii_upper

is_ascii_upper(self) -> Bool

Determines whether the given character is an uppercase character.

This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".

Returns:

True if the character is uppercase.

is_ascii_lower

is_ascii_lower(self) -> Bool

Determines whether the given character is an lowercase character.

This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".

Returns:

True if the character is lowercase.

is_ascii_printable

is_ascii_printable(self) -> Bool

Determines whether the given character is a printable character.

Returns:

True if the character is a printable character, otherwise False.

is_python_space

is_python_space(self) -> Bool

Determines whether this character is a Python whitespace string.

This corresponds to Python's universal separators: " \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029".

Examples

Check if a string contains only whitespace:

from testing import assert_true, assert_false

# ASCII space characters
assert_true(Char.ord(" ").is_python_space())
assert_true(Char.ord(" ").is_python_space())

# Unicode paragraph separator:
assert_true(Char.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Char.ord("a").is_python_space())
from testing import assert_true, assert_false

# ASCII space characters
assert_true(Char.ord(" ").is_python_space())
assert_true(Char.ord(" ").is_python_space())

# Unicode paragraph separator:
assert_true(Char.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Char.ord("a").is_python_space())

.

Returns:

True if this character is one of the whitespace characters listed above, otherwise False.

is_posix_space

is_posix_space(self) -> Bool

Returns True if this Char is a space character according to the POSIX locale.

The POSIX locale is also known as the C locale.

This only respects the default "C" locale, i.e. returns True only if the character specified is one of " \t\n\v\f\r". For semantics similar to Python, use String.isspace().

Returns:

True iff the character is one of the whitespace characters listed above.

to_u32

to_u32(self) -> SIMD[uint32, 1]

Returns the numeric value of this scalar value as an unsigned 32-bit integer.

Returns:

The numeric value of this scalar value as an unsigned 32-bit integer.

unsafe_write_utf8

unsafe_write_utf8(self, ptr: UnsafePointer[SIMD[uint8, 1]]) -> UInt

Shift unicode to utf8 representation.

Safety: ptr MUST point to at least self.utf8_byte_length() allocated bytes or else an out-of-bounds write will occur, which is undefined behavior.

Unicode (represented as UInt32 BE) to UTF-8 conversion:

  • 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
    • a
  • 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
    • (a >> 6) | 0b11000000, b | 0b10000000
  • 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
    • (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000
  • 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc 10dddddd
    • (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 .

Args:

  • ptr (UnsafePointer[SIMD[uint8, 1]]): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data.

Returns:

Returns the number of bytes written.

utf8_byte_length

utf8_byte_length(self) -> UInt

Returns the number of UTF-8 bytes required to encode this character.

The returned value is always between 1 and 4 bytes.

Returns:

Byte count of UTF-8 bytes required to encode this character.