Mojo struct

Char

struct Char

A single textual character.

This type represents a single textual character. Specifically, this type stores a single Unicode scalar value, typically encoding a single user-recognizable character.

All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges.

Implemented traits

AnyType, CollectionElement, Copyable, EqualityComparable, EqualityComparableCollectionElement, ExplicitlyCopyable, Intable, Movable, Stringable, UnknownDestructibility

Methods

`init`

__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])

Construct a Char from a code point value without checking that it falls in the valid range.

Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type.

Args:

unsafe_unchecked_codepoint (SIMD[uint32, 1]): A valid Unicode scalar value code point.

__init__(out self, codepoint: SIMD[uint8, 1])

Construct a Char from a single byte value.

This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values.

Args:

codepoint (SIMD[uint8, 1]): The 8-bit codepoint value to convert to a Char.

`eq`

__eq__(self, other: Self) -> Bool

Return True if this character has the same codepoint value as other.

Args:

other (Self): The codepoint value to compare against.

Returns:

True if this character and other have the same codepoint value; False otherwise.

`ne`

__ne__(self, other: Self) -> Bool

Return True if this character has a different codepoint value from other.

Args:

other (Self): The codepoint value to compare against.

Returns:

True if this character and other have different codepoint values; False otherwise.

`from_u32`

static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Char]

Construct a Char from a code point value. Returns None if the provided codepoint is not in the valid range.

Args:

codepoint (SIMD[uint32, 1]): An integer representing a Unicode scalar value.

Returns:

A Char if codepoint falls in the valid range for Unicode scalar values, otherwise None.

`ord`

static ord(string: StringSlice[origin]) -> Self

Returns the Char that represents the given one-character string.

Given a string representing one character, return a Char representing the codepoint of that character. For example, Char.ord("a") returns the codepoint 97. This is the inverse of the chr() function.

This function is similar to the ord() free function, except that it returns a Char instead of an Int.

Args:

string (StringSlice[origin]): The input string, which must contain only a single character.

Returns:

A Char representing the codepoint of the given character.

`unsafe_decode_utf8_char`

static unsafe_decode_utf8_char(_ptr: UnsafePointer[SIMD[uint8, 1]]) -> Tuple[Char, Int]

Decodes a single Char and number of bytes read from a given UTF-8 string pointer.

Safety: _ptr MUST point to the first byte in a known-valid UTF-8 character sequence. This function MUST NOT be used on unvalidated input.

Args:

_ptr (UnsafePointer[SIMD[uint8, 1]]): Pointer to UTF-8 encoded data containing at least one valid encoded codepoint.

Returns:

The decoded codepoint Char, as well as the number of bytes read.

`int`

__int__(self) -> Int

Returns the numeric value of this scalar value as an integer.

Returns:

The numeric value of this scalar value as an integer.

`str`

__str__(self) -> String

Formats this Char as a single-character string.

Returns:

A string containing this single character.

`is_ascii`

is_ascii(self) -> Bool

Returns True if this Char is an ASCII character.

All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8.

Returns:

A boolean indicating if this Char is an ASCII character.

`is_ascii_digit`

is_ascii_digit(self) -> Bool

Determines whether the given character is a digit [0-9].

Returns:

True if the character is a digit.

`is_ascii_upper`

is_ascii_upper(self) -> Bool

Determines whether the given character is an uppercase character.

This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".

Returns:

True if the character is uppercase.

`is_ascii_lower`

is_ascii_lower(self) -> Bool

Determines whether the given character is an lowercase character.

This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".

Returns:

True if the character is lowercase.

`is_ascii_printable`

is_ascii_printable(self) -> Bool

Determines whether the given character is a printable character.

Returns:

True if the character is a printable character, otherwise False.

`is_python_space`

is_python_space(self) -> Bool

Determines whether this character is a Python whitespace string.

This corresponds to Python's universal separators: " \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029".

Examples

Check if a string contains only whitespace:

from testing import assert_true, assert_false

# ASCII space characters
assert_true(Char.ord(" ").is_python_space())
assert_true(Char.ord("	").is_python_space())

# Unicode paragraph separator:
assert_true(Char.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Char.ord("a").is_python_space())
from testing import assert_true, assert_false

# ASCII space characters
assert_true(Char.ord(" ").is_python_space())
assert_true(Char.ord("	").is_python_space())

# Unicode paragraph separator:
assert_true(Char.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Char.ord("a").is_python_space())

Returns:

True if this character is one of the whitespace characters listed above, otherwise False.

`is_posix_space`

is_posix_space(self) -> Bool

Returns True if this Char is a space character according to the POSIX locale.

The POSIX locale is also known as the C locale.

This only respects the default "C" locale, i.e. returns True only if the character specified is one of " \t\n\v\f\r". For semantics similar to Python, use String.isspace().

Returns:

True iff the character is one of the whitespace characters listed above.

`to_u32`

to_u32(self) -> SIMD[uint32, 1]

Returns the numeric value of this scalar value as an unsigned 32-bit integer.

Returns:

The numeric value of this scalar value as an unsigned 32-bit integer.

`unsafe_write_utf8`

unsafe_write_utf8(self, ptr: UnsafePointer[SIMD[uint8, 1]]) -> UInt

Shift unicode to utf8 representation.

Safety: ptr MUST point to at least self.utf8_byte_length() allocated bytes or else an out-of-bounds write will occur, which is undefined behavior.

Unicode (represented as UInt32 BE) to UTF-8 conversion:

1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
- a
2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
- (a >> 6) | 0b11000000, b | 0b10000000
3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
- (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000
4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc 10dddddd
- (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 .

Args:

ptr (UnsafePointer[SIMD[uint8, 1]]): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data.

Returns:

Returns the number of bytes written.

`utf8_byte_length`

utf8_byte_length(self) -> UInt

Returns the number of UTF-8 bytes required to encode this character.

The returned value is always between 1 and 4 bytes.

Returns:

Byte count of UTF-8 bytes required to encode this character.

Implemented traits​

Methods​

__init__​

__eq__​

__ne__​

from_u32​

ord​

unsafe_decode_utf8_char​

__int__​

__str__​

is_ascii​

is_ascii_digit​

is_ascii_upper​

is_ascii_lower​

is_ascii_printable​

is_python_space​

Examples

is_posix_space​

to_u32​

unsafe_write_utf8​

Unicode (represented as UInt32 BE) to UTF-8 conversion:​

utf8_byte_length​

Implemented traits

Methods

`init`

`eq`

`ne`

`from_u32`

`ord`

`unsafe_decode_utf8_char`

`int`

`str`

`is_ascii`

`is_ascii_digit`

`is_ascii_upper`

`is_ascii_lower`

`is_ascii_printable`

`is_python_space`

`is_posix_space`

`to_u32`

`unsafe_write_utf8`

Unicode (represented as UInt32 BE) to UTF-8 conversion:

`utf8_byte_length`