struct Codepoint
A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values.
This type is restricted to store a single Unicode scalar value, typically encoding a single user-recognizable character.
All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges.
Codepoints vs Scalar Values
Formally, Unicode defines a codespace of values in the range 0 to 0x10FFFF inclusive, and a Unicode codepoint is any integer falling within that range. However, due to historical reasons, it became necessary to "carve out" a subset of the codespace, excluding codepoints in the range 0xD7FF–0xE000. That subset of codepoints excluding that range are known as Unicode scalar values. The codepoints in the range 0xD7FF-0xE000 are known as "surrogate" codepoints. The surrogate codepoints will never be assigned a semantic meaning, and can only validly appear in UTF-16 encoded text.
The difference between codepoints and scalar values is a technical
distiction related to the backwards-compatible workaround chosen to enable
UTF-16 to encode the full range of the Unicode codespace. For simplicities
sake, and to avoid a confusing clash with the Mojo Scalar
type, this type
is pragmatically named Codepoint
, even though it is restricted to valid
scalar values.
Implemented traits
AnyType
,
CollectionElement
,
Copyable
,
EqualityComparable
,
EqualityComparableCollectionElement
,
ExplicitlyCopyable
,
Intable
,
Movable
,
Stringable
,
UnknownDestructibility
Methods
__init__
__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])
Construct a Codepoint
from a code point value without checking that it falls in the valid range.
Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type.
Args:
- unsafe_unchecked_codepoint (
SIMD[uint32, 1]
): A valid Unicode scalar value code point.
__init__(out self, codepoint: SIMD[uint8, 1])
Construct a Codepoint
from a single byte value.
This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values.
Args:
- codepoint (
SIMD[uint8, 1]
): The 8-bit codepoint value to convert to aCodepoint
.
__eq__
__eq__(self, other: Self) -> Bool
Return True if this character has the same codepoint value as other
.
Args:
- other (
Self
): The codepoint value to compare against.
Returns:
True if this character and other
have the same codepoint value; False otherwise.
__ne__
__ne__(self, other: Self) -> Bool
Return True if this character has a different codepoint value from other
.
Args:
- other (
Self
): The codepoint value to compare against.
Returns:
True if this character and other
have different codepoint values; False otherwise.
from_u32
static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Codepoint]
Construct a Codepoint
from a code point value. Returns None if the provided codepoint
is not in the valid range.
Args:
- codepoint (
SIMD[uint32, 1]
): An integer representing a Unicode scalar value.
Returns:
A Codepoint
if codepoint
falls in the valid range for Unicode scalar values, otherwise None.
ord
static ord(string: StringSlice[origin]) -> Self
Returns the Codepoint
that represents the given single-character string.
Given a string containing one character, return a Codepoint
representing the codepoint of that character. For example,
Codepoint.ord("a")
returns the codepoint 97
. This is the inverse of
the chr()
function.
This function is similar to the ord()
free function, except that it
returns a Codepoint
instead of an Int
.
Args:
- string (
StringSlice[origin]
): The input string, which must contain only a single character.
Returns:
A Codepoint
representing the codepoint of the given character.
unsafe_decode_utf8_codepoint
static unsafe_decode_utf8_codepoint(_ptr: UnsafePointer[SIMD[uint8, 1]]) -> Tuple[Codepoint, Int]
Decodes a single Codepoint
and number of bytes read from a given UTF-8 string pointer.
Safety:
_ptr
MUST point to the first byte in a known-valid UTF-8
character sequence. This function MUST NOT be used on unvalidated
input.
Args:
- _ptr (
UnsafePointer[SIMD[uint8, 1]]
): Pointer to UTF-8 encoded data containing at least one valid encoded codepoint.
Returns:
The decoded codepoint Codepoint
, as well as the number of bytes read.
__int__
__int__(self) -> Int
Returns the numeric value of this scalar value as an integer.
Returns:
The numeric value of this scalar value as an integer.
__str__
__str__(self) -> String
Formats this Codepoint
as a single-character string.
Returns:
A string containing this single character.
is_ascii
is_ascii(self) -> Bool
Returns True if this Codepoint
is an ASCII character.
All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8.
Returns:
A boolean indicating if this Codepoint
is an ASCII character.
is_ascii_digit
is_ascii_digit(self) -> Bool
Determines whether the given character is a digit [0-9].
Returns:
True if the character is a digit.
is_ascii_upper
is_ascii_upper(self) -> Bool
Determines whether the given character is an uppercase character.
This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".
Returns:
True if the character is uppercase.
is_ascii_lower
is_ascii_lower(self) -> Bool
Determines whether the given character is an lowercase character.
This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".
Returns:
True if the character is lowercase.
is_ascii_printable
is_ascii_printable(self) -> Bool
Determines whether the given character is a printable character.
Returns:
True if the character is a printable character, otherwise False.
is_python_space
is_python_space(self) -> Bool
Determines whether this character is a Python whitespace string.
This corresponds to Python's universal separators:
" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"
.
Examples
Check if a string contains only whitespace:
from testing import assert_true, assert_false
# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord(" ").is_python_space())
# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())
# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())
from testing import assert_true, assert_false
# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord(" ").is_python_space())
# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())
# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())
.
Returns:
True if this character is one of the whitespace characters listed above, otherwise False.
is_posix_space
is_posix_space(self) -> Bool
Returns True if this Codepoint
is a space character according to the POSIX locale.
The POSIX locale is also known as the C locale.
This only respects the default "C" locale, i.e. returns True only if the
character specified is one of " \t\n\v\f\r". For semantics similar
to Python, use String.isspace()
.
Returns:
True iff the character is one of the whitespace characters listed above.
to_u32
to_u32(self) -> SIMD[uint32, 1]
Returns the numeric value of this scalar value as an unsigned 32-bit integer.
Returns:
The numeric value of this scalar value as an unsigned 32-bit integer.
unsafe_write_utf8
unsafe_write_utf8(self, ptr: UnsafePointer[SIMD[uint8, 1]]) -> UInt
Shift unicode to utf8 representation.
Safety:
ptr
MUST point to at least self.utf8_byte_length()
allocated
bytes or else an out-of-bounds write will occur, which is undefined
behavior.
Unicode (represented as UInt32 BE) to UTF-8 conversion:
- 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
- a
- 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
- (a >> 6) | 0b11000000, b | 0b10000000
- 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
- (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000
- 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc
10dddddd
- (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 .
Args:
- ptr (
UnsafePointer[SIMD[uint8, 1]]
): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data.
Returns:
Returns the number of bytes written.
utf8_byte_length
utf8_byte_length(self) -> UInt
Returns the number of UTF-8 bytes required to encode this character.
The returned value is always between 1 and 4 bytes.
Returns:
Byte count of UTF-8 bytes required to encode this character.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!