-
Notifications
You must be signed in to change notification settings - Fork 211
What string type should we use for short UTF-8 strings? #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks; found it on GitLab. Added it above. |
I think we might want to make a custom type that is ArrayString on no_std and ArrayString+spillover on std. SmolStr is also not mutable, otherwise that would be nice. |
Mutability is a nice-to-have, but I think not essential. When we want to build up strings from pieces, we probably want something more full-featured than a mutable string, something that can keep track of field positions, capitalization contexts, etc. (ICU calls this FormattedStringBuilder.) Maintaining our own custom type sounds complicated. I think I'm leaning toward We can always change before v1. |
CC @emilio @hsivonen for feedback It would be great to document our decision in https://github.com/unicode-org/icu4x/blob/master/docs/string-representation.md |
The string representation doc is scoped to the outer API. Do we expect this type to be visible in the API that a Rust application developer using ICU4X sees? |
My understanding was that this is an internal type, but Shane's:
make me question that. If it is indeed internal, we can also switch it post v1 I believe. |
The type is mostly internal, but I was thinking that we could use it in the data provider structs, which are public API since you can build your own data provider. |
Agreement: use smallstr. |
I looked at the binary sizes. I selectively compiled the following functions to WASM: // #[no_mangle]
pub fn init_smallstr() {
let message: SmallString<[u8; 16]> = SmallString::from("Hello, world!");
unsafe {
alert(&message);
}
}
// #[no_mangle]
pub fn greet_smallstr(input: &str) {
let mut message: SmallString<[u8; 16]> = SmallString::new();
message.push_str("Hello, ");
message.push_str(input);
message.push_str("!");
unsafe {
alert(&message);
}
}
// #[no_mangle]
pub fn init_arraystring() {
let message: ArrayString<[_; 16]> = ArrayString::from("Hello, world!").unwrap();
unsafe {
alert(&message);
}
}
// #[no_mangle]
pub fn greet_arraystring(input: &str) {
let mut message: ArrayString<[_; 16]> = ArrayString::new();
message.push_str("Hello, ");
message.push_str(input);
message.push_str("!");
unsafe {
alert(&message);
}
}
// #[no_mangle]
pub fn init_str() {
let message = String::from("Hello, world!");
unsafe {
alert(&message);
}
}
// #[no_mangle]
pub fn greet_str(input: &str) {
let mut message = String::new();
message.push_str("Hello, ");
message.push_str(input);
message.push_str("!");
unsafe {
alert(&message);
}
} Results:
See: sffc/rust-wasm-i18n@1583344 Main takeaways:
I expect that the high-runner use case for these types is as a data type constructed from a fixed string, and both SmallString and ArrayString are slimmer than standard library String in that case. However, ArrayString lacks the ability to heap-allocate on overflow. Conclusion: stick with SmallString. |
A recent email in the rust-users google group brought up this interesting blog post on |
We commonly need to handle short UTF-8 strings that are longer than one code point. For example, a lot of CLDR data may be one grapheme cluster long, or may be one code point plus a bidi control character.
@zbraniecki compiled a great list of some options:
https://github.com/zbraniecki/tinystr/wiki/Performance
Here are the options in that doc, along with my initial thoughts on their pros and cons.
size_of::<SmolStr>() == size_of::<String>()
#![no_std]
#![no_std]
(usesalloc
)#![no_std]
(no dependence onalloc
)#![no_std]
(usesalloc
)Thoughts? @Manishearth @hsivonen
The doc also lists "istring", but I can't find it on GitHub. Pointers?
The text was updated successfully, but these errors were encountered: