Skip to content

What string type should we use for short UTF-8 strings? #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sffc opened this issue Jun 5, 2020 · 11 comments · Fixed by #159
Closed

What string type should we use for short UTF-8 strings? #113

sffc opened this issue Jun 5, 2020 · 11 comments · Fixed by #159
Assignees
Labels
A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole T-core Type: Required functionality
Milestone

Comments

@sffc
Copy link
Member

sffc commented Jun 5, 2020

We commonly need to handle short UTF-8 strings that are longer than one code point. For example, a lot of CLDR data may be one grapheme cluster long, or may be one code point plus a bidi control character.

@zbraniecki compiled a great list of some options:

https://github.com/zbraniecki/tinystr/wiki/Performance

Here are the options in that doc, along with my initial thoughts on their pros and cons.

  • SmolStr
    • 29 stars on GitHub
    • Pro: size_of::<SmolStr>() == size_of::<String>()
    • Pro: Serde integration
    • Pro: No dependencies except for the Serde traits
    • Con: Doesn't support #![no_std]
    • Con: No mutable functions
  • smallstr
    • 6 stars on GitHub
    • Pro: FFI integration
    • Pro: Serde integration
    • Pro: Good documentation
    • Pro: #![no_std] (uses alloc)
    • Con: Only 6 stars; owner's affiliation unclear
  • ArrayString
    • 255 stars on GitHub
    • Pro: Serde integration
    • Pro: Good documentation
    • Pro: #![no_std] (no dependence on alloc)
    • Pro: Lots of stars and contributors
    • Pro: Comes with other non-String stack vectors that we could leverage
    • Pro: No dependencies except for the Serde traits
    • Con: Does not support overflow onto heap, which may be necessary since grapheme clusters are not bounded by length
  • istring
    • No stars on GitLab
    • Pro: #![no_std] (uses alloc)
    • Con: Only one contributor, and no activity since 2018
    • Con: Not much documentation

Thoughts? @Manishearth @hsivonen

The doc also lists "istring", but I can't find it on GitHub. Pointers?

@zbraniecki
Copy link
Member

@sffc
Copy link
Member Author

sffc commented Jun 5, 2020

https://crates.io/crates/istring

Thanks; found it on GitLab. Added it above.

@Manishearth
Copy link
Member

I think we might want to make a custom type that is ArrayString on no_std and ArrayString+spillover on std.

SmolStr is also not mutable, otherwise that would be nice. smallstr is okay.

@sffc
Copy link
Member Author

sffc commented Jun 6, 2020

Mutability is a nice-to-have, but I think not essential. When we want to build up strings from pieces, we probably want something more full-featured than a mutable string, something that can keep track of field positions, capitalization contexts, etc. (ICU calls this FormattedStringBuilder.)

Maintaining our own custom type sounds complicated.

I think I'm leaning toward smallstr because it has all the features we want, and it's built on a solid foundation (smallvec).

We can always change before v1.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 12, 2020
@sffc sffc self-assigned this Jun 12, 2020
@zbraniecki
Copy link
Member

CC @emilio @hsivonen for feedback

It would be great to document our decision in https://github.com/unicode-org/icu4x/blob/master/docs/string-representation.md

@hsivonen
Copy link
Member

The string representation doc is scoped to the outer API.

Do we expect this type to be visible in the API that a Rust application developer using ICU4X sees?

@zbraniecki
Copy link
Member

My understanding was that this is an internal type, but Shane's:

We can always change before v1.

make me question that.

If it is indeed internal, we can also switch it post v1 I believe.
I mostly wanted to get your feedback to verify that what we're choosing will work well between public API and internal types.

@sffc
Copy link
Member Author

sffc commented Jun 16, 2020

The type is mostly internal, but I was thinking that we could use it in the data provider structs, which are public API since you can build your own data provider.

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jun 18, 2020
@sffc
Copy link
Member Author

sffc commented Jun 18, 2020

Agreement: use smallstr.

@sffc sffc added this to the 2020 Q2 milestone Jun 18, 2020
@sffc sffc added T-core Type: Required functionality C-meta Component: Relating to ICU4X as a whole A-design Area: Architecture or design labels Jun 22, 2020
@sffc
Copy link
Member Author

sffc commented Jun 26, 2020

I looked at the binary sizes. I selectively compiled the following functions to WASM:

// #[no_mangle]
pub fn init_smallstr() {
	let message: SmallString<[u8; 16]> = SmallString::from("Hello, world!");
	unsafe {
		alert(&message);
	}
}

// #[no_mangle]
pub fn greet_smallstr(input: &str) {
	let mut message: SmallString<[u8; 16]> = SmallString::new();
	message.push_str("Hello, ");
	message.push_str(input);
	message.push_str("!");
	unsafe {
		alert(&message);
	}
}

// #[no_mangle]
pub fn init_arraystring() {
	let message: ArrayString<[_; 16]> = ArrayString::from("Hello, world!").unwrap();
	unsafe {
		alert(&message);
	}
}

// #[no_mangle]
pub fn greet_arraystring(input: &str) {
	let mut message: ArrayString<[_; 16]> = ArrayString::new();
	message.push_str("Hello, ");
	message.push_str(input);
	message.push_str("!");
	unsafe {
		alert(&message);
	}
}

// #[no_mangle]
pub fn init_str() {
	let message = String::from("Hello, world!");
	unsafe {
		alert(&message);
	}
}

// #[no_mangle]
pub fn greet_str(input: &str) {
	let mut message = String::new();
	message.push_str("Hello, ");
	message.push_str(input);
	message.push_str("!");
	unsafe {
		alert(&message);
	}
}

Results:

Function Bytes of .wat file
(control) 5578
init_smallstr 29084
greet_smallstr 43388
init_arraystring 6305
greet_arraystring 23897
init_str 31295
greet_str 37333

See: sffc/rust-wasm-i18n@1583344

Main takeaways:

  • SmallString is smaller than the standard library String when using FromStr, but a bit bloated when using mutation methods (the greet input/output function).
  • Most of the size of both SmallString and standard library String is in WeeAlloc
  • ArrayString is very small in the FromStr, because it doesn't need WeeAlloc, but gets substantially larger in the greet function. It appears that most of the assembly code for the greet function is core::fmt::Display.
  • SmallString gets a bit smaller, down to 28917 on init_smallstr, when the nightly union feature is used.

I expect that the high-runner use case for these types is as a data type constructed from a fixed string, and both SmallString and ArrayString are slimmer than standard library String in that case. However, ArrayString lacks the ability to heap-allocate on overflow.

Conclusion: stick with SmallString.

@EvanJP
Copy link
Contributor

EvanJP commented Jul 6, 2020

A recent email in the rust-users google group brought up this interesting blog post on SmolStr vs SmartString (no SmallStr). https://fasterthanli.me/articles/small-strings-in-rust. There are benchmarks regarding heap allocation (regardless of mutation/immutability) that are if nothing else an interesting read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-meta Component: Relating to ICU4X as a whole T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants