Skip to content

Avoid redundant WTF-8 checks in PathBuf #140159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 16 additions & 6 deletions library/std/src/ffi/os_str.rs
Original file line number Diff line number Diff line change
Expand Up @@ -582,15 +582,25 @@ impl OsString {
#[unstable(feature = "os_string_truncate", issue = "133262")]
pub fn truncate(&mut self, len: usize) {
self.as_os_str().inner.check_public_boundary(len);
self.inner.truncate(len);
// SAFETY: The length was just checked to be at a valid boundary.
unsafe { self.inner.truncate_unchecked(len) };
}

/// Provides plumbing to core `Vec::extend_from_slice`.
/// More well behaving alternative to allowing outer types
/// full mutable access to the core `Vec`.
/// Provides plumbing to `Vec::extend_from_slice` without giving full
/// mutable access to the `Vec`.
///
/// # Safety
///
/// The slice must be valid for the platform encoding (as described in
/// [`OsStr::from_encoded_bytes_unchecked`]).
///
/// This bypasses the encoding-dependent surrogate joining, so `self` must
/// not end with a leading surrogate half and `other` must not start with
/// with a trailing surrogate half.
#[inline]
Comment on lines +597 to 600
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you to @thaliaarchi for helping explain to/remind me that the reason this is sufficient as a safety condition is that joined surrogate pairs get distinct encodings that are not identical to "a lone surrogate followed by a lone surrogate" because that's why UTF-8 is, er, UTF-8 instead of... yeah.

To reiterate for my own benefit, because I have an annoying "I understand this from the bottom up or it doesn't click" problem, this required understanding:

  • UTF-16 evolved from UCS-2, an encoding for the "basic multilingual plane"
  • UTF-16 encodes "astral" code points beyond the BMP using "surrogate pairs", unassigned values from UCS-2 that served as a "niche" in the encoding, to fit in each single u16 the de facto enum:
enum Utf16 {
    Single(Ucs2),
    Paired(Surrogate)
}
  • Nominally, "UTF-16" encoded data should reject the Utf16::Paired variant when it isn't actually paired, but a lot of "WTF-16" encodings actually allow lone surrogates in because everyone was still tacitly assuming handling things code-unit-by-code-unit would work fine and that's not how any multi-code-unit encoding works.
  • WTF-8 always encodes values in UTF-8 when values can be validly encoded in UTF-8.
  • WTF-8 has an encoding for lone surrogates so it can handle invalid "WTF-16", taken out of UTF-8's own encoding "niche": it is not valid UTF-8.
  • This means we have to handle "this surrogate should not exist" and "this surrogate is paired with another surrogate so they're a real boy Unicode code point". Thus, putting two surrogates "one after the other" is not always "simple" appending for WTF-8, when it can actually mean decoding the pair of code units into a Unicode code point and then encoding that into UTF-8.

( There's a detail for "trailing and leading" surrogates about the actual encoding for surrogates but I am deliberately eliding it because everyone spends five hours of explanation on exactly how surrogates are encoded which doesn't matter when thinking about the design at a higher level, such that everything that actually matters oozes out of my brain before we get to the conclusion. )

( I guess that's me not being completely bottom-up but whatever. )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...this is almost completely irrelevant to this actual review obviously I'm just trying to write this down again to try to get it to stick to my brain this time, as I have tried to learn/understand this at least 3 times.

pub(crate) fn extend_from_slice(&mut self, other: &[u8]) {
self.inner.extend_from_slice(other);
pub(crate) unsafe fn extend_from_slice_unchecked(&mut self, other: &[u8]) {
// SAFETY: Guaranteed by caller.
unsafe { self.inner.extend_from_slice_unchecked(other) };
}
}

Expand Down
19 changes: 12 additions & 7 deletions library/std/src/path.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1526,11 +1526,13 @@ impl PathBuf {
self.inner.truncate(end_file_stem.wrapping_sub(start));

// add the new extension, if any
let new = extension;
let new = extension.as_encoded_bytes();
if !new.is_empty() {
self.inner.reserve_exact(new.len() + 1);
self.inner.push(OsStr::new("."));
self.inner.push(new);
self.inner.push(".");
// SAFETY: Since a UTF-8 string was just pushed, it is not possible
// for the buffer to end with a surrogate half.
unsafe { self.inner.extend_from_slice_unchecked(new) };
}

true
Expand Down Expand Up @@ -1587,7 +1589,7 @@ impl PathBuf {
Some(f) => f.as_encoded_bytes(),
};

let new = extension;
let new = extension.as_encoded_bytes();
if !new.is_empty() {
// truncate until right after the file name
// this is necessary for trimming the trailing slash
Expand All @@ -1597,8 +1599,10 @@ impl PathBuf {

// append the new extension
self.inner.reserve_exact(new.len() + 1);
self.inner.push(OsStr::new("."));
self.inner.push(new);
self.inner.push(".");
// SAFETY: Since a UTF-8 string was just pushed, it is not possible
// for the buffer to end with a surrogate half.
unsafe { self.inner.extend_from_slice_unchecked(new) };
}

true
Expand Down Expand Up @@ -2759,7 +2763,8 @@ impl Path {
};

let mut new_path = PathBuf::with_capacity(new_capacity);
new_path.inner.extend_from_slice(slice_to_copy);
// SAFETY: The path is empty, so cannot have surrogate halves.
unsafe { new_path.inner.extend_from_slice_unchecked(slice_to_copy) };
new_path.set_extension(extension);
new_path
}
Expand Down
25 changes: 16 additions & 9 deletions library/std/src/sys/os_str/bytes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -216,19 +216,26 @@ impl Buf {
self.as_slice().into_rc()
}

/// Provides plumbing to core `Vec::truncate`.
/// More well behaving alternative to allowing outer types
/// full mutable access to the core `Vec`.
#[inline]
pub(crate) fn truncate(&mut self, len: usize) {
/// Provides plumbing to `Vec::truncate` without giving full mutable access
/// to the `Vec`.
///
/// # Safety
///
/// The length must be at an `OsStr` boundary, according to
/// `Slice::check_public_boundary`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having increasingly negative opinions about having an internal type named Slice but it is not your responsibility for that, so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what would a better name be? I might submit a PR for that. But maybe not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, honestly! it just is starting to bug me because this looked like a typo at first.

#[inline]
pub unsafe fn truncate_unchecked(&mut self, len: usize) {
self.inner.truncate(len);
}

/// Provides plumbing to core `Vec::extend_from_slice`.
/// More well behaving alternative to allowing outer types
/// full mutable access to the core `Vec`.
/// Provides plumbing to `Vec::extend_from_slice` without giving full
/// mutable access to the `Vec`.
///
/// # Safety
///
/// This encoding has no safety requirements.
#[inline]
pub(crate) fn extend_from_slice(&mut self, other: &[u8]) {
pub unsafe fn extend_from_slice_unchecked(&mut self, other: &[u8]) {
self.inner.extend_from_slice(other);
}
}
Expand Down
30 changes: 21 additions & 9 deletions library/std/src/sys/os_str/wtf8.rs
Original file line number Diff line number Diff line change
Expand Up @@ -195,19 +195,31 @@ impl Buf {
self.as_slice().into_rc()
}

/// Provides plumbing to core `Vec::truncate`.
/// More well behaving alternative to allowing outer types
/// full mutable access to the core `Vec`.
#[inline]
pub(crate) fn truncate(&mut self, len: usize) {
/// Provides plumbing to `Vec::truncate` without giving full mutable access
/// to the `Vec`.
///
/// # Safety
///
/// The length must be at an `OsStr` boundary, according to
/// `Slice::check_public_boundary`.
#[inline]
pub unsafe fn truncate_unchecked(&mut self, len: usize) {
self.inner.truncate(len);
}

/// Provides plumbing to core `Vec::extend_from_slice`.
/// More well behaving alternative to allowing outer types
/// full mutable access to the core `Vec`.
/// Provides plumbing to `Vec::extend_from_slice` without giving full
/// mutable access to the `Vec`.
///
/// # Safety
///
/// The slice must be valid for the platform encoding (as described in
/// [`Slice::from_encoded_bytes_unchecked`]).
///
/// This bypasses the WTF-8 surrogate joining, so `self` must not end with a
/// leading surrogate half and `other` must not start with with a trailing
/// surrogate half.
#[inline]
pub(crate) fn extend_from_slice(&mut self, other: &[u8]) {
pub unsafe fn extend_from_slice_unchecked(&mut self, other: &[u8]) {
self.inner.extend_from_slice(other);
}
}
Expand Down
Loading