Skip to content

improve performance and reduce allocations of UUID methods #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 111 additions & 88 deletions codec.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@
package uuid

import (
"bytes"
"encoding/hex"
"errors"
"fmt"
)

Expand All @@ -45,11 +44,77 @@ func FromBytesOrNil(input []byte) UUID {
return uuid
}

var errInvalidFormat = errors.New("uuid: invalid UUID format")

func fromHexChar(c byte) byte {
switch {
case '0' <= c && c <= '9':
return c - '0'
case 'a' <= c && c <= 'f':
return c - 'a' + 10
case 'A' <= c && c <= 'F':
return c - 'A' + 10
}
return 255
}

// Parse parses the UUID stored in the string text. Parsing and supported
// formats are the same as UnmarshalText.
func (u *UUID) Parse(s string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make Parse() delegate to UnmarshalText() or vice versa? The conversion from string to []byte or []byte to string is almost always optimized out by the compiler and this is non-trivial code to be duplicating.

Copy link
Contributor Author

@charlievieth charlievieth Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion from string to []byte or []byte to string is almost always optimized out by the compiler and this is non-trivial code to be duplicating.

In some cases this is true and the runtime can stack allocate or use a temp buffer for the string to []byte conversion (see: stringtoslicebyte()), but that is not true here since ([]byte)(s) escapes to heap (you can check this by making the suggested change and running go build -i -gcflags='-m').

TLDR: delegating to UnmarshalText would require an allocation and a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, delegating to UnmarshalText could be DOS vector if this was used to parse untrusted input (a malicious actor could pass very large "UUID" causing the application to allocate large amounts of memory).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this is a fairly complex block of code that's identical to UnmarshalText() other than one processes string and the other []byte. It would be better in the long run to not have the duplication. What's the result if UnmarshalText() delegates to Parse() instead?

Also, since the logic is the same, any DOS vector in UnmarshalText() exists here as well. Looking at the code, though, neither path allocates memory based on the input. Each loops over 32 characters of the input data after cleaning up "decorations" like braces, etc. and writes to corresponding locations in the already-allocated [16]byte that is the underlying value of the UUID type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this is a fairly complex block of code that's identical to UnmarshalText() other than one processes string and the other []byte. It would be better in the long run to not have the duplication.

I don't like the duplication either, but this is the only way to do this without allocating. The fact that we have 100% test coverage considerably alleviates these concerns IMO. In the long run this could be replaced by a generic function that takes both strings and byte slices, but that should only be done once Go 1.17 is no longer in wide use.

What's the result if UnmarshalText() delegates to Parse() instead?

Same result and same issues around creating a copy of untrusted input.

Also, since the logic is the same, any DOS vector in UnmarshalText() exists here as well.

The DOS vector only exists if we delegate the call to UnmarshalText/Parse from Parse/UnmarshalText and have to allocate to create a copy of untrusted input.

switch len(s) {
case 32: // hash
case 36: // canonical
case 34, 38:
if s[0] != '{' || s[len(s)-1] != '}' {
return fmt.Errorf("uuid: incorrect UUID format in string %q", s)
}
s = s[1 : len(s)-1]
case 41, 45:
if s[:9] != "urn:uuid:" {
return fmt.Errorf("uuid: incorrect UUID format in string %q", s[:9])
}
s = s[9:]
default:
return fmt.Errorf("uuid: incorrect UUID length %d in string %q", len(s), s)
}
// canonical
if len(s) == 36 {
if s[8] != '-' || s[13] != '-' || s[18] != '-' || s[23] != '-' {
return fmt.Errorf("uuid: incorrect UUID format in string %q", s)
}
for i, x := range [16]byte{
0, 2, 4, 6,
9, 11,
14, 16,
19, 21,
24, 26, 28, 30, 32, 34,
} {
v1 := fromHexChar(s[x])
v2 := fromHexChar(s[x+1])
if v1|v2 == 255 {
return errInvalidFormat
}
u[i] = (v1 << 4) | v2
}
return nil
}
// hash like
for i := 0; i < 32; i += 2 {
v1 := fromHexChar(s[i])
v2 := fromHexChar(s[i+1])
if v1|v2 == 255 {
return errInvalidFormat
}
u[i/2] = (v1 << 4) | v2
}
return nil
}

// FromString returns a UUID parsed from the input string.
// Input is expected in a form accepted by UnmarshalText.
func FromString(input string) (UUID, error) {
u := UUID{}
err := u.UnmarshalText([]byte(input))
func FromString(text string) (UUID, error) {
var u UUID
err := u.Parse(text)
return u, err
}

Expand All @@ -66,7 +131,9 @@ func FromStringOrNil(input string) UUID {
// MarshalText implements the encoding.TextMarshaler interface.
// The encoding is the same as returned by the String() method.
func (u UUID) MarshalText() ([]byte, error) {
return []byte(u.String()), nil
var buf [36]byte
encodeCanonical(buf[:], u)
return buf[:], nil
}

// UnmarshalText implements the encoding.TextUnmarshaler interface.
Expand Down Expand Up @@ -103,96 +170,52 @@ func (u UUID) MarshalText() ([]byte, error) {
// braced := '{' plain '}' | '{' hashlike '}'
// urn := URN ':' UUID-NID ':' plain
//
func (u *UUID) UnmarshalText(text []byte) error {
switch len(text) {
case 32:
return u.decodeHashLike(text)
func (u *UUID) UnmarshalText(b []byte) error {
switch len(b) {
case 32: // hash
case 36: // canonical
case 34, 38:
return u.decodeBraced(text)
case 36:
return u.decodeCanonical(text)
if b[0] != '{' || b[len(b)-1] != '}' {
return fmt.Errorf("uuid: incorrect UUID format in string %q", b)
}
b = b[1 : len(b)-1]
case 41, 45:
return u.decodeURN(text)
if string(b[:9]) != "urn:uuid:" {
return fmt.Errorf("uuid: incorrect UUID format in string %q", b[:9])
}
b = b[9:]
default:
return fmt.Errorf("uuid: incorrect UUID length %d in string %q", len(text), text)
}
}

// decodeCanonical decodes UUID strings that are formatted as defined in RFC-4122 (section 3):
// "6ba7b810-9dad-11d1-80b4-00c04fd430c8".
func (u *UUID) decodeCanonical(t []byte) error {
if t[8] != '-' || t[13] != '-' || t[18] != '-' || t[23] != '-' {
return fmt.Errorf("uuid: incorrect UUID format in string %q", t)
return fmt.Errorf("uuid: incorrect UUID length %d in string %q", len(b), b)
}

src := t
dst := u[:]

for i, byteGroup := range byteGroups {
if i > 0 {
src = src[1:] // skip dash
if len(b) == 36 {
if b[8] != '-' || b[13] != '-' || b[18] != '-' || b[23] != '-' {
return fmt.Errorf("uuid: incorrect UUID format in string %q", b)
}
_, err := hex.Decode(dst[:byteGroup/2], src[:byteGroup])
if err != nil {
return err
for i, x := range [16]byte{
0, 2, 4, 6,
9, 11,
14, 16,
19, 21,
24, 26, 28, 30, 32, 34,
} {
v1 := fromHexChar(b[x])
v2 := fromHexChar(b[x+1])
if v1|v2 == 255 {
return errInvalidFormat
}
u[i] = (v1 << 4) | v2
}
src = src[byteGroup:]
dst = dst[byteGroup/2:]
}

return nil
}

// decodeHashLike decodes UUID strings that are using the following format:
// "6ba7b8109dad11d180b400c04fd430c8".
func (u *UUID) decodeHashLike(t []byte) error {
src := t[:]
dst := u[:]

_, err := hex.Decode(dst, src)
return err
}

// decodeBraced decodes UUID strings that are using the following formats:
// "{6ba7b810-9dad-11d1-80b4-00c04fd430c8}"
// "{6ba7b8109dad11d180b400c04fd430c8}".
func (u *UUID) decodeBraced(t []byte) error {
l := len(t)

if t[0] != '{' || t[l-1] != '}' {
return fmt.Errorf("uuid: incorrect UUID format in string %q", t)
return nil
}

return u.decodePlain(t[1 : l-1])
}

// decodeURN decodes UUID strings that are using the following formats:
// "urn:uuid:6ba7b810-9dad-11d1-80b4-00c04fd430c8"
// "urn:uuid:6ba7b8109dad11d180b400c04fd430c8".
func (u *UUID) decodeURN(t []byte) error {
total := len(t)

urnUUIDPrefix := t[:9]

if !bytes.Equal(urnUUIDPrefix, urnPrefix) {
return fmt.Errorf("uuid: incorrect UUID format in string %q", t)
}

return u.decodePlain(t[9:total])
}

// decodePlain decodes UUID strings that are using the following formats:
// "6ba7b810-9dad-11d1-80b4-00c04fd430c8" or in hash-like format
// "6ba7b8109dad11d180b400c04fd430c8".
func (u *UUID) decodePlain(t []byte) error {
switch len(t) {
case 32:
return u.decodeHashLike(t)
case 36:
return u.decodeCanonical(t)
default:
return fmt.Errorf("uuid: incorrect UUID length %d in string %q", len(t), t)
for i := 0; i < 32; i += 2 {
v1 := fromHexChar(b[i])
v2 := fromHexChar(b[i+1])
if v1|v2 == 255 {
return errInvalidFormat
}
u[i/2] = (v1 << 4) | v2
}
return nil
}

// MarshalBinary implements the encoding.BinaryMarshaler interface.
Expand Down
Loading