Skip to content

Prevent re-encoding of path when following a redirect #2184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

JoeMylinn
Copy link

The current implementation re-encodes the path and query when following a redirect, unless path_encode_ is set to false.
However, the query is never decoded. This causes issues if the query contains a space ' ' (encoded as '+'). Since it's never decoded, but later re-encoded, the space ' ' (encoded as '+') gets changed into a '+' (encoded as '%2B'). Thus, the value of the parameter gets changed.

This fixes it by disabling path encoding on the redirect client.

@yhirose
Copy link
Owner

yhirose commented Jul 19, 2025

@JoeMylinn thanks for the pull request. Could you please add enough unit tests for this in test/test.cc? Then, I'll take a look at it. Thanks!

@JoeMylinn
Copy link
Author

@yhirose I've added a test for this, let me know if it's enough.

@yhirose
Copy link
Owner

yhirose commented Jul 20, 2025

@ungive, do you have any thought of this pull request? Thanks!
#2082

@yhirose
Copy link
Owner

yhirose commented Jul 20, 2025

@JoeMylinn could you please fix the style formatting error?

@JoeMylinn
Copy link
Author

@yhirose done

@yhirose
Copy link
Owner

yhirose commented Jul 20, 2025

'test/style-check' still gives errors...

@JoeMylinn
Copy link
Author

my bad, shouldn't give any errors now

httplib.h Outdated
Comment on lines 8668 to 8669
this->set_path_encode(
false); // Disable path encoding, redirects should already be encoded
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this set path_encode_ for all subsequent requests, overriding what the user configured?

Client cli(HOST, PORT);
cli.set_follow_location(true);
cli.set_path_encode(true); // encode paths
auto res = cli.Get("/a b"); // encoded as "/a%20b", redirects
// path_encode_ is now false
auto res = cli.Get("/c d"); // not encoded anymore, sent as "/c d"

Just a thought, I haven't tested this code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I believe you are right. That shouldn't be the case of course. Maybe we can just always create a new redirect client, even if it's the same host and never reuse the original one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you go back to your original suggestion (we talked about on discord) and decode the host+query so that it can be re-encoded in the client as normal. @JoeMylinn

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe the best option is to always decode both the path and query, and re-encode it before passing the new path to the original or redirect client.

Copy link
Author

@JoeMylinn JoeMylinn Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like the decode and re-encode thing. Especially since we'd always have to encode if we decode it first, or it could break URLs with encoded characters if path_encode_ is set to false.

We could just parse the query parameters and add them to the request, something like this:

 auto path = next_path;

 if (!next_query.empty())
 {
   detail::parse_query_text(next_query, req.params);
 }

@yhirose any thoughts on what solution you'd prefer?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoeMylinn I still don't fully understand the whole situation, yet. Could you please try with v0.22.0 to see if the same issue happens?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhirose Yes, v0.22.0 has the same issue.

I'll try to explain the issue a bit better. We need several things for it to happen:

  1. follow_location_ set to true
  2. path_encode_ set to true
  3. Get redirected, with the query part of the Location containing a +.

For example, let's say the redirect location is https://example.com/test?say=Hello+World. The value of the parameter say should be Hello World, since '+' encodes space ' ' in a query. However, the current implementation just appends the query to the path, and then (if path_encode_ is set to true) encodes the whole path, including the query. This encodes it as Hello%2BWorld, instead of Hello+World or Hello%20World, changing the value of the of the parameter to Hello+World.

I've noticed this issue because it causes downloads from GitHub to fail, same as in #2185.

I originally fixed this by just setting path_encode_ to false when following a redirect.
However this is flawed, since it might override the user configured settings for the client, when the location is on the same host.

I think there are mainly two ways to fix this:

  1. Always use a new redirect client, even if the location is on the same host, with path_encode_ set to false.
  2. Parse the query properly, and pass the parameters to the new request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is why is only the host decoded and not the entire query?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but decoding just the path can already cause issues even if there isn't a query. If path_encode_ is false, and the path contains an encoded character, it will get decoded but not encoded again and the request will fail. So we'd have to always encode the path, regardless of the settings on the original client.

I just feel like it's simpler to never decode the path in the first place, and just parse the query. Unless there is a reason for it that I am missing.

@yhirose
Copy link
Owner

yhirose commented Jul 25, 2025

I am still trying to understand the real cause of this problem and learn how other HTTP libraries handle this issue. Thanks for your patience.

Here is my investigation with Claude Code:

⏺ How Other HTTP Libraries Handle This Problem

  Based on research findings, here's how major HTTP libraries address this issue:

  1. libcurl Solution Approach

  Design Characteristics:
  - CURLU_URLENCODE flag for explicit control
  - Space→+ conversion executed at set time for query parts
  - At retrieval time, CURLU_URLDECODE performs +→space conversion

  Redirect Handling:
  Set time: "Hello World" → "Hello+World" (CURLU_URLENCODE)
  Redirect time: Preserves encoding state
  Get time: "Hello+World" → "Hello World" (CURLU_URLDECODE)

  2. Python requests Solution Approach

  Design Characteristics:
  - Automatic encoding via params argument
  - Space→+ conversion with urllib.parse.quote_plus()
  - Preserves encoding state during redirects

  Implementation Example:
  # Automatic handling (recommended)
  params = {'query': 'Hello World'}
  requests.get(url, params=params)  # Automatically encodes properly

  # Manual control
  from urllib.parse import quote_plus
  encoded = quote_plus("Hello World")  # "Hello+World"

  3. Go net/http Solution Approach

  Design Characteristics:
  - Context-specific encoding
    - url.QueryEscape(): For queries (preserves + character)
    - url.PathEscape(): For paths (encodes + character)
  - Structured processing via url.Values

  Implementation Example:
  // For queries
  query := url.QueryEscape("Hello World")  // "Hello+World"

  // For paths
  path := url.PathEscape("math+symbol")    // "math%2Bsymbol"

  // Structured approach
  params := url.Values{}
  params.Add("query", "Hello World")
  queryStr := params.Encode()  // Properly encoded

  4. cpp-httplib Issues and Improvement Direction

  Current Problems:
  - Single encoding function handles everything
  - Context-agnostic (same processing for paths and queries)
  - Double encoding during redirects

  Lessons from Other Libraries:

  A. Context Separation Approach (Go style)

  // Proposed implementation
  std::string encode_query_component(const std::string& s);
  std::string encode_path_component(const std::string& s);

  B. Configuration Flag Control (libcurl style)

  client.set_query_encoding_mode(PRESERVE_PLUS | RFC3986_STRICT);

  C. Structured Parameter Processing (Python requests style)

  // Decompose URL into components for proper handling
  struct URLComponents {
      std::string scheme, host, path;
      std::map<std::string, std::string> query_params;
  };

  5. Recommended Solution

  Phased Approach:
  1. Immediate fix: Disable path_encode_ during redirects
  2. Medium-term improvement: Separate encoding for queries and paths
  3. Long-term improvement: Flexible configuration options like other libraries

  This analysis reveals that cpp-httplib has room for improvement compared to other major
  libraries in terms of context awareness and configuration flexibility for URL encoding.

@DeltaGW2
Copy link

I think it's safe to assume that the Location header already correctly encodes the full query.

I do not know how httplib handles it internally, specifically, whether redirect clients are disposed after use.

  1. If they are disposed after use, encoding could be disabled.
  2. If they are not disposed - meaning modifying the client directly impacts future usage of the same host - then the safe route would be to treat just the Location header separate to normal URLs and fully decode host + query, based on the assumption that the Location header is correctly encoded to begin with.

@JoeMylinn
Copy link
Author

It depends on if the redirect is on the same host or not. When on the same host, the original client is used, and modifying it would affect future usage of it. That's the issue with my original fix.

Always decoding the path and query has it's own issues, since it won't be encoded again if path_encode_ is false.

Since as you've said, the Location header should always be encoded already anyway, I think the best option would be to skip decoding the path entirely and just parse the query like so:

if (!next_query.empty()) { detail::parse_query_text(next_query, req.params); }

This will ensure the query is always properly parsed, we don't have to worry about doing any encoding or decoding in the redirect function, and this could be further adjusted in the future to allow config options to change how a + should be treated in a query (although I really don't see why you'd want to, as that wouldn't follow the standard).

Add proper parsing of query parameters when following a redirect
@JoeMylinn
Copy link
Author

I went ahead and committed my suggestion, I think this is the best solution for now.

yhirose added a commit that referenced this pull request Jul 29, 2025
@yhirose
Copy link
Owner

yhirose commented Jul 29, 2025

@JoeMylinn I implemented the 2nd recommendation that the Claude Code suggested in #2190. This PR fixes not only this issue, but also #2185.

2. Medium-term improvement: Separate encoding for queries and paths

@yhirose yhirose closed this in c0c36f0 Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants