Skip to content

gh-117174: Add a new route in linecache to fetch interactive source code #117500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 10, 2025

Conversation

pablogsal
Copy link
Member

@pablogsal pablogsal commented Apr 3, 2024

Using leads to confusion when linecache it's populated with
entries as the inspect module heavily relies on it. An example:

./python -c "import inspect;x=eval('lambda x: x');print(inspect.getsource(x))"

The code above should trigger an exception because we should not be able
to get the source of x (it's dynamically generated) but if we use
as name for interactive code it will return gives the full string.

@pablogsal pablogsal changed the title bpo-117174: Rename <string> to <__interactive_string__> for interactive code objects gh-117174: Rename <string> to <__interactive_string__> for interactive code objects Apr 3, 2024
@pablogsal pablogsal force-pushed the bpo-117174 branch 2 times, most recently from f9c7ec5 to 90de6a6 Compare April 3, 2024 10:53
@gaogaotiantian
Copy link
Member

This is a breaking change right? Maybe we should do the same thing as the interactive input - to secretly convert the file name back to <string> in the traceback. We should also be more careful about the name because that would be used in the future and we need to keep it for a long time. Unlike the REPL commands, this is something the users are more likely to see. This also introduces some backward-incompatibility in pdb. Basically the file name of the code object changed and that's public.

@pablogsal
Copy link
Member Author

pablogsal commented Apr 3, 2024

This is a breaking change right?

Breaking how? This doesn't affect any visible thing that's guaranteed unless I am missing something.

We should also be more careful about the name because that would be used in the future and we need to keep it for a long time.

Not sure I follow, there is no stable API that guarantees this name anywhere in a way that you can use it. This is just the name of the code object that's get created when you use Python with -c. That's absolutely not a public API and it's not stable.

Basically the file name of the code object changed and that's public.

Again, not sure how you are arriving to. this conclusion. There is absolutely no guarantee on how the interactive code objects are named

@pablogsal
Copy link
Member Author

I have added a translation layer similarly as we do for the REPL so it looks a bit better in tracebacks

@gaogaotiantian
Copy link
Member

Breaking how? This doesn't affect any visible thing that's guaranteed unless I am missing something.

First of all, the tests are failing.

I don't think "backward-compatibility" is only about the things we "guarantee", it's also about the behavior that has been there for a long time, even without documentation.

This change breaks the CPython test suite and I bet it will break many other library tests.

Making sure the traceback behaves as before is a very helpful step, but there might be code in the market that relies on the name of the code object. pdb will show a different stack I believe.

I'm not saying that this is the wrong path, simply suggesting that this might break other people's code because it changes a behavior that has been there for a long time. Yes I'm aware that this basically applies to any code change we do for CPython, but maybe we could give it some extra thought and compensate as we can (and like to). The translation in traceback for example, is very useful in my opinion.

@pablogsal
Copy link
Member Author

pablogsal commented Apr 3, 2024

I don't think "backward-compatibility" is only about the things we "guarantee", it's also about the behavior that has been there for a long time, even without documentation.

I understand, but we cannot just lift every observable behaviour to something we must preserve. Just take into account when we balance our options.

In general I agree with you, and you have an excellent point that if the test suite is breaking in more than one place, that's a reason for concern.

But in any case our approach here is quite limited: we can basically rename the string and fix whatever we can (but there will be places like pdb that we cannot "fix") but we won't be able to change the code object, so thing may break in other ways. Another way it's to close the issue as won't fix or to deactivate the feature for -c (which is in any case not trivial as we don't distinguish down the line between -c and REPL).

@pablogsal
Copy link
Member Author

@lysnikolaou thoughts?

@gaogaotiantian
Copy link
Member

gaogaotiantian commented Apr 3, 2024

I agree that we need to balance the new changes and the backward compatibility issue, otherwise we won't be able to move forward.

First of all, I think the current solution, where we change the name of the code name string + a translator in traceback is acceptable. That's simple and straightfoward.

I do have an alternative proposal which could solve this issue in a larger scope.

We can somehow expand the capability for linecache, so that when the file name starts with <, it requires a code object to confirm the corresponding text lines. This will not break any existing code using linecache, and only adds some extra logic for filename starting with < - it will basically do another layer of mapping for each code object, which is hashable and can be the key of the dict.

When we register a code object with linecache._register_code, it iterates through all the children code objects (using co_consts) and register them each. When we need to extract the source code of a dynamically generated code object, we need to explicity pass the code object itself - which is easily accessible in traceback.

So, instead of hacking the existing "filename" based linecache, we create a competely new path for dynamically generated code, which will introduce 0 backward compatibility issue.

And, we unlocked the potential to give source to all dynamically generated code, not only the ones in REPL and -c.

This will add some overhead when we register the code with linecache, but the majority of dynamically generated code is small and I don't think it'll make a big difference to the users.

@lysnikolaou
Copy link
Member

This is a tricky one. In principle, I agree that not every behavior can be promoted to something we must preserve. This appears to be one of those cases though, where this is something we should not be preserving, but enough test suites will fail that we'll get complaints about it. And, since this is the case, it might make sense for us to put in the effort to preserve the filename.

I don't understand enough about the linecache module to be able to immediately see if @gaogaotiantian's approach will work, and what the trade-offs will be, but I'd be happy to review a PR where that approach is implemented and I can dive in a bit deeper to figure it out.

@gaogaotiantian
Copy link
Member

gaogaotiantian commented Apr 4, 2024

In gist, we ask linecache for the source code when we are printing traceback, and the argument is the filename and the line number. Because we have fully control over the traceback and linecache, we can make traceback to pass filename, line number and the code object if the filename starts with <. In linecache, when the filename starts with <, a code object is required otherwise it will return '', which is the existing behavior for all files starting with <.

In this way, we are fully backward-compatible, all the existing user code should work because the behavior does not change at all. We get source code in the traceback for anything we'd like (REPL, -c and even more in the future), because we can explicitly pass the code object. In the future, we can make the registration public and the user can register the source code by themselves.

If @pablogsal is convinced that this approach has a chance, I can do the draft implementation.

@pablogsal
Copy link
Member Author

The problem is that it's not just the traceback: it's the code object. Most test here fail because either pdb or tracemalloc are getting the wrong name because of the different name and those use different traceback machinery or directly check the code object. You approach will still break people that don't use the standard one or that inspect the code objects and expect a different thing, so it's not fully backwards compatible.

@pablogsal
Copy link
Member Author

In linecache, when the filename starts with <, a code object is required otherwise it will return ''

Also, just for me to understand correctly, isn't this a breaking change? If someone is registering by hand something starting with < now suddenly we will do something different, no? What am I missing?

@pablogsal
Copy link
Member Author

This will add some overhead when we register the code with linecache, but the majority of dynamically generated code is small and I don't think it'll make a big difference to the users.

I agree that most of the cases are like this but we don't know if this is always true to the point that it doesn't matter. This is python we are talking about and users generate a lot of edge cases

@gaogaotiantian
Copy link
Member

The problem is that it's not just the traceback: it's the code object. Most test here fail because either pdb or tracemalloc are getting the wrong name because of the different name and those use different traceback machinery or directly check the code object. You approach will still break people that don't use the standard one or that inspect the code objects and expect a different thing, so it's not fully backwards compatible.

With my proposal, we do not change the name of the code object - which is exactly why the method is more backward compatible compared to the current one. The filename <string> is preserved for -c (it could even be preserved for REPL inputs if that matters). What I'm proposing is to use the actual code object as a key to search for source files, and that's unique.

Also, just for me to understand correctly, isn't this a breaking change? If someone is registering by hand something starting with < now suddenly we will do something different, no? What am I missing?

In my mind, a breaking change is something that breaks the existing code. There's no public way for the users to "register" anything now. If the user opt-in the new feature, then the behavior will be different - that's not breaking change, that's just new. If the user programs as before, everything will be the same.

I agree that most of the cases are like this but we don't know if this is always true to the point that it doesn't matter. This is python we are talking about and users generate a lot of edge cases

Yes, but the time to search through the code objects will always be significantly shorter than compiling them. If they chose to compile a huge file and create a lot of the code objects dynamically, the time cost is already significant. The worst case complexity for searching code objects is not worse than compiling them.

@pablogsal
Copy link
Member Author

Ok, I think I misunderstood your proposal because I clearly had the wrong idea of how it would work.

But then, what's the difference between this and having an "alternate" storage for line cache that it's only internal and only the interpreter can use and it's only retrieved from traceback machinery? Let's say the traceback machinery first checks in the normal line cache and if that fails it checks in the "secret storage" and with -c and interactive mode we put it there. In this way we don't change code objects, this doesn't interfere with regular users and we get what we need.

@gaogaotiantian
Copy link
Member

Yes this is basically having an alternate storage in linecache. We don't need to change the code objects in any case, they are only used as keys.

If we have a reliable source for dynamically generated code, why keep it a secret? pdb definitely wants it, other debuggers might as well. This is a useful feature for the users. I even think we should make the source code registeration public, but I'm okay for only making it internally available for now.

What's the advantage to combine it with existing interface? Well, first of all, they are doing the exact same thing from high level - to get source code. They can share all the existing code like cache mechanism. More importantly, the funcion call can simply be modifed from

linecache.getline(filename, lineno)

to

linecache.getline(filename, lineno, code=code)

When filename does not start with <, we ignore the code argument. The change on the calling side is minimal, they don't need to do something like

line = linecache.getline(filename, lineno)
if not line:
    line = linecache.secret_getline(filename, lineno, code=code)

The existing user code won't be interfered as they will behave as before. The user can opt-in the feature to get dynamically generated code - I believe that was asked for before.

@pablogsal
Copy link
Member Author

pablogsal commented Apr 5, 2024

If we have a reliable source for dynamically generated code, why keep it a secret?

Because that's a new feature and we are considering a bugfix. (I'm on my phone now, I can answer in more detail later but this is basically why my intent is to keep this as simple as possible)

@gaogaotiantian
Copy link
Member

Because that's a new feature and we are considering a bugfix. (I'm on my phone now, I can answer in more detail later but this is basically why my intent is to keep this as simple as possible)

Okay that's reasonable, but can we also make it extensible for the future? I work on pdb a lot and it would be nice to have the source code for dymanically generated objects.

So strictly for the bug fix:

  1. We keep the file name <string> for -c to avoid hassles for test cases (and that's our target for the future as well)
  2. Keep the current code register interface so we do not need to touch C code at all. Also for now make it private and internal only.
  3. When we register code, put it in a different container than linecache.cache with the id of the code object as the key(different source code can build to the code objects that equal to each other).
  4. Do one of the following:
    • In traceback, try to get from that container if source is not available (or code starts with <?). this requires a new private interface for linecache
    • Change linecache.getline so that linecache handles this internally - it checks the file name, then look for it in the alternate container if applicable

I prefer the latter.

With this change, pdb can also use it. We may or may not promote this to public feature in the future, but with this pdb is already happy.

I don't think this solution is much more complicated than changing the code file name and translate it back in traceback, and I think this solution is helpful in long term.

@pablogsal
Copy link
Member Author

pablogsal commented Apr 10, 2024

3. When we register code, put it in a different container than linecache.cache with the id of the code object as the key(different source code can build to the code objects that equal to each other).

How do you retrieve this? The public APIs to the line cache module work on filename, not on code object. Specially if you want linecache.getline to do this internally, we cannot change the interface.

Also, the traceback module in most places where we call line cache we don't have the code object at hand and I don't think we want to push down the code object itself or some secret ID everywhere.

pablogsal added a commit to pablogsal/cpython that referenced this pull request Apr 10, 2024
@pablogsal pablogsal changed the title gh-117174: Rename <string> to <__interactive_string__> for interactive code objects gh-117174: Store interactive source code in an alternative location in linecache Apr 10, 2024
pablogsal added a commit to pablogsal/cpython that referenced this pull request Apr 10, 2024
@gaogaotiantian
Copy link
Member

Tests are still failing, is it related?

@pablogsal
Copy link
Member Author

HummMm something happened in the rebase with the new code added since the last update . Will look into it

Copy link
Member

@gaogaotiantian gaogaotiantian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve for the mechanism of the linecache part. I don't know if the id and _idfunc replacements are proper.

@pablogsal
Copy link
Member Author

I don't know if the id and _idfunc replacements are proper.

Can you elaborate?

@pablogsal
Copy link
Member Author

Ah you mean on the tests? Then yes, because those tests were flaky and they were relying on id to be called at a specific point but our id() is called before

@gaogaotiantian
Copy link
Member

Ah, okay, that's some interesting testing mechanism. But yeah, I don't know about the repl part and if anyone other than me should give it another review, but the overall logic to make this happen looks good to me. It's been a long journey :) Thanks for the effort!

@pablogsal pablogsal merged commit a931a8b into python:main Mar 10, 2025
53 checks passed
@pablogsal pablogsal deleted the bpo-117174 branch March 10, 2025 21:54
@pablogsal pablogsal added the needs backport to 3.13 bugs and security fixes label Mar 10, 2025
@miss-islington-app
Copy link

Thanks @pablogsal for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 10, 2025
…urce code (pythonGH-117500)

(cherry picked from commit a931a8b)

Co-authored-by: Pablo Galindo Salgado <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Mar 10, 2025

GH-131060 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Mar 10, 2025
@StanFromIreland
Copy link
Contributor

StanFromIreland commented Mar 10, 2025

How did this even happen though? It was fine before?

@pablogsal
Copy link
Member Author

I think the problem is that GH actions was happy when the PR was created but the merge broke it. merge queue would have watched this :S

pablogsal added a commit that referenced this pull request Mar 10, 2025
…ource code (GH-117500) (#131060)

gh-117174: Add a new route in linecache to fetch interactive source code (GH-117500)
(cherry picked from commit a931a8b)

Co-authored-by: Pablo Galindo Salgado <[email protected]>
@markshannon
Copy link
Member

markshannon commented Mar 11, 2025

@pablogsal

I think this introduces a refleak. The leak is in pyrepl, not linecache.
If I revert the changes to Lib/_pyrepl/console.py the leak goes away and, oddly, the tests still pass

Edit: the test didn't pass, I'd commented some of out them out when modifying Lib/_pyrepl/console.py

@pablogsal
Copy link
Member Author

Hummm, I will take a look today

@pablogsal
Copy link
Member Author

Fix for the leak here: #131095

seehwan pushed a commit to seehwan/cpython that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip news topic-repl Related to the interactive shell
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants