Skip to content

Normalize symbols #332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gustaphe opened this issue Apr 10, 2025 · 4 comments
Open

Normalize symbols #332

gustaphe opened this issue Apr 10, 2025 · 4 comments

Comments

@gustaphe
Copy link
Collaborator

#331 highlights an issue in the way we currently deal with variable and function names when it comes to subscripting. Substituting in strings leads to some strange edge cases, and the code is pretty difficult to parse.

I suggest a new model: any symbol (that could represent a variable name or a function name) is only normalized once. It follows something like this schedule:

  1. Unicode substitution (:a₁ --> :a_1)
  2. The string is split into "sub-symbols" (:abc_x_y --> (:abc, :x, :y))
  3. Each sub-symbol is normalized separately
  4. If snakecase, sub-symbols are joined with \_, otherwise all but the first pair are ("abc_{x\_y}")

Sub-symbol normalization:

  1. If matches constant list (e.g. inf, atan), get normalized form from dict
  2. If more than one (alphabetical) character, \mathrm (configurable?)
  3. Else return sub-symbol as is

This leaves an uncertainty in how to sort indexing (a_1[3]), and breaks the current behavior of latexify(:abc) --> "$abc$". It will however be more consistent with mathematical notation.

The indexing uncertainty is the biggest block to me, we might have to consider using a placeholder struct, more or less saving :a_b as a special type of :a[:b] and delaying the stringification, but that will require a bit of an overhaul.

@isaacsas
Copy link
Contributor

I'm not sure if I understand the proposal fully, but if a user writes :a₁ or :(a[1]) with snakecase = true, I'd still want to get the Latex string a_1 and not a\_1. My understanding of snakecase = true is that it is in reference to when a user explicitly has a symbol in which they've put an underscore like :a_1 (which should then become a\_1, i.e. a snakecase variable, which is not the same as an array variable or a variable with a unicode subscript). Isn't it meant to handle the common occurrence that one might have a variable written in snakecase in code, like a_long_variable, where the name doesn't map to using subscripts in math notation (in contrast to either the unicode or array reference cases)?

So I guess I'm suggesting the substitution order should be different and unicode substitution should come after determining snakecase so as not to treat unicode subscripts as snakecase.

Adding \mathrm as an option sounds good, but part of the symbolics PR efforts by myself and others were to essentially provide flexibility to opt-out of such wrapping there (they now wrap anything that is more than one character in \mathtt), so it would be nice if that option is turned off by default.

@isaacsas
Copy link
Contributor

(One place where it is common to have multicharacter math variables but not do anything special is in chemical reaction models in biology, which often have multi-character chemical species names.)

@gustaphe
Copy link
Collaborator Author

Thank you for your comments!

A variable that is both snakecased and has utf8 subscripts sounds pretty cursed, but you're right.

I think there will be a long_symbol_font kwarg that you can set to "mathtt" or nothing in the Symbolics recipe.

Chemical symbols should of course be surrounded in \ce{} and not printed as a product of single character variables. If mhchem is not available I would still do \mathrm.

@isaacsas
Copy link
Contributor

True, but if you are using utf subscripts and snakecasing I'm not sure what you can even reasonably expect (and that seems like a pretty exceptional case, usually the two aren't mixed in code I've read).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants