Skip to content

Improve inference for numeric columns #4406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 15, 2025
Merged

Improve inference for numeric columns #4406

merged 22 commits into from
Apr 15, 2025

Conversation

mathemancer
Copy link
Contributor

@mathemancer mathemancer commented Apr 10, 2025

Related to #4370

This greatly improves the accuracy, safety, and performance of type inference for numeric columns.

Technical details

Previously, we simply tried to cast a sample of the column to numeric using our msar.cast_to_numeric(text) function. This regularly resulted in false negatives, and potentially resulted in false positives. Now, we:

  • Test a (quite small) sample to try to find consistent group and decimal separators
  • Try to cast the column to numeric, assuming that group and decimal separator for each row
  • Return the group and decimal separator info, as well as the chosen type (numeric).

This PR does not make necessary changes to the front end code to enable all possible performance benefits of the back end changes. To get those benefits, the front end should submit the returned details.decimal_p and details.group_sep values back to the API to enable the faster casting behavior when actually confirming and saving the table.

TODO

In follow-up PRs, we should:

  • Clean up column altering code a bit.
  • Modify table previewing and column altering code to accept parameters for casting.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the develop branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@mathemancer mathemancer requested review from Anish9901 and pavish April 11, 2025 07:09
@mathemancer mathemancer added the pr-status: review A PR awaiting review label Apr 11, 2025
@mathemancer mathemancer added this to the v0.2.3 milestone Apr 11, 2025
pavish

This comment was marked as outdated.

@pavish pavish removed their assignment Apr 14, 2025
Copy link
Member

@Anish9901 Anish9901 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -17,7 +17,7 @@ if [[ $EXIT_CODE -eq 0 ]]; then
for i in {1..50}; do
pg_isready -U mathesar -d mathesar_testing && break || sleep 0.5
done
pg_prove --runtests -U mathesar -d mathesar_testing "$@"
pg_prove --runtests -U mathesar -d mathesar_testing -v "$@"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, are you aware of a way to make only the failing tests verbose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked the same question, but couldn't find a way.

@Anish9901 Anish9901 assigned mathemancer and unassigned Anish9901 Apr 14, 2025
Co-authored-by: Pavish Kumar Ramani Gopal <[email protected]>
@mathemancer mathemancer requested a review from pavish April 15, 2025 05:47
@mathemancer
Copy link
Contributor Author

@pavish I accepted your suggestion as per our call.

Copy link
Member

@pavish pavish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@pavish pavish added this pull request to the merge queue Apr 15, 2025
Merged via the queue into develop with commit e1191e3 Apr 15, 2025
98 checks passed
@pavish pavish deleted the numeric_infer branch April 15, 2025 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-status: review A PR awaiting review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants