Skip to content

[CRITICAL] Not receiving realtime updates to desktop/web session #52437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 8 tasks
m-natarajan opened this issue Nov 12, 2024 · 77 comments
Closed
1 of 8 tasks

[CRITICAL] Not receiving realtime updates to desktop/web session #52437

m-natarajan opened this issue Nov 12, 2024 · 77 comments
Assignees
Labels
AutoAssignerNewDotQuality Used to assign quality issues to engineers Bug Something is broken. Auto assigns a BugZero manager. Internal Requires API changes or must be handled by Expensify staff Monthly KSv2

Comments

@m-natarajan
Copy link

m-natarajan commented Nov 12, 2024

If you haven’t already, check out our contributing guidelines for onboarding and email [email protected] to request to join our Slack channel!


Version Number:
Reproducible in staging?: Needs Reproduction
Reproducible in production?: Needs Reproduction
If this was caught on HybridApp, is this reproducible on New Expensify Standalone?:
If this was caught during regression testing, add the test name, ID and link from TestRail:
Email or phone of affected tester (no customers):
Logs: https://stackoverflow.com/c/expensify/questions/4856
Expensify/Expensify Issue URL:
Issue reported by: @quinthar
Slack conversation (hyperlinked to channel name): ts_external_expensify_quality

Action Performed:

  1. Login to staging.new.expensify.com as user A
  2. As user B send messages to user A

Expected Result:

User A receives message in real time

Actual Result:

For user A typing indicator displayed, not receiving realtime updates to desktop/web session, but receiving push notifications in mobile for the same

Workaround:

Can the user still use Expensify without this being fixed? Have you informed them of the workaround?

Platforms:

Which of our officially supported platforms is this issue occurring on?

  • Android: Standalone
  • Android: HybridApp
  • Android: mWeb Chrome
  • iOS: Standalone
  • iOS: HybridApp
  • iOS: mWeb Safari
  • MacOS: Chrome / Safari
  • MacOS: Desktop

Screenshots/Videos

Add any screenshot/video evidence

image (18)

image (19)

image (20)

image (21)

Screen.Recording.2024-11-10.at.8.37.50.PM.mov

View all open jobs on GitHub

Issue OwnerCurrent Issue Owner: @tgolen
@m-natarajan m-natarajan added Daily KSv2 Needs Reproduction Reproducible steps needed Bug Something is broken. Auto assigns a BugZero manager. AutoAssignerNewDotQuality Used to assign quality issues to engineers labels Nov 12, 2024
Copy link

melvin-bot bot commented Nov 12, 2024

Triggered auto assignment to @deetergp (AutoAssignerNewDotQuality)

Copy link

melvin-bot bot commented Nov 12, 2024

Triggered auto assignment to @trjExpensify (Bug), see https://stackoverflow.com/c/expensify/questions/14418 for more details. Please add this bug to a GH project, as outlined in the SO.

@melvin-bot melvin-bot bot added Weekly KSv2 and removed Weekly KSv2 labels Nov 12, 2024
@MelvinBot
Copy link

This has been labelled "Needs Reproduction". Follow the steps here: https://stackoverflowteams.com/c/expensify/questions/16989

@trjExpensify
Copy link
Contributor

@deetergp I'm assuming notification issues like this need to remain internal, but let me know if you don't think so and we can ask a C+ to get involved as a next step to try and reproduce.

I seemingly can't repro this myself. Question from the thread is: "Why isn't the ping/ping detecting and fixing this?"

@melvin-bot melvin-bot bot added the Overdue label Nov 15, 2024
@quinthar
Copy link
Contributor

This happened again; I can't figure out how to reproduce reliably though.

Copy link

melvin-bot bot commented Nov 18, 2024

@deetergp, @trjExpensify Huh... This is 4 days overdue. Who can take care of this?

@trjExpensify trjExpensify moved this to CRITICAL in [#whatsnext] #quality Nov 18, 2024
@trjExpensify
Copy link
Contributor

@deetergp thoughts on the above, will you be able to look at this today?

CC: @muttmuure I think this one is in the CRITICAL category for #quality, so I've moved it there.

@muttmuure
Copy link
Contributor

muttmuure commented Nov 18, 2024 via email

@deetergp
Copy link
Contributor

@trjExpensify I've spent a bit of time with this today and I also cannot seem to reproduce it. I've been having a protracted conversation between the ExpensiScotts (-fy.com & -fail.com) in splitscreen browser windows and they both come through fine. I'm looking at DM chat between DB & Kadie to see if there's anything "off" about what's in Auth and in the logs.

@melvin-bot melvin-bot bot removed the Overdue label Nov 18, 2024
@trjExpensify
Copy link
Contributor

Gotcha. I'm sure DB would be happy to live debug or something, if you want to take it to the thread: https://expensify.slack.com/archives/C05LX9D6E07/p1731449676200089?thread_ts=1731299637.345689&cid=C05LX9D6E07

@melvin-bot melvin-bot bot added the Overdue label Nov 21, 2024
Copy link

melvin-bot bot commented Nov 22, 2024

@deetergp, @trjExpensify Whoops! This issue is 2 days overdue. Let's get this updated quick!

@deetergp
Copy link
Contributor

Spent a bit of time looking into this today and it interesting. A log search for blob:"PusherError" returns tens of thousands of results for just the last 24 hours. They all have the 1006 error code which Pusher's documentation has this to say about it:

When a WebSocket connection is closed without a "close frame", the pusher-js library emits an error with code 1006. Usually this is caused by WebSocket-incompatible proxies, which can't close the connection in the correct way.

Looking specifically into @quinthar's logs, I see an interesting 1006 log line that pops up: Software caused connection abort. Between my own searching and ChatGPT, it sounds like poor network connectivity can be a culprit, as can "Version or Library Mismatch". I found some GH issues from 2021 that talk about needing to be on the latest (for the time) version of 9.x. Looking in our package.json file, it looks like we are on v 8.3.0. Maybe we need to update the version of the pusher client we are using?

I'm not sure how involved updating to a newer version might be, maybe @mountiny or @AndrewGable might have some insight?

@melvin-bot melvin-bot bot removed the Overdue label Nov 25, 2024
Copy link

melvin-bot bot commented Nov 26, 2024

@deetergp @trjExpensify this issue was created 2 weeks ago. Are we close to a solution? Let's make sure we're treating this as a top priority. Don't hesitate to create a thread in #expensify-open-source to align faster in real time. Thanks!

@mountiny
Copy link
Contributor

@deetergp I dont know the specifics it would involve to update the pusher, but here is a PR when we did it last time and seems like it was fine without any specific testing and it was fine. So I would check if there are any specific breaking changes that should worry us and try to update it. However, we are already on the latest officially stable version 8.3.0 https://www.npmjs.com/package/pusher-js?activeTab=versions the next version 8.4.0 is still a release candidate.

@deetergp
Copy link
Contributor

Hmm… Maybe I'm confusing versions of other things. @quinthar Does this happen when you're using a poor connectivity setting in Dev Tools? Just trying to narrow down possible causes…

@melvin-bot melvin-bot bot added the Overdue label Nov 29, 2024
@tgolen
Copy link
Contributor

tgolen commented Feb 14, 2025

Weekly Update

  • There has been steady progress on the E/App PR to implement the RN client

Next Steps

  • Wait for that PR to be deployed and merged

ETA

  • Friday, Feb. 21

@tgolen
Copy link
Contributor

tgolen commented Feb 21, 2025

Weekly Update

  • The E/App PR to implement the RN client has been merged

Next Steps

  • Once the PR goes to production, @tgolen monitor logs to see if there are improvements with the connection errors
  • I'll report back here in a week!

  • Friday, Feb. 27

@mallenexpensify mallenexpensify changed the title Not receiving realtime updates to desktop/web session [CRITICAL] Not receiving realtime updates to desktop/web session Feb 27, 2025
@tgolen
Copy link
Contributor

tgolen commented Mar 7, 2025

Weekly Update

  • Now that the RN client has been deployed for a couple of weeks, I took at look at the logs and the number of errors hasn't really decreased, but the types of errors are not as random now.

Next Steps

  • @tgolen Create a new graph in grafana which looks at the number of occurrences of the different types of errors and look for trends

ETA

  • I'll create the graphs today, and then look for an update next week

@tgolen
Copy link
Contributor

tgolen commented Mar 7, 2025

Analysis

New graphs are on this dashboard.

First, let's look at the pusher errors by platform:

Image

Web is a large portion of those. In order to break down the web platform in a little more detail, look at it by OS:

Image

Conclusion

The majority of Pusher errors occur on the web platform on OSx. It's difficult to know at this point if this correlates to general usage trends or not. Like, maybe it's just a majority of our traffic is on this platform. I'm not sure, but I do know that if we can begin to figure out and decrease these figures, it would have the biggest impact on our users.

@mallenexpensify
Copy link
Contributor

The majority of Pusher errors occur on the web platform on OSx. It's difficult to know at this point if this correlates to general usage trends or not. Like, maybe it's just a majority of our traffic is on this platform.

I'm also curious. Also a lil surprised that 83% is OSx, I would have thought we had more Windows users.

@tgolen
Copy link
Contributor

tgolen commented Mar 14, 2025

Weekly Update

  • Pusher said that delivery rate is low, and asked us to look and see if there are any server errors happening when we try to send the events
  • I've looked into a lot of the server errors this week to try and understand them better (happening in this GH)
  • There hasn't been any progress made for the delivery rate, but I am starting to understand the server errors better now

Next Steps

  • Re-evaluate the delivery rate once more server errors have been mitigated

ETA

  • Friday, Mar. 28

@tgolen
Copy link
Contributor

tgolen commented Mar 21, 2025

Weekly Update

  • Delivery rate remains low at 79.5%
  • I've been back and forth with Pusher a little bit to try and figure out why their service is respond with 502 bad gateway errors when we batch events to them

Next Steps

  • Keep trying to mitigate more errors and see if the delivery rate can improve

ETA

  • Friday, Mar. 28

@yuwenmemon
Copy link
Contributor

@tgolen are we moving forward with what was discussed today in the resync thread?

@tgolen tgolen removed the Reviewing Has a PR in review label Apr 1, 2025
@melvin-bot melvin-bot bot added the Overdue label Apr 1, 2025
@tgolen
Copy link
Contributor

tgolen commented Apr 1, 2025

Yeah, we are. I'm doing all of that as part of #58024. I see I missed a weekly update on this issue due to the reviewing label, so let me get a new update added.

@melvin-bot melvin-bot bot removed the Overdue label Apr 1, 2025
@tgolen
Copy link
Contributor

tgolen commented Apr 1, 2025

Weekly Update

Next Step

  • @tgolen wait to see if additional logs and retry logic increase our deliverability rate

ETA

  • Monday, Apr. 7

@melvin-bot melvin-bot bot added the Overdue label Apr 10, 2025
@trjExpensify
Copy link
Contributor

Morning Melv, I believe this is the latest on the related issue. Waiting for Tim to return to look at some logs.

@tgolen
Copy link
Contributor

tgolen commented Apr 16, 2025

Daily Update

Next Steps

  • Add more error logging for the PONG events to see if any of those are specifically failing to send to Pusher
  • If there are none (or a very small amount), then I think that means the problem is either on Pusher's side, or the problem is on the frontend client side

ETA

  • I'll implement that in a PR either today or tomorrow

@melvin-bot melvin-bot bot removed the Overdue label Apr 16, 2025
@tgolen
Copy link
Contributor

tgolen commented Apr 16, 2025

Here is the Web-E PR to add more logging

@tgolen
Copy link
Contributor

tgolen commented Apr 18, 2025

Weekly Update

  • I'm making progress on getting a better picture of Pusher errors that happen on the server
  • I found several more places in the code where we were sending pusher events but not catching any errors so I added logging to that

Next Steps

  • @tgolen analyze logs next week to see if the new logs revealed anything that I wasn't already aware of

ETA

  • Friday, Apr. 25

@tgolen
Copy link
Contributor

tgolen commented Apr 25, 2025

Weekly Update

@mallenexpensify
Copy link
Contributor

Per convo here, let's close this for now and focus attention on

@github-project-automation github-project-automation bot moved this from CRITICAL to Done in [#whatsnext] #quality May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AutoAssignerNewDotQuality Used to assign quality issues to engineers Bug Something is broken. Auto assigns a BugZero manager. Internal Requires API changes or must be handled by Expensify staff Monthly KSv2
Projects
Status: Done
Development

No branches or pull requests

10 participants