Skip to content

GC No longer Executes #21824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdavid01 opened this issue Apr 3, 2025 · 5 comments
Open

GC No longer Executes #21824

mdavid01 opened this issue Apr 3, 2025 · 5 comments

Comments

@mdavid01
Copy link

mdavid01 commented Apr 3, 2025

Hi Team: raised this issue on the 4/2 community meeting:
on March 6, GC ran as expected. On march 7, it stopped executing in same fashion as our other environment.
GC log only shows:
{"errors":[{"code":"NOT_FOUND","message":"{"code":10010,"message":"object is not found","details":"log entity: b768f80f8781bf9ef30708f0"}"}]}

Same result on GC scheduled, manual, and dry runs. Only the log entity in the error message changes. Since we're using HELM, there's no way to trace the code.

Initially opened this issue at #21655 but received no actionable response. We've googled and tried multiple fixes. We don't know how to find the log entity -- Or, is it the log entity that's not found?

LM (LMT) Leadership quite frustrated with lack of attention to this issue. We're happy to answer any questions about our environment, pod logs, etc.

Sorry, guys - really need help with this.

@wy65701436
Copy link
Contributor

  1. What is the status of the execution of GC on March 6?
  2. Can you go to the jobservice dashboard and share the pending count of GARBAGE_COLLECTION job queue?
  3. What is the version of your harbor?

@wy65701436
Copy link
Contributor

wy65701436 commented Apr 7, 2025

Hi @mdavid01

To resolve this issue, follow these steps:

  1. Stop all running jobs and disable all scheduled tasks, such as GC, replication, tag retention, and scanning.
    
  2. Clear all job queues in the Job Service dashboard.
    
  3. Ensure that all workers in the Job Service dashboard are unoccupied.
    
  4. Flush the Job Service database in Redis.
     >kubectl exec -it <redis-pod> bash
     >redis-cli
     >flushdb 
    
  5. Restart all Job Service pods.
    
  6. Manually execute the GC, change worker setting to 5 in cleanup page, but avoid setting a schedule. Monitor the running GC process closely. there is a dedicate log file for this job in the job service pod's /var/log/jobs/ directory, remember the name of this file, because even the GC is in Error status, as long as the GC log file keep updating, then the GC goroutine is running in the background. and you can check it to get the progress of the GC.
    

@mdavid01
Copy link
Author

mdavid01 commented Apr 8, 2025

  1. What is the status of the execution of GC on March 6?
  2. Can you go to the jobservice dashboard and share the pending count of GARBAGE_COLLECTION job queue?
  3. What is the version of your harbor?

Thanks, Wang:

  1. AWS Gov: SUCCESS, 1718 blob(s) and 210 manifest(s) deleted, 25.06GiB space freed up,Mar 5, 2025, 7:00:00 PM,Mar 5, 2025, 10:49:59 PM. Note however that log output on this and all logs are like "{"errors":[{"code":"NOT_FOUND","message":"{"code":10010,"message":"object is not found","details":"log entity: 27f6668dedbf4fe4a5fbfcb9"}"}]}"
  2. Pending counts (GC is run daily @ 8pm ET):
    AWS Gov: EXECUTION_SWEEP, 1036, 1164hrs 23min 16sec
    AWS Gov: GARBAGE_COLLECTION, 32, 779hrs 23min 16sec
    AWS Commercial: GARBAGE_COLLECTION,5,121hrs 43min 26sec
    AWS Commercial: no other executions, e.g., sweep
  3. Both AWS Gov and Commercial: Version v2.11.1-6b7ecba1

@mdavid01
Copy link
Author

mdavid01 commented Apr 8, 2025

Hi @mdavid01

To resolve this issue, follow these steps:

  1. Stop all running jobs and disable all scheduled tasks, such as GC, replication, tag retention, and scanning.
    
  2. Clear all job queues in the Job Service dashboard.
    
  3. Ensure that all workers in the Job Service dashboard are unoccupied.
    
  4. Flush the Job Service database in Redis.
     >kubectl exec -it <redis-pod> bash
     >redis-cli
     >flushdb 
    
  5. Restart all Job Service pods.
    
  6. Manually execute the GC, change worker setting to 5 in cleanup page, but avoid setting a schedule. Monitor the running GC process closely. there is a dedicate log file for this job in the job service pod's /var/log/jobs/ directory, remember the name of this file, because even the GC is in Error status, as long as the GC log file keep updating, then the GC goroutine is running in the background. and you can check it to get the progress of the GC.
    

Thanks Wang. Will attempt this on weekend as these are critical production systems. At this time, none of the 5 job service /var/log/jobs shows any files. I assume we need root access to view these files?

@mdavid01
Copy link
Author

mdavid01 commented Apr 14, 2025

Hello Wang: we did not execute the steps recommended above because I found what I believe to be a confirmation that GC is running but far too slowly to ever complete. When tracking the Artifact_Trash database table row count every 10 seconds, I found the record count to be decreasing at about 1 entry per minute during prime time. Our artifact trash table count started at 57,000+ entries this morning. Below is a summary of GC performance based on the attached performance tracking file:

844 Estimated total hours to complete garbage collection
13.4 Total Hours for Test (2 run segments)
3.5 Maximum number of artifacts deleted during any 1-minute period
-1.83 Minimum number of artifacts deleted during any 1-minute period
1.1 Average number of artifacts deleted during any 1-minute period

At current rate, GC will run for 844 hours if no new deletions are added. At the current rate, we can handle 1400 GC artifacts per day.

  • Is there any other script you can provide to safely and quickly remove the "trash' blobs and manifests?
  • Will copying the active repos and artifacts to another database eliminate the trash (least preferred option)?

Attached is the Excel file that captured the speed at which GC is removing Artifact_Trash records from the database. I assume that tracking Artifact_Trash activity serves as a valid proxy for GC performance.

Harbor Garbage Collection Timings.xlsx

Thanks.
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

2 participants