Skip to content

Crash after disk is full #1801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
davies opened this issue Aug 29, 2022 · 9 comments
Open

Crash after disk is full #1801

davies opened this issue Aug 29, 2022 · 9 comments

Comments

@davies
Copy link

davies commented Aug 29, 2022

When the disk is full, the process who open the badger database crashed. When it start again, it crash again:

unexpected fault address 0x7f25fb57a000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f25fb57a000 pc=0x46c58e]

goroutine 333 [running]:
runtime.throw({0x2ac25dd, 0xc0008ffdd0})
	/usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc0001b6b70 sp=0xc0001b6b40 pc=0x437031
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:732 +0x125 fp=0xc0001b6bc0 sp=0xc0001b6b70 pc=0x44d565
runtime.memmove()
	/usr/local/go/src/runtime/memmove_amd64.s:383 +0x42e fp=0xc0001b6bc8 sp=0xc0001b6bc0 pc=0x46c58e
github.com/dgraph-io/badger/v3/table.(*buildData).Copy(0xc0001b6cc0, {0x7f25fb4e0000, 0x10c76b, 0x10c76b})
	/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/builder.go:411 +0xb4 fp=0xc0001b6c20 sp=0xc0001b6bc8 pc=0xbb9dd4
github.com/dgraph-io/badger/v3/table.CreateTable({0xc0008ffdd0, 0x10}, 0xc007e6e090)
	/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:268 +0x1f2 fp=0xc0001b6d88 sp=0xc0001b6c20 pc=0xbbecf2
github.com/dgraph-io/badger/v3.(*DB).handleFlushTask(0xc000564480, {0xc0003c8000, {0x0, 0x0, 0x0}})
	/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:1062 +0x232 fp=0xc0001b6ef8 sp=0xc0001b6d88 pc=0xbdbd12
github.com/dgraph-io/badger/v3.(*DB).flushMemtable(0xc000564480, 0x0)
	/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:1084 +0x21c fp=0xc0001b6fc0 sp=0xc0001b6ef8 pc=0xbdc15c
github.com/dgraph-io/badger/v3.Open.func5()
	/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:357 +0x25 fp=0xc0001b6fe0 sp=0xc0001b6fc0 pc=0xbd7565
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0001b6fe8 sp=0xc0001b6fe0 pc=0x46b2e1
created by github.com/dgraph-io/badger/v3.Open
	/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:356 +0x10c5

The application is JuiceFS, which uses badger as the metadata engine.

@CristianCurteanu
Copy link

CristianCurteanu commented Oct 6, 2022

Hello!

In this case, the error is caused by writing to a memory mapped file, and mmap(2) is raising this SIGBUS issue because of insufficient storage space on disk, which causes an inconsistency between virtual memory content of the file, and disk content.

This however, should not cause an app to crash, so I would suggest to recover and return an error when this panic is raised, and should be applied to Ristretto's z.MmapFile.Data, as a write abstraction for z.MmapFile, and then to replace the copy to mmap'ed file all around Badger.

In order to reproduce it, so far I am creating a file system with limited amount of space (2MB):

dd if=/dev/zero of=rawfile bs=1K count=2000

mkfs.ext4 rawfile
mkdir ~/.bfs
sudo mount -o loop rawfile ~/.bfs
sudo chmod -R 777 ~/.bfs

After which I tested used this directory as path for badger.Options:

package main

import (
	"crypto/rand"
	"flag"
	"fmt"

	"github.com/dgraph-io/badger/v3"
)

const (
	MB = 1024 * 1024
)

var rounds *int

func init() {
	rounds = flag.Int("megs", 20, "Number of MBs of data storage")
}

func main() {
	flag.Parse()
	path := "/home/admin02/.bfs"

	opts := badger.DefaultOptions(path)
	opts.WithInMemory(false)

	bdb, err := badger.Open(opts)
	if err != nil {
		panic(err)
	}

	defer bdb.Close()

	for i := 0; i <= *rounds; i++ {
		func() {
			tx := bdb.NewTransaction(true)
			defer tx.Discard()

			var key []byte = make([]byte, 10)
			rand.Read(key)

			var data []byte = make([]byte, 1*MB)
			rand.Read(data)

			fmt.Println(">>> entry:", i, len(data))
			err = tx.Set(key, data)
			if err != nil {
				panic(err)
			}

			err = tx.Commit()
			if err != nil {
				panic(err)
			}
		}()
	}
}

which resulted (when hitting the limit):

badger 2022/10/06 08:42:50 INFO: All 0 tables opened in 0s
badger 2022/10/06 08:42:50 INFO: Discard stats nextEmptySlot: 0
badger 2022/10/06 08:42:50 INFO: Set nextTxnTs to 0
>>> entry: 0 1048576
>>> entry: 1 1048576
unexpected fault address 0x7efb8c1ba000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7efb8c1ba000 pc=0x46686e]

goroutine 50 [running]:
runtime.throw({0x90efb0?, 0xc0002ec000?})
        /home/admin02/sdk/go1.18.5/src/runtime/panic.go:992 +0x71 fp=0xc000545cd0 sp=0xc000545ca0 pc=0x436211
runtime.sigpanic()
        /home/admin02/sdk/go1.18.5/src/runtime/signal_unix.go:815 +0x125 fp=0xc000545d20 sp=0xc000545cd0 pc=0x44b405
runtime.memmove()
        /home/admin02/sdk/go1.18.5/src/runtime/memmove_amd64.s:431 +0x50e fp=0xc000545d28 sp=0xc000545d20 pc=0x46686e
github.com/dgraph-io/badger/v3.(*valueLog).write.func2(0xc000000960?)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/value.go:826 +0xf2 fp=0xc000545d70 sp=0xc000545d28 pc=0x830ef2
github.com/dgraph-io/badger/v3.(*valueLog).write(0xc000129cf8, {0xc0002e60f0?, 0x1, 0x0?})
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/value.go:884 +0x682 fp=0xc000545ec0 sp=0xc000545d70 pc=0x830ba2
github.com/dgraph-io/badger/v3.(*DB).writeRequests(0xc000129b00, {0xc0002e60f0?, 0x1, 0xa})
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:816 +0xb5 fp=0xc000545f58 sp=0xc000545ec0 pc=0x7f0cf5
github.com/dgraph-io/badger/v3.(*DB).doWrites.func1({0xc0002e60f0?, 0x0?, 0x0?})
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:887 +0x45 fp=0xc000545fb8 sp=0xc000545f58 pc=0x7f1b05
github.com/dgraph-io/badger/v3.(*DB).doWrites.func3()
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:940 +0x32 fp=0xc000545fe0 sp=0xc000545fb8 pc=0x7f1a92
runtime.goexit()
        /home/admin02/sdk/go1.18.5/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc000545fe8 sp=0xc000545fe0 pc=0x465521
created by github.com/dgraph-io/badger/v3.(*DB).doWrites
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:940 +0x16c

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc0000a8000?)
        /home/admin02/sdk/go1.18.5/src/runtime/sema.go:56 +0x25
sync.(*WaitGroup).Wait(0xc0059739a0?)
        /home/admin02/sdk/go1.18.5/src/sync/waitgroup.go:136 +0x52
github.com/dgraph-io/badger/v3.(*request).Wait(0xc0000767e0)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/value.go:702 +0x27
github.com/dgraph-io/badger/v3.(*Txn).commitAndSend.func3()
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/txn.go:609 +0x33
github.com/dgraph-io/badger/v3.(*Txn).Commit(0xc000170400)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/txn.go:679 +0xc6
main.main.func1(0x91745f?, 0xc005973c78, 0xc005973c88)
        /home/admin02/projects/vortex/cmd/badgersigbus/main.go:52 +0x21f
main.main()
        /home/admin02/projects/vortex/cmd/badgersigbus/main.go:56 +0x1a5

goroutine 6 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0x0?)
        /home/admin02/go/pkg/mod/github.com/golang/[email protected]/glog.go:882 +0x6a
created by github.com/golang/glog.init.0
        /home/admin02/go/pkg/mod/github.com/golang/[email protected]/glog.go:410 +0x1bf

goroutine 7 [select]:
github.com/dgraph-io/badger/v3/y.(*WaterMark).process(0xc00023e330, 0xc00023e300)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/y/watermark.go:214 +0x285
created by github.com/dgraph-io/badger/v3/y.(*WaterMark).Init
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/y/watermark.go:72 +0xaa

goroutine 8 [select]:
github.com/dgraph-io/badger/v3/y.(*WaterMark).process(0xc00023e360, 0xc00023e300)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/y/watermark.go:214 +0x285
created by github.com/dgraph-io/badger/v3/y.(*WaterMark).Init
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/y/watermark.go:72 +0xaa

goroutine 9 [select]:
github.com/dgraph-io/ristretto/z.(*AllocatorPool).freeupAllocators(0xc00000ecf0)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/[email protected]/z/allocator.go:385 +0x150
created by github.com/dgraph-io/ristretto/z.NewAllocatorPool
        /home/admin02/go/pkg/mod/github.com/dgraph-io/[email protected]/z/allocator.go:324 +0xc5

goroutine 10 [select]:
github.com/dgraph-io/ristretto.(*defaultPolicy).processItems(0xc000074a80)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/[email protected]/policy.go:102 +0x91
created by github.com/dgraph-io/ristretto.newDefaultPolicy
        /home/admin02/go/pkg/mod/github.com/dgraph-io/[email protected]/policy.go:86 +0x156

goroutine 11 [select]:
github.com/dgraph-io/ristretto.(*Cache).processItems(0xc000170380)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/[email protected]/cache.go:452 +0x15e
created by github.com/dgraph-io/ristretto.NewCache
        /home/admin02/go/pkg/mod/github.com/dgraph-io/[email protected]/cache.go:207 +0x696

goroutine 12 [select]:
github.com/dgraph-io/badger/v3.(*DB).monitorCache(0xc000129b00, 0xc0002525d0)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:469 +0x18a
created by github.com/dgraph-io/badger/v3.Open
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:311 +0xc8b

goroutine 13 [select]:
github.com/dgraph-io/badger/v3.(*DB).updateSize(0xc000129b00, 0xc000252720)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:1171 +0x158
created by github.com/dgraph-io/badger/v3.Open
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:331 +0xe8c

goroutine 34 [select]:
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc000204000, 0x0, 0xc0000a2120)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:438 +0x125
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:354 +0x4e

goroutine 35 [select]:
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc000204000, 0x1, 0xc0000a2120)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:438 +0x125
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:354 +0x4e

goroutine 36 [select]:
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc000204000, 0x2, 0xc0000a2120)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:438 +0x125
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:354 +0x4e

goroutine 37 [select]:
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc000204000, 0x3, 0xc0000a2120)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:438 +0x125
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:354 +0x4e

goroutine 38 [chan receive]:
github.com/dgraph-io/badger/v3.(*DB).flushMemtable(0xc000129b00, 0xc00000ecf0?)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:1078 +0xb2
github.com/dgraph-io/badger/v3.Open.func5()
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:357 +0x25
created by github.com/dgraph-io/badger/v3.Open
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:356 +0x107c

goroutine 39 [select]:
github.com/dgraph-io/badger/v3.(*vlogThreshold).listenForValueThresholdUpdate(0xc000074a00)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/value.go:1172 +0x11a
created by github.com/dgraph-io/badger/v3.Open
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:380 +0x170a

goroutine 40 [select]:
github.com/dgraph-io/badger/v3.(*DB).doWrites(0xc000129b00, 0xc0000a21e0)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:900 +0x236
created by github.com/dgraph-io/badger/v3.Open
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:387 +0x17cf

goroutine 41 [chan receive]:
github.com/dgraph-io/badger/v3.(*valueLog).waitOnGC(0xc000129cf8, 0x0?)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/value.go:1079 +0x7d
created by github.com/dgraph-io/badger/v3.Open
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/db.go:391 +0x188c

goroutine 42 [select]:
github.com/dgraph-io/badger/v3.(*publisher).listenForUpdates(0xc00023e420, 0xc0000a2240)
        /home/admin02/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/publisher.go:73 +0x150
created by github.com/dgraph-io/badger/v3.Open

@fatelei
Copy link
Contributor

fatelei commented Aug 12, 2023

how about add an error "disk is full", let client to handle this error

@SOF3
Copy link

SOF3 commented Aug 14, 2023

reproduced with jaeger-remote-storage using badger memTable backend with v3.2103.5 on Linux 5.4.56 in container:

unexpected fault address 0x7f3eb7a1c000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f3eb7a1c000 pc=0x46f32f]

goroutine 26177775 [running]:
runtime.throw({0x139d61b?, 0xc07f11a348?})
        runtime/panic.go:1047 +0x5d fp=0xc001afc9d8 sp=0xc001afc9a8 pc=0x43a09d
runtime.sigpanic()
        runtime/signal_unix.go:832 +0x125 fp=0xc001afca28 sp=0xc001afc9d8 pc=0x450725
runtime.memmove()
        runtime/memmove_amd64.s:195 +0x16f fp=0xc001afca30 sp=0xc001afca28 pc=0x46f32f
github.com/dgraph-io/badger/v3.(*logFile).writeEntry(_, _, _, {{0xc000046045, 0xf}, {0xc000046017, 0x10}, 0x0, 0x1, 0x0, ...})
        github.com/dgraph-io/badger/[email protected]/memtable.go:344 +0xdb fp=0xc001afca78 sp=0xc001afca30 pc=0xc777fb
github.com/dgraph-io/badger/v3.(*memTable).Put(0xc0022d2000, {0xc0239ffa40, 0x29, 0xc00250d590?}, {0x40, 0x0, 0x64db26da, {0x0, 0x0, 0x0}, ...})

@fatelei
Copy link
Contributor

fatelei commented Aug 15, 2023

i will submit a mr soon

Copy link

This issue has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.

@github-actions github-actions bot added the Stale label Jul 19, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 27, 2024
@gaurav-vio
Copy link

@fatelei any update on this?

@astraw38
Copy link

Hi, I've been taking a stab at fixing this. My first attempt was using debug.SetPanicOnFault(true) & panic recover functions so we could return errors correctly. However, I found this rather unwieldy, as it has to be done on every goroutine that might interrupt w/ a sigbus. It also meant that we could lose data on a write, since we couldn't actually write the thing to disk.

As an alternative, I swapped the mmap file creation to use fallocate instead of truncate. This thick provisions any mmapped file so we cannot trigger a sigbus at all. It does mean that we now have to track the size of the mmap file manually -- you cannot rely on the filedescriptor to accurately describe how much we have written. I accomplished this by modifying the vlog header to include used size. Any writes to the vlog now also update that header value.

The relevant changes can be found here: https://github.com/astraw38/ristretto/tree/use-fallocate and https://github.com/astraw38/badger/tree/use-fallocate. Feedback welcome as to strategy as to when to enable fallocate usage. In this case I used build tags, but it could probably be done via options.

Testing

Testing was done using a custom script that would test the following scenarios:

Disk is full on open

db.Open() returns error
db.Open(ReadOnly=True) works, and can read existing data.

Disk is full during operation

Operation that triggered the disk full returns an error.
DB is still usable to Get/List/Read.
Additional attempts to write will return errBlockedWrites
DB can be closed and re-opened as read-only.
DB does NOT lose data on close/re-open.

Additional thoughts welcome on how/if/when we should unblock writes when hitting disk-full, and then the disk is freed up.

@astraw38
Copy link

astraw38 commented Mar 4, 2025

@ryanfoxtyler (or any maintainer?) could we get this re-opened please?

@ryanfoxtyler ryanfoxtyler reopened this Mar 5, 2025
Copy link

linear bot commented Mar 5, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants