调试一个Go应用的死锁Bug

上一周几乎花了一整周的时间调试这个头疼的死锁 Bug。死锁 Bug 很难重现，因此也很难调试。谨以此文纪念这个教训。

死锁原因分析

分析过后这个死锁的原因主要是 Mutex 和 Channel 的混用。在 Golang 中 Mutex 和 Channel 都能够作为同步功能使用，保证多个协程之间不会同时读写共享数据，保证不出现数据竞争（Data Race）。Channel 能比 Mutex 更加灵活地使用，比如用在调度 Goroutine，收集多个Goroutine 的返回数据等。

Channel 用在同步上的一例：

See here: https://medium.com/stupid-gopher-tricks/more-powerful-synchronization-in-go-using-channels-f4a1c3242ed0

go func() {
  for {
    select {
    case value = <-h.setValCh: // set the current value.
    case h.getValCh <- value: // send the current value.
  }
}()

如上例 Goroutine 协程中利用一个 select – case 来决定当前执行的任务。读写任务都以这一个 Goroutine 作为入口，保证了不会同时出现读写的情况。这样的用法非常强大，但使用者也需要注意其误用带来的死锁问题。

比如以下出现的问题。这个问题比较简单，但是确实是新手很可能会犯的一个问题：

func foo() {
	a := make(chan bool)
	b := make(chan bool)
	done := make(chan bool)
	go func() {
		for {
			select {
			case <-a:
				fmt.Println("case A")
				<-b
			case <-b:
				fmt.Println("case B")
			case <-done:
				fmt.Println("case done")
				break
			}
		}
	}()
}

如果程序中只出现 Mutex 或 Channel 进行同步，程序都会简单易懂，也更好 Debug。需要注意的是不同的部分使用 Mutex 或 Channel 并出现互相操作的时候。

以下是这次 Bug 出现的极简化版。你能看出问题所在吗？

type A struct {
    mtx *sync.Mutex
    // other data structures
}

type B struct {
    action chan bool
    clear  chan bool
    // other channels and data structures
}

a := NewA()
b := NewB()

func NewB() *B {
    go func() {
        for {
            select {
            case <- clear:
                // clear records
            case <- action:
                a.Action()
                // ... other cases
            }
        }
    }()
    // other initializations
}

func (a *A) Action() {
    a.Mtx.Lock()
    defer a.Mtx.Unlock()

    // do action
}

func (a *A) Foo() {
    a.Mtx.Lock()
    defer a.Mtx.Unlock()

    // do some other actions
    b.clear <- true
}

当 Action() 和 Foo() 被不同 Goroutine 同时调用的时候，两个函数中的 Mutex 可能会被同时锁住。这在没有 Channel 的情况下通常是没有问题的。而 Channel 在这个程序中作为同步作用出现，保证了只有一个 case 能够同时执行。也就是 clear 和 Action() 不会同时出现。而 Action() 和 Foo() 同时锁住的时候，Action() 可能会等待 Foo()，而 Foo() 中的 b.clear <- true 语句会阻塞等待 Action() 的结束，出现互相等待的情况。这样程序就出现了死锁！

利用 Go 生态的调试工具

目前我还没有找到非常好的，能够解决这一问题的调试工具。Golang 在运行时中加入了全局死锁的检测，但死锁问题往往是局部的，目前好像并没有什么工具能够直接准确定位类似的死锁问题。

这次问题 gdb 和 Golang 的 pprof 工具库帮上了大忙。尤其是 pprof。对于有 HTTP 服务的服务器 Go 程序，使用 pprof 非常简单：直接导入 pprof，就能够在默认的 HTTP 服务上注册一个新的路径作为调试：

import (
    ...
    _ "net/http/pprof"
)

然后 HTTP 服务启动之后便能通过浏览器或者 curl 看到 debug 输出。如下是输出程序中所有 Goroutine 的 backtrace：

curl localhost:10000/debug/pprof/goroutines?debug=1

阅读 pprof 输出的时候，可以特别关注以下几个点来调试死锁问题：

有哪些 Mutex 还在等待状态。尽管获得锁的 Goroutine 不会直接等待锁，但是目前正在等待的有可能就是罪魁祸首；
有哪些 Channel 还在等待。这一可能很容易被忽视，因为 Go 程序中可能出现很多个等待的 Channel。可以从如上描述为了同步的 Channel 中开始检查。

另外 pprof 作为一个程序分析库非常有用。我这一次甚至利用 pprof 发现了一个资源泄漏的问题。更多参考：

– https://golang.org/pkg/net/http/pprof/

– Profiling Go programs with pprof

– https://blog.minio.io/debugging-go-routine-leaks-a1220142d32c

pprof 的样例输出，来自博客 (https://blog.minio.io/debugging-go-routine-leaks-a1220142d32c)：

goroutine 149 [chan send]:
main.sum(0xc420122e58, 0x3, 0x3, 0xc420112240)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

goroutine 243 [chan send]:
main.sum(0xc42021a0d8, 0x3, 0x3, 0xc4202760c0)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

goroutine 259 [chan send]:
main.sum(0xc4202700d8, 0x3, 0x3, 0xc42029c0c0)
        /home/karthic/gophercon/count-instrument.go:39 +0x6c
created by main.sumConcurrent
        /home/karthic/gophercon/count-instrument.go:51 +0x12b

经验总结

可能很少有人会像注意 Mutex 一样注意 Channel 的同步功能，但这是 Golang 中的经典用法。但在使用时需要注意。

尽量不要混用 Mutex 和 Channel。尤其是不能将 Channel 操作放在 Mutex 的保护区间，否则很有可能出现死锁现象。
尽量用尽可能小的区间，如果有可能，只放在需要保护的数据的周边。甚至可以将保护直接数据的周边。可以考虑用 Getter 和 Setter 函数。这样也减少了 Channel 在 Goroutine 里的可能性。

这样就应该能在一定程度上减少死锁的可能性。当然，避免死锁还是离不开程序猿自身谨慎地设计规划代码。

A Hungry Fool

调试一个Go应用的死锁Bug

利用 Go 生态的调试工具

经验总结

0 Comments

1 Pingback