Oct 15 2007

Reverse Mapping in Linux kernel

从vm_area找到其映射的struct page那叫mapping，反过来，从一个struct page找到所有映射上来的vm_area就叫reverse mapping。

对于匿名映射，情况是这样的：

而对于非匿名映射，应该这样：

前者的技巧是加了anon_vma链表，而后者是加了一个layer——address_space。瞧～这名字起得多好，如果你第一次见绝对会把你绕晕～！

上面的图片也清楚地向我们展示了这么一个结论：C语言编程其实就是Find a Needle in the Haystack～～！怎样？这次你信了吧～！

Aug 26 2007

戏说Intel f00f Bug

f00f是Intel Pentium CPU的一个臭名昭著的bug，就算你不知道它的具体成因但估计也对其大名有所耳闻了。它影响到了Intel Pentium，Pentium MMX和Pentium OverDrive系列的CPU，属于设计上的缺陷。最早出现在1997年11月，从当时的一些邮件存档来看，这个bug造成了不小的轰动。当然了，Intel自然也不弱（尤其是它的“纠错”和“兼容”技术，对i386稍微有所了解的人都知道），很快就给出了解决方案来应对。当然，在那以后的Intel CPU中就没有这个bug了。

好了，让我们现在一起来看看f00f bug的原理，恶劣影响，解决方法，以及对应到Linux内核源代码中的实际解决方案。

还得先从名字说起，这个bug之所以叫这么个名字，是因为它是用来指代f0 0f c7 c8这个16进制数序列，而这个序列是表示一条i386汇编指令，用AT&T语法来表示就是：


 lock cmpxchg8b %eax

用Intel语法来表示是：


 LOCK CMPXCHG8B EAX

这条指令明显是错误的，因为：

1. cmpxchg8b是用来比较两个64bit的数，其中一个是隐含的edx:eax，另一个是由后面的操作数表示的指针来指出。上面那条指令显然违反了这条规则。

2. lock前缀只能用于基于内存的“读—修改—写”型的指令，而上面的指令同样也不符合这一要求。

按理说，上面的任何一个错误都会导致invalid opcode的错误。可是意外发生了，当这两种错误按照上面的方式叠加到一起时，CPU会自己锁死！解决方法只有重启！换句话说，我们可以构造这么一个短小的C程序来触发这个bug：

char x [5] = { 0xf0, 0x0f, 0xc7, 0xc8 };

int main(void){
void (*f)() = x;
f();
return 0;
}

而且不需要任何特权，任何用户都能使用！这是头一次“崭露头角”的硬件错误，操作系统对此也无可奈何。这样，我们可以轻易让某台配有此bug的CPU的服务器挂掉！由此可见，这个bug有多么严重！

这到底是怎么回事？为什么本该是错上加错的指令却酿成大祸？其实是这样，当CPU发现cmpxchg8b %eax是错误的之后，会产生一个invalid opcode的异常，然后寻找其对应的处理函数来处理。当CPU读取这个处理函数的地址时，错误地判断出了LOCK#信号，就对总线进行加锁，然后等待一个对该地址的加锁的写入，但在这中间，CPU是不会有任何写操作的，于是就把自己挂起了！这明显是一个设计上的失误！

Intel老大马上给出了两套解决方案，都可行，但第一个不怎么聪明，甚至说“和问题本身一样糟糕”；而另一个算是一个聪明的方案。我们一起来看一下第二个，也就是Linux内核使用的这个。

Intel这样建议：把包含IDT前7项的页面设置为只读，也把CR0的 WP位置为1。现在，当bug发生时，它先会找invalid opcode的异常处理函数，进而产生一个缺页异常，因为CPU企图写一个只读页面。而缺页异常不会被锁住，这样控制权又回到操作系统手中。然后就是要修改缺页异常的处理例程，让它判断，如果异常发生在内核态并且无效地址正好是invalid opcode异常处理函数的地址，那么就是f00f bug了。这时操作系统就应该转入处理invalid opcode异常。

下面就是Linux内核源代码中的实际处理方法：

这是把IDT放到固定映射区域的代码，很明显，设置了只读。
(arch/i386/kernel/traps.c)

void init trap_init_f00f_bug(void)
{ set_fixmap(FIX_F00F_IDT, __pa(&idt_table), PAGE_KERNEL_RO);

/*
 * Update the IDT descriptor and reload the IDT so that
 * it uses the read-only mapped virtual address.
 */
idt_descr.address = fix_to_virt(FIX_F00F_IDT);
load_idt(&amp;idt_descr);

}

这里是do_page_fault()，也就是缺页异常处理函数中的部分代码：
(arch/i386/mm/fault.c)

//…
if (boot_cpu_data.f00f_bug) {
unsigned long nr;

    nr = (address - idt_descr.address) &gt;&gt; 3;

    if (nr == 6) {
        do_invalid_op(regs, 0);
        return;
    }
}

//…

和上面的Intel的方案完全吻合！

参考资料：

[1] Wikipedia, http://en.wikipedia.org/wiki/F00f
[2] x86.org, http://www.x86.org/errata/dec97/f00fbug.htm
[3] Understanding the Linux Kernel
[4] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A

Aug 17 2007

Slab, Slob, Slub

Linux内核真是变化太快了，内存管理这块就是一个好例子。

本来Linux内核只有Slab的，现在好了，Slab多了两个兄弟：Slob和Slub。瞧！这就是内核的命名风格，让你光看名字就糊涂了！这也是我这两天读内核源代码的深刻体会，什么cache啊，cache_cache啊，free_area啊，绕不晕你才怪呢～！

以前搞不懂这三个到底什么关系，为什么要有这三个。今天搜了一下，明白了一些。简单的说：Slab是基础，是最早从Sun OS那引进的；Slub是在Slab上进行的改进，在大型机上表现出色（不知道在普通PC上如何），据说还被IA-64作为默认；而Slob是针对小型系统设计的，当然了，主要是嵌入式。相关文章如下：

Anatomy of the Linux slab allocator
The SLUB allocator
The SLOB allocator

这也正好体现了一个Linux内核开发一贯的思想：提供一种机制，而不是一种策略（Provide mechanism not policy）。其它软件开发又何尝不是如此呢？

May 3 2007

第二个补丁被-mm内核树接受

嗨！

很高兴宣布我的又一个补丁被Andrew的-mm接受，有兴趣的同学可以看这里：

http://marc.info/?l=linux-mm-commits&m=117817128028211&w=2

（佩服Andrew的工作效率，刚发出补丁没几分钟就马上被确认并归入了）

顺带补充上次没找到的链接：
http://marc.info/?l=linux-mm-commits&m=117624330000536&w=2

下面还有个总结：

Patches currently in -mm which might be from xiyou.wangcong AT gmail.com are

partitions-check-the-return-value-of-kobject_add-etc.patch
vt-add-color-support-to-the-underline-and-italic-attributes-fix-2.patch

不知道这两补丁何时才能被归入-stable树，等吧。

Apr 11 2007

我的补丁第一次被正式接受

很高兴宣布我的补丁第一次被Linux kernel -mm tree接受，在此之前提交过三四个补丁，似乎都没有被正式接受。;-(

感谢Andrew Morton和德国IBM的Cornelia Huck，没有他们的帮助不可能有这个补丁，尤其是Cornelia Huck，他非常耐心地辅导我三次修改这个补丁。

Andrew已经把这个补丁归入-mm树，可惜我找不到mm-commit邮件列表的www链接，在这里只给出一个lkml上的相关链接：

http://lkml.org/lkml/2007/4/10/335

同时给那些对Linux内核感兴趣的人一个建议：坚持下去！这是任何一个内核开发者能给你的最好的建议了！2.6现在仍有不少东西需要去做，赶快动手吧！

呵呵，非常开心啊！ ;-D

Apr 9 2007

linux-2.6.21-rc6内核schedule()和fork()流程图

勉强把schedule()和fork()的流程图画出来了，对2.6的进程调度算法有了更深刻的认识。不容易~

schedule()见下：
http://files.myopera.com/congwang/files/schedule.jpeg

fork()见下：
http://files.myopera.com/congwang/files/fork.jpeg

BTW：我设计的xylftp客户端的类图如下：

http://files.myopera.com/congwang/files/java_uml.jpeg

Mar 12 2007

在LKML上提的一个问题

发件人 Cong WANG
收件人 linux-kernel@vger.kernel.org
日期 2007-3-11 下午10:15
主题 Style Question
邮送域 gmail.com
Hi, list!

I have a question about coding style in linux kernel. In
Documention/CodingStyle, it is said that “Linux style for comments is
the C89 “/ … /“ style. Don’t use C99-style “// …” comments.”
But I see a lot of ‘//‘ style comments in current kernel code.

Which is wrong? The documentions or the code, or neither? And why?

Another question is about NULL. AFAIK, in user space, using NULL is
better than directly using 0 in C. In kernel, I know it used its own
NULL, which may be defined as ((void*)0), but it’s still different
from raw zero. So can I say using NULL is better than 0 in kernel?

Any reply is welcome. Thanks and have a nice day!

Bernd Petrovitsch 第一个回答到：

On Sun, 2007-03-11 at 22:15 +0800, Cong WANG wrote:
[…]
> Another question is about NULL. AFAIK, in user space, using NULL is
> better than directly using 0 in C. In kernel, I know it used its own
> NULL, which may be defined as ((void*)0),

Userspace has the usually same definition.

> but it’s still different
> from raw zero.

It is different that “0” as such has the type “int”. But this int is
automatically promoted to a “0 pointer”.

> So can I say using NULL is better than 0 in kernel?

Yes, because it is immediately clear that a pointer is (or should be)
there (and not an int).
And the same holds for userspace since this is a pure C question.

Bernd

Jan Engelhardt 回复到：

On Mar 11 2007 22:15, Cong WANG wrote:
>
> I have a question about coding style in linux kernel. In
> Documention/CodingStyle, it is said that “Linux style for comments is
> the C89 “/ … /“ style. Don’t use C99-style “// …” comments.”
> But I see a lot of ‘//‘ style comments in current kernel code.
>
> Which is wrong? The documentions or the code, or neither? And why?

The code. And because it’s not always reviewed but silently pushed.

> Another question is about NULL. AFAIK, in user space, using NULL is
> better than directly using 0 in C. In kernel, I know it used its own
> NULL, which may be defined as ((void*)0), but it’s still different
> from raw zero.

In what way?

>So can I say using NULL is better than 0 in kernel?

On what basis? Do you even know what NULL is defined as in
(C, not C++) userspace? Think about it.

Jan

我看到后回复：

2007/3/12, Jan Engelhardt:
>
> On Mar 11 2007 22:15, Cong WANG wrote:
> >
> > I have a question about coding style in linux kernel. In
> > Documention/CodingStyle, it is said that “Linux style for comments is
> > the C89 “/ … /“ style. Don’t use C99-style “// …” comments.”
> > But I see a lot of ‘//‘ style comments in current kernel code.
> >
> > Which is wrong? The documentions or the code, or neither? And why?
>
> The code. And because it’s not always reviewed but silently pushed.
>
> > Another question is about NULL. AFAIK, in user space, using NULL is
> > better than directly using 0 in C. In kernel, I know it used its own
> > NULL, which may be defined as ((void*)0), but it’s still different
> > from raw zero.
>
> In what way?

The following code is picked from drivers/kvm/kvm_main.c:

static struct kvm_vcpu vcpu_load(struct kvm kvm, int vcpu_slot)
{
struct kvm_vcpu *vcpu = &kvm->vcpus[vcpu_slot];

mutex_lock(&vcpu->mutex);
if (unlikely(!vcpu->vmcs)) {
mutex_unlock(&vcpu->mutex);
return 0;
}
return kvm_arch_ops->vcpu_load(vcpu);
}

Obviously, it used 0 rather than NULL when returning a pointer to
indicate an error. Should we fix such issue?

>
> >So can I say using NULL is better than 0 in kernel?
>
> On what basis? Do you even know what NULL is defined as in
> (C, not C++) userspace? Think about it.
>

I think it’s more clear to indicate we are using a pointer rather than
an integer when we use NULL in kernel. But in userspace, using NULL is
for portbility of the program, although most (just most, NOT all) of
NULL’s defination is ((void*)0).

一些其它的回复如下：

Robert Hancock 写道：

Cong WANG wrote:
> Hi, list!
>
> I have a question about coding style in linux kernel. In
> Documention/CodingStyle, it is said that “Linux style for comments is
> the C89 “/ … /“ style. Don’t use C99-style “// …” comments.”
> But I see a lot of ‘//‘ style comments in current kernel code.
>
> Which is wrong? The documentions or the code, or neither? And why?

The code.. As with a lot of coding style issues, it’s likely just that
nobody saw it and bothered to complain when it went in.

> Another question is about NULL. AFAIK, in user space, using NULL is
> better than directly using 0 in C. In kernel, I know it used its own
> NULL, which may be defined as ((void*)0), but it’s still different
> from raw zero. So can I say using NULL is better than 0 in kernel?

It’s the preferred style, Sparse will complain about using 0 for a null
pointer for example..

Mac的Kyle Moffett 如是说：

On Mar 11, 2007, at 16:41:51, Daniel Hazelton wrote:
> On Sunday 11 March 2007 16:35:50 Jan Engelhardt wrote:
>> On Mar 11 2007 22:15, Cong WANG wrote:
>>> So can I say using NULL is better than 0 in kernel?
>>
>> On what basis? Do you even know what NULL is defined as in (C, not
>> C++) userspace? Think about it.
>
> IIRC, the glibc and GCC headers define NULL as (void*)0 :)

On the other hand when cplusplus is defined they define it to the
“null” builtin, which GCC uses to give type conversion errors for
“int foo = NULL” but not “char foo = NULL”. A “((void )0)”
definition gives C++ type errors for both due to the broken C++ void
pointer conversion problems.

Cheers,

Nicholas Miell这样说：

On Mon, 2007-03-12 at 06:40 +0100, Jan Engelhardt wrote:
> On Mar 12 2007 13:37, Cong WANG wrote:
> >
> > The following code is picked from drivers/kvm/kvm_main.c:
> >
> > static struct kvm_vcpu vcpu_load(struct kvm kvm, int vcpu_slot)
> > {
> > struct kvm_vcpu vcpu = &kvm->vcpus[vcpu_slot];
> >
> > mutex_lock(&vcpu->mutex);
> > if (unlikely(!vcpu->vmcs)) {
> > mutex_unlock(&vcpu->mutex);
> > return 0;
> > }
> > return kvm_arch_ops->vcpu_load(vcpu);
> > }
> >
> > Obviously, it used 0 rather than NULL when returning a pointer to
> > indicate an error. Should we fix such issue?
>
> Indeed. If it was for me, something like that should throw a compile error.
>
> >>[…]
> > I think it’s more clear to indicate we are using a pointer rather than
> > an integer when we use NULL in kernel. But in userspace, using NULL is
> > for portbility of the program, although most (just most, NOT all) of
> > NULL’s defination is ((void)0).

>
> NULL has the same bit pattern as the number zero. (I’m not saying the bit
> pattern is all zeroes. And I am not even sure if NULL ought to have the same
> pattern as zero.) So C++ could use (void *)0, if it would let itself :p

Not necessarily. You can use 0 at the source level, but the compiler has
to convert it to the actual NULL pointer bit pattern, whatever it may
be.

In C++, NULL is typically defined to 0 (with no void* cast) by most
compilers because 0 (and only 0) can be implicitly converted to to null
pointer of any ponter type without a cast.

GCC introduced the __null extension so that NULL still works correctly
in C++ when passed to a varargs function on 64-bit platforms.

(This just works in C because C makes NULL ((void)0) is thus is the
right size. In C++, the 0 ends up being an int instead of a pointer when
passed to a varargs function, and things tend to blow up when they read
the garbage high bits. Of course, nobody else does this, so you still
have to use (void)NULL to be portable.)

Randy.Dunlap 给以肯定回答：

On Mon, 12 Mar 2007, Jan Engelhardt wrote:

>
> On Mar 12 2007 13:37, Cong WANG wrote:
> >
> > The following code is picked from drivers/kvm/kvm_main.c:
> >
> > static struct kvm_vcpu vcpu_load(struct kvm kvm, int vcpu_slot)
> > {
> > struct kvm_vcpu *vcpu = &kvm->vcpus[vcpu_slot];
> >
> > mutex_lock(&vcpu->mutex);
> > if (unlikely(!vcpu->vmcs)) {
> > mutex_unlock(&vcpu->mutex);
> > return 0;
> > }
> > return kvm_arch_ops->vcpu_load(vcpu);
> > }
> >
> > Obviously, it used 0 rather than NULL when returning a pointer to
> > indicate an error. Should we fix such issue?
>
> Indeed. If it was for me, something like that should throw a compile error.

At least it does throw a sparse warning, and yes, it should
be fixed.

最后我决定提交补丁：

[PATCH]Replace 0 with NULL when returning a pointer

Use NULL to indicate we are returning a pointer rather than an integer
and to eliminate some sparse warnings.

Signed-off-by: Cong WANG <xiyou.wangcong@gmail.com>

—- drivers/kvm/kvm_main.c.orig 2007-03-11 21:41:23.000000000 +0800
+++ drivers/kvm/kvm_main.c 2007-03-12 14:26:17.000000000 +0800
@@ -205,7 +205,7 @@ static struct kvm_vcpu *vcpu_load(struct
mutex_lock(&vcpu->mutex);
if (unlikely(!vcpu->vmcs)) {
mutex_unlock(&vcpu->mutex);

return 0;

return NULL;
}
return kvm_arch_ops->vcpu_load(vcpu);
}
@@ -799,7 +799,7 @@ struct kvm_memory_slot *gfn_to_memslot(s
&& gfn < memslot->base_gfn + memslot->npages)
return memslot;
}

return 0;

return NULL;
}
EXPORT_SYMBOL_GPL(gfn_to_memslot);

Mar 9 2007

Linux内核中的几个概念

Sunday, 25. February 2007, 07:19:10

王聪@西邮

软中断（softirq）是内核使用的一种推后执行任务的一种机制，由于一些中断处理必须要在短期内完成，所以内核不得不把一些相对不重要的工作推后执行，软中断就是专门用来执行这种后退的工作。它在某种程度上有点像硬件中断，来得“随时随地”，而且不在进程上下文之中。千万不要把它和“软件中断（software interrupts）”这个概念混了，后者是因为在编程中使用了中断指令（比如：int 0x80）而产生一个硬件上实际存在的中断信号，而前者更本就不和硬件扯关系。

小任务（tasklet）是在软中断的基础上实现的一种后推执行方式。软中断是在编译时就已经确定好的，而小任务则可以在运行时分配和初始化；软中断可以在不同的CPU上同时执行，而同一类型的tasklet只能在调度它们的同一CPU上运行，因此软中断函数必须是可重入的，而小任务的函数则没有这个必要。

工作队列（work queue）是另一种后推方式，但它和小任务有着很大的区别，小任务是在中断上下文中执行的，而工作队列是在进程上下文中执行的，所以工作队列是可以休眠的，也就不一定是原子的。执行工作队列的线程是ksoftirqd/n（n是cpu的编号，在UP是ksoftirqd/0），这是一个内核线程，因此也不能访问用户内存。

下半部（bottom half）是后推执行任务的一个统称，它主要是完成上半部未完成的一些工作。一般来说，在处理中断时，在中断处理例程（上半部）中做的工作越少越好，其余一些相对不那么迫切的工作可以后推给下半部来完成，当然了，下半部可以是小任务，也可以是工作队列。

Mar 9 2007

开发内核的几条经验

Wednesday, 14. February 2007, 12:20:41

作者：王聪@西邮

1. 永远不要使用sleep_on及其变种，那是不安全的，而且即将被剔出。
2. 老的/proc接口是有害的，如果你不能彻底理解它，千万别使用！尽量不要再往/proc里面放东西了，因为那里现在已经够乱了。
3. wait_event的一个变种──wait_event_interruptible_exclusive──是可以执行独占等待的。LDD中这一点错了，不，过时了。
4. 不要使用lock_kernel()，这个锁太大了，会让系统性能下降。
5. 信号量本身就是一种高级的lock，它能比普通的lock，比如spinlock，实现更多的语义。就像分配了内存后必须释放一样，一个进程对一个信号量进行down操作后，必须在退出前执行up。切记：不要在拥有锁时睡眠！
6. Unix的errno具有丰富的含义，请认真小心地处理它们，在用户空间如此，在内核空间更要如此，因为内核空间的一个错误状态直接影响到现有用户程序的表现。
7. oops是严重的错误，不能信任存在oops的内核。
8.
关于补丁：
a) 稳定内核的补丁只能应用于基版本的内核，比如：2.6.17.*补丁只能应用于2.6.17版本的内核；
b) 基版本内核补丁只能应用于上一个版本的内核，比如：2.6.18补丁只能应用于2.6.17；
c) 增量型补丁只能从一个指定的版本应用于另一个指定版本，比如：补丁patch-2.6.17.10-11.bz2只能应用于2.6.17.10，把它升级为2.6.17.11；
d) 如果你需要升级的版本间隔比较大，比如从2.6.17.1到2.6.17.11，可以先从2.6.17.1退回到2.6.17，然后再一次升级到2.6.17.11；
e) 稳定内核的补丁和基版本内核补丁都在源代码目录中，而增量型补丁在一个单独的目录/pub/linux/kernel/v2.6/incr中。

Mar 9 2007

探索内核bug的经历

Thursday, 8. February 2007, 12:38:09

04043196 王聪西安邮电学院计算机系

我们知道，当无符号数上溢时，它会安安静静地绕回，因此，当比较两个无符号数时，不得不考虑绕回的问题。很可能绝大多数情况下不会出现溢出的情况，但是一旦溢出而处理不当就会导致系统进入非预期状态。不幸的是，Linux内核中的kfifo并没有恰当地处理这一问题。

struct kfifo定义在include/linux/kfifo.h中，其成员如下：

struct kfifo {
        unsigned char buffer;
        unsigned int size;
        unsigned int in;
        unsigned int out;
        spinlock_t lock;
};

很明显，in和out两个成员都是无符号整型，这主要是为了下面的一个与操作方便。kfifo_put和kfifoget是不带锁的两个接口，分别向循环缓冲区中放数据和取数据，定义如下：
118 unsigned int kfifo_put(struct kfifo fifo,
119 unsigned char buffer, unsigned int len)
120 {
121 unsigned int l;
122
123 len = min(len, fifo-<size - fifo-<in + fifo-<out);
…
130 smp_mb();
131
132 / first put the data starting from fifo-<in to buffer end /
133 l = min(len, fifo-<size - (fifo-<in & (fifo-<size - 1)));
134 memcpy(fifo-<buffer + (fifo-<in & (fifo-<size - 1)), buffer, l);
135
136 / then put the rest (if any) at the beginning of the buffer /
137 memcpy(fifo-<buffer, buffer + l, len - l);
…
144 smp_wmb();
145
146 fifo-<in += len;
147
148 return len;
149 }
…
164 unsigned int kfifo_get(struct kfifo fifo,
165 unsigned char buffer, unsigned int len)
166 {
167 unsigned int l;
168
169 len = min(len, fifo-<in - fifo-<out);
…
176 smp_rmb();
177
178 / first get the data from fifo-<out until the end of the buffer /
179 l = min(len, fifo-<size - (fifo-<out & (fifo-<size - 1)));
180 memcpy(buffer, fifo-<buffer + (fifo-<out & (fifo-<size - 1)), l);
181
182 / then get the rest (if any) from the beginning of the buffer /
183 memcpy(buffer + l, fifo-<buffer, len - l);
…
190 smp_mb();
191
192 fifo-<out += len;
193
194 return len;
195 }
上面的两个函数在正常情况下可以保证in总是大于等于out，并且它们的差不会超过size。但是当in溢出，而out恰好又没有溢出时，不幸的情况就会发生，in会小于out！这对kfifo_get影响似乎不大，但对kfifo_put却是致命地影响！in绕回后会变成一个很小的正数，而out仍然是一个很大的正数，结果(fifo-<size - fifo-<in + fifo-<out)也会变成一个很大的正数。如果内核程序员恰好不小心把一个很大的len作为参数传递给了kfifo_put（kfifo_put也一样），就会出现指针越界，更严重的会让内核痛苦地oops！

下面一个粗糙的内核模块和用户程序可以展示这个bug。内核模块如下：
1 #include >linux/kernel.hlinux/init.hlinux/module.hlinux/fs.hasm/uaccess.hlinux/err.hlinux/gfp.hlinux/spinlock.hlinux/kfifo.hlinux/string.h<
11
12 #define LFS_MAGIC 0x19860913
13 #define NFILES 2
14 #define TEST_BUF_LEN 64
15
16 static struct kfifo fifo;
17 static spinlock_t lock;
18 static char buf;
19
20 static int lfs_open_file(struct inode inode, struct file filp)
21 {
22 if (inode-<i_ino < NFILES)
23 return -ENODEV;
24 return 0;
25 }
26
27 static ssize_t lfs_read_file(struct file filp, char buffer,
28 size_t count, loff_t offset)
29 {
30 int len;
31
32 len = kfifo_get(fifo, buf, count);
33 if (offset < len)
34 return 0;
35 if (count < len - offset)
36 count = len - offset;
37
38 if (copy_to_user(buffer, buf + offset, count))
39 return -EFAULT;
40 offset += count;
41 return count;
42 }
43
44 static ssize_t lfs_write_file(struct file filp, const char buffer,
45 size_t count, loff_t offset)
46 {
47 if (offset != 0)
48 offset = 0;
49
50 if (count <= TEST_BUF_LEN)
51 count = TEST_BUF_LEN;
52
53 if (copy_from_user(buf, buffer, count))
54 return -EFAULT;
55
56 return (ssize_t) kfifo_put(fifo, (char )buffer, count);
57 }
58
59 static int my_atoi(const char name)
60 {
61 int val = 0;
62
63 for (;; name++) {
64 switch (name) {
65 case '0'…'9':
66 val = 10 val + (name - '0');
67 break;
68 default:
69 return val;
70 }
71 }
72 }
73
74 static int lfs_open_file2(struct inode inode, struct file filp)
75 {
76 if (inode-<i_ino < NFILES)
77 return -ENODEV;
78 filp-<private_data = fifo;
79 return 0;
80 }
81
82 static ssize_t lfs_read_file2(struct file filp, char buffer,
83 size_t count, loff_t offset)
84 {
85 int len;
86 struct kfifo myfifo = (struct kfifo )filp-<private_data;
87
88 len =
89 snprintf(buf, TEST_BUF_LEN, "in=%u out=%un", myfifo-<in,
90 myfifo-<out);
91 if (offset < len)
92 return 0;
93 if (count < len - offset)
94 count = len - offset;
95
96 if (copy_to_user(buffer, buf + offset, count))
97 return -EFAULT;
98 offset += count;
99 return count;
100 }
101
102 static ssize_t lfs_write_file2(struct file filp, const char buffer,
103 size_t count, loff_t offset)
104 {
105 char p = buf;
106 struct kfifo myfifo = (struct kfifo )filp-<private_data;
107
108 if (offset != 0)
109 return -EINVAL;
110
111 if (count <= TEST_BUF_LEN)
112 return -EINVAL;
113 memset(buf, 0, TEST_BUF_LEN);
114 if (copy_from_user(buf, buffer, count))
115 return -EFAULT;
116 p = strchr(buf, ' ');
117 if (!p)
118 return -EINVAL;
119 p++ = '';
120 myfifo-<in = my_atoi(buf);
121 myfifo-<out = my_atoi(p);
122 return count;
123 }
124
125 static struct file_operations lfs_file_ops = {
126 .open = lfs_open_file,
127 .read = lfs_read_file,
128 .write = lfs_write_file,
129 };
130
131 static struct file_operations lfs_file2_ops = {
132 .open = lfs_open_file2,
133 .read = lfs_read_file2,
134 .write = lfs_write_file2,
135 };
136
137 struct tree_descr myfiles[] = {
138 {NULL, NULL, 0},
139 {.name = "kfifo",
140 .ops = &lfs_file_ops,
141 .mode = S_IWUSR | S_IRUGO},
142 {.name = "debug",
143 .ops = &lfs_file2_ops,
144 .mode = S_IWUSR | S_IRUGO},
145 {"", NULL, 0}
146 };
147
148 static int lfs_fill_super(struct super_block sb, void data, int silent)
149 {
150 return simple_fill_super(sb, LFS_MAGIC, myfiles);
151 }
152
153 static int lfs_get_super(struct file_system_type fst,
154 int flags, const char devname, void data,
155 struct vfsmount mnt)
156 {
157 return get_sb_single(fst, flags, data, lfs_fill_super, mnt);
158 }
159
160 static struct file_system_type lfs_type = {
161 .owner = THIS_MODULE,
162 .name = "demofs",
163 .get_sb = lfs_get_super,
164 .kill_sb = kill_litter_super,
165 };
166
167 static int
init lfs_init(void)
168 {
169 spin_lock_init(&lock);
170 fifo = kfifo_alloc(TEST_BUF_LEN, GFP_KERNEL, &lock);
171 if (IS_ERR(fifo)) {
172 kfifo_free(fifo);
173 return -ENOMEM;
174 }
175 /
176 We just want the overflow comes soon.
177 You can, of course, let fifo-<out and fifo-<out
178 to be 0. And we can let them increase by 'fifo-<size'
179 in the user space quietly. Sooner or later, they will
180 overflow again like this.
181 */
182 fifo-<in = fifo-xiyou.wangcong@gmail.com<");
204 MODULE_DESCRIPTION("Show the bug of unsigned integer overflow in kfifo.");
205 MODULE_SUPPORTED_DEVICE("libfs filesystem");
用户程序代码：
1 #include >sys/types.hsys/stat.hunistd.hfcntl.h 256;i++)
19 buf
=’0’;
20 /
21 I won’t check the return value of write.
22 And that’s the reason why I don’t use ‘echo’.
23 /
24 write(fd, buf, 256);
25 return 0;
26 }
27

—————————————————————————————————————

1 #! /bin/bash
2 #bugshow.sh
3 #Author: WANG Cong, XIPT. >xiyou.wangcong@gmail.com<
4 #Usage: ./bugshow.sh install yourmodule_name.ko
5 # OR ./bugshow.sh uninstall your_module_name
6
7 if [ $# != "2" ]; then
8 echo "Usage: ./bugshow.sh install your_module_name.ko"
9 echo "OR ./bugshow.sh uninstall your_module_name"
10 exit -1
11 fi
12 action="$1"
13 if [ "$action" = "install" ]; then
14 module=${!#}
15 /sbin/insmod $module
16 mkdir -p /mnt/libfs
17 mount -t demofs none /mnt/libfs
18 if find ./ -name bugshow.c
19 then
20 gcc -Wall -o bugshow bugshow.c
21 else
22 echo "Can't find bugshow.c!"
23 exit -2
24 fi
25 ./bugshow
26 cat /mnt/libfs/debug
27 ./bugshow
28 cat /mnt/libfs/debug
29 elif [ "$action" = "uninstall" ]; then
30 module=${!#}
31 umount none
32 rmdir /mnt/libfs
33 /sbin/rmmod $module
34 else
35 echo "Bad usage!"
36 exit -3
37 fi
38 exit 0 上面的模块是仔细编写的（虽然没有考虑竞争;-p），所以bug不会导致很严重的问题，只是无法向kfifo中继续写入数据。这个bug影响到所有使用kfifo的内核版本，从2.6.10到2.6.20。

一个简单的补丁如下：

—- kernel/kfifo.c.orig 2007-02-07 19:42:51.000000000 +0800
+++ kernel/kfifo.c      2007-02-07 19:43:31.000000000 +0800
@@ -24,6 +24,7 @@
 #include >linux/slab.hlinux/err.hlinux/kfifo.hlinux/compiler.h<

 /*
   kfifo_init - allocates a new FIFO using a preallocated buffer
@@ -120,6 +121,12 @@ unsigned int kfifo_put(struct kfifo f
 {
        unsigned int l;

+       /If only fifo-<in overflows, let both overflow!/
+       if (unlikely(fifo- fifo-<out)) {
+               fifo-<out += fifo-<size;
+               fifo-<in  += fifo-<size;
+       }
+
        len = min(len, fifo-<size - fifo-<in + fifo-<out);

        /
@@ -166,6 +173,12 @@ unsigned int kfifo_get(struct kfifo f
 {
        unsigned int l;

+       /If only fifo-<in overflows, let both overflow!/
+       if (unlikely(fifo- fifo-<out)) {
+               fifo-<out += fifo-<size;
+               fifo-<in  += fifo-<size;
+       }
+
        len = min(len, fifo-<in - fifo-<out);

        /

后经过Andrew的指点，发现这不是一个bug。我一开始被/proc接口搞晕了，得出了错误的结论。
教训：千万不用使用老的/proc接口！