深入分析CVE-2021–20226漏洞

匿名者 1500天前

在这篇文章中，我们将从技术角度为读者深入分析CVE-2021-20226漏洞。

需要说明的是，在阅读本文过程中，如果遇到任何问题或发现任何错误，若能能单独与我联系的话，本人将不胜感激。而且，本文中的代码基本上是指Linux内核5.6.19的源代码。

同时，io_uring是2021年更新最为频繁的功能之一，并且，相关的信息会随着版本的变化而变化（从我发现它以来，已经发生了多次变动）。因此，请注意，即使在撰写本文时，相关的信息也无法保证是最新的。

在本文中，我们不会解释Linux内核中的一般术语/知识。当然，我将解释自己编写的PoC代码，但不会发布实际的漏洞利用代码。

漏洞概述

前提条件

已经获得了在系统中任意执行代码（命令）的权限。

漏洞的影响

将权限提升为root级别。

什么是io_uring

粗略的说，io_uring是最新的一种异步I/O（网络/文件系统）机制。当然，读者也可以参考互联网上发布的一些博客/幻灯片，以从用户的角度了解相关的规范和详细说明。下面，我将继续概要解释io_uring，不过，我们将假设您已经对其有所了解。

在io_uring机制中，首先通过一个专门的系统调用（io_uring_setup）生成一个文件描述符，然后，通过对其发出mmap()系统调用，将提交队列（SQ）和完成队列（CQ）映射/共享到用户空间的内存中。它们将被双方（内核/用户空间）作为环形缓冲区使用。同时，通过向共享内存写入SQE（提交队列条目），来注册系统调用的条目，如read/write/send/recv。然后，通过调用io_uring_enter()函数开始执行。

异步执行

顺便说一下，本漏洞与异步执行的实现密切相关，所以，我将重点对其进行介绍。先说明一下，io_uring并不总是异步执行的，而是根据需要异步执行的。

首先，请先参考下面的代码。（在此之后，作者将使用Kernelv5.8来解释该行为。需要注意的是，该行为可能与您的环境略有不同。）

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/syscall.h>
#include <sys/fcntl.h>
#include <err.h>
#include <unistd.h>
#include <sys/mman.h>
#include <linux/io_uring.h>#define SYSCHK(x)({          \
  typeof(x)__res = (x);      \
  if (__res ==(typeof(x))-1) \
    err(1,"SYSCHK(" #x ")"); \
  __res;                      \
})static int uring_fd;struct iovec *io;
#define SIZE 32
char _buf[SIZE];int main(void) {
  // initializeuring
  structio_uring_params params = { };
  uring_fd =SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/10, &params));
  unsigned char*sq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                      MAP_SHARED, uring_fd,
                                      IORING_OFF_SQ_RING));
  unsigned char*cq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                       MAP_SHARED, uring_fd,
                                      IORING_OFF_CQ_RING));
  structio_uring_sqe *sqes = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                         MAP_SHARED, uring_fd,
                                          IORING_OFF_SQES));io= malloc(sizeof(struct iovec)*1);
  io[0].iov_ba se= _buf;
  io[0].iov_len= SIZE;struct timespec ts = { .tv_sec = 1 };
  sqes[0] =(struct io_uring_sqe) {
    .opcode =IORING_OP_TIMEOUT,
    //.flags =IOSQE_IO_HARDLINK,
    .len = 1,
    .addr =(unsigned long)&ts
  };
  sqes[1] =(struct io_uring_sqe) {
    .opcode =IORING_OP_READV,
    .addr = io,
    .flags = 0,
    .len = 1,
    .off = 0,
    .fd =SYSCHK(open("/etc/passwd", O_RDONLY))
  };
  ((int*)(sq_ring+ params.sq_off.array))[0] = 0;
 ((int*)(sq_ring + params.sq_off.array))[1] = 1;
 (*(int*)(sq_ring + params.sq_off.tail)) += 2;int submitted =SYSCHK(syscall(__NR_io_uring_enter, uring_fd,
                                /*to_submit=*/2, /*min_complete=*/0,
                                 /*flags=*/0,/*sig=*/NULL, /*sigsz=*/0));
  while(1){
   usleep(100000);
    if(*_buf){
     puts("READV executed.");
      break;
    }
   puts("Waiting.");
  }
}

在上面的代码中，在为IORING_OP_TIMEOUT和IORING_OP_READV操作执行必要的设置之后，就开始执行，然后，每隔0.1秒检查一次readv()函数是否完成。考虑到readv()函数是按环形缓冲区的顺序执行的，因此，正常情况下好像应该是在1秒后完成。然而，当我实际运行它时，结果却是这样的：

$ ./sample
READV executed.

也就是说，readv()的执行是立即完成的。这是因为，正如我之前所说，它是根据需要异步执行的，但在这种情况下，是可以立即执行readv()的（因为知道它的执行不会停止）。所以，随后的操作先被完成（这里，我们先暂时忽略IORING_OP_TIMEOUT）。作为测试，我们可以用下面的systemtap[见注1]脚本来检查readv()是否被同步执行（即位于系统调用的处理程序中）。

[注1]:这是一个可以用来灵活执行脚本的工具，比如追踪（包括但不限于）内核函数，并在追踪点输出变量。我非常喜欢这个工具，因为内核调试原本是非常麻烦的事情，有了它，生活就轻松了许多。

#!/usr/bin/stapprobe kernel.function("io_read@/build/linux-b4NE0x/linux-5.8.0/fs/io_uring.c:2710"){
 printf("%s\n",task_execname(task_current()))
}

这是在执行上述systemtap脚本时，前一个程序（文件名为sample）的运行结果。如果是异步执行，那么执行任务自然是被注册在某个worker中的，但由于这里是同步执行的，所以输出的是调用系统调用的可执行文件的名称。

$ sudo stap -g ./sample.stp
sample

那么IORING_OP_TIMEOUT去哪了呢？答案是“传递给了内核线程，因为它被认为需要异步执行”。对此，存在一些条件，如果满足的话，它们将插入队列以异步执行。下面，我们举例说明。

1. 当强制异步标志被启用时

} else if (req->flags & REQ_F_FORCE_ASYNC) {
  ......
  /*
   * Never tryinline submit of IOSQE_ASYNC is set, go straight
   * to asyncexecution.
   */
 req->work.flags |= IO_WQ_WORK_CONCURRENT;
 io_queue_async_work(req);
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4825

2. 由为每个操作准备的逻辑所决定（例如，在调用readv()时添加了IOCB_NOWAIT标志时，如果预计执行被终止，则返回EAGAIN）。

static int io_read(struct io_kiocb *req, structio_kiocb **nxt,
     boolforce_nonblock)
{
 ......
 ret =rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count);
 if (!ret) {
  ssize_tret2;if (req->file->f_op->read_iter)
   ret2 =call_read_iter(req->file, kiocb, &iter);
  else
   ret2 =loop_rw_iter(READ, req->file, kiocb, &iter);/* Catch -EAGAIN return forforced non-blocking submission */
  if(!force_nonblock || ret2 != -EAGAIN) {
  kiocb_done(kiocb, ret2, nxt, req->in_async);
  } else {
copy_iov:
   ret =io_setup_async_rw(req, io_size, iovec,
     inline_vecs, &iter);
   if (ret)
    gotoout_free;
   return-EAGAIN;
  }
 }
    ......
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L2224

当返回EAGAIN时，它被加入队列，以便异步执行（如果它是一个使用文件描述符的操作类型，则在这里获得对file结构体的引用）。

static void __io_queue_sqe(struct io_kiocb *req, conststruct io_uring_sqe *sqe)
{
 ......ret =io_issue_sqe(req, sqe, &nxt, true);/*
  * We asyncpunt it if the file wasn't marked NOWAIT, or if the file
  * doesn'tsupport non-blocking read/write attempts
  */
 if (ret ==-EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
    (req->flags & REQ_F_MUST_PUNT))) {
punt:
  if(io_op_defs[req->opcode].file_table) {
   ret =io_grab_files(req);
   if (ret)
    goto err;
  }/*
   * Queued upfor async execution, worker will release
   * submitreference when the iocb is actually submitted.
   */
 io_queue_async_work(req);
  goto done_req;
 }
 ......
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4741
static int io_issue_sqe(struct io_kiocb *req, conststruct io_uring_sqe *sqe,
   structio_kiocb **nxt, bool force_nonblock)
{
 structio_ring_ctx *ctx = req->ctx;
 int ret;switch(req->opcode) {
 caseIORING_OP_NOP:
  ret =io_nop(req);
  break;
 caseIORING_OP_READV:
 caseIORING_OP_READ_FIXED:
 case IORING_OP_READ:
  if (sqe) {
   ret =io_read_prep(req, sqe, force_nonblock);
   if (ret <0)
    break;
  }
  ret =io_read(req, nxt, force_nonblock);
  break;
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4314

3. 当使用IOSQE_IO_LINK|IOSQE_IO_HARDLINK标志时（指定执行顺序），并且确定执行顺序在先的操作需要异步执行时。

（按照下面代码中描述的链接方式连接，按顺序执行，如果中间满足条件2，则整个链接将加入异步执行队列）

static bool io_submit_sqe(struct io_kiocb *req, conststruct io_uring_sqe *sqe,
     structio_submit_state *state, struct io_kiocb **link)
{
 ......
 /*
  * If we alreadyhave a head request, queue this one for async
  * submittalonce the head completes. If we don't have a head but
  *IOSQE_IO_LINK is set in the sqe, start a new head. This one will be
  * submittedsync once the chain is complete. If none of those
  * conditionsare true (normal request), then just queue it.
  */
 if (*link) {
  ......
 list_add_tail(&req->link_list, &head->link_list);/* lastrequest of a link, enqueue the link */
  if(!(sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK))) {
   io_queue_link_head(head);
   *link = NULL;
  }
 } else {
  ......
  if (sqe_flags& (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) {
   req->flags|= REQ_F_LINK;
  INIT_LIST_HEAD(&req->link_list);if (io_alloc_async_ctx(req)) {
    ret =-EAGAIN;
    gotoerr_req;
   }
   ret =io_req_defer_prep(req, sqe);
   if (ret)
   req->flags |= REQ_F_FAIL_LINK;
   *link = req;
  } else {
  io_queue_sqe(req, sqe);
  }
 }return true;
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4858

严格来说，IORING_OP_TIMEOUT有点特殊，因为它并不像2中所示的那样返回EAGAIN，但是（我认为）它很容易理解，所以我将其用作示例。如下图所示，通过将一个需要异步执行的操作（IORING_OP_TIMEOUT）与另一个操作链接起来，我们就可以看到前面的IORING_OP_READV在等待1秒后肯定会执行。

在上面的示例代码中给IORING_OP_TIMEOUT操作添加了IOSQE_IO_HARDLINK标志，以表明它将与后续操作链接在一起。

48c48
<    //.flags = IOSQE_IO_HARDLINK,
---
>     .flags= IOSQE_IO_HARDLINK,

执行结果：

$ ./sample
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
READV executed.

此时，如果您以与前面相同的方式显示正在执行io_read()的进程的名称，则会得到以下输出：

$ sudo stap -g ./sample.stp
io_wqe_worker-0

通过进程列表可以看出，这是一个内核线程。

$ ps aux | grep -A 2 -m 1 sample
garyo    131388  0.0  0.0  2492  1412 pts/1    S+  19:03   0:00 ./sample
root     131389  0.0  0.0     0     0 ?        S   19:03   0:00 [io_wq_manager]
root     131390  0.0  0.0     0     0 ?        S   19:03   0:00 [io_wqe_worker-0]

此后，我们将该内核线程将被称为“worker”。该worker由以下代码生成，然后从队列中出列并执行异步任务。

static bool create_io_worker(struct io_wq *wq, structio_wqe *wqe, int index)
{
 ......worker->task =kthread_create_on_node(io_wqe_worker, worker, wqe->node,
   "io_wqe_worker-%d/%d", index, wqe->node);
 ......
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io-wq.c#L621

如前所述，IORING_OP_TIMEOUT的行为与下图略有不同，但为了简单起见，我们这里不做区别。严格来说，当io_timeout()被调用时，它将在处理程序中设置io_timeout_fn()并启动定时器。在定时器设定的时间过后，io_timeout_fn()将被调用，在异步执行队列中加载对应的操作。换句话说，IORING_OP_TIMEOUT本身并没有被插入异步执行队列中。在解释中，我们使用了TIMEOUT，因为它很容易让人想象到执行将被中止。

将I/O操作卸载到内核时的注意事项

人们发现，异步处理是由作为内核线程运行的worker运行的。然而，这里有一个注意事项。由于worker是作为内核线程运行的，所以，执行上下文与调用io_uring相关系统调用的线程不同。这里，“执行上下文”是指与进程相关的task_struct结构体以及与之相关的各种信息，例如，mm（管理进程的虚拟内存空间），cred（保存UID/GID/Capability）,files_struct（保存文件描述符的表格。files_struct结构体中有一个file结构体数组，其索引为文件描述符，等等。

当然，如果在调用系统调用的线程中没有引用这些结构体，可能会引用错误的虚拟内存或文件描述符表，或者以内核线程权限执行I/O操作[注2]。

[[注2]: 顺便说一下，这是一个实打实的安全漏洞，由于忘记了切换cred，所以相关的操作能够以root权限执行。虽然当时没有实现相当于open open()的操作，但利用sendmsg的SCM_CREDENTIALS选项，可以给出发送方的权限。这是一个与D-Bus相关的安全问题，因为权限是由它确定的，详情见https://www.exploit-db.com/exploits/47779。

因此，在io_uring中，这些引用被传递给worker，以便worker在执行前通过切换自己的上下文来共享执行上下文。例如，在下面的代码中，您可以看到，对mm和cred的引用被传递给了req->work。

static inline void io_req_work_grab_env(structio_kiocb *req,
     conststruct io_op_def *def)
{
 if(!req->work.mm && def->needs_mm) {
 mmgrab(current->mm);
 req->work.mm = current->mm;
 }
 if(!req->work.creds)
 req->work.creds = get_current_cred();
 if(!req->work.fs && def->needs_fs) {
 spin_lock(&current->fs->lock);
  if(!current->fs->in_exec) {
  req->work.fs = current->fs;
  req->work.fs->users++;
  } else {
  req->work.flags |= IO_WQ_WORK_CANCEL;
  }
 spin_unlock(&current->fs->lock);
 }
 if(!req->work.task_pid)
 req->work.task_pid = task_pid_vnr(current);
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L910

你可以看到，在下面的代码中，对files_struct的引用被传递给了req->work。

static int io_grab_files(struct io_kiocb *req)
{
 ......
 if (fcheck(ctx->ring_fd)== ctx->ring_file) {
 list_add(&req->inflight_entry, &ctx->inflight_list);
  req->flags|= REQ_F_INFLIGHT;
 req->work.files = current->files;
  ret = 0;
 }
 ......
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4634

然后，在执行之前，这些内容被替换为worker的current（一个获得当前运行的线程的task_struct的宏）的内容。

static void io_worker_handle_work(struct io_worker*worker)
 __releases(wqe->lock)
{
 structio_wq_work *work, *old_work = NULL, *put_work = NULL;
 struct io_wqe*wqe = worker->wqe;
 struct io_wq*wq = wqe->wq;do {
  ......if(work->files && current->files != work->files) {
  task_lock(current);
  current->files = work->files;
  task_unlock(current);
  }
  if(work->fs && current->fs != work->fs)
  current->fs = work->fs;
  if(work->mm != worker->mm)
  io_wq_switch_mm(worker, work);
  if(worker->cur_creds != work->creds)
  io_wq_switch_creds(worker, work);
  ......
 work->func(&work);
  ......
 } while (1);
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io-wq.c#L443

漏洞详情

与worker共享时，files_struct结构体中引用计数器

现在，让我们继续对漏洞进行解释。在下面的代码中（我之前发布的），可以看到worker将对执行系统调用的线程的files_struct结构体的引用传递给worker稍后将引用的结构体，但是，并没有递增引用计数器。

static int io_grab_files(struct io_kiocb *req)
{
 ......
 if(fcheck(ctx->ring_fd) == ctx->ring_file) {
  list_add(&req->inflight_entry,&ctx->inflight_list);
  req->flags|= REQ_F_INFLIGHT;
 req->work.files = current->files;
  ret = 0;
 }
 ......
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4634

顺便说一下，正如前面简单解释的那样，当在队列中为异步执行的任务排队时，首先会从指定的文件描述符（传递给io_kiocb结构体）中提取并保存对文件结构体的引用。

static int io_req_set_file(struct io_submit_state*state, struct io_kiocb *req,
      conststruct io_uring_sqe *sqe)
{
 structio_ring_ctx *ctx = req->ctx;
 unsigned flags;
 int fd;flags =READ_ONCE(sqe->flags);
 fd =READ_ONCE(sqe->fd);if (!io_req_needs_file(req, fd))
  return 0;if(flags & IOSQE_FIXED_FILE) {
  if(unlikely(!ctx->file_data ||
      (unsigned)fd >= ctx->nr_user_files))
   return-EBADF;
  fd =array_index_nospec(fd, ctx->nr_user_files);
  req->file =io_file_from_index(ctx, fd);
  if(!req->file)
   return-EBADF;
  req->flags|= REQ_F_FIXED_FILE;
 percpu_ref_get(&ctx->file_data->refs);
 } else {
  if(req->needs_fixed_file)
   return-EBADF;
 trace_io_uring_file_get(ctx, fd);
  req->file =io_file_get(state, fd);
  if(unlikely(!req->file))
   return-EBADF;
 }return 0;
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4599

这样的话，worker就不必再从文件描述符中检索它，也不需要引用files_struct结构体了。如果是这样的话，files_struct结构体的引用计数器就算没有递增似乎也没有问题（因为根本就用不到它）。但是，这个假设在Linux Kernel 5.5及以后的版本中是不成立的，因为影响文件描述符表的系统调用，比如open/close/accept，现在可以通过io_uring来实现。很明显，这些系统调用会影响到文件描述符表，所以看起来的确存在安全隐患，不过：

即使直接调用open/close/accept等，如果files_struct结构体是可用的，也不会出现什么问题。

——当然，当多个线程同时处理同一个文件时，系统调用也有相应的对策，所以，通常无法造成调用线程和工作线程之间的竟态条件。

通过释放files_struct并将引用计数器设置为0，一个新的进程可以将其作为该进程的files_struct重新使用。当重用时，worker将获得对新进程的 files_struct的引用。

——但文件结构已经从文件描述符中获得，所以，无法获得新进程文件结构体的引用（这是个谎言，我将在稍后加以解释—）

——通过打开一个文件，有可能将一个文件结构体插入到一个新进程的文件描述符表中。但它将不会被引用。(因为人们在编程时不使用固定的文件描述符编号。)

在这里，我将解释当多个线程处理同一文件时，与文件结构体的引用计数器相关的应对机制。是的，这里先剧透一下：该机制实际上可以被滥用。

open/close系统调用中的引用计数器的机制

为了理解文件结构体中的引用计数器是如何工作的，我们首先需要理解open/close系统调用的实际作用。当然，根据要打开的实际文件，行为会有所不同，但以下内容是相同的。

open系统调用：

创建一个文件结构体并将引用计数器设置为1
将其注册到文件描述符表中

创建一个文件结构并将引用计数器设为1：

static struct file *__alloc_file(int flags, conststruct cred *cred)
{
 struct file *f;
 int error;f =kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
 ......
 atomic_long_set(&f->f_count, 1);
 ......
 return f;
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/file_table.c#L96

将其注册到文件描述符表（fd_install）

static long do_sys_openat2(int dfd, const char __user*filename,
      structopen_how *how)
{
 ......
 fd =get_unused_fd_flags(how->flags);
 if (fd >= 0){
  struct file *f= do_filp_open(dfd, tmp, &op);
  if (IS_ERR(f)){
   put_unused_fd(fd);
   fd =PTR_ERR(f);
  } else {
  fsnotify_open(f);
  fd_install(fd, f);
  }
 }
 putname(tmp);
 return fd;
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/open.c#L1130

close系统调用：

从文件描述符表中删除
递减文件结构体的引用计数器(fput)

从文件描述符表中删除

int __close_fd(struct files_struct *files, unsignedfd)
{
 struct file*file;
 struct fdtable*fdt;spin_lock(&files->file_lock);
 fdt =files_fdtable(files);
 if (fd >=fdt->max_fds)
  gotoout_unlock;
 file =fdt->fd[fd];
 if (!file)
  gotoout_unlock;
 rcu_assign_pointer(fdt->fd[fd],NULL);
 __put_unused_fd(files, fd);
 spin_unlock(&files->file_lock);
 returnfilp_close(file, files);out_unlock:
 spin_unlock(&files->file_lock);
 return -EBADF;
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/file.c#L626

递减文件结构体的引用计数器（fput）

int filp_close(struct file *filp, fl_owner_t id)
{
 ......
 fput(filp);
 return retval;
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/open.c#L1239

这里需要重点关注的是fget()/fput()函数（尽管在open系统调用中并没有用到fget()）。由于这个机制的缘故，如果是通过fget()获得文件结构体的话，即使在fput()之前关闭文件，引用计数器也不会为0（计数器在文件打开时应该为1，调用fget()后为2，即使此时关闭文件，计数器的值也会为1）。因此，这意味着即使它在使用过程中被关闭也不会有问题。

例如，当使用mmap将文件映射到内存时，如果内存在调用munmap之前释放，即使在关闭之后也会出现问题。因此，在mmap中需要使用fget()来防止内存被释放。

unsigned long ksys_mmap_pgoff(unsigned long addr,unsigned long len,
        unsigned long prot, unsigned long flags,
        unsigned long fd, unsigned long pgoff)
{
 struct file*file = NULL;
 unsigned longretval;if (!(flags & MAP_ANONYMOUS)) {
 audit_mmap_fd(fd, flags);
  file =fget(fd);
 ......
}
https://elixir.bootlin.com/linux/v5.6.19/source/mm/mmap.c#L1551

不改变引用计数器的fdget()函数

此外，还有一个叫做fdget()/fdput()的函数，经常被用来获取一个文件结构体的引用（它在系统调用处理程序里面经常用到）。

例如，在read系统调用中，文件结构体用于fdget()(fdget_pos())和fdput()(fdput_pos())之间，具体如下所示：

ssize_t ksys_read(unsigned int fd, char __user *buf,size_t count)
{
 struct fd f =fdget_pos(fd);
 ssize_t ret =-EBADF;if (f.file) {
  loff_t pos,*ppos = file_ppos(f.file);
  if (ppos) {
   pos = *ppos;
   ppos =&pos;
  }
  ret =vfs_read(f.file, buf, count, ppos);
  if (ret >=0 && ppos)
  f.file->f_pos = pos;
  fdput_pos(f);
 }
 return ret;
}SYSCALL_DEFINE3(read, unsigned int, fd, char __user*, buf, size_t, count)
{
 returnksys_read(fd, buf, count);
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/read_write.c#L576

由于缓存行（cacheline）的缘故，似乎不应太频繁地增加或减少文件结构体的引用计数器。因此，fdget()在某些条件下不会增加文件结构体的引用计数器。通过跟踪该函数可以发现，fdget()最后调用了__fget_light()函数。下面，让我们来看看这个函数的实现代码。

/*
 * Lightweightfile lookup - no refcnt increment if fd table isn't shared.
 *
 * You can usethis instead of fget if you satisfy all of the following
 * conditions:
 * 1) You mustcall fput_light before exiting the syscall and returning control
 *    to userspace (i.e. you cannot remember thereturned struct file * after
 *    returning to userspace).
 * 2) You must notcall filp_close on the returned struct file * in between
 *    calls to fget_light and fput_light.
 * 3) You mustnot clone the current task in between the calls to fget_light
 *    and fput_light.
 *
 * Thefput_needed flag returned by fget_light should be passed to the
 * correspondingfput_light.
 */
static unsigned long __fget_light(unsigned int fd,fmode_t mask)
{
 structfiles_struct *files = current->files;
 struct file*file;if (atomic_read(&files->count) == 1) {
  file =__fcheck_files(files, fd);
  if (!file ||unlikely(file->f_mode & mask))
   return 0;
  return(unsigned long)file;
 } else {
  file =__fget(fd, mask, 1);
  if (!file)
   return 0;
  returnFDPUT_FPUT | (unsigned long)file;
 }
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/file.c#L807

正如注释所言，只有在条件满足的情况下，才能使用这个函数。同时，“如果fd表不是共享的，则没有必要递增refcnt”——这是什么意思呢？

一般来说，在多线程程序中，文件描述符表都是共享的（&files->count>=2），而且同一个文件描述符指向同一个文件。这种做法有很多好处，例如，另一个线程可以在执行read系统调用时调用close系统调用。因此，这时read系统调用的fdget()应该递增引用计数器。

但如果这是一个普通的单线程程序呢？在这种情况下，当read系统调用正在执行时，无法中断另一个系统调用。因此，即使引用计数器没有递增，也不会引发任何问题。正是出于这个原因，它并不会递增文件结构体的引用计数器，除非文件描述符表是共享的。

将这些漏洞与fdget()规范相结合

这里的安全漏洞在于，它会将files_struct结构体的引用传递给另一个结构体，而该结构体被worker引用时，并没有递增引用计数器。您可能已经注意到，这意味着如果原始程序是单线程的，即使文件描述符表是共享的（与worker共享），但&files->count的只仍然为1。

如果&files->count等于1，就意味着fdget()没有递增file结构体的引用计数器。但是，实际上worker可能已经关闭了与file结构体相关的文件描述符，也就是说，由fdget()获得的file结构体的内存可能早已被释放了。

现在，我们来总结一下这个漏洞的相关情况：

aio worker与调用线程共享files_struct结构体。这时，files_struct结构体的引用计数器不会递增。
由于当files_struct结构体的引用计数器为1时，fdget不会递增file结构体的引用计数器，所以，由fdget()获得的文件可能已经在worker（或调用线程）中被关闭和释放。
由于file结构体被释放，在处理它的地方（例如，在与文件相关的系统调用中）就会导致UAF漏洞。

PoC概述

接下来，我们要做的事情是创建一个内exploit，不过，这里只是进行大致的介绍，并不会给出详细的解释。

如果存在像下面这样的代码块，我们就可以在worker端使用close来触发文件结构体的Use After Free漏洞（如果在它们之间放一个userfaultfd就更好了）。

void func(){
  struct fd f;
  f =fdget();//refcount is not incremented.
  /*
  Play withf.file :)
  */
  fdput(f);
}

或者，也可以使用与file结构体相关联的private_data成员的内存区域（保存它自己的数据结构的位置，它包含多种数据结构）进行利用，因为它也会被释放。我通过覆盖eBPF中使用的map结构体的内存来利用该漏洞，该内存是通过调用kmalloc来分配与map结构体大小相同的内存（重叠）。

当前，该漏洞似乎是通过在下面的提交中将files_struct结构体的引用计数器更改为increment来进行修复的：

https://github.com/torvalds/linux/commit/0f2122045b946241a9e549c2a76cea54fa58a7ff

题外话

在写这篇博客时，我注意到一件重要的事情。在我提交报告之后，有人提出了以下安全问题，并分配了相应的CVE编号：

https://bugs.chromium.org/p/project-zero/issues/detail?id=2089

很明显，其他人也提交了类似的安全报告，因为这个问题没有得到相应的重现，所以没有得到及时的响应。而之前的那个安全问题似乎先得到了修复。

基本上，我是根据文件名和指定的行数来报告存在安全问题的代码的，但在报告内容或PoC似乎还有改进的余地。

另外，在阅读了上述网址的报告后，我意识到有一个更简单、更有趣的漏洞利用方法，所以，我想简单说两句。

即使worker正在运行，files_struct结构体的引用计数器也不会递增，所以，调用与io_uring相关的系统调用的线程的current->files->count总是1，这就是漏洞的根因。另外，在用execve更新可执行文件时，如下面的代码所示，根据相应的规范，在current->files-> count == 1的条件下，files_struct结构体将被重用。

load_elf_binary()->begin_new_exec()->unshare_files()->unshare_fd()
static int unshare_fd(unsigned long unshare_flags,struct files_struct **new_fdp)
{
 structfiles_struct *fd = current->files;
 int error =0;if ((unshare_flags & CLONE_FILES) &&
     (fd&& atomic_read(&fd->count) > 1)) {
  *new_fdp =dup_fd(fd, &error);
  if (!*new_fdp)
   return error;
 }return 0;
}
https://elixir.bootlin.com/linux/v5.6.19/source/kernel/fork.c#L2883

换句话说，如果execve是在worker运行期间调用的，那么worker将总是引用execve之后的进程的files_struct结构体。(实际上我认为，即使调用了kmem_cache_free&kmem_cache_alloc，重复一个地址也不是什么难事……）

execve之后的进程并不总是拥有与之前进程相同的权限。例如，如果执行setuid-ed的二进制文件（sudo/su/etc...），在execve之后它将成为一个特权进程。因此，通过中止worker的执行，然后执行sudo或类似的命令，worker就可以引用特权进程的文件描述符表（位于files_struct结构体中）。

由于cred(process authority)结构体和类似的东西都是从execve()之前的状态中继承的（在对要执行的任务进行排队时，它也会根据需要保存在worker端），所以无法通过特权进程的权限重新打开。但是，由特权进程本身打开的文件可以从worker端引用。

static void io_wq_switch_creds(struct io_worker*worker,
          structio_wq_work *work)
{
 const structcred *old_creds = override_creds(work->creds);worker->cur_creds =work->creds;
 if(worker->saved_creds)
 put_cred(old_creds); /* creds set by previous switch */
 else
 worker->saved_creds = old_creds;
}
https://elixir.bootlin.com/linux/v5.6.19/source/fs/io-wq.c#L431

这意味着有可能通过读/写由特权进程打开的文件描述符来实现LPE。(例如，如果一个将以root身份执行的shell脚本以可写形式被打开，它就可以被用来进行提权。)

顺便说一下，要读/写的文件结构体是在卸载前根据文件描述符获得的，所以不能使用特权进程的file结构体。然而，事实上，io_uring提供了一个特性，允许我们在执行上下文端定义文件描述符，并且还可以通过操作IORING_OP_FILES_UPDATE动态地更新它们。这就是说，我们又可以从执行上下文端持有的files_struct结构体中获得了文件描述符了，这意味着为窃取特权进程的文件描述符打开了一扇窗户。

我还不知道是否存在便利的可执行文件，可以真正用于利用该漏洞。至少，sudo可以用O_RDONLY临时打开/etc/shadow，所以，如果时机合适，我们似乎可以得到所需内容。

另外，根据版本的不同，有时file结构体是通过引用特权进程上的内存来更新的（这意味着，在更新时需要指定特权进程的地址作为文件描述符表的地址）。所以我觉得，它可能受到了ASLR的影响（需要通过suid二进制文件来处理files_struct结构体的内存重用，但su/sudo二进制文件是作为PIE构建的。我想，我可以用这个借口来为我的文章进行辩护。:) )

参考资料

https://www.zerodayinitiative.com/advisories/ZDI-21-001/https://github.com/torvalds/linux/commit/0f2122045b946241a9e549c2a76cea54fa58a7ffhttps://bugs.chromium.org/p/project-zero/issues/detail?id=2089

原文地址：https://flattsecurity.medium.com/cve-2021-20226-a-reference-counting-bug-which-leads-to-local-privilege-escalation-in-io-uring-e946bd69177a

上一篇：【安全通报】VMware Carbon ...... 下一篇：【安全通报】Cortex XS......