Analyse linux kernel rust first cve

Author: 堇姬Naup

前言

聽起來噱頭很大，來看一下是甚麼樣的洞
雖然只寫過一點 Rust 對 rust 不熟，如果有錯誤歡迎私訊我更正

Binder & death recipient

這次出問題的地方是在 binder 的地方，他是 android 的一個 IPC 機制，可以用於兩個 process 間的相互通訊
Binder 的詳細原理在以下文章有詳敘
https://hackmd.io/@AlienHackMd/S1qm0vmK5#Binder---Native-Binder-%E5%8E%9F%E7%90%86

這次出問題的地方是在 binder 中負責 DeathRecipient 的部分有關
Binder 是一個有 server-client 概念的東西，當 server 發生意外掛掉的狀況，所有依賴於該 server 的 client 就會收到通知

/// https://elixir.bootlin.com/linux/v6.18-rc1/source/drivers/android/binder/node.rs#L168

    /// List of processes to deliver a notification to when this node is destroyed (usually due to
    /// the process dying).
    death_list: List<DTRWrap<NodeDeath>, 1>,

death_list 是一個 link list，其中會友許多 NodeDeath，這個結構就是表示了通知本身，這個 link list 儲存了所有須要通知的東西

/// https://elixir.bootlin.com/linux/v6.18-rc1/source/drivers/android/binder/node.rs#L896

/// Used to deliver notifications when a process dies.
///
/// A process can request to be notified when a process dies using `BC_REQUEST_DEATH_NOTIFICATION`.
/// This will make the driver send a `BR_DEAD_BINDER` to userspace when the process dies (or
/// immediately if it is already dead). Userspace is supposed to respond with `BC_DEAD_BINDER_DONE`
/// once it has processed the notification.
///
/// Userspace can unregister from death notifications using the `BC_CLEAR_DEATH_NOTIFICATION`
/// command. In this case, the kernel will respond with `BR_CLEAR_DEATH_NOTIFICATION_DONE` once the
/// notification has been removed. Note that if the remote process dies before the kernel has
/// responded with `BR_CLEAR_DEATH_NOTIFICATION_DONE`, then the kernel will still send a
/// `BR_DEAD_BINDER`, which userspace must be able to process. In this case, the kernel will wait
/// for the `BC_DEAD_BINDER_DONE` command before it sends `BR_CLEAR_DEATH_NOTIFICATION_DONE`.
///
/// Note that even if the kernel sends a `BR_DEAD_BINDER`, this does not remove the death
/// notification. Userspace must still remove it manually using `BC_CLEAR_DEATH_NOTIFICATION`.
///
/// If a process uses `BC_RELEASE` to destroy its last refcount on a node that has an active death
/// registration, then the death registration is immediately deleted (we implement this using the
/// `aborted` field). However, userspace is not supposed to delete a `NodeRef` without first
/// deregistering death notifications, so this codepath is not executed under normal circumstances.
#[pin_data]
pub(crate) struct NodeDeath {
    node: DArc<Node>,
    process: Arc<Process>,
    pub(crate) cookie: u64,
    #[pin]
    links_track: AtomicTracker<0>,
    /// Used by the owner `Node` to store a list of registered death notifications.
    ///
    /// # Invariants
    ///
    /// Only ever used with the `death_list` list of `self.node`.
    #[pin]
    death_links: ListLinks<1>,
    /// Used by the process to keep track of the death notifications for which we have sent a
    /// `BR_DEAD_BINDER` but not yet received a `BC_DEAD_BINDER_DONE`.
    ///
    /// # Invariants
    ///
    /// Only ever used with the `delivered_deaths` list of `self.process`.
    #[pin]
    delivered_links: ListLinks<2>,
    #[pin]
    delivered_links_track: AtomicTracker<2>,
    #[pin]
    inner: SpinLock<NodeDeathInner>,
}

一支 process 可以通過 BC_REQUEST_DEATH_NOTIFICATION 來註冊一個 NodeDeath，並掛到 death_list 中
可以在這裡觀察到

/// https://elixir.bootlin.com/linux/v6.18-rc1/source/drivers/android/binder/process.rs#L1152

    pub(crate) fn request_death(
        self: &Arc<Self>,
        reader: &mut UserSliceReader,
        thread: &Thread,
    ) -> Result {
        let handle: u32 = reader.read()?;
        let cookie: u64 = reader.read()?;

        // Queue BR_ERROR if we can't allocate memory for the death notification.
        let death = UniqueArc::new_uninit(GFP_KERNEL).inspect_err(|_| {
            thread.push_return_work(BR_ERROR);
        })?;
        let mut refs = self.node_refs.lock();
        let Some(info) = refs.by_handle.get_mut(&handle) else {
            pr_warn!("BC_REQUEST_DEATH_NOTIFICATION invalid ref {handle}\n");
            return Ok(());
        };

        // Nothing to do if there is already a death notification request for this handle.
        if info.death().is_some() {
            pr_warn!("BC_REQUEST_DEATH_NOTIFICATION death notification already set\n");
            return Ok(());
        }

        let death = {
            let death_init = NodeDeath::new(info.node_ref().node.clone(), self.clone(), cookie);
            match death.pin_init_with(death_init) {
                Ok(death) => death,
                // error is infallible
                Err(err) => match err {},
            }
        };

        // Register the death notification.
        {
            let owner = info.node_ref2().node.owner.clone();
            let mut owner_inner = owner.inner.lock();
            if owner_inner.is_dead {
                let death = Arc::from(death);
                *info.death() = Some(death.clone());
                drop(owner_inner);
                death.set_dead();
            } else {
                let death = ListArc::from(death);
                *info.death() = Some(death.clone_arc());
                info.node_ref().node.add_death(death, &mut owner_inner);
            }
        }
        Ok(())
    }

add_death 就會將 node 加入至 linklist

/// https://elixir.bootlin.com/linux/v6.18-rc1/source/drivers/android/binder/node.rs#L323

    pub(crate) fn add_death(
        &self,
        death: ListArc<DTRWrap<NodeDeath>, 1>,
        guard: &mut Guard<'_, ProcessInner, SpinLockBackend>,
    ) {
        self.inner.access_mut(guard).death_list.push_back(death);
    }

當完成通知或有其他需要移除 node 會在這裡實現，他會將 node 本身傳入並 remove

/// https://elixir.bootlin.com/linux/v6.18-rc1/source/drivers/android/binder/node.rs#L976

    pub(crate) fn set_cleared(self: &DArc<Self>, abort: bool) -> bool {
	...
            // SAFETY: A `NodeDeath` is never inserted into the death list of any node other than
            // its owner, so it is either in this death list or in no death list.
            unsafe { node_inner.death_list.remove(self) };
        }
        needs_queueing
}

漏洞

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3e0ae02ba831da2b707905f4e602e43f8507b8cc

rust 的編譯器在編譯階段會進行相當多的檢查，基本上有編譯通過，可以說明理論上不會有如 overflow、data race condition、dangling pointer 之類的問題
然而編譯器靜態分析的能力是有限的，並不是所有操作都可以在編譯階段驗證是否正確，而 rust 的準則是，若編譯器檢查了，開發者需要主動的通過 unsafe 來去向編譯器說我驗證過這是安全的，並且在開發社群上，你也需要撰寫 SAFETY 的註解來說明為何是安全的，安全前提條件是甚麼

這次出問題的段落，安全的前提是，NodeDeath 這個節點只可能被插進「它自己的 owner 的 death_list」，所以 NodeDeath 要馬在自己的 death_list，要馬不在
然而出問題的原因就是這件事沒有滿足了

1
2
3

// SAFETY: A `NodeDeath` is never inserted into the death list of any node other than
// its owner, so it is either in this death list or in no death list.
unsafe { node_inner.death_list.remove(self) };

原因在 release 功能上
他首先先將 death_list move 到了 stack 上
並且解鎖，然後遍歷這個在 stack 上的臨時鏈表來 set_death

/// https://elixir.bootlin.com/linux/v6.18-rc1/source/drivers/android/binder/node.rs#L536

    pub(crate) fn release(&self) {
        let mut guard = self.owner.inner.lock();
        while let Some(work) = self.inner.access_mut(&mut guard).oneway_todo.pop_front() {
            drop(guard);
            work.into_arc().cancel();
            guard = self.owner.inner.lock();
        }
		
        let death_list = core::mem::take(&mut self.inner.access_mut(&mut guard).death_list);
        drop(guard);
        for death in death_list {
            death.into_arc().set_dead();
        }

問題就很明顯了，原本聲稱 Node 只會存在在自己的 death_list 中，但現在卻存在在 stack 上的臨時 linklist
也就是說當有兩個線程
線程 1 正在遍歷這個臨時 linklist
線程 2 去對原始 linklist 做 remove
這兩個線程會同時操作到同一塊記憶體的 prev/next pointer
就會破壞這整個 linklist，導致線程 1 戳到不合法 prev/next pointer 而 crash

Patch 的方法也很單純
原本是一次性 mem::take 搬到 stack 上，並解鎖遍歷
改成每次 pop 前都上鎖，pop 後再解鎖
這確保了在遍歷時，list 本身不會被其他線程修改

-        let death_list = core::mem::take(&mut self.inner.access_mut(&mut guard).death_list);
-        drop(guard);
-        for death in death_list {
+        while let Some(death) = self.inner.access_mut(&mut guard).death_list.pop_front() {
+            drop(guard);
             death.into_arc().set_dead();
+            guard = self.owner.inner.lock();

after all

快速 review 完後其實可以發現，由於作業系統需要與硬體、driver 或其他用 C 寫的部分，往往需要更底層的細緻操作，而這是 rust 在編譯期間無法檢查的，也因此需要通過 unsafe 並通過開發者的保證來去確保寫出來的 code 安全性
前陣子看到有蠻多人在吹噓 rust 的，確實他對於 linux kernel 或是其他開發的 project 在安全性上都有相當多的保證，但當遇到編譯器難以保證其安全性時，就需要出動人工來去審視
另外，Rust 雖然在 Linux kernel 中有第一個 CVE，不過這次其實相比起之前 review 其他 CVE，脈絡是相當乾淨的，原因在於一段 unsafe code，會有一個開發者，必須要去承諾這段 code 的安全性，並敘述清楚這段 code 安全的前提，這使得讓這個洞的危害降到最小範圍，並且後續追蹤上，能快速的去定位到問題原因，我認為這也是 Rust 一個相當好的地方
近期嘗試寫了一些 Rust 的 code，體趕上我認為 rust 引入 Linux Kernel 會有其他的問題，不過這邊先不多贅述了，總之這是個很好的例子來去觀察及學習 Rust 在 Linux kernel 中的應用及在安全性設計上與 C 的不同
好玩 ~

最後我要抱怨一件事
就是 https://elixir.bootlin.com/linux
這網站我覺得拿來看 code 相當清楚，但是他的 rust code 沒有辦法用 xref 呀XD