Linux 编译内核与新增系统调用

Posted on 2024-12-28 Edited on 2025-01-08 Views:

网络设置

桥接模式：参考博客
yum 换国内源：参考博客
无法复制粘贴压缩包/文件夹的问题，这里参考用共享文件夹的方式解决：参考文章

内核与发行版的区别

内核指的是一个提供设备驱动、文件系统、进程管理、网络通信等功能的系统软件。内核并不是一套完整的操作系统，它只是操作系统的核心。一些组织或厂商将 Linux 内核与各种软件和文档包装起来，并提供系统安装界面和系统配置、设定与管理工具，这便构成了 Linux 的发行版本。

编译内核

下载内核

我们下载好的发行版本的内核是已编译的二进制文件，无法直接修改，且系统中通常不包含完整的内核源代码，因此我们需要先下载内核的源码
笔者下载的是4.19版本的内核，下载地址，点击 tarball
将下载好的 linux-4.19.325.tar 文件复制到 /usr/src/kernels 目录下，然后使用 tar 解压

1 2	sudo cp -r linux-4.19.325.tar.xz /usr/src/kernels sudo tar xvJf linux-4.19.325.tar.xz # 注意J是大写

安装依赖

1
2
3

sudo yum install efutils-libelf-devel -y
sudo yum install ncurses-devel -y
sudo yum install openssl-devel -y

编译及安装内核

cd /usr/src/kernels/linux-4.19.325
make mrproper     # 清除之前残留的编译结果
make menuconfig   # 生成编译配置文件
make -j8          # 启用8个内核开始编译
make modules      # 编译模块
make modules_install # 安装模块
make install      # 安装新内核
reboot            # 进入新内核

新增系统调用

简单类比

把系统调用比作在餐厅点餐和服务的过程：

用户程序 = 顾客
系统调用 = 向服务员点餐
内核 = 厨房
系统调用号 = 菜品编号
参数 = 点菜要求（比如咸淡程度）

具体步骤

准备阶段（顾客准备点餐）：
- 程序确定需要什么服务（比如要读取文件）
- 准备好需要的参数（比如文件名、读取长度等）
- 找到对应的系统调用号（就像找到菜品编号）
发起调用（叫服务员）：
- 程序触发特殊指令（相当于按服务铃）
- CPU切换到特权模式（服务员来到桌前）
- 保存当前状态（记录顾客要求）
内核处理（厨房工作）：
- 根据系统调用号找到对应函数（厨师根据菜号准备食材）
- 检查参数是否合理（确认要求是否可以满足）
- 执行具体操作（开始烹饪）
- 准备返回结果（装盘）
返回结果（上菜）：
- 保存处理结果
- 切换回普通模式
- 恢复之前的状态
- 将结果返回给程序

实际例子 sys_write

进入 /usr/src/kernels/linux-4.19.325 文件夹，在 include/linux/syscalls.h 中有如下声明：

1 2	asmlinkage long sys_write(unsigned int fd, const char __user *buf, size_t count);

各部分详解

asmlinkage：这是一个特殊的修饰符，告诉编译器从栈中获取函数参数，是Linux系统调用的标准调用约定，确保参数通过栈传递而不是寄存器
long：
- 返回值类型
- 表示写入的字节数（成功时）
- 负值表示错误代码
sys_write：
- 系统调用的名称
- 前缀sys_表明这是一个系统调用
- 实现写操作的核心函数
参数列表：
- unsigned int fd：
  - 第一个参数：文件描述符
  - 无符号整数类型
  - 指定要写入的目标
- const char __user *buf：
  - 第二个参数：数据缓冲区指针
  - const表示不会修改缓冲区内容
  - __user表示指针指向用户空间
  - 包含要写入的数据
- size_t count：
  - 第三个参数：要写入的字节数
  - 无符号整数类型
  - 指定写入操作的长度

在 arch/x86/entry/syscalls/syscall_64.tbl 中有如下条目：

1	1 common write __x64_sys_write

各字段含义

1
- 系统调用号（syscall number）
- 是write系统调用的唯一标识符
- 在x86_64架构上固定为1
- 这是一个非常小的数字，表明它是最基础的系统调用之一
common：
- 表示这是一个通用的系统调用
- 在所有架构上都可用
- 不是架构特定的实现
write：
- 系统调用的名称
- 用户空间程序使用这个名称
- 对应libc中的write()函数
__x64_sys_write：
- 实际的内核函数名
- x64表示这是64位系统的实现
- sys_前缀表明这是系统调用
- 这是内核中实际执行的函数

宏展开

在 fs/read_write.c 中可以找到如下代码，再往 vfs_write 追下去有点多，感兴趣的读者可以自行翻阅，因为函数的具体实现原理不是本次的重点（~~我也看不懂啊~~

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos = file_pos_read(f.file);
		ret = vfs_write(f.file, buf, count, &pos);
		if (ret >= 0)
			file_pos_write(f.file, pos);
		fdput_pos(f);
	}

	return ret;
}


SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
		size_t, count)
{
	return ksys_write(fd, buf, count);
}

其中 SYSCALL_DEFINE3是一个特殊的宏，用于定义带有3个参数的系统调用，我们可以在 syscalls.h 中找到它的定义：

#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)

#define SYSCALL_DEFINEx(x, sname, ...)				\
	SYSCALL_METADATA(sname, x, __VA_ARGS__)			\
	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

然后在 linux-4.19.325/arch/x86/include/asm/syscall_wrapper.h 可以继续找到 __SYSCALL_DEFINEx

#define __SYSCALL_DEFINEx(x, name, ...)					\
	asmlinkage long __x64_sys##name(const struct pt_regs *regs);	\
	ALLOW_ERROR_INJECTION(__x64_sys##name, ERRNO);			\
	static long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__));	\
	static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\
	asmlinkage long __x64_sys##name(const struct pt_regs *regs)	\
	{								\
		return __se_sys##name(SC_X86_64_REGS_TO_ARGS(x,__VA_ARGS__));\
	}								\
	__IA32_SYS_STUBx(x, name, __VA_ARGS__)				\
	static long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__))	\
	{								\
		long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__));\
		__MAP(x,__SC_TEST,__VA_ARGS__);				\
		__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__));	\
		return ret;						\
	}								\
	static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))

这里的关键点是：

宏展开链：

1
2
3

SYSCALL_DEFINE3(write, ...) 
→ SYSCALL_DEFINEx(3, _write, ...)
→ __SYSCALL_DEFINEx(3, _write, ...)

在x86_64架构下： - __SYSCALL_DEFINEx 被重新定义为生成 __x64_sys 前缀的函数 - 创建了一个pt_regs包装器函数 - 这个包装器调用实际的系统调用实现

所以当你在代码中看到：

1	SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count)

它实际上会展开生成多个函数：

// pt_regs 包装器
asmlinkage long __x64_sys_write(const struct pt_regs *regs);

// 参数转换和符号扩展包装器
asmlinkage long __se_sys_write(...);

// 实际实现
static inline long __do_sys_write(...);

系统调用表中使用的是 __x64_sys_write，它： - 从pt_regs结构中提取参数 - 调用 __se_sys_write - 后者再调用 __do_sys_write

pt_regs结构体

pt_regs结构体定义在 usr/src/kernels/linux-4.19.325/arch/x86/include/asm/ptrace.h

struct pt_regs {
/*
 * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
 * unless syscall needs a complete, fully filled "struct pt_regs".
 */
	unsigned long r15;
	unsigned long r14;
	unsigned long r13;
	unsigned long r12;
	unsigned long bp;
	unsigned long bx;
/* These regs are callee-clobbered. Always saved on kernel entry. */
	unsigned long r11;
	unsigned long r10;
	unsigned long r9;
	unsigned long r8;
	unsigned long ax;
	unsigned long cx;
	unsigned long dx;
	unsigned long si;
	unsigned long di;
/*
 * On syscall entry, this is syscall#. On CPU exception, this is error code.
 * On hw interrupt, it's IRQ number:
 */
	unsigned long orig_ax;
/* Return frame for iretq */
	unsigned long ip;
	unsigned long cs;
	unsigned long flags;
	unsigned long sp;
	unsigned long ss;
/* top of stack page */
};

寄存器

在 usr/src/kernels/linux-4.19.325/arch/x86/entry/entry_64.S 中可以找到有关寄存器的参数存放的代码：

有一些英文注释说明：

/*
 * 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
 *
 * This is the only entry point used for 64-bit system calls.  The
 * hardware interface is reasonably well designed and the register to
 * argument mapping Linux uses fits well with the registers that are
 * available when SYSCALL is used.
 *
 * SYSCALL instructions can be found inlined in libc implementations as
 * well as some other programs and libraries.  There are also a handful
 * of SYSCALL instructions in the vDSO used, for example, as a
 * clock_gettimeofday fallback.
 *
 * 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
 * then loads new ss, cs, and rip from previously programmed MSRs.
 * rflags gets masked by a value from another MSR (so CLD and CLAC
 * are not needed). SYSCALL does not save anything on the stack
 * and does not change rsp.
 *
 * Registers on entry:
 * rax  system call number
 * rcx  return address
 * r11  saved rflags (note: r11 is callee-clobbered register in C ABI)
 * rdi  arg0
 * rsi  arg1
 * rdx  arg2
 * r10  arg3 (needs to be moved to rcx to conform to C ABI)
 * r8   arg4
 * r9   arg5
 * (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
 *
 * Only called from user space.
 *
 * When user can change pt_regs->foo always force IRET. That is because
 * it deals with uncanonical addresses better. SYSRET has trouble
 * with them due to bugs in both AMD and Intel CPUs.
 */

翻译如下：

64位SYSCALL指令入口点说明。最多支持通过寄存器传递6个参数。

这是64位系统调用唯一使用的入口点。其硬件接口设计合理，Linux使用的寄存器到参数的映射与SYSCALL可用的寄存器配合得很好。

SYSCALL指令的使用场景：

可在libc实现中找到内联的SYSCALL指令
某些程序和库中也使用SYSCALL指令
vDSO（虚拟动态共享对象）中也包含少量SYSCALL指令，例如用作clock_gettimeofday的后备方案

64位SYSCALL的执行过程：

将rip保存到rcx
清除rflags.RF
将rflags保存到r11
从预先编程的MSR（模型特定寄存器）加载新的ss、cs和rip
rflags通过另一个MSR的值进行掩码处理（因此不需要CLD和CLAC指令）
SYSCALL不在栈上保存任何内容
不改变rsp值

入口时寄存器状态：

rax：系统调用号
rcx：返回地址
r11：保存的rflags（注意：在C ABI中r11是被调用者可破坏的寄存器）
rdi：参数0 (arg0)
rsi：参数1 (arg1)
rdx：参数2 (arg2)
r10：参数3 (arg3)（需要移动到rcx以符合C ABI）
r8：参数4 (arg4)
r9：参数5 (arg5)
注意：在C ABI中，r12-r15、rbp、rbx是被调用者保护的寄存器

特别说明：

仅从用户空间调用
当用户可以修改pt_regs->foo时，始终强制使用IRET
使用IRET的原因是它能更好地处理非规范地址
由于AMD和Intel CPU都存在bug，SYSRET在处理非规范地址时会出现问题

ENTRY(entry_SYSCALL_64)
	UNWIND_HINT_EMPTY
	/*
	 * Interrupts are off on entry.
	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
	 * it is too small to ever cause noticeable irq latency.
	 */

	swapgs
	/*
	 * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
	 * is not required to switch CR3.
	 */
	movq	%rsp, PER_CPU_VAR(rsp_scratch)
	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp

	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS			/* pt_regs->ss */
	pushq	PER_CPU_VAR(rsp_scratch)	/* pt_regs->sp */
	pushq	%r11				/* pt_regs->flags */
	pushq	$__USER_CS			/* pt_regs->cs */
	pushq	%rcx				/* pt_regs->ip */
GLOBAL(entry_SYSCALL_64_after_hwframe)
	pushq	%rax				/* pt_regs->orig_ax */

	PUSH_AND_CLEAR_REGS rax=$-ENOSYS

	TRACE_IRQS_OFF

	/* IRQs are off. */
	movq	%rax, %rdi
	movq	%rsp, %rsi

	/* clobbers %rax, make sure it is after saving the syscall nr */
	IBRS_ENTER

	call	do_syscall_64		/* returns with IRQs disabled */

	TRACE_IRQS_IRETQ		/* we're about to change IF */

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.  If we're not,
	 * go to the slow exit path.
	 */
	movq	RCX(%rsp), %rcx
	movq	RIP(%rsp), %r11

	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
	 * in kernel space.  This essentially lets the user take over
	 * the kernel, since userspace controls RSP.
	 *
	 * If width of "canonical tail" ever becomes variable, this will need
	 * to be updated to remain correct on both old and new CPUs.
	 *
	 * Change top bits to match most significant bit (47th or 56th bit
	 * depending on paging mode) in the address.
	 */
#ifdef CONFIG_X86_5LEVEL
	ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
		"shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
#else
	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif

	/* If this changed %rcx, it was not canonical */
	cmpq	%rcx, %r11
	jne	swapgs_restore_regs_and_return_to_usermode

	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	movq	R11(%rsp), %r11
	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
	 * restore RF properly. If the slowpath sets it for whatever reason, we
	 * need to restore it correctly.
	 *
	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
	 * trap from userspace immediately after SYSRET.  This would cause an
	 * infinite loop whenever #DB happens with register state that satisfies
	 * the opportunistic SYSRET conditions.  For example, single-stepping
	 * this user code:
	 *
	 *           movq	$stuck_here, %rcx
	 *           pushfq
	 *           popq %r11
	 *   stuck_here:
	 *
	 * would never get past 'stuck_here'.
	 */
	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
	jnz	swapgs_restore_regs_and_return_to_usermode

	/* nothing to check for RSP */

	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	IBRS_EXIT
	POP_REGS pop_rdi=0

	/*
	 * Now all regs are restored except RSP and RDI.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
	UNWIND_HINT_EMPTY

	pushq	RSP-RDI(%rdi)	/* RSP */
	pushq	(%rdi)		/* RDI */

	/*
	 * We are on the trampoline stack.  All regs except RDI are live.
	 * We can do future final exit work right here.
	 */
	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

	popq	%rdi
	popq	%rsp
	USERGS_SYSRET64
END(entry_SYSCALL_64)

在存放好寄存器参数后的 call do_syscall_64 在 usr/src/kernels/linux-4.19.325/arch/x86/entry/common.c 中：

#ifdef CONFIG_X86_64
__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
	struct thread_info *ti;

	enter_from_user_mode();
	local_irq_enable();
	ti = current_thread_info();
	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
		nr = syscall_trace_enter(regs);

	/*
	 * NB: Native and x32 syscalls are dispatched from the same
	 * table.  The only functional difference is the x32 bit in
	 * regs->orig_ax, which changes the behavior of some syscalls.
	 */
	nr &= __SYSCALL_MASK;
	if (likely(nr < NR_syscalls)) {
		nr = array_index_nospec(nr, NR_syscalls);
		regs->ax = sys_call_table[nr](regs);
	}

	syscall_return_slowpath(regs);
}
#endif

在 usr/src/kernels/linux-4.19.325/usr/include/asm/unistd_64.h 中有如下：

1	#define __NR_write 1

在 usr/src/kernels/linux-4.19.325/arch/x86/entry/syscall_64.c 中有如下：

/* this is a lie, but it does not hurt as sys_ni_syscall just returns -EINVAL */
extern asmlinkage long sys_ni_syscall(const struct pt_regs *);
#define __SYSCALL_64(nr, sym, qual) extern asmlinkage long sym(const struct pt_regs *);
#include <asm/syscalls_64.h>
#undef __SYSCALL_64

#define __SYSCALL_64(nr, sym, qual) [nr] = sym,

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
	/*
	 * Smells like a compiler bug -- it doesn't work
	 * when the & below is removed.
	 */
	[0 ... __NR_syscall_max] = &sys_ni_syscall,
#include <asm/syscalls_64.h>
};

在 usr/src/kernels/linux-4.19.325/arch/x86/include/generated/asm/syscalls_64.h 中有如下

#ifdef CONFIG_X86
__SYSCALL_64(1, __x64_sys_write, )
#else /* CONFIG_UML */
__SYSCALL_64(1, sys_write, )
#endif

// 在系统调用表中：
sys_call_table[1] = __x64_sys_write;

// 在do_syscall_64中的调用：
regs->ax = sys_call_table[nr](regs);
// 实际等价于：
regs->ax = __x64_sys_write(regs);

完整调用链

用户空间 write() 
    → syscall指令
        → 系统调用入口（保存寄存器状态到pt_regs）
            → do_syscall_64()
                → sys_call_table[nr](regs)
                    → __x64_sys_write(regs)
                        → __se_sys_write()
                            → __do_sys_write()

新增系统调用计算一个数的三次方

添加调用函数定义

usr/src/kernels/linux-4.19.325/kernel/sys.c 在 #endif 后加上

SYSCALL_DEFINE1(cube, int, num){
	int res = num * num * num;
	printk("The result is %d\n", res);
	return res;
}

注册系统调用号

usr/src/kernels/linux-4.19.325/arch/x86/entry/syscalls/syscall_64.tbl 添加以下内容，此时添加的自定义函数对应的系统调用号是 \(350\)

1	350 64 cube __x64_sys_cube

编译安装新的内核

同上

测试

#include <stdio.h>
#include <linux/kernel.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    // 检查命令行参数数量
    if (argc != 2) {
        printf("使用方法: %s <数字>\n", argv[0]);
        return 1;
    }

    // 将命令行参数转换为整数
    int input = atoi(argv[1]);

    // 调用系统调用350
    int ans = syscall(350, input);

    // 打印结果
    printf("ans = %d\n", ans);

    return 0;
}

1 2	gcc test.c -o test ./test

调用链执行流程

用户空间调用阶段
- 用户程序通过 syscall() 函数发起系统调用
- 传入系统调用号350和参数3
- 系统从用户态切换到内核态
内核空间处理阶段
- 系统根据调用号350在系统调用表中查找对应的处理函数
- 定位到 sys_cube() 函数
- 执行计算过程:
  1
  res = num * num * num // 3 * 3 * 3 = 27
- 通过 printk() 在内核日志中打印结果
- 将计算结果返回给用户空间
返回用户空间阶段
- 系统从内核态切换回用户态
- 计算结果通过 syscall() 的返回值返回给用户程序
- 用户程序通过 printf() 打印结果

系统调用头文件的层次结构

内核源码中的头文件位置

linux/
├── include/
│   ├── uapi/                    # 用户空间API头文件
│   │   └── asm-generic/
│   │       └── unistd.h        # 通用系统调用定义
│   │
│   └── asm-generic/            # 架构无关的通用定义
│       └── unistd.h
│
└── arch/
    └── x86/
        └── include/
            └── uapi/
                └── asm/
                    ├── unistd_32.h  # 32位系统调用
                    ├── unistd_64.h  # 64位系统调用
                    └── unistd_x32.h # x32 ABI系统调用

头文件包含关系

用户空间程序

#include <unistd.h>      // 用户空间接口
    ↓
#include <sys/syscall.h> // 系统调用定义
    ↓
#include <asm/unistd.h>  // 架构特定系统调用号
    ↓
根据架构选择相应的 unistd_*.h

内核空间

1
2
3

#include <uapi/asm-generic/unistd.h>  // 通用系统调用定义
    ↓
根据 __BITS_PER_LONG 等条件选择具体实现

系统调用定义的层次

通用层（asm-generic）
- 定义架构无关的系统调用
- 提供默认实现
架构特定层（arch/x86）
- 定义特定架构的系统调用
- 可能覆盖通用实现
- 包含架构特定的优化
用户空间接口层（uapi）
- 提供给用户空间使用的稳定API
- 保持ABI兼容性

实际应用中的查找顺序

当需要查找系统调用定义时：

首先检查架构特定的定义
- arch/x86/include/uapi/asm/unistd_*.h
如果没有找到，查看通用定义
- include/uapi/asm-generic/unistd.h
最后查看用户空间头文件
- /usr/include/asm/unistd.h

这种分层结构的目的是：

提供灵活的实现方式
支持不同架构的特定需求
维护良好的兼容性
便于代码管理和维护