Dissecting Linux Kernel Internals - procfs (/proc) and sysfs (/sys)

9 minute read | By: Ayoma Wijethunga

Tags: , ,

Categories:

Updated:

In this series of posts, I’m looking at different aspects of the Linux Kernel, starting from less complicated areas, to understand Linux Kernel a little better. During this effort, I pay special attention to the security aspect and highlight areas that might be important for security research and testing.

In this post, we look at the history, usage, internals and implementation details of procfs which is responsible for the /proc mount and sysfs which is responsible for /sys.

procfs (/proc)

The /proc mount is part of procfs, a special in-memory Virtual File System (VFS) used to present (and modify) process information, kernel processes, and other system information in a hierarchical file-like structure. This layer is expected to provide convenient access to the said information, acting as an interface to internal data structures in the kernel.

The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at /proc. Typically, it is mounted automatically by the system, but it can also be mounted manually using a command such as: mount -t proc proc /proc

Most of the files in the proc filesystem are read-only, but some files are writable, allowing kernel variables to be changed.

Source: https://man7.org/linux/man-pages/man5/procfs.5.html

I will not go into details about the contents and structure of /proc. This is well described in the following pages:

However, following are worth mentioning since I have found these highly useful during security research/testing:

  • /proc/PID/cmdline: which contains the command which originally started the process
    • Read confidential information passed in as arguments
  • /proc/PID/environ: a file containing the names and contents of the environment variables that affect the process
    • Read confidential information passed in as environment variables (specially useful in containerized environments, since this is a common practice)
  • /proc/PID/mem: a binary image representing the process’s virtual memory, can only be accessed by a ptrace’ing process
  • /proc/PID/maps: the memory map showing which addresses currently visible to that process are mapped to which regions in RAM or to files
    • Useful in binary exploitation (offset calculations, etc.)
  • /proc/cpuinfo: containing information about the CPU
    • Identifying CPU architecture. Useful in binary exploitations to create matching payloads.
  • /proc/version: containing the Linux kernel version, distribution number, gcc version number used to build the kernel and any other pertinent information relating to the version of the kernel currently running
    • Identifying operating system related information architecture. Useful in binary exploitations to create matching payloads.
  • /proc/net/: a directory containing useful information about the network stack, in particular /proc/net/nf_conntrack, which lists existing network connections
    • Get information about network stack and connections.
  • /proc/modules: containing a list of the kernel modules currently loaded . It gives some indication (not always entirely correct) of dependencies.
    • Useful in binary exploitations to create matching payloads.
  • /proc/mounts: a symlink to self/mounts which contains a list of the currently mounted devices and their mount points
    • Get information about the mounted devices (for example: through LFI)
  • /proc/kcore: represents the physical memory of the system (kernel virtual address space region of memory) and is stored in the ELF core file format. (examined by gdb, objdump)
    • /dev/kmem: (not relevant to /proc) gives access to the kernel’s virtual memory space
    • /dev/mem: (not relevant to /proc) gives access to physical memory.
  • /proc/kmsg: used to hold messages generated by the kernel (picked by /bin/dmesg).

I will keep this lists updated, if I come across any additional important security usages in future.

If you are even further keen about how procfs works, you can refer to the following documents and the implementation details below:

Implementation Details

procfs is just another Virtual Filesystem (VFS) from the Kernel’s point of view. Therefore, Kernel can treat it similar to how it treats any file system. It is not my intention to discuss how VFS works in this post. The article “Virtual filesystems in Linux: Why we need them and how they work” further discuss about how userspace accesses is provided for various types of filesystems commonly mounted on Linux systems.

VFS

Source: https://opensource.com/article/19/3/virtual-filesystems-linux

As explained in, Overview of the Linux Virtual File System in order to expose a VFS, it’s required to first register it using register_filesystem or linux/fs.h. The function proc_root_init(void) of /fs/proc/root.c does this [ref]. Also, notice the name of the root inode of /proc tree at proc_root struct of /fs/proc/root.c [ref].

Now let’s briefly look at how /proc/cpuinfo is implemented:

  • fs/proc/cpuinfo.c contains relevant source file. The last line of this file is an invocation of fs_initcall.

  • The initcall mechanism of the Kernel is further described in 0xax’s - Linux Inside - GitBook. do_initcalls(void) of init/main.c is where each initcall level is executed [ref]. fs_initcall is part of this process [ref].

  • During the fs_initcall, proc_cpuinfo_init of fs/proc/cpuinfo.c is executed.
  • proc_create(-) of fs/proc/generic.c is called from proc_cpuinfo_init.
  • proc_ops defined at [ref] is set to proc_dir_entry at [ref].
  • Said proc_ops, defines proc_open to point to cpuinfo_open [ref].

  • cpuinfo_op referenced in cpuinfo_open is defined in each CPU architecture supported by the Kernel. If we consider x86 as the example, /arch/x86/kernel/cpu/proc.c defines this [ref].

  • cpuinfo_op ultimately uses show_cpuinfo to show CPU information [ref].

It’s true that this is not the complete end-to-end flow, such detailed explanation is not the focus of this post. However, if you are interested in further details, I believe now you have good amount of pointers regarding where to look at.

Following are list of interesting references that further discuss internals of procfs:

procfs in action

We can use strace to observe how procfs is used in different process related commands. sudo strace ps whuld result in the following trace:

$ sudo strace ps

...

openat(AT_FDCWD, "/proc", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 5
fstat(5, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f920df3e000
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f920df1d000
getdents64(5, /* 169 entries */, 32768) = 4432
stat("/proc/1", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
openat(AT_FDCWD, "/proc/1/stat", O_RDONLY) = 6
read(6, "1 (systemd) S 0 1 1 0 -1 4194560"..., 1024) = 322
close(6)                                = 0
openat(AT_FDCWD, "/proc/1/status", O_RDONLY) = 6
read(6, "Name:\tsystemd\nUmask:\t0000\nState:"..., 1024) = 1024
read(6, "0000000,00000000,00000000,000000"..., 1024) = 263
close(6)                                = 0

...

stat("/proc/1676", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
openat(AT_FDCWD, "/proc/1676/stat", O_RDONLY) = 6
read(6, "1676 (sudo) S 1665 1676 1665 348"..., 2048) = 311
close(6)                                = 0
openat(AT_FDCWD, "/proc/1676/status", O_RDONLY) = 6
read(6, "Name:\tsudo\nUmask:\t0022\nState:\tS "..., 2048) = 1301
close(6)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
write(1, "    PID TTY          TIME CMD\n", 30    PID TTY          TIME CMD
) = 30
openat(AT_FDCWD, "/proc/tty/drivers", O_RDONLY) = 6
read(6, "/dev/tty             /dev/tty   "..., 9999) = 576
close(6)                                = 0
stat("/dev/pts0", 0x7ffcc4460d60)       = -1 ENOENT (No such file or directory)
stat("/dev/pts/0", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
write(1, "   1676 pts/0    00:00:00 sudo\n", 31   1676 pts/0    00:00:00 sudo
) = 31

...

stat("/proc/1678", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
openat(AT_FDCWD, "/proc/1678/stat", O_RDONLY) = 6
read(6, "1678 (strace) R 1676 1676 1665 3"..., 2048) = 320
close(6)                                = 0
openat(AT_FDCWD, "/proc/1678/status", O_RDONLY) = 6
read(6, "Name:\tstrace\nUmask:\t0022\nState:\t"..., 2048) = 1305
close(6)                                = 0
stat("/dev/pts0", 0x7ffcc4460d60)       = -1 ENOENT (No such file or directory)
stat("/dev/pts/0", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
write(1, "   1678 pts/0    00:00:00 strace"..., 33   1678 pts/0    00:00:00 strace
) = 33

stat("/proc/1681", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
openat(AT_FDCWD, "/proc/1681/stat", O_RDONLY) = 6
read(6, "1681 (ps) R 1678 1676 1665 34816"..., 2048) = 312
close(6)                                = 0
openat(AT_FDCWD, "/proc/1681/status", O_RDONLY) = 6
read(6, "Name:\tps\nUmask:\t0022\nState:\tR (r"..., 2048) = 1303
close(6)                                = 0
stat("/dev/pts0", 0x7ffcc4460d60)       = -1 ENOENT (No such file or directory)
stat("/dev/pts/0", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
write(1, "   1681 pts/0    00:00:00 ps\n", 29   1681 pts/0    00:00:00 ps
) = 29

getdents64(5, /* 0 entries */, 32768)   = 0
close(5)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?

Following are the highlights from the output:

  • openat is used to open the folder /proc
  • getdents64 is used to read the contents of the directory (which is then used to enumerate over directories found)
  • Use openat to read contents of “/proc/PID/stat” and “/proc/PID/status”.
  • Use write to print extracted data to the screen (/dev/pts/0 - which is the pseudo-terminal I’m connected to)

Usage of VFS in procfs can be observed also by using trace-bpfcc. Install BCC as instructed in INSTALL.md and run the following command to trace any invocations made to show_cpuinfo. In a separate terminal run command cat /proc/spuinfo:

$ sudo trace-bpfcc -K -U -I /usr/src/linux-headers-5.4.0-42/include/linux/proc_fs.h 'p::show_cpuinfo(struct seq_file *m, void *v) "CPU-Info Called"'

PID     TID     COMM            FUNC             -
3142    3142    cat             show_cpuinfo     CPU-Info Called
        b'show_cpuinfo+0x1 [kernel]'
        b'proc_reg_read+0x43 [kernel]'
        b'__vfs_read+0x1b [kernel]'
        b'vfs_read+0xab [kernel]'
        b'ksys_read+0x67 [kernel]'
        b'__x64_sys_read+0x1a [kernel]'
        b'do_syscall_64+0x57 [kernel]'
        b'entry_SYSCALL_64_after_hwframe+0x44 [kernel]'
        b'[unknown]'

In the meantime, if you leave vfsstat-bpfcc running you will observe high VFS usage at the time of /proc reads.

$ sudo vfsstat-bpfcc 
TIME         READ/s  WRITE/s  FSYNC/s   OPEN/s CREATE/s
03:57:04:        27       11        0        2        0
03:57:05:        18       11        0        0        0
03:57:06:        25       16        0        0        0
# Run: cat /proc/cpuinfo
03:57:07:        39       17        0       21        0
03:57:08:        18       11        0        0        0
03:57:11:        17       10        0        0        0
# Run: cat /proc/cpuinfo
03:57:12:        73       49        0       21        0
03:57:13:        17       10        0        0        0
03:57:16:        17       10        0        0        0
# Run: ps 
03:57:17:       287       33        0      254        0
03:57:18:        17       10        0        0        0
03:57:19:        17       10        0        0        0

Virtual filesystems in Linux: Why we need them and how they work discuss how you can utilize trace-bpfcc to observe sysfs (which we discuss next) and identify USB stick insertions.

sysfs (/sys)

Eventually /proc got clustered with non-process related information, especially with device & driver information. In order to make sure every device-driver adhere to a common structure and to make /proc less cluttered /sys (sysfs) was introduced from Kernel 2.6 and above.

sysfs is actually intended to simplify procfs by moving most of the hardware information out of procfs. For example, sysfs is used by udev to access device and device driver information

If you are interested, please read the response to the following unix-stackexchange answer to understand the difference and historical reasons for having these filesystems.

sysfs expose the readable and writable properties of what the kernel calls kobjects to userspace.

When you see a sysfs directory full of other directories, generally each of those directories corresponds to a kobject in the same kset. https://www.kernel.org/doc/Documentation/kobject.txt

Following references will help you understand further details:

Era before procfs (and sysfs)

Following links point to papers on “The Process File System and Process Model in UNIX System V” / “Process ad Files”, which are the initiation points of procfs:

Process as Files

Source: http://lucasvr.gobolinux.org/etc/Killian84-Procfs-USENIX.pdf

Before procfs, debugging a process was done by going through /dev/kmem and/or /dev/mem with help of syscall ptrace. This required extensive knowledge of Kernel data structures since the person who is debugging should calculate the locations to read precisely to capture certain information. Even though tools like ps use /proc now [ref], those used to go through /dev/mem before existence of procfs (such operations also required root-access).

What about MacOS?

MacOS doesn’t have procfs (/proc) or sysfs (/sys).

All such access to process or Kernel information is done through other interfaces to the Linux Kernel such as sysctl (sysctl command). An excellent post talking about attempting to port procfs to MacOS is available at http://web.archive.org/web/20200103161748/http://osxbook.com/book/bonus/ancient/procfs/.

Leave a comment