Architecture
Linux is a monolithic kernel with a modular design (e.g, it can insert and remove loadable kernel modules at runtime), supporting most features once only available in closed source kernels of non-free operating systems:
- concurrent computing and (with the availability of enough CPU cores for tasks that are ready to run) even true parallel execution of many processes at once (each of them having one or more threads of execution) on SMP and NUMA architectures;
- selection and configuration of hundreds of kernel features and drivers (using one of the "make *config" family of commands, before running compilation), modification of kernel parameters before booting (usually by inserting instructions into the lines of the GRUB2 menu), and fine tuning of kernel behavior at run-time (using the sysctl(8) interface to /proc/sys/);
- configuration (again using the make *config commands) and run-time modifications of the policies (via nice(2), setpriority(2) and the family of sched_*(2) syscalls) of the task schedulers that allow preemptive multitasking (both in user mode and, since the 2.6 series, in kernel mode); the Completely Fair Scheduler (CFS) is the default scheduler of Linux since 2007 and it uses a red-black tree which can search, insert and delete process information (task_struct) with O(log n) time complexity, where n is the number of runnable tasks;
- advanced memory management with paged virtual memory;
- inter-process communications and synchronization mechanism;
- a virtual filesystem on top of several concrete filesystems (ext4, Btrfs, XFS, JFS, FAT32, and many more);
- configurable I/O schedulers, ioctl(2) syscall that manipulates the underlying device parameters of special files (it is a non standard system call, since arguments, returns, and semantics depends on the device driver in question), support for POSIX asynchronous I/O (however, because they scale poorly with multithreaded applications, a family of Linux specific I/O system calls (io_*(2)) had to be created for the management of asynchronous I/O contexts suitable for concurrently processing:
- OS-level virtualization (with Linux-VServer), paravirtualization and hardware-assisted virtualization (with KVM or Xen, and using QEMU for hardware emulation); On the Xen hypervisor, the Linux kernel provides support to build Linux distributions (such as openSuSE Leap and many others) that work as Dom0, that are virtual machine host servers that provide the management environment for the user's virtual machines (DomU).
- security mechanisms for discretionary and mandatory access control (SELinux, AppArmor, POSIX ACLs, and others);
- several types of layered communication protocols (including the Internet protocol suite).
Device drivers and kernel extensions run in kernel space (ring 0 in many CPU architectures), with full access to the hardware, although some exceptions run in user space, for example, filesystems based on FUSE/CUSE, and parts of UIO. The graphics system most people use with Linux does not run within the kernel. Unlike standard monolithic kernels, device drivers are easily configured as modules, and loaded or unloaded while the system is running and can also be pre-empted under certain conditions in order to handle hardware interrupts correctly and to better support symmetric multiprocessing. By choice, Linux has no stable device driver application binary interface.
Linux typically makes use of memory protection and virtual memory and can also handle non-uniform memory access, however the project has absorbed μClinux which also makes it possible to run Linux on microcontrollers without virtual memory.
The hardware is represented in the file hierarchy. User applications interact with device drivers via entries in the /dev or /sys directories. Processes information as well are mapped to the file system through the /proc directory.
User mode | User applications | bash, LibreOffice, GIMP, Blender, 0 A.D., Mozilla Firefox, ... | ||||
---|---|---|---|---|---|---|
System components | Daemons: systemd, runit, logind, networkd, PulseAudio, ... |
Window manager: X11, Wayland, SurfaceFlinger (Android) |
Graphics: Mesa, AMD Catalyst, ... |
Other libraries: GTK+, Qt, EFL, SDL, SFML, FLTK, GNUstep, ... | ||
C standard library | open() , exec() , sbrk() , socket() , fopen() , calloc() , ... (up to 2000 subroutines)glibc aims to be fast, musl and uClibc target embedded systems, bionic written for Android, etc. All aim to be POSIX/SUS-compatible. | |||||
Kernel mode | Linux kernel | stat , splice , dup , read , open , ioctl , write , mmap , close , exit , etc. (about 380 system calls)The Linux kernel System Call Interface (SCI, aims to be POSIX/SUS-compatible)citation needed | ||||
Process scheduling subsystem |
IPC subsystem |
Memory management subsystem |
Virtual files subsystem |
Network subsystem | ||
Other components: ALSA, DRI, evdev, LVM, device mapper, Linux Network Scheduler, Netfilter Linux Security Modules: SELinux, TOMOYO, AppArmor, Smack | ||||||
Hardware (CPU, main memory, data storage devices, etc.) |
Interfacesedit
Linux is a clone of UNIX, and aims towards POSIX and Single UNIX Specification compliance. The kernel also provides system calls and other interfaces that are Linux-specific. In order to be included in the official kernel, the code must comply with a set of licensing rules.
The Linux Application binary interface (ABI) between the kernel and the user space has four degrees of stability (stable, testing, obsolete, removed); however, the system calls are expected to never change in order to not break the userspace programs that rely on them.
Loadable kernel modules (LKMs), by design, cannot rely on a stable ABI. Therefore they must always be recompiled whenever a new kernel executable is installed in a system, otherwise they will not be loaded. In-tree drivers that are configured to become an integral part of the kernel executable (vmlinux) are statically linked by the building process.
There is also no guarantee of stability of source-level in-kernel API and, because of this, device drivers code, as well as the code of any other kernel subsystem, must be kept updated with kernel evolution. Any developer who makes an API change is required to fix any code that breaks as the result of their change.
Kernel-to-userspace APIedit
The set of the Linux kernel API that regards the interfaces exposed to user applications is fundamentally composed of UNIX and Linux-specific system calls. A system call is an entry point into the Linux kernel. For example, among the Linux-specific ones there is the family of the clone(2) system calls. Most extensions must be enabled by defining the _GNU_SOURCE macro in a header file or when the user-land code is being compiled.
System calls can only be invoked by using assembly instructions which enable the transition from unprivileged user space to privileged kernel space in ring 0. For this reason, the C standard library (libC) acts as a wrapper to most Linux system calls, by exposing C functions that, only whether it is needed, can transparently enter into the kernel which will execute on behalf of the calling process. For those system calls not exposed by libC, e.g. the fast userspace mutex (futex), the library provides a function called syscall(2) which can be used to explicitly invoke them.
Pseudo filesystems (e.g., the sysfs and procfs filesystems) and special files (e.g., /dev/random, /dev/sda, /dev/tty, and many others) constitute another layer of interface to kernel data structures representing hardware or logical (software) devices.
Kernel-to-userspace ABIedit
Because of the differences existing between the hundreds of various implementations of the Linux OS, executable objects, even though they are compiled, assembled, and linked for running on a specific hardware architecture (that is, they use the ISA of the target hardware), often cannot run on different Linux Distributions. This issue is mainly due to distribution-specific configurations and a set of patches applied to the code of the Linux kernel, differences in system libraries, services (daemons), filesystem hierarchies, and environment variables.
The main standard concerning application and binary compatibility of Linux distributions is the Linux Standard Base (LSB). However, the LSB goes beyond what concerns the Linux kernel, because it also defines the desktop specifications, the X libraries and Qt that have little to do with it. The LSB version 5 is built upon several standards and drafts (POSIX, SUS, X/Open, File System Hierarchy (FHS), and others).
The parts of the LSB largely relevant to the kernel are the General ABI (gABI), especially the System V ABI and the Executable and Linking Format (ELF), and the Processor Specific ABI (psABI), for example the Core Specification for X86-64.
The standard ABI for how x86_64 user programs invoke system calls is to load the syscall number into the rax register, and the other parameters into rdi, rsi, rdx, r10, r8, and r9, and finally to put the syscall assembly instruction in the code.
In-kernel APIedit
There are several kernel internal APIs utilized between the different subsystems. Some are available only within the kernel subsystems, while a somewhat limited set of in-kernel symbols (i.e., variables, data structures and functions) is exposed also to dynamically loadable modules (e.g., device drivers loaded on demand) whether they're exported with the EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL() macros (the latter reserved to modules released under a GPL-compatible license).
Linux provides in-kernel APIs that manipulate data structures (e.g., linked lists, radix trees, red-black trees, queues) or perform common routines (e.g., copy data from and to user space, allocate memory, print lines to the system log, and so on) that have remained stable at least since Linux version 2.6.
In-kernel APIs include libraries of low-level common services used by device drivers:
- SCSI Interfaces and libATA – respectively, a peer-to-peer packet based communication protocol for storage devices attached to USB, SATA, SAS, Fibre Channel, FireWire, ATAPI device, and an in-kernel library to support SATA host controllers and devices.
- Direct Rendering Manager (DRM) and Kernel Mode Setting (KMS) – for interfacing with GPUs and supporting the needs of modern 3D-accelerated video hardware, and for setting screen resolution, color depth and refresh rate
- DMA buffers (dma_buf) – for sharing buffers for hardware direct memory access across multiple device drivers and subsystems
- Video4Linux – for video capture hardware
- Advanced Linux Sound Architecture (ALSA) – for sound cards
- New API – for network interface controllers
- mac80211 – for wireless network interface controllers
In-kernel ABIedit
The Linux developers choose not to maintain a stable in-kernel ABI. Modules compiled for a specific version of the kernel cannot be loaded into another version without being re-compiled, assuming that the source level in-kernel API has remained the same, otherwise also the module code must be modified accordingly.
Technical featuresedit
Processes and threadsedit
Linux creates processes by means of the clone(2) or by the newer clone3(2) system calls. Depending on the given parameters, the new entity can share most or none of the resources of the caller. These syscalls can create new entities ranging from new independent processes (each having a special identifier called TGID within the task_struct data structure in kernel space, although that same identifier is called PID in userspace), to new threads of execution within the calling process (by using the CLONE_THREAD parameter). In this latter case the new entity owns the same TGID of the calling process and consequently has also the same PID in userspace.
If the executable is dynamically linked to shared libraries, an interpreter (for ELF objects it is typically /lib/ld-linux.so.2) is used to find and load the needed objects, prepare the program to run and then run it.
The Native POSIX Thread Library, simply known as the NPTL, provides the standard POSIX threads interface (pthreads) to userspace Whenever a new thread is created using the pthread_create(3) POSIX interface, the clone(2) family of system calls must also be given the address of the function that the new thread must jump to. The Linux kernel provides the futex(7) (acronym for "Fast user-space mutexes") mechanisms for fast user-space locking and synchronization; the majority of the operations are performed in userspace but it may be necessary to communicate with the kernel using the futex(2) system call.
A very special category of threads is the so-called kernel threads. They must not be confused with the above-mentioned threads of execution of the user's processes. Kernel threads exist only in kernel space and their only purpose is to concurrently run kernel tasks.
Differently, whenever an independent process is created, the syscalls return exactly to the next instruction of the same program, concurrently in parent process and in child's one (i.e., one program, two processes). Different return values (one per process) enable the program to know in which of the two processes it is currently executing. Programs need this information because the child process, a few steps after process duplication, usually invokes the execve(2) system call (possibly via the family of exec(3) wrapper functions in glibC) and replace the program that is currently being run by the calling process with a new program, with newly initialized stack, heap, and (initialized and uninitialized) data segments. When it is done, it results in two processes that run two different programs.
Depending on the effective user id (euid), and on the effective group id (egid), a process running with user zero privileges (root, the system administrator, owns the identifier 0) can perform everything (e.g., kill all the other processes or recursively wipe out whole filesystems), instead non zero user processes cannot. Capabilities(7) divides the privileges traditionally associated with superuser into distinct units, which can be independently enabled and disabled by the parent process or dropped by the child itself.
Scheduling and preemptionedit
Linux enables different scheduling classes and policies. By default the kernel uses a scheduler mechanism called the Completely Fair Scheduler introduced in the 2.6.23 version of the kernel. Internally this default-scheduler class is also known as SCHED_OTHER
, but the kernel also contains two POSIX-compliant real-time scheduling classes named SCHED_FIFO
(realtime first-in-first-out) and SCHED_RR
(realtime round-robin), both of which take precedence over the default class. An additional scheduling policy known as SCHED_DEADLINE
, implementing the earliest deadline first algorithm (EDF), was added in kernel version 3.14, released on 30 March 2014. SCHED_DEADLINE
takes precedence over all the other scheduling classes.
Linux provides both user preemption as well as full kernel preemption. Preemption reduces latency, increases responsiveness, and makes Linux more suitable for desktop and real-time applications.
With user preemption, the kernel scheduler can replace the current process with the execution of a context switch to a different one that therefore acquires the computing resources for running (CPU, memory, and more). It makes it according to the CFS algorithm (in particular it uses a variable called vruntime for sorting processes), to the active scheduler policy and to the processes relative priorities. With kernel preemption, the kernel can preempt itself when an interrupt handler returns, when kernel tasks block, and whenever a subsystem explicitly calls the schedule() function.
The Linux kernel patch PREEMPT_RT
enables full preemption of critical sections, interrupt handlers, and "interrupt disable" code sequences. Partial integration of the real-time Linux patches brought the above mentioned functionality to the kernel mainline.
Memory managementedit
Memory management in Linux is a complex topic. First of all, the kernel is not pageable (i.e., it is always resident in physical memory and cannot be swapped to the disk). In the kernel there is no memory protection (no SIGSEGV signals, unlike in userspace), therefore memory violations lead to instability and system crashes.
Supported architecturesedit
While not originally designed to be portable, Linux is now one of the most widely ported operating system kernels, running on a diverse range of systems from the ARM architecture to IBM z/Architecture mainframe computers. The first port was performed on the Motorola 68000 platform. The modifications to the kernel were so fundamental that Torvalds viewed the Motorola version as a fork and a "Linux-like operating system". However, that moved Torvalds to lead a major restructure of the code to facilitate porting to more computing architectures. The first Linux that, in a single source tree, had code for more than i386 alone, supported the DEC Alpha AXP 64-bit platform.
Linux runs as the main operating system on IBM's Summit; as of October 2019update, all of the world's 500 fastest supercomputers run some operating system based on the Linux kernel, a big change from 1998 when the first Linux supercomputer got added to the list.
Linux has also been ported to various handheld devices such as Apple's iPhone 3G and iPod.
Live patchingedit
Rebootless updates can even be applied to the kernel by using live patching technologies such as Ksplice, kpatch and kGraft. Minimalistic foundations for live kernel patching were merged into the Linux kernel mainline in kernel version 4.0, which was released on 12 April 2015. Those foundations, known as livepatch and based primarily on the kernel's ftrace functionality, form a common core capable of supporting hot patching by both kGraft and kpatch, by providing an application programming interface (API) for kernel modules that contain hot patches and an application binary interface (ABI) for the userspace management utilities. However, the common core included into Linux kernel 4.0 supports only the x86 architecture and does not provide any mechanisms for ensuring function-level consistency while the hot patches are applied. As of April 2015update, there is ongoing work on porting kpatch and kGraft to the common live patching core provided by the Linux kernel mainline.
Securityedit
Kernel bugs present potential security issues. For example, they may allow for privilege escalation or create denial-of-service attack vectors. Over the years, numerous bugs affecting system security were found and fixed. New features are frequently implemented to improve the kernel's security.
Capabilities(7) have already been introduced in the section about the processes and threads. Android makes use of them and Systemd gives administrators detailed control over the capabilities of processes.
Linux offers a wealth of mechanisms to reduce kernel attack surface and improve security which are collectively known as the Linux Security Modules (LSM). They comprise the Security-Enhanced Linux (SELinux) module, whose code has been originally developed and then released to the public by the NSA, and AppArmor among others. SELinux is now actively developed and maintained on GitHub. SELinux and AppArmor provide support to access control security policies, including mandatory access control (MAC), though they profoundly differ in complexity and scope.
Another security feature is the Seccomp BPF (SECure COMPuting with Berkeley Packet Filters) which works by filtering parameters and reducing the set of system calls available to user-land applications.
Critics have accused kernel developers of covering up security flaws or at least not announcing them; in 2008, Linus Torvalds responded to this with the following:
I personally consider security bugs to be just "normal bugs". I don't cover them up, but I also don't have any reason what-so-ever to think it's a good idea to track them and announce them as something special...one reason I refuse to bother with the whole security circus is that I think it glorifies—and thus encourages—the wrong behavior. It makes "heroes" out of security people, as if the people who don't just fix normal bugs aren't as important. In fact, all the boring normal bugs are way more important, just because there'ssic a lot more of them. I don't think some spectacular security hole should be glorified or cared about as being any more "special" than a random spectacular crash due to bad locking.
Linux distributions typically release security updates to fix vulnerabilities in the Linux kernel. Many offer long-term support releases that receive security updates for a certain Linux kernel version for an extended period of time.
Comments
Post a Comment