kernel源码学习-2

SMP

    对称多处理器结构 , 英文名称为 “ Symmetrical Multi-Processing “ , 简称 SMP ;

    SMP 又称为 UMA , 全称 “ Uniform Memory Access “ , 中文名称 “ 统一内存访问架构 “ ;

在 “ 对称多处理器结构 “ 的 系统中 , 所有的 CPU 处理器 的 地位 都是 平等的 , 一般指的是 服务器 设备上 , 运行的 多个 CPU , 没有 主次/从属 关系 , 都是平等的 ;

这些处理器 共享 所有的设备资源 , 所有的资源 对 处理器 具有相同的 可访问性 , 如 : 磁盘 , 内存 , 总线 等 ; 多个 CPU 处理器 共享相同的物理内存 , 每个 CPU 访问相同的物理地址 , 所消耗的时间是相同的 ;

SMP优点:
1、它们是增加吞吐时的一种划算的方法;
2、由于操作系统由所有处理器共享,它们提供了一个单独的系统映像(容易管理);3、它们对一个单独的问题应用多处理器(并行编程);
4、负载平衡是由操作系统实现的;
5、这种单处理器(UP)编程模型可用于一个SMP中;6、对于共享数据来说,它们是可伸缩的;
7、所有数据可由所有处理器寻址,并且由硬件监视逻辑保持连续性;
8、由于通信经由全局共享内存执行,在处理器之间通信不必使用消息传送库;

SMP局限性:
1、由于高速缓存相关性、锁定机制、共享对象和其它问题,可伸缩性受限制。2、需要新技术来利用多处理器,例如:线程编程和设备驱动程序编程等。

周期性负载均衡:

CPU对应的运行队列数据结构中记录下一次周期性负载均衡的时间,当超过这个时间点后,将触发SCHED_SOFTIRQ软中断来进行负载均衡。
scheduler_tick()–>trigger_load_balance().

用到SMP负载均衡模型的时机:
内核运行中,还有部分情况中需要用掉SMP负载均衡模型来确定最佳运行CPU

1、进程A唤醒进程B时,try_to_wake_up()中会考虑进程B将在那个CPU上运行;

2、进程调用execve()系统调用时

3、fork出子进程,子进程第一次被调度运行。

【Linux运行时调优】:Linux引入重要sysctls来在运行时对调度程序进行调优(单位以纳秒ns)
sched child _runs_frist: child在fork之后进行调度;此为默认设备。如果设置为0,那么先调度parent。

sched_min_granularity_ns:针对CPU密集型任务执行最低级别抢占粒度。
sched_latency_ns:针对CPU密集型任务进行目标抢占延迟。
sched_stat_granularity_ns:收集调度程序统计信息的粒度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
struct sched_domain_topology_level {
sched_domain_mask_f mask;//函数指针,用于指定某个SDTL层级的cpumask
sched_domain_flags_f sd_flags;//函数指针,用于指定某个SDTL层级的标志位
int flags;
int numa_level;
struct sd_data data;
#ifdef CONFIG_SCHED_DEBUG
char *name;
#endif
};
//表示系统中有多少个可以运行的核心
const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);
EXPORT_SYMBOL(cpu_possible_mask);

static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
//表示系统中有多少个正处于运行状态
const struct cpumask *const cpu_online_mask = to_cpumask(cpu_online_bits);
EXPORT_SYMBOL(cpu_online_mask);

static DECLARE_BITMAP(cpu_present_bits, CONFIG_NR_CPUS) __read_mostly;
//表示系统中有多少个具备online条件的核心(有的核心可热插拔)
const struct cpumask *const cpu_present_mask = to_cpumask(cpu_present_bits);
EXPORT_SYMBOL(cpu_present_mask);

static DECLARE_BITMAP(cpu_active_bits, CONFIG_NR_CPUS) __read_mostly;
//表示系统中有剅个活跃的核心
const struct cpumask *const cpu_active_mask = to_cpumask(cpu_active_bits);
EXPORT_SYMBOL(cpu_active_mask);
//这四个变量都是bitmap

RCU

    RCU英文全称为Read-Copy-Update,顾名思义就是“读-拷贝-更新”,是内核中重要的同步机制。
RCU原理
    RCU记录所有指向共享数据的指针的使用者,当要修改该共享数据时,首先创建一个副本,在副本中修改。所有读访问线程都离开讯临界区之后,指针指向新的修改后副本的指针,并且删除旧数据。写者删除对象,必须等到所有访问被删除对象的读者访问结束,才能够执行销毁操作。RCU关键技术是怎么判断所有读者已经完成访问。等待所有读者访问结束的时间称为宽限期(grace period)。RCU读者不并不需要直接与写者进行同步,读者与写者也能并发的执行。RCU目标最大程序来减少读者的开销。因为也经常使用于读者性能要求高的场景。
RCU优点:

    读者开销少,不需要获取任何锁,不需要执行原子指令或内存屏障;没有死锁问题;没有优先级反转的问题;没有内存泄露的危险问题;很好的实时延迟操作。
RCU缺点:

    写者的同步开销比较大的,写者之间需要互斥处理;使用其它同步机制复杂。

RCU应用场景
    例如:每种锁都有自己适合场景: spin lock不分区reader/writer,对于些读写强度不对称的是不适合的,RW spin lock和seq lock解决了这个问题,seq lock倾向writer,RW spin lock倾向reader。
a、RCU只能保护动态分配的数据结构,并且必须是通过指针访问该数据结构;
b、受RCU保护的临界区不能sleep;
c、读写不对称,对writer的性能没有特别的要求,但是reader性能要求极高;d、reader端对新旧数据不敏感。
RCU适用于需要频繁的读取数据,而相应修改数据并不多的场景。比如:文件系统中,搜索定位目录,而对目录修改相对来讲基本没有。

链表操作
    RCU能保护的不仅仅是一般的指针。内核也提供标准函数,使得能通过RCU机制保护双链表,这是RCU机制在内核内部最重要的应用。
有关通过RCU保护的链表,好消息是仍然可以使用标准的链表元素。只有在遍历链表、修改和删除链表元素时,必须调用标准函数的RCU变体。函数名称很容易记住:在标准函数之后附加_rcu后缀。

1
2
3
4
5
//RCU标准:内核源码分析:/include/linux/rculist.h
static inline void list _add_rcu(struct list_head *new, struct list_head *head)
static inline void list add_tail_rcu(struct list_head *new,struct list head *head)
static inline void list_del_rcu(struct list_head *entry)
static inline void list_replace_rcu(struct list_head *old,struct list_head*new)

对于writer,RCU操作包含:
1、rcu_ assign_pointer:该函数被writer用来进行removal的操作,在writer完成新版数据分配和更新之后,调用这个函数可以让RCU protected pointer指向RCU protected data。
2、 synchronize_rcu: writer端操作可以是同步的,也就是说,完成更新操作之后,可以调用这个函数等待所有旧版本数据上的reader线程离开临界区,一旦从函数返回,说明旧的共享数据没有任何的引用,直接进行reclaimation的操作。
3、call rcu: writer无法阻塞,这时候可以调用call rcu接口函数,该函数是注册callback直接返回,在适当时机会调用callback函数,完成reclaimation的操作。

removal:

     write分配一个new version共享数据进行数据更新,更新完毕后将RCU protected pointer指向新版本的数据,一旦把RCU protected pointer指向的新的数据,也就意味着将推向前台。通过这样的操作,原来reader0,reader1对共享数据的引用被删除,它们都在旧版本的RCU protected data上进行数据访问。
reclamation:

    共享数据不能有两个版本,因此一定要在适当的时机回收旧版本的数据。

32核处理器的CPU层次结构:
    在32核处理器中,层次结构分成两层次,level0包含两个struct rcu_node,其中每个struct rcu_node管理16个struct rcu data数据结构,分配表示16个CPU的独立struct rcu_data数据结构;在level1层级,有一个structrcu_node节点管理level0层级的两个rcu_node节点,level1层级中的rcu_node 节点称为根节点,level0层级的两个rcu_node节点是叶子结点。

内存优化屏障

    在编程时,指令一般不按照源程序顺序执行,原因是为提高程序执行性能,会对它进行优化,主要为两种:编译器优化和CPU执行优化。
优化屏障避免编译的重新排序优化操作,保证编译程序时在优化屏障之前的指令不会在优化屏障之后执行。

编译器屏障:

Linux使用宏barrier实现优化屏障,如gcc编译器的优化屏障宏定义如;

linux内核源码: include/linux/compiler-gcc.h。

1
2
#define barrier() __asm__ __volatile__("": : :"memory")
//asm_表示插入汇编语言程序;_volatile_表示阻止编译对该值进行优化,确保变量使用了用户定义的精确地址,而不是装有同一信息的一些别名。memory表示修改了内存单元。

CPU屏障:

1
2
3
4
5
6
7
8
CPU MEMORY BARRIERS
-------------------
The Linux kernel has eight basic CPU memory barriers:
TYPE MANDATORY SMP CONDITIONAL
=============== ======================= ========================
    GENERAL mb() smp_mb()
WRITE wmb() smp_wmb()
READ rmb() smp_rmb()

内存布局

    ARM64架构处理器采用48位物理寻址机制,最大可寻找256TB的物理地址空间。对于
目前应用完全足够,不需要扩展到64位的物理寻址。虚拟地址也同样最大支持48位寻址,所以在处理器架构设计上,把虚拟地址空间划分为两个空间,每个空间最大支持256TB,linux内核在大多数体系结构上都把两个地址划分为:用户空间和内核空间。

用户空间: 0x0000_0000_0000_0000 - 0x0000_FFFF_FFFF_FFFF(只支持到48位)

内核空间: 0xFFFF_0000_0000_0000 - 0xFFFF_FFFF_FFFF_FFFF

线性映射区
.data
.init
.text
modules
PCI I/O
vmenmap
非规范区域
用户空间

.data:

数据段(内核初始化全局变量);
.init:

对应大部分模块初始化数据,初始化结束之后就会释放这部分内存;
.text:

代码段(_text是代码段起始地址,_etext是结束地址);
modules:

128MB内核模块区域,是内核模块使用的虚拟地址空间;
PCI I/o:

pci设备的I/O地址空间;
vmemmap:

内存的物理地址(如果不连续,则会在内存空洞),vmemmap就用来存放空间内存的page结构体的数据的虚拟地址空间。
vmalloc:

vmalloc函数使用的虚拟地址空间。

1、用户空间
    相当于应用程序使用malloc()申请内存,通过free()释放内存。malloc()/free()是glibc库的内存分配器ptmalloc提供的接口,
ptmalloc使用系统调用brk或mmap向内核以页为单位申请内存,然后进行分成很小内存块分配给对应应用程序。
2、内核空间
    虚拟内存管理负责从进程的虚拟地址空间分配虚拟页,sys_brk来扩大或收缩堆,sys_mmap用来在内存映射区域分配虚拟页,
sys_munmap用来释放虚拟页。页分配器负责分配物理页,使用分配器是伙伴分配器
内核空间扩展功能,不连续页分配器提供分配内存的接口vmalloc和释放内存接口vfree。在内存碎片化的时候,申请连续物理页的成功
率比较低,可以申请不连续的物理页,映射到连续的虚拟页,即虚拟地址连续而物理地址不连续。
内存控制组用来控制进程占用的内存资源。当内存碎片化的时候,找不到连续的物理页,内存碎片整理通过迁移方式得到连续的物理
页。在内存不足的时候,页回收负责回收物理页。
3、硬件
    MMU包含一个页表缓存,保存最近使用过的页表映射,避免每次把虚拟地址转换为物理地址都需要查询内存中的页表。解决处理器执行速度和内存速度不匹配问题,中间增加一个缓存。一级缓存分为数据缓存和指令缓存。二级作用协调—级缓存和内存之间的工作效率。

内存管理

    堆是进程中主要用于动态分配变量和数据的内存区域,堆的管理对应程序员不是直接可见的。因为它依赖标准库提供的各个辅助函数(其中最重要的是malloc)来分配任意长度的内存区。malloc和内核之间的经典接口是brk系统调用,负责扩展/收缩堆。
    堆是一个连续的内存区域,在扩展时自下至上增长。其中mm_struct(include/linux/mm_types.h)结构,包含堆在虚拟地址空间中的起始和当前结束地址(start brk和brk)。

    一个进程的虚拟地址空间主要由两个数据结构进行描述。一个是最高层次的mm_struct,较高层次的vm_ area_struct。最高层次mm_struct结构描述一个进程整个虚拟地址空间。较高层次结构描述虚拟地址空间的一个区间(称为虚拟区)。每个进程只有一个mm_struct结构,在每个进程的task_struct结构中,有一个专门用来指向该进程的结构。mm_struct结构是对整个用户空间的描述。

    创建内存映射时,在进程的用户虚拟地址空间中分配一个虚拟内存区域。内核采用延
迟分配物理内存的策略,在进程第一次访问虚拟页的时候,产生缺页异常。如果是文件映射,那么分配物理页,把文件指定区间的数据读到物理页中,然后在页表中把虚拟页映射到物理页。如果是匿名映射,就分配物理页,然后在页表中把虚拟页映射到物理页。

    两个进程可以使用共享的文件映射实现共享内存。匿名映射通常是私有映射,共享的匿名映射只可能出现在父进程和子进程之间。在进程的虚拟地址空间中,代码段和数据段是私有的文件映射,未初始化数据段、堆栈是私有的匿名映射。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
struct mm_struct {
struct vm_area_struct *mmap;//虚拟内存区域链表 /* list of VMAs */
struct rb_root mm_rb;//虚拟内存区域红黑树
u32 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU//在内存映射区域中找到一个没用映射的区域
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
#endif
unsigned long mmap_base; //内存映射区域的起始地址 /* base of mmap area */
unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
unsigned long task_size; //用户虚拟地址空间的长度 /* size of task vm space */
unsigned long highest_vm_end; /* highest vma end address */
pgd_t * pgd;//指向页全局目录,即第一级页表
atomic_t mm_users; //共享一个用户虚拟地址空间的进程的数量 /* How many users with user space? */
atomic_t mm_count; //内存描述符的引用计数 /* How many references to "struct mm_struct" (users count as 1) */
atomic_long_t nr_ptes; /* PTE page table pages */
atomic_long_t nr_pmds; /* PMD page table pages */
int map_count; /* number of VMAs */

spinlock_t page_table_lock; /* Protects page tables and some counters */
struct rw_semaphore mmap_sem;

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/


unsigned long hiwater_rss;//进程所拥有的最大页框数 /* High-watermark of RSS usage */
unsigned long hiwater_vm;//进程线性区中最大页数 /* High-water virtual memory usage */

unsigned long total_vm; //进程地址空间的大小(页数) /* Total pages mapped */
unsigned long locked_vm; //上锁的页数 /* Pages that have PG_mlocked set */
unsigned long pinned_vm; /* Refcount permanently increased */
unsigned long shared_vm; /* Shared pages (files) */
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */
unsigned long stack_vm; /* VM_GROWSUP/DOWN */
unsigned long def_flags;
//代码段起始地址喝结束地址 数据段起始地址喝结束地址
unsigned long start_code, end_code, start_data, end_data;
//堆起始地址喝结束地址 栈的起始地址
unsigned long start_brk, brk, start_stack;
//参数字符串的起始地址喝结束地址 环境变量的起始地址喝结束地址
unsigned long arg_start, arg_end, env_start, env_end;

unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

/*
* Special counters, in some configurations protected by the
* page_table_lock, in other configurations by being atomic.
*/
struct mm_rss_stat rss_stat;

struct linux_binfmt *binfmt;

cpumask_var_t cpu_vm_mask_var;

/* Architecture-specific MM context */
mm_context_t context;//处理器架构特定的内存管理上下文

unsigned long flags; /* Must use atomic bitops to access the bits */

struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
/*
* "owner" points to a task that is regarded as the canonical
* user/owner of this mm. All of the following must be true in
* order for it to be changed:
*
* current == mm->owner
* current->mm != mm
* new_owner->mm == mm
* new_owner->alloc_lock is held
*/
struct task_struct __rcu *owner;
#endif

/* store ref to file /proc/<pid>/exe symlink points to */
struct file *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
#ifdef CONFIG_NUMA_BALANCING
/*
* numa_next_scan is the next time that the PTEs will be marked
* pte_numa. NUMA hinting faults will gather statistics and migrate
* pages to new nodes if necessary.
*/
unsigned long numa_next_scan;

/* Restart point for scanning and setting pte_numa */
unsigned long numa_scan_offset;

/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
#endif
#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
/*
* An operation with batched TLB flushing is going on. Anything that
* can move process memory needs to flush the TLB when moving a
* PROT_NONE or PROT_NUMA mapped page.
*/
bool tlb_flush_pending;
#endif
struct uprobes_state uprobes_state;
#ifdef CONFIG_X86_INTEL_MPX
/* address of the bounds directory */
void __user *bd_addr;
#endif
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
//分别保存该虚拟地址空间的首地址喝末地址后第一个字节的地址
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */

/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;//每个片段组成的双链表
//如果采用链表,影响搜索速度,所以采用红黑树。将VMA作为一个节点加入到红黑树中。
struct rb_node vm_rb;

/*
* Largest free memory gap in bytes to the left of this VMA.
* Either between this VMA and vma->vm_prev, or between one of the
* VMAs below us in the VMA rbtree and its ->vm_prev. This helps
* get_unmapped_area find a free area of the right size.
*/
unsigned long rb_subtree_gap;

/* Second cache line starts here. */
//指向虚拟内存区域对应的用户虚拟地址空间
struct mm_struct *vm_mm; /* The address space we belong to. */
//保护位 访问权限
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
//读写标志位 rwx
unsigned long vm_flags; /* Flags, see mm.h. */

/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
*/
//为了支持查询一个文件区间被映射到哪些虚拟内存区域
struct {
struct rb_node rb;
unsigned long rb_subtree_last;
} shared;

/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages. A MAP_SHARED vma
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
* or brk vma (with NULL file) can only be in an anon_vma list.
*/
//把虚拟内存区域关联的所有anon_yma实例串联起来,
//一个虚拟内存区域会关联到父进程的anon_vma实例和自己的anon_vma实例
struct list_head anon_vma_chain; /* Serialized by mmap_sem &
* page_table_lock */
//指向一个anon_vma实例,结构anon_vma用来组织匿名页被映射到的所有的虚拟地址空间
struct anon_vma *anon_vma; /* Serialized by page_table_lock */

/* Function pointers to deal with this struct. */
/*
虚拟内存操作集合
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
//在创建虚拟内存区域时调用open方法
void (*close)(struct vm_area_struct * area);
//在删除虚拟内存区域时调用close方法
int (*mremap)(struct vm_area_struct * area);
//使用系统调用mremap移动虚拟内存区域时调用mremap方法
int (*fault) (struct vm fault *vmf);
//访问文件映射的虚拟页时,如果没有映射到物理页,生成缺页异常,
//异常处理程序调用fault就去来把文件的数据读到文件页缓存当中

//与fault类似,区别是huge_fault方法针对使用透明巨型页的文件映射
int (*huge_fault) (struct vm_fault *vmf,enum page_entry_size pe_size);
读文件映射的虚拟页时,如果没有映射到物理页,生成缺页异常,异常处理程序除了读入正在访问的文件页,还会预读后续的文件页,调用map pages方法在文件的页缓存中分配物理页
void (*map pages) (struct vm_fault *vmf,
pgoff_t start pgoff, pgoff_t end pgoff) ;
//第一次写私有的文件映射时,生成页错误异常,异常处理程序执行写时复制,
调用page_mkwrite方法以通知文件系统页即将变成可写,以便文件系统检查是否允许写,
或者等待页进入合适的状态。int (*page_mkwrite) (struct vm_fault *vmf) ;
*/

const struct vm_operations_struct *vm_ops;

/* Information about our backing store: */
unsigned long vm_pgoff; //文件偏移,单位是页

struct file * vm_file; //文件 ,匿名映射则指针为空
void * vm_private_data; //指向内存区私有数据
#ifndef CONFIG_MMU
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
};

    brk系统调用用于指定堆在虚拟地址空间中新的结束地址(如果堆将要收缩,当然可以
小于当前值)。brk系统调用通过do brk增长动态分配区(mm/mmap.c):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
static unsigned long do_brk(unsigned long addr, unsigned long len)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
unsigned long flags;
struct rb_node **rb_link, *rb_parent;
pgoff_t pgoff = addr >> PAGE_SHIFT;
int error;

len = PAGE_ALIGN(len);//对len进行页对齐
if (!len)
return addr;

flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
//检查是否有足够的虚拟内存内存空间
error = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED);
if (error & ~PAGE_MASK)
return error;

error = mlock_future_check(mm, mm->def_flags, len);
if (error)
return error;

/*
* mm->mmap_sem is required to protect against another thread
* changing the mappings in case we sleep.
*/
verify_mm_writelocked(mm);

/*
* Clear old maps. this also does some error checking for us
*/
//循环遍历用户进程红黑树中的VMA,然后根据addr来查找合适的插入点
munmap_back:
if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) {
if (do_munmap(mm, addr, len))
return -ENOMEM;
goto munmap_back;
}

/* Check against address space limits *after* clearing old maps... */
//检查是否要对此虚拟区间进行扩充
if (!may_expand_vm(mm, len >> PAGE_SHIFT))
return -ENOMEM;

if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
//判断系统是否有足够的内存
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;

/* Can we just expand an old private anonymous mapping? */
//判断是否可以合并
vma = vma_merge(mm, prev, addr, addr + len, flags,
NULL, NULL, pgoff, NULL);
if (vma)
goto out;//如果可以,去合并

/*
* create a vma struct for an anonymous mapping
*/
//没办法合并,新创建一个VMA,VMA地址是[addr,addr+len]
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
if (!vma) {
vm_unacct_memory(len >> PAGE_SHIFT);
return -ENOMEM;
}

INIT_LIST_HEAD(&vma->anon_vma_chain);
vma->vm_mm = mm;//指向VMA所属于进程struct mm_struct结构
vma->vm_start = addr;//首地址
vma->vm_end = addr + len;//末地址
vma->vm_pgoff = pgoff;//指定文件映射的偏移量,单位为页面
vma->vm_flags = flags;
/*
VMA标志位
VM_EXEC:可以执行
VM_IO:IO地址空间
VM_SHM:IPC共享
VM_SHARED:多进程共享
*/
vma->vm_page_prot = vm_get_page_prot(flags);//VMA访问权限
vma_link(mm, vma, prev, rb_link, rb_parent);
out:
perf_event_mmap(vma);
mm->total_vm += len >> PAGE_SHIFT;
if (flags & VM_LOCKED)
mm->locked_vm += (len >> PAGE_SHIFT);
vma->vm_flags |= VM_SOFTDIRTY;
return addr;
}

相关常用函数:
    1、mmap()—-创建内存映射

    #include <sys/mman.h>
    void *mmap(void *addr,size_t length,int prot,int flags,int fd,off_t offset);
系统调用mmap():

    进程创建匿名的内存映射,把内存的物理页映射到进程的虚拟地址空间。进程把文件映射到进程的虚拟地址空间,可以像访问内存一样访问文件,不需要调用系统调用read()/write()访问文件,从而避免用户模式和内核模式之间的切换,提高读写文件速度。两个进程针对同一个文件创建共享的内存映射,实现共享内存。

    2、munmap()—-删除内存映射

    #include <sys/mman.h>
    int munmap(void *addr, size_t len);
    3、mprotect()—-设置虚拟内存区域的访问权限

    #include <sys/mman.h>
    int mprotect(void *addr, size_t len, int prot);

内存管理子系统使用节点(node),区域(zone)、页(page)三级结构描述物理内存:

node:

    Node是内存管理最顶层的结构,在NUMA架构下,CPU平均划分为多个Node,每个Node有自己的内存控制器及内存插槽。CPU访问自己Node上内存速度快,而访问其他CPU所关联Node的内存速度慢。UMA被当做只一个Node的NUMA系统。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
//mmzone.h
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];//内存区域数组
struct zonelist node_zonelists[MAX_ZONELISTS];//备用区域数组
int nr_zones;//该节点包含的内存区域数量
#ifdef CONFIG_FLAT_NODE_MEM_MAP //除了稀疏内存模型以外
struct page *node_mem_map;//页描述符数组 每个物理页对应一个页描述符
#ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;//页的扩展属性
#endif
#endif
#ifndef CONFIG_NO_BOOTMEM
struct bootmem_data *bdata;//引导bootmen分配器
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
/*
* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant. Holding this will also
* guarantee that any pfn_valid() stays that way.
*
* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
* manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG.
*
* Nests above zone->lock and zone->span_seqlock
*/
spinlock_t node_size_lock;
#endif
unsigned long node_start_pfn;//该节点的起始物理页号
unsigned long node_present_pages; //物理页总数
unsigned long node_spanned_pages; //物理页范围的总长度,包括空洞
int node_id;//节点标识符
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
mem_hotplug_begin/end() */
int kswapd_max_order;
enum zone_type classzone_idx;
#ifdef CONFIG_NUMA_BALANCING
/* Lock serializing the migrate rate limiting window */
spinlock_t numabalancing_migrate_lock;

/* Rate limiting time interval */
unsigned long numabalancing_migrate_next_window;

/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
#endif
} pg_data_t;

zone:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
//mmzone.h
enum zone_type {
#ifdef CONFIG_ZONE_DMA
/*
* ZONE_DMA is used when there are devices that are not able
* to do DMA to all of addressable memory (ZONE_NORMAL). Then we
* carve out the portion of memory that is needed for these devices.
* The range is arch specific.
*
* Some examples
*
* Architecture Limit
* ---------------------------
* parisc, ia64, sparc <4G
* s390 <2G
* arm Various
* alpha Unlimited or 0-16MB.
*
* i386, x86_64 and multiple other arches
* <16M.
*/
/*DMA区域
Direct Memory Access ,意思是直接内存访问。
如果有些设备不能直接访问所有内存,需要使用DMA区域。ISA*/
ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
/*
* x86_64 needs two ZONE_DMAs because it supports devices that are
* only able to do DMA to the lower 16M but also 32 bit devices that
* can only do DMA areas below 4G.
*/
/*DMA32区域64位系统,如果既要支持只能直接访问16MB以下的内存设备,64位系统,
如果既要支持只能直接访问16MB以下的内存设备,
又要支持只能直接访问4GB以下内存的32设备,必须使用此DMA32区域*/
ZONE_DMA32,
#endif
/*
* Normal addressable memory is in ZONE_NORMAL. DMA operations can be
* performed on pages in ZONE_NORMAL if the DMA devices support
* transfers to all addressable memory.
*/
/*普通区域
直接映射到内核虚拟地址空间的内存区域,又称为普通区域,
又称为直接映射区域,又称为线性映射区域*/

ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
/*
* A memory area that is only addressable by the kernel through
* mapping portions into its own address space. This is for example
* used by i386 to allow the kernel to address the memory beyond
* 900MB. The kernel will set up special mappings (page
* table entries on i386) for each page that the kernel needs to
* access.
*/
/*高端内存区域
此区域是32位时代的产物,内核和用户地址空间按1: 3划分,内核地址空间只有1GB,
不能把1cE以上的内存直接映射到内核地址。
*/
ZONE_HIGHMEM,
#endif
/*可移动区域
它是一个伪内存区域,用来防止内存碎片*/
ZONE_MOVABLE,
__MAX_NR_ZONES
};


struct zone {
/* Read-mostly fields */

/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long watermark[NR_WMARK];//页分配器使用的流水线

/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
* to run OOM on the lower zones despite there's tons of freeable ram
* on the higher zones). This array is recalculated at runtime if the
* sysctl_lowmem_reserve_ratio sysctl changes.
*/
long lowmem_reserve[MAX_NR_ZONES];//页分配器使用,当前区域保留多少页不能借给高的区域类型

#ifdef CONFIG_NUMA
int node;
#endif

/*
* The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
* this zone's LRU. Maintained by the pageout code.
*/
unsigned int inactive_ratio;

struct pglist_data *zone_pgdat;//指向内存节点的pglist_data实例
struct per_cpu_pageset __percpu *pageset;//每处理页集合

/*
* This is a per-zone reserve of pages that should not be
* considered dirtyable memory.
*/
unsigned long dirty_balance_reserve;

#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
* In SPARSEMEM, this map is stored in struct mem_section
*/
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

#ifdef CONFIG_NUMA
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
#endif /* CONFIG_NUMA */

/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;//当前区域的起始物理页号

/*
* spanned_pages is the total pages spanned by the zone, including
* holes, which is calculated as:
* spanned_pages = zone_end_pfn - zone_start_pfn;
*
* present_pages is physical pages existing within the zone, which
* is calculated as:
* present_pages = spanned_pages - absent_pages(pages in holes);
*
* managed_pages is present pages managed by the buddy system, which
* is calculated as (reserved_pages includes pages allocated by the
* bootmem allocator):
* managed_pages = present_pages - reserved_pages;
*
* So present_pages may be used by memory hotplug or memory power
* management logic to figure out unmanaged pages by checking
* (present_pages - managed_pages). And managed_pages should be used
* by page allocator and vm scanner to calculate all kinds of watermarks
* and thresholds.
*
* Locking rules:
*
* zone_start_pfn and spanned_pages are protected by span_seqlock.
* It is a seqlock because it has to be read outside of zone->lock,
* and it is done in the main allocator path. But, it is written
* quite infrequently.
*
* The span_seq lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*
* Write access to present_pages at runtime should be protected by
* mem_hotplug_begin/end(). Any reader who can't tolerant drift of
* present_pages should get_online_mems() to get a stable value.
*
* Read access to managed_pages should be safe because it's unsigned
* long. Write access to zone->managed_pages and totalram_pages are
* protected by managed_page_count_lock at runtime. Idealy only
* adjust_managed_page_count() should be used instead of directly
* touching zone->managed_pages and totalram_pages.
*/
unsigned long managed_pages;//伙伴分配器管理的物理页数量
unsigned long spanned_pages;//当前区域跨越的总页数,包括空洞
unsigned long present_pages;//当前区域存在的物理页总数,不包括空洞

const char *name;//区域名称

/*
* Number of MIGRATE_RESERVE page block. To maintain for just
* optimization. Protected by zone->lock.
*/
int nr_migrate_reserve_block;

#ifdef CONFIG_MEMORY_ISOLATION
/*
* Number of isolated pageblock. It is used to solve incorrect
* freepage counting problem due to racy retrieving migratetype
* of pageblock. Protected by zone->lock.
*/
unsigned long nr_isolate_pageblock;
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
#endif

/*
* wait_table -- the array holding the hash table
* wait_table_hash_nr_entries -- the size of the hash table array
* wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t *wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;

ZONE_PADDING(_pad1_)
/* free areas of different sizes */
struct free_area free_area[MAX_ORDER];//不同长度的空间区域

/* zone flags, see below */
unsigned long flags;

/* Write-intensive fields used from the page allocator */
spinlock_t lock;

ZONE_PADDING(_pad2_)

/* Write-intensive fields used by page reclaim */

/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct lruvec lruvec;

/* Evictions & activations on the inactive file list */
atomic_long_t inactive_age;

/*
* When free pages are below this point, additional steps are taken
* when reading the number of free pages to avoid per-cpu counter
* drift allowing watermarks to be breached
*/
unsigned long percpu_drift_mark;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* pfn where compaction free scanner should start */
unsigned long compact_cached_free_pfn;
/* pfn where async and sync compaction migration scanner should start */
unsigned long compact_cached_migrate_pfn[2];
#endif

#ifdef CONFIG_COMPACTION
/*
* On compaction failure, 1<<compact_defer_shift compactions
* are skipped before trying again. The number attempted since
* last failure is tracked with compact_considered.
*/
unsigned int compact_considered;
unsigned int compact_defer_shift;
int compact_order_failed;
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* Set to true when the PG_migrate_skip bits should be cleared */
bool compact_blockskip_flush;
#endif

ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
} ____cacheline_internodealigned_in_smp;r

page:

    每个物理页对应一个page结构体,称为页描述符,内存节点的pglist_data实例的成员node_mem_map指向该内存节点包含的所有物理页的页描述符组成的数组。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
struct page {
/* First double word block */
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
union {
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
* memory, low bit is set, and
* it points to anon_vma object:
* see PAGE_MAPPING_ANON below.
*/
void *s_mem; /* slab first object */
};

/* Second double word */
struct {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* sl[aou]b first free object */
bool pfmemalloc; /* If set by the page allocator,
* ALLOC_NO_WATERMARKS was set
* and the low watermark was not
* met implying that the system
* is under some pressure. The
* caller should try ensure
* this page is only used to
* free other pages.
*/
};

union {
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
/* Used for cmpxchg_double in slub */
unsigned long counters;
#else
/*
* Keep _count separate from slub cmpxchg_double data.
* As the rest of the double word is protected by
* slab_lock but _count is not.
*/
unsigned counters;
#endif

struct {

union {
/*
* Count of ptes mapped in
* mms, to show when page is
* mapped & limit reverse map
* searches.
*
* Used also for tail pages
* refcounting instead of
* _count. Tail pages cannot
* be mapped and keeping the
* tail page _count zero at
* all times guarantees
* get_page_unless_zero() will
* never succeed on tail
* pages.
*/
atomic_t _mapcount;

struct { /* SLUB */
unsigned inuse:16;
unsigned objects:15;
unsigned frozen:1;
};
int units; /* SLOB */
};
atomic_t _count; /* Usage count, see below. */
};
unsigned int active; /* SLAB */
};
};

/* Third double word block */
union {
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
* Can be used as a generic list
* by the page owner.
*/
struct { /* slub per cpu partial pages */
struct page *next; /* Next partial slab */
#ifdef CONFIG_64BIT
int pages; /* Nr of partial slabs left */
int pobjects; /* Approximate # of objects */
#else
short int pages;
short int pobjects;
#endif
};

struct slab *slab_page; /* slab fields */
struct rcu_head rcu_head; /* Used by SLAB
* when destroying via RCU
*/
/* First tail page of compound page */
struct {
compound_page_dtor *compound_dtor;
unsigned long compound_order;
};

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page->ptl */
#endif
};

/* Remainder is not double word aligned */
union {
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache;
* indicates order in the buddy
* system if PG_buddy is set.
*/
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
#endif
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
struct page *first_page; /* Compound tail pages */
};

#ifdef CONFIG_MEMCG
struct mem_cgroup *mem_cgroup;
#endif

/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef CONFIG_KMEMCHECK
/*
* kmemcheck wants to track the status of each byte in a page; this
* is a pointer to such a status block. NULL if not tracked.
*/
void *shadow;
#endif

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
#endif
}

Big Kernel Lock BKL

    Big Kernel Lock (BKL)大内核锁,其为Linux内核中的一种锁,与普通的锁原理基本一致,一旦进程获得BKL,则进入被它保护的临界区,不但该临界区被上锁,所有被保护的临

界区都会一起被锁住。大内核锁一般使用在驱动、文件系统等。大内核锁从Linux 2.6.39已被淘汰。

per-CPU计数器

    Linux操作系统,特别是针对SMP或者NUMA架构的多CPU系统的时候,描述每个CPU的私有数据的时候,Linux操作系统提供了per_cpu机制。
    per_cpu机制就是让每个CPU都有自己的私有数据段,便于保护与访问。因为上锁会损失性能。

1
2
3
4
5
6
7
8
9
//include/linux/percpu_counter.h
struct percpu_counter {
raw_spinlock_t lock;//自旋锁 用于在需要准确值时
s64 count;//计数器的准确值
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
s32 __percpu *counters;//该数组缓存了对计数器的操作
};