[docs] Update the README with roadmap info

[kernel] Fix #DF when building with -O3
I had failed to specify in inline asm that an input variable was the same as the output variable.
2021-02-17 00:47:12 -08:00 · 2021-02-17 00:22:22 -08:00 · 2021-02-15 12:56:22 -08:00 · 2021-02-13 01:45:17 -08:00 · 2021-02-11 00:00:34 -08:00 · 2021-02-10 23:59:05 -08:00
56 changed files with 1748 additions and 737 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -3,8 +3,10 @@
 *.bak
 tags
 jsix.log
+*.out
 *.o
 *.a
 sysroot
 .gdb_history
 .peru
+__pycache__
--- a/README.md
+++ b/README.md
@@ -1,10 +1,10 @@
 ![jsix](assets/jsix.svg)

-# jsix: A hobby operating system
+# The jsix operating system

-**jsix** is the hobby operating system that I am currently building. It's far
-from finished, or even being usable. Instead, it's a sandbox for me to play
-with kernel-level code and explore architectures.
+**jsix** is a custom multi-core x64 operating system that I am building from
+scratch. It's far from finished, or even being usable - see the *Status and
+Roadmap* section, below.

 The design goals of the project are:

@@ -23,9 +23,8 @@ The design goals of the project are:
  by the traditional microkernel problems.

 * Exploration - I'm really mostly doing this to have fun learning and exploring
-  modern OS development. Modular design may be tossed out (hopefully
-  temporarily) in some places to allow me to play around with the related
-  hardware.
+  modern OS development. Initial feature implementations may temporarily throw
+  away modular design to allow for exploration of the related hardware.

 A note on the name: This kernel was originally named Popcorn, but I have since
 discovered that the Popcorn Linux project is also developing a kernel with that
@@ -35,6 +34,67 @@ and my wonderful wife.

 [cpu_features]: https://github.com/justinian/jsix/blob/master/src/libraries/cpu/include/cpu/features.inc

+## Status and Roadmap
+
+The following major feature areas are targets for jsix development:
+
+#### UEFI boot loader
+
+_Done._ The bootloader loads the kernel and initial userspace programs, and
+sets up necessary kernel arguments about the memory map and EFI GOP
+framebuffer. Possible future ideas:
+
+- take over more init-time functions from the kernel
+- rewrite it in Zig
+
+#### Memory
+
+_Virtual memory: Sufficient._ The kernel manages virtual memory with a number
+of kinds of `vm_area` objects representing mapped areas, which can belong to
+one or more `vm_space` objects which represent a whole virtual memory space.
+(Each process has a `vm_space`, and so does the kernel itself.)
+
+Remaining to do:
+
+- TLB shootdowns
+- Page swapping
+
+_Physical page allocation: Sufficient._ The current physical page allocator
+implementation suses a group of block representing up-to-1GiB areas of usable
+memory as defined by the bootloader. Each block has a three-level bitmap
+denoting free/used pages.
+
+#### Multitasking
+
+_Sufficient._ The global scheduler object keeps separate ready/blocked lists
+per core. Cores periodically attempt to balance load via work stealing.
+
+User-space tasks are able to create threads as well as other processes.
+
+Several kernel-only tasks exist, though I'm trying to reduce that. Eventually
+only the timekeeping task should be a separate kernel-only thread.
+
+#### API
+
+_In progress._ User-space tasks are able to make syscalls to the kernel via
+fast SYSCALL/SYSRET instructions.
+
+Major tasks still to do:
+
+- The process initialization protocol needs to be re-built entirely.
+- Processes' handles to kernel objects need the ability to check capabilities
+
+#### Hardware Support
+
+  * Framebuffer driver: _In progress._ Currently on machines with a video
+	device accessible by UEFI, jsix starts a user-space framebuffer driver that
+	only prints out kernel logs.
+  * Serial driver: _To do._ Machines without a video device should have a
+	user-space log output task like the framebuffer driver, but currently this
+	is done inside the kernel.
+  * USB driver: _To do_
+  * AHCI (SATA) driver: _To do_
+
 ## Building

 jsix uses the [Ninja][] build tool, and generates the build files for it with a
--- a/assets/debugging/jsix.elf-gdb.py
+++ b/assets/debugging/jsix.elf-gdb.py
@@ -56,8 +56,77 @@ class PrintBacktraceCommand(gdb.Command):
                return


+class TableWalkCommand(gdb.Command):
+    def __init__(self):
+        super().__init__("j6tw", gdb.COMMAND_DATA)
+
+    def invoke(self, arg, from_tty):
+        args = gdb.string_to_argv(arg)
+        if len(args) < 2:
+            raise Exception("Must be: j6tw <pml4> <addr>")
+
+        pml4 = int(gdb.parse_and_eval(args[0]))
+        addr = int(gdb.parse_and_eval(args[1]))
+
+        indices = [
+            (addr >> 39) & 0x1ff,
+            (addr >> 30) & 0x1ff,
+            (addr >> 21) & 0x1ff,
+            (addr >> 12) & 0x1ff,
+            ]
+
+        names = ["PML4", "PDP", "PD", "PT"]
+
+        table_flags = [
+            (0x0001, "present"),
+            (0x0002, "write"),
+            (0x0004, "user"),
+            (0x0008, "pwt"),
+            (0x0010, "pcd"),
+            (0x0020, "accessed"),
+            (0x0040, "dirty"),
+            (0x0080, "largepage"),
+            (0x0100, "global"),
+            (0x1080, "pat"),
+            ((1<<63), "xd"),
+        ]
+
+        page_flags = [
+            (0x0001, "present"),
+            (0x0002, "write"),
+            (0x0004, "user"),
+            (0x0008, "pwt"),
+            (0x0010, "pcd"),
+            (0x0020, "accessed"),
+            (0x0040, "dirty"),
+            (0x0080, "pat"),
+            (0x0100, "global"),
+            ((1<<63), "xd"),
+        ]
+
+        flagsets = [table_flags, table_flags, table_flags, page_flags]
+
+        table = pml4
+        entry = 0
+        for i in range(len(indices)):
+            entry = int(gdb.parse_and_eval(f'((uint64_t*){table})[{indices[i]}]'))
+            flagset = flagsets[i]
+            flag_names = " | ".join([f[1] for f in flagset if (entry & f[0]) == f[0]])
+
+            print(f"{names[i]:>4}: {table:016x}")
+            print(f"      index: {indices[i]:3} {entry:016x}")
+            print(f"      flags: {flag_names}")
+
+            if (entry & 1) == 0 or (i < 3 and (entry & 0x80)):
+                break
+
+            table = (entry & 0x7ffffffffffffe00) | 0xffffc00000000000
+
+
+
 PrintStackCommand()
 PrintBacktraceCommand()
+TableWalkCommand()

 gdb.execute("target remote :1234")
 gdb.execute("display/i $rip")
--- a/modules.yaml
+++ b/modules.yaml
@@ -12,6 +12,7 @@ modules:
            - src/kernel
        source:
            - src/kernel/apic.cpp
+            - src/kernel/ap_startup.s
            - src/kernel/assert.cpp
            - src/kernel/boot.s
            - src/kernel/clock.cpp
@@ -24,8 +25,9 @@ modules:
            - src/kernel/frame_allocator.cpp
            - src/kernel/fs/gpt.cpp
            - src/kernel/gdt.cpp
-            - src/kernel/gdt.s
+            - src/kernel/gdtidt.s
            - src/kernel/hpet.cpp
+            - src/kernel/idt.cpp
            - src/kernel/interrupts.cpp
            - src/kernel/interrupts.s
            - src/kernel/io.cpp
@@ -56,6 +58,7 @@ modules:
            - src/kernel/syscalls/thread.cpp
            - src/kernel/syscalls/vm_area.cpp
            - src/kernel/task.s
+            - src/kernel/tss.cpp
            - src/kernel/vm_space.cpp

    boot:
@@ -111,6 +114,7 @@ modules:
            - src/libraries/kutil/logger.cpp
            - src/libraries/kutil/memory.cpp
            - src/libraries/kutil/printf.c
+            - src/libraries/kutil/spinlock.cpp

    cpu:
        kind: lib
@@ -118,7 +122,7 @@ modules:
        includes:
            - src/libraries/cpu/include
        source:
-            - src/libraries/cpu/cpu.cpp
+            - src/libraries/cpu/cpu_id.cpp

    j6:
      kind: lib
--- a/src/boot/main.cpp
+++ b/src/boot/main.cpp
@@ -8,7 +8,7 @@
 #include <stdint.h>

 #include "console.h"
-#include "cpu/cpu.h"
+#include "cpu/cpu_id.h"
 #include "error.h"
 #include "fs.h"
 #include "hardware.h"
@@ -93,6 +93,8 @@ add_module(args::header *args, args::mod_type type, buffer &data)
 	m.type = type;
 	m.location = data.data;
 	m.size = data.size;
+
+	change_pointer(m.location);
 }

 /// Check that all required cpu features are supported
@@ -198,12 +200,15 @@ efi_main(uefi::handle image, uefi::system_table *st)
 		reinterpret_cast<kernel::entrypoint>(kernel.entrypoint);
 	status.next();

-
 	hw::setup_control_regs();
 	memory::virtualize(args->pml4, map, st->runtime_services);
 	status.next();

+	change_pointer(args);
 	change_pointer(args->pml4);
+	change_pointer(args->modules);
+	change_pointer(args->programs);
+
 	status.next();

 	kentry(args);
--- a/src/drivers/fb/main.cpp
+++ b/src/drivers/fb/main.cpp
@@ -92,14 +92,28 @@ main(int argc, const char **argv)
 	scrollback scroll(rows, cols);

 	int pending = 0;
-	constexpr int pending_threshold = 10;
+	constexpr int pending_threshold = 5;
+
+	j6_handle_t sys = __handle_sys;
+	size_t buffer_size = 0;
+	void *message_buffer = nullptr;

-	char message_buffer[256];
 	while (true) {
-		size_t size = sizeof(message_buffer);
-		j6_system_get_log(__handle_sys, message_buffer, &size);
-		if (size != 0) {
-			entry *e = reinterpret_cast<entry*>(&message_buffer);
+		size_t size = buffer_size;
+		j6_status_t s = j6_system_get_log(sys, message_buffer, &size);
+
+		if (s == j6_err_insufficient) {
+			free(message_buffer);
+			message_buffer = malloc(size * 2);
+			buffer_size = size;
+			continue;
+		} else if (s != j6_status_ok) {
+			j6_system_log("fb driver got error from get_log, quitting");
+			return s;
+		}
+
+		if (size > 0) {
+			entry *e = reinterpret_cast<entry*>(message_buffer);

 			size_t eom = e->bytes - sizeof(entry);
 			e->message[eom] = 0;
@@ -119,7 +133,6 @@ main(int argc, const char **argv)
 		}
 	}

-
 	j6_system_log("fb driver done, exiting");
 	return 0;
 }
--- a/src/drivers/fb/screen.cpp
+++ b/src/drivers/fb/screen.cpp
@@ -9,7 +9,8 @@ screen::screen(volatile void *addr, unsigned hres, unsigned vres, unsigned scanl
 	m_resx(hres),
 	m_resy(vres)
 {
-	m_back = reinterpret_cast<pixel_t*>(malloc(scanline*vres*sizeof(pixel_t)));
+	const size_t size = scanline * vres;
+	m_back = reinterpret_cast<pixel_t*>(malloc(size * sizeof(pixel_t)));
 }

 screen::pixel_t
@@ -33,15 +34,9 @@ screen::color(uint8_t r, uint8_t g, uint8_t b) const
 void
 screen::fill(pixel_t color)
 {
-	const size_t len = m_resx * m_resy;
-	for (size_t i = 0; i < len; ++i)
-		m_back[i] = color;
-}
-
-void
-screen::draw_pixel(unsigned x, unsigned y, pixel_t color)
-{
-	m_back[x + y * m_resx] = color;
+	const size_t len = m_scanline * m_resy;
+	asm volatile ( "rep stosl" : :
+		"a"(color), "c"(len), "D"(m_back) );
 }

 void
--- a/src/drivers/fb/screen.h
+++ b/src/drivers/fb/screen.h
@@ -17,7 +17,11 @@ public:
 	pixel_t color(uint8_t r, uint8_t g, uint8_t b) const;

 	void fill(pixel_t color);
-	void draw_pixel(unsigned x, unsigned y, pixel_t color);
+
+	inline void draw_pixel(unsigned x, unsigned y, pixel_t color) {
+		const size_t index = x + y * m_scanline;
+		m_back[index] = color;
+	}

 	void update();

--- a/src/drivers/fb/scrollback.cpp
+++ b/src/drivers/fb/scrollback.cpp
@@ -45,8 +45,12 @@ scrollback::render(screen &scr, font &fnt)
 	const unsigned xstride = (m_margin + fnt.width());
 	const unsigned ystride = (m_margin + fnt.height());

+	unsigned start = m_count <= m_rows ? 0 :
+		m_count  % m_rows;
+
 	for (unsigned y = 0; y < m_rows; ++y) {
-		char *line = &m_data[y*m_cols];
+		unsigned i = (start + y) % m_rows;
+		char *line = &m_data[i*m_cols];
 		for (unsigned x = 0; x < m_cols; ++x) {
 			fnt.draw_glyph(scr, line[x], fg, bg, m_margin+x*xstride, m_margin+y*ystride);
 		}
--- a/src/kernel/ap_startup.s
+++ b/src/kernel/ap_startup.s
@@ -0,0 +1,149 @@
+%include "tasking.inc"
+
+section .ap_startup
+
+BASE equ 0x8000 ; Where the kernel will map this at runtime
+
+CR0_PE  equ (1 << 0)
+CR0_MP  equ (1 << 1)
+CR0_ET  equ (1 << 4)
+CR0_NE  equ (1 << 5)
+CR0_WP  equ (1 << 16)
+CR0_PG  equ (1 << 31)
+CR0_VAL equ CR0_PE|CR0_MP|CR0_ET|CR0_NE|CR0_WP|CR0_PG
+
+CR4_DE         equ (1 << 3)
+CR4_PAE        equ (1 << 5)
+CR4_MCE        equ (1 << 6)
+CR4_PGE        equ (1 << 7)
+CR4_OSFXSR     equ (1 << 9)
+CR4_OSCMMEXCPT equ (1 << 10)
+CR4_FSGSBASE   equ (1 << 16)
+CR4_PCIDE      equ (1 << 17)
+CR4_INIT equ CR4_PAE|CR4_PGE
+CR4_VAL equ CR4_DE|CR4_PAE|CR4_MCE|CR4_PGE|CR4_OSFXSR|CR4_OSCMMEXCPT|CR4_FSGSBASE|CR4_PCIDE
+
+EFER_MSR  equ 0xC0000080
+EFER_SCE  equ (1 << 0)
+EFER_LME  equ (1 << 8)
+EFER_NXE  equ (1 << 11)
+EFER_VAL equ EFER_SCE|EFER_LME|EFER_NXE
+
+bits 16
+default rel
+align 8
+
+global ap_startup
+ap_startup:
+	jmp .start_real
+
+align 8
+	.pml4:  dq 0
+	.cpu:   dq 0
+	.ret:   dq 0
+
+align 16
+.gdt:
+	dq 0x0        ; Null GDT entry
+
+	dq 0x00209A0000000000 ; Code
+	dq 0x0000920000000000 ; Data
+
+align 4
+.gdtd:
+	dw ($ - .gdt)
+	dd BASE + (.gdt - ap_startup)
+
+align 4
+.idtd:
+	dw 0 ; zero-length IDT descriptor
+	dd 0
+
+.start_real:
+	cli
+	cld
+
+	xor ax, ax
+	mov ds, ax
+
+	; set the temporary null IDT
+	lidt [BASE + (.idtd - ap_startup)]
+
+	; Enter long mode
+	mov eax, cr4
+	or eax, CR4_INIT
+	mov cr4, eax
+
+	mov eax, [BASE + (.pml4 - ap_startup)]
+	mov cr3, eax
+
+	mov ecx, EFER_MSR
+	rdmsr
+	or eax, EFER_VAL
+	wrmsr
+
+	mov eax, CR0_VAL
+	mov cr0, eax
+
+	; Set the temporary minimal GDT
+	lgdt [BASE + (.gdtd - ap_startup)]
+
+	jmp (1 << 3):(BASE + (.start_long - ap_startup))
+
+bits 64
+default abs
+align 8
+.start_long:
+	; set data segments
+	mov ax, (2 << 3)
+	mov ds, ax
+	mov es, ax
+	mov fs, ax
+	mov gs, ax
+	mov ss, ax
+
+	mov eax, CR4_VAL
+
+	mov rdi, [BASE + (.cpu - ap_startup)]
+	mov rax, [rdi + CPU_DATA.rsp0]
+	mov rsp, rax
+
+	mov rax, [BASE + (.ret - ap_startup)]
+	jmp rax
+
+
+global ap_startup_code_size
+ap_startup_code_size:
+	dq ($ - ap_startup)
+
+
+section .text
+global init_ap_trampoline
+init_ap_trampoline:
+	push rbp
+	mov rbp, rsp
+
+	; rdi is the kernel pml4
+	mov [BASE + (ap_startup.pml4 - ap_startup)], rdi
+
+	; rsi is the cpu data for this AP
+	mov [BASE + (ap_startup.cpu - ap_startup)], rsi
+
+	; rdx is the address to jump to
+	mov [BASE + (ap_startup.ret - ap_startup)], rdx
+
+	; rcx is the processor id
+	mov rdi, rdx
+
+	pop rbp
+	ret
+
+extern long_ap_startup
+global ap_idle
+ap_idle:
+	call long_ap_startup
+	sti
+.hang:
+	hlt
+	jmp .hang
+
--- a/src/kernel/apic.cpp
+++ b/src/kernel/apic.cpp
@@ -6,11 +6,18 @@
 #include "kernel_memory.h"
 #include "log.h"

+uint64_t lapic::s_ticks_per_us = 0;
+
+static constexpr uint16_t lapic_id         = 0x0020;
 static constexpr uint16_t lapic_spurious   = 0x00f0;

+static constexpr uint16_t lapic_icr_low    = 0x0300;
+static constexpr uint16_t lapic_icr_high   = 0x0310;
+
 static constexpr uint16_t lapic_lvt_timer  = 0x0320;
 static constexpr uint16_t lapic_lvt_lint0  = 0x0350;
 static constexpr uint16_t lapic_lvt_lint1  = 0x0360;
+static constexpr uint16_t lapic_lvt_error  = 0x0370;

 static constexpr uint16_t lapic_timer_init = 0x0380;
 static constexpr uint16_t lapic_timer_cur  = 0x0390;
@@ -25,6 +32,7 @@ apic_read(uint32_t volatile *apic, uint16_t offset)
 static void
 apic_write(uint32_t volatile *apic, uint16_t offset, uint32_t value)
 {
+	log::debug(logs::apic, "LAPIC write: %x = %08lx", offset, value);
 	*(apic + offset/sizeof(uint32_t)) = value;
 }

@@ -48,14 +56,58 @@ apic::apic(uintptr_t base) :
 }


-lapic::lapic(uintptr_t base, isr spurious) :
+lapic::lapic(uintptr_t base) :
 	apic(base),
 	m_divisor(0)
 {
-	apic_write(m_base, lapic_spurious, static_cast<uint32_t>(spurious));
+	apic_write(m_base, lapic_lvt_error, static_cast<uint32_t>(isr::isrAPICError));
+	apic_write(m_base, lapic_spurious, static_cast<uint32_t>(isr::isrSpurious));
 	log::info(logs::apic, "LAPIC created, base %lx", m_base);
 }

+uint8_t
+lapic::get_id()
+{
+	return static_cast<uint8_t>(apic_read(m_base, lapic_id) >> 24);
+}
+
+void
+lapic::send_ipi(ipi mode, uint8_t vector, uint8_t dest)
+{
+	// Wait until the APIC is ready to send
+	ipi_wait();
+
+	uint32_t command =
+		static_cast<uint32_t>(vector) |
+		static_cast<uint32_t>(mode);
+
+	apic_write(m_base, lapic_icr_high, static_cast<uint32_t>(dest) << 24);
+	apic_write(m_base, lapic_icr_low, command);
+}
+
+void
+lapic::send_ipi_broadcast(ipi mode, bool self, uint8_t vector)
+{
+	// Wait until the APIC is ready to send
+	ipi_wait();
+
+	uint32_t command =
+		static_cast<uint32_t>(vector) |
+		static_cast<uint32_t>(mode) |
+		(self ? 0 : (1 << 18)) |
+		(1 << 19);
+
+	apic_write(m_base, lapic_icr_high, 0);
+	apic_write(m_base, lapic_icr_low, command);
+}
+
+void
+lapic::ipi_wait()
+{
+	while (apic_read(m_base, lapic_icr_low) & (1<<12))
+		asm volatile ("pause" : : : "memory");
+}
+
 void
 lapic::calibrate_timer()
 {
@@ -72,10 +124,10 @@ lapic::calibrate_timer()
 	clock::get().spinwait(us);

 	uint32_t remaining = apic_read(m_base, lapic_timer_cur);
-	uint32_t ticks_total = initial - remaining;
-	m_ticks_per_us = ticks_total / us;
+	uint64_t ticks_total = initial - remaining;
+	s_ticks_per_us = ticks_total / us;

-	log::info(logs::apic, "APIC timer ticks %d times per microsecond.", m_ticks_per_us);
+	log::info(logs::apic, "APIC timer ticks %d times per microsecond.", s_ticks_per_us);

 	interrupts_enable();
 }
@@ -95,7 +147,7 @@ lapic::set_divisor(uint8_t divisor)
 	case  64: divbits = 0x9; break;
 	case 128: divbits = 0xa; break;
 	default:
-		kassert(0, "Invalid divisor passed to lapic::enable_timer");
+		kassert(0, "Invalid divisor passed to lapic::set_divisor");
 	}

 	apic_write(m_base, lapic_timer_div, divbits);
--- a/src/kernel/apic.h
+++ b/src/kernel/apic.h
@@ -3,6 +3,7 @@
 /// Classes to control both local and I/O APICs.

 #include <stdint.h>
+#include "kutil/enum_bitfields.h"

 enum class isr : uint8_t;

@@ -18,6 +19,22 @@ protected:
 	uint32_t *m_base;
 };

+enum class ipi : uint32_t
+{
+	// Delivery modes
+	fixed    = 0x0000,
+	smi      = 0x0200,
+	nmi      = 0x0400,
+	init     = 0x0500,
+	startup  = 0x0600,
+
+	// Flags
+	deassert = 0x0000,
+	assert   = 0x4000,
+	edge     = 0x0000, ///< edge-triggered
+	level    = 0x8000, ///< level-triggered
+};
+IS_BITFIELD(ipi);

 /// Controller for processor-local APICs
 class lapic :
@@ -26,8 +43,26 @@ class lapic :
 public:
 	/// Constructor
 	/// \arg base      Physicl base address of the APIC's MMIO registers
-	/// \arg spurious  Vector of the spurious interrupt handler
-	lapic(uintptr_t base, isr spurious);
+	lapic(uintptr_t base);
+
+	/// Get the local APIC's ID
+	uint8_t get_id();
+
+	/// Send an inter-processor interrupt.
+	/// \arg mode   The sending mode
+	/// \arg vector The interrupt vector
+	/// \arg dest   The APIC ID of the destination
+	void send_ipi(ipi mode, uint8_t vector, uint8_t dest);
+
+	/// Send an inter-processor broadcast interrupt to all other CPUs
+	/// \arg mode   The sending mode
+	/// \arg self   If true, include this CPU in the broadcast
+	/// \arg vector The interrupt vector
+	void send_ipi_broadcast(ipi mode, bool self, uint8_t vector);
+
+	/// Wait for an IPI to finish sending. This is done automatically
+	/// before sending another IPI with send_ipi().
+	void ipi_wait();

 	/// Enable interrupts for the LAPIC timer.
 	/// \arg vector   Interrupt vector the timer should use
@@ -57,19 +92,14 @@ public:
 	void calibrate_timer();

 private:
-	inline uint64_t ticks_to_us(uint32_t ticks) const {
-		return static_cast<uint64_t>(ticks) / m_ticks_per_us;
-	}
-
-	inline uint64_t us_to_ticks(uint64_t interval) const {
-		return interval * m_ticks_per_us;
-	}
+	inline static uint64_t ticks_to_us(uint64_t ticks)    { return ticks / s_ticks_per_us; }
+	inline static uint64_t us_to_ticks(uint64_t interval) { return interval * s_ticks_per_us; }

 	void set_divisor(uint8_t divisor);
 	void set_repeat(bool repeat);

 	uint32_t m_divisor;
-	uint32_t m_ticks_per_us;
+	static uint64_t s_ticks_per_us;
 };


--- a/src/kernel/clock.cpp
+++ b/src/kernel/clock.cpp
@@ -17,6 +17,6 @@ void
 clock::spinwait(uint64_t us) const
 {
 	uint64_t when = value() + us;
-	while (value() < when);
+	while (value() < when) asm ("pause");
 }

--- a/src/kernel/cpu.cpp
+++ b/src/kernel/cpu.cpp
@@ -2,10 +2,18 @@
 #include "kutil/assert.h"
 #include "kutil/memory.h"
 #include "cpu.h"
-#include "cpu/cpu.h"
+#include "cpu/cpu_id.h"
+#include "device_manager.h"
+#include "gdt.h"
+#include "idt.h"
+#include "kernel_memory.h"
 #include "log.h"
+#include "msr.h"
+#include "objects/vm_area.h"
+#include "syscall.h"
+#include "tss.h"

-cpu_data bsp_cpu_data;
+cpu_data g_bsp_cpu_data;

 void
 cpu_validate()
@@ -29,3 +37,30 @@ cpu_validate()
 #undef CPU_FEATURE_OPT
 #undef CPU_FEATURE_REQ
 }
+
+void
+cpu_early_init(cpu_data *cpu)
+{
+	IDT::get().install();
+	cpu->gdt->install();
+
+	// Install the GS base pointint to the cpu_data
+	wrmsr(msr::ia32_gs_base, reinterpret_cast<uintptr_t>(cpu));
+}
+
+void
+cpu_init(cpu_data *cpu, bool bsp)
+{
+	if (!bsp) {
+		// The BSP already called cpu_early_init
+		cpu_early_init(cpu);
+	}
+
+	// Set up the syscall MSRs
+	syscall_enable();
+
+	// Set up the page attributes table
+	uint64_t pat = rdmsr(msr::ia32_pat);
+	pat = (pat & 0x00ffffffffffffffull) | (0x01ull << 56); // set PAT 7 to WC
+	wrmsr(msr::ia32_pat, pat);
+}
--- a/src/kernel/cpu.h
+++ b/src/kernel/cpu.h
@@ -2,9 +2,12 @@

 #include <stdint.h>

+class GDT;
+class lapic;
+class process;
 struct TCB;
 class thread;
-class process;
+class TSS;

 struct cpu_state
 {
@@ -18,15 +21,39 @@ struct cpu_state
 /// version in 'tasking.inc'
 struct cpu_data
 {
+	cpu_data *self;
+	uint16_t id;
+	uint16_t index;
+	uint32_t reserved;
 	uintptr_t rsp0;
 	uintptr_t rsp3;
 	TCB *tcb;
-	thread *t;
-	process *p;
+	thread *thread;
+	process *process;
+	TSS *tss;
+	GDT *gdt;
+
+	// Members beyond this point do not appear in
+	// the assembly version
+	lapic *apic;
 };

-extern cpu_data bsp_cpu_data;
+extern "C" cpu_data * _current_gsbase();

-// We already validated the required options in the bootloader,
-// but iterate the options and log about them.
+/// Set up the running CPU. This sets GDT, IDT, and necessary MSRs as well as creating
+/// the cpu_data structure for this processor.
+/// \arg cpu  The cpu_data structure for this CPU
+/// \arg bsp  True if this CPU is the BSP
+void cpu_init(cpu_data *cpu, bool bsp);
+
+/// Do early (before cpu_init) initialization work. Only needs to be called manually for
+/// the BSP, otherwise cpu_init will call it.
+/// \arg cpu  The cpu_data structure for this CPU
+void cpu_early_init(cpu_data *cpu);
+
+/// Get the cpu_data struct for the current executing CPU
+inline cpu_data & current_cpu() { return *_current_gsbase(); }
+
+/// Validate the required CPU features are present. Really, the bootloader already
+/// validated the required features, but still iterate the options and log about them.
 void cpu_validate();
--- a/src/kernel/debug.cpp
+++ b/src/kernel/debug.cpp
@@ -13,6 +13,7 @@ void
 print_regs(const cpu_state &regs)
 {
 	console *cons = console::get();
+	cpu_data &cpu = current_cpu();

 	uint64_t cr2 = 0;
 	__asm__ __volatile__ ("mov %%cr2, %0" : "=r"(cr2));
@@ -20,8 +21,8 @@ print_regs(const cpu_state &regs)
 	uintptr_t cr3 = 0;
 	__asm__ __volatile__ ( "mov %%cr3, %0" : "=r" (cr3) );

-	cons->printf("       process: %llx", bsp_cpu_data.p->koid());
-	cons->printf("   thread: %llx\n", bsp_cpu_data.t->koid());
+	cons->printf("       process: %llx", cpu.process->koid());
+	cons->printf("   thread: %llx\n", cpu.thread->koid());

 	print_regL("rax", regs.rax);
 	print_regM("rbx", regs.rbx);
@@ -43,7 +44,7 @@ print_regs(const cpu_state &regs)
 	cons->puts("\n\n");
 	print_regL("rbp", regs.rbp);
 	print_regM("rsp", regs.user_rsp);
-	print_regR("sp0", bsp_cpu_data.rsp0);
+	print_regR("sp0", cpu.rsp0);

 	print_regL("rip", regs.rip);
 	print_regM("cr3", cr3);
--- a/src/kernel/debug.h
+++ b/src/kernel/debug.h
@@ -4,6 +4,8 @@

 #include <stdint.h>

+struct cpu_state;
+
 extern "C" {
 	uintptr_t get_rsp();
 	uintptr_t get_rip();
--- a/src/kernel/device_manager.cpp
+++ b/src/kernel/device_manager.cpp
@@ -63,7 +63,7 @@ void irq4_callback(void *)


 device_manager::device_manager() :
-	m_lapic(nullptr)
+	m_lapic_base(0)
 {
 	m_irqs.ensure_capacity(32);
 	m_irqs.set_size(16);
@@ -106,6 +106,26 @@ device_manager::parse_acpi(const void *root_table)
 	load_xsdt(memory::to_virtual(acpi2->xsdt_address));
 }

+const device_manager::apic_nmi *
+device_manager::get_lapic_nmi(uint8_t id) const
+{
+	for (const auto &nmi : m_nmis) {
+		if (nmi.cpu == 0xff || nmi.cpu == id)
+			return &nmi;
+	}
+
+	return nullptr;
+}
+
+const device_manager::irq_override *
+device_manager::get_irq_override(uint8_t irq) const
+{
+	for (const auto &o : m_overrides)
+		if (o.source == irq) return &o;
+
+	return nullptr;
+}
+
 ioapic *
 device_manager::get_ioapic(int i)
 {
@@ -163,38 +183,38 @@ device_manager::load_apic(const acpi_table_header *header)
 {
 	const auto *apic = check_get_table<acpi_apic>(header);

-	uintptr_t local = apic->local_address;
-	m_lapic = new lapic(local, isr::isrSpurious);
+	m_lapic_base = apic->local_address;

 	size_t count = acpi_table_entries(apic, 1);
 	uint8_t const *p = apic->controller_data;
 	uint8_t const *end = p + count;

-	// Pass one: count IOAPIC objcts
-	int num_ioapics = 0;
+	// Pass one: count objcts
+	unsigned num_lapics = 0;
+	unsigned num_ioapics = 0;
+	unsigned num_overrides = 0;
+	unsigned num_nmis = 0;
 	while (p < end) {
 		const uint8_t type = p[0];
 		const uint8_t length = p[1];
-		if (type == 1) num_ioapics++;
+
+		switch (type) {
+		case 0: ++num_lapics; break;
+		case 1: ++num_ioapics; break;
+		case 2: ++num_overrides; break;
+		case 4: ++num_nmis; break;
+		default: break;
+		}
+
 		p += length;
 	}

+	m_apic_ids.set_capacity(num_lapics);
 	m_ioapics.set_capacity(num_ioapics);
+	m_overrides.set_capacity(num_overrides);
+	m_nmis.set_capacity(num_nmis);

-	// Pass two: set up IOAPIC objcts
-	p = apic->controller_data;
-	while (p < end) {
-		const uint8_t type = p[0];
-		const uint8_t length = p[1];
-		if (type == 1) {
-			uintptr_t base = kutil::read_from<uint32_t>(p+4);
-			uint32_t base_gsr = kutil::read_from<uint32_t>(p+8);
-			m_ioapics.emplace(base, base_gsr);
-		}
-		p += length;
-	}
-
-	// Pass three: configure APIC objects
+	// Pass two: configure objects
 	p = apic->controller_data;
 	while (p < end) {
 		const uint8_t type = p[0];
@@ -204,38 +224,42 @@ device_manager::load_apic(const acpi_table_header *header)
 		case 0: { // Local APIC
 				uint8_t uid = kutil::read_from<uint8_t>(p+2);
 				uint8_t id = kutil::read_from<uint8_t>(p+3);
-				log::debug(logs::device, "    Local APIC uid %x id %x", id);
+				m_apic_ids.append(id);
+
+				log::debug(logs::device, "    Local APIC uid %x id %x", uid, id);
 			}
 			break;

-		case 1: // I/O APIC
+		case 1: { // I/O APIC
+				uintptr_t base = kutil::read_from<uint32_t>(p+4);
+				uint32_t base_gsi = kutil::read_from<uint32_t>(p+8);
+				m_ioapics.emplace(base, base_gsi);
+
+				log::debug(logs::device, "    IO APIC gsi %x base %x", base_gsi, base);
+			}
 			break;

 		case 2: { // Interrupt source override
-				uint8_t source = kutil::read_from<uint8_t>(p+3);
-				isr gsi = isr::irq00 + kutil::read_from<uint32_t>(p+4);
-				uint16_t flags = kutil::read_from<uint16_t>(p+8);
+				irq_override o;
+				o.source = kutil::read_from<uint8_t>(p+3);
+				o.gsi = kutil::read_from<uint32_t>(p+4);
+				o.flags = kutil::read_from<uint16_t>(p+8);
+				m_overrides.append(o);

 				log::debug(logs::device, "    Intr source override IRQ %d -> %d Pol %d Tri %d",
-						source, gsi, (flags & 0x3), ((flags >> 2) & 0x3));
-
-				// TODO: in a multiple-IOAPIC system this might be elsewhere
-				m_ioapics[0].redirect(source, static_cast<isr>(gsi), flags, true);
+						o.source, o.gsi, (o.flags & 0x3), ((o.flags >> 2) & 0x3));
 			}
 			break;

 		case 4: {// LAPIC NMI
-			uint8_t cpu = kutil::read_from<uint8_t>(p + 2);
-			uint8_t num = kutil::read_from<uint8_t>(p + 5);
-			uint16_t flags = kutil::read_from<uint16_t>(p + 3);
+			apic_nmi nmi;
+			nmi.cpu = kutil::read_from<uint8_t>(p + 2);
+			nmi.lint = kutil::read_from<uint8_t>(p + 5);
+			nmi.flags = kutil::read_from<uint16_t>(p + 3);
+			m_nmis.append(nmi);

-			log::debug(logs::device, "    LAPIC NMI Proc %d LINT%d Pol %d Tri %d",
-					kutil::read_from<uint8_t>(p+2),
-					kutil::read_from<uint8_t>(p+5),
-					kutil::read_from<uint16_t>(p+3) & 0x3,
-					(kutil::read_from<uint16_t>(p+3) >> 2) & 0x3);
-
-			m_lapic->enable_lint(num, num == 0 ? isr::isrLINT0 : isr::isrLINT1, true, flags);
+			log::debug(logs::device, "    LAPIC NMI Proc %02x LINT%d Pol %d Tri %d",
+					nmi.cpu, nmi.lint, nmi.flags & 0x3, (nmi.flags >> 2) & 0x3);
 			}
 			break;

@@ -245,17 +269,6 @@ device_manager::load_apic(const acpi_table_header *header)

 		p += length;
 	}
-
-	/*
-	for (uint8_t i = 0; i < m_ioapics[0].get_num_gsi(); ++i) {
-		switch (i) {
-			case 2: break;
-			default: m_ioapics[0].mask(i, false);
-		}
-	}
-	*/
-
-	m_lapic->enable();
 }

 void
--- a/src/kernel/device_manager.h
+++ b/src/kernel/device_manager.h
@@ -24,10 +24,6 @@ public:
 	/// \returns  A reference to the system device manager
 	static device_manager & get() { return s_instance; }

-	/// Get the LAPIC
-	/// \returns An object representing the local APIC
-	lapic * get_lapic() { return m_lapic; }
-
 	/// Get an IOAPIC
 	/// \arg i   Index of the requested IOAPIC
 	/// \returns An object representing the given IOAPIC if it exists,
@@ -68,6 +64,39 @@ public:
 	/// \returns  True if the interrupt was handled
 	bool dispatch_irq(unsigned irq);

+	struct apic_nmi
+	{
+		uint8_t cpu;
+		uint8_t lint;
+		uint16_t flags;
+	};
+
+	struct irq_override
+	{
+		uint8_t source;
+		uint16_t flags;
+		uint32_t gsi;
+	};
+
+	/// Get the list of APIC ids for other CPUs
+	inline const kutil::vector<uint8_t> & get_apic_ids() const { return m_apic_ids; }
+
+	/// Get the LAPIC base address
+	/// \returns The physical base address of the local apic registers
+	uintptr_t get_lapic_base() const { return m_lapic_base; }
+
+	/// Get the NMI mapping for the given local APIC
+	/// \arg id  ID of the local APIC
+	/// \returns apic_nmi structure describing the NMI configuration,
+	///          or null if no configuration was provided
+	const apic_nmi * get_lapic_nmi(uint8_t id) const;
+
+	/// Get the IRQ source override for the given IRQ
+	/// \arg irq  IRQ number (not isr vector)
+	/// \returns  irq_override structure describing that IRQ's
+	///           configuration, or null if no configuration was provided
+	const irq_override * get_irq_override(uint8_t irq) const;
+
 	/// Register the existance of a block device.
 	/// \arg blockdev  Pointer to the block device
 	void register_block_device(block_device *blockdev);
@@ -119,9 +148,13 @@ private:
 	/// that has no callback.
 	void bad_irq(uint8_t irq);

-	lapic *m_lapic;
+	uintptr_t m_lapic_base;
+
 	kutil::vector<ioapic> m_ioapics;
 	kutil::vector<hpet> m_hpets;
+	kutil::vector<uint8_t> m_apic_ids;
+	kutil::vector<apic_nmi> m_nmis;
+	kutil::vector<irq_override> m_overrides;

 	kutil::vector<pci_group> m_pci;
 	kutil::vector<pci_device> m_devices;
--- a/src/kernel/frame_allocator.cpp
+++ b/src/kernel/frame_allocator.cpp
@@ -1,6 +1,6 @@
-#include "kernel_memory.h"
 #include "kutil/assert.h"
 #include "kutil/memory.h"
+
 #include "frame_allocator.h"
 #include "kernel_args.h"
 #include "kernel_memory.h"
@@ -17,22 +17,24 @@ frame_allocator::get()
 }

 frame_allocator::frame_allocator(kernel::args::frame_block *frames, size_t count) :
-	m_blocks(frames),
-	m_count(count)
+	m_blocks {frames},
+	m_count {count}
 {
 }

 inline unsigned
 bsf(uint64_t v)
 {
-	asm ("tzcntq %q0, %q1" : "=r"(v) : "r"(v) : "cc");
+	asm ("tzcntq %q0, %q1" : "=r"(v) : "0"(v) : "cc");
 	return v;
 }

 size_t
 frame_allocator::allocate(size_t count, uintptr_t *address)
 {
-	for (long i = m_count - 1; i >= 0; ++i) {
+	kutil::scoped_lock lock {m_lock};
+
+	for (long i = m_count - 1; i >= 0; --i) {
 		frame_block &block = m_blocks[i];

 		if (!block.map1)
@@ -80,6 +82,8 @@ frame_allocator::allocate(size_t count, uintptr_t *address)
 void
 frame_allocator::free(uintptr_t address, size_t count)
 {
+	kutil::scoped_lock lock {m_lock};
+
 	kassert(address % frame_size == 0, "Trying to free a non page-aligned frame!");

 	if (!count)
@@ -116,6 +120,8 @@ frame_allocator::free(uintptr_t address, size_t count)
 void
 frame_allocator::used(uintptr_t address, size_t count)
 {
+	kutil::scoped_lock lock {m_lock};
+
 	kassert(address % frame_size == 0, "Trying to mark a non page-aligned frame!");

 	if (!count)
--- a/src/kernel/frame_allocator.h
+++ b/src/kernel/frame_allocator.h
@@ -3,6 +3,7 @@
 /// Allocator for physical memory frames

 #include <stdint.h>
+#include "kutil/spinlock.h"

 namespace kernel {
 namespace args {
@@ -43,7 +44,9 @@ public:

 private:
 	frame_block *m_blocks;
-	long m_count;
+	size_t m_count;
+
+	kutil::spinlock m_lock;

 	frame_allocator() = delete;
 	frame_allocator(const frame_allocator &) = delete;
--- a/src/kernel/gdt.cpp
+++ b/src/kernel/gdt.cpp
@@ -1,36 +1,80 @@
 #include <stdint.h>

 #include "kutil/assert.h"
-#include "kutil/enum_bitfields.h"
 #include "kutil/memory.h"
+#include "kutil/no_construct.h"
 #include "console.h"
-#include "kernel_memory.h"
+#include "cpu.h"
+#include "gdt.h"
 #include "log.h"
+#include "tss.h"
+
+extern "C" void gdt_write(const void *gdt_ptr, uint16_t cs, uint16_t ds, uint16_t tr);
+
+static constexpr uint8_t kern_cs_index   = 1;
+static constexpr uint8_t kern_ss_index   = 2;
+static constexpr uint8_t user_cs32_index = 3;
+static constexpr uint8_t user_ss_index   = 4;
+static constexpr uint8_t user_cs64_index = 5;
+static constexpr uint8_t tss_index       = 6; // Note that this takes TWO GDT entries
+
+// The BSP's GDT is initialized _before_ global constructors are called,
+// so we don't want it to have a global constructor, lest it overwrite
+// the previous initialization.
+static kutil::no_construct<GDT> __g_bsp_gdt_storage;
+GDT &g_bsp_gdt = __g_bsp_gdt_storage.value;


-enum class gdt_type : uint8_t
+GDT::GDT(TSS *tss) :
+	m_tss(tss)
 {
-	accessed	= 0x01,
-	read_write	= 0x02,
-	conforming	= 0x04,
-	execute		= 0x08,
-	system		= 0x10,
-	ring1		= 0x20,
-	ring2		= 0x40,
-	ring3		= 0x60,
-	present		= 0x80
-};
-IS_BITFIELD(gdt_type);
+	kutil::memset(this, 0, sizeof(GDT));

-struct gdt_descriptor
+	m_ptr.limit = sizeof(m_entries) - 1;
+	m_ptr.base = &m_entries[0];
+
+	// Kernel CS/SS - always 64bit
+	set(kern_cs_index, 0, 0xfffff, true, gdt_type::read_write | gdt_type::execute);
+	set(kern_ss_index, 0, 0xfffff, true, gdt_type::read_write);
+
+	// User CS32/SS/CS64 - layout expected by SYSRET
+	set(user_cs32_index, 0, 0xfffff, false, gdt_type::ring3 | gdt_type::read_write | gdt_type::execute);
+	set(user_ss_index,   0, 0xfffff, true,  gdt_type::ring3 | gdt_type::read_write);
+	set(user_cs64_index, 0, 0xfffff, true,  gdt_type::ring3 | gdt_type::read_write | gdt_type::execute);
+
+	set_tss(tss);
+}
+
+GDT &
+GDT::current()
 {
-	uint16_t limit_low;
-	uint16_t base_low;
-	uint8_t base_mid;
-	gdt_type type;
-	uint8_t size;
-	uint8_t base_high;
-} __attribute__ ((packed));
+	cpu_data &cpu = current_cpu();
+	return *cpu.gdt;
+}
+
+void
+GDT::install() const
+{
+	gdt_write(
+		static_cast<const void*>(&m_ptr),
+		kern_cs_index << 3,
+		kern_ss_index << 3,
+		tss_index << 3);
+}
+
+void
+GDT::set(uint8_t i, uint32_t base, uint64_t limit, bool is64, gdt_type type)
+{
+	m_entries[i].limit_low = limit & 0xffff;
+	m_entries[i].size = (limit >> 16) & 0xf;
+	m_entries[i].size |= (is64 ? 0xa0 : 0xc0);
+
+	m_entries[i].base_low = base & 0xffff;
+	m_entries[i].base_mid = (base >> 16) & 0xff;
+	m_entries[i].base_high = (base >> 24) & 0xff;
+
+	m_entries[i].type = type | gdt_type::system | gdt_type::present;
+}

 struct tss_descriptor
 {
@@ -44,72 +88,16 @@ struct tss_descriptor
 	uint32_t reserved;
 } __attribute__ ((packed));

-struct tss_entry
-{
-	uint32_t reserved0;
-
-	uint64_t rsp[3]; // stack pointers for CPL 0-2
-	uint64_t ist[8]; // ist[0] is reserved
-
-	uint64_t reserved1;
-	uint16_t reserved2;
-	uint16_t iomap_offset;
-} __attribute__ ((packed));
-
-struct idt_descriptor
-{
-	uint16_t base_low;
-	uint16_t selector;
-	uint8_t ist;
-	uint8_t flags;
-	uint16_t base_mid;
-	uint32_t base_high;
-	uint32_t reserved;		// must be zero
-} __attribute__ ((packed));
-
-struct table_ptr
-{
-	uint16_t limit;
-	uint64_t base;
-} __attribute__ ((packed));
-
-
-gdt_descriptor g_gdt_table[10];
-idt_descriptor g_idt_table[256];
-table_ptr g_gdtr;
-table_ptr g_idtr;
-tss_entry g_tss;
-
-
-extern "C" {
-	void idt_write();
-	void idt_load();
-
-	void gdt_write(uint16_t cs, uint16_t ds, uint16_t tr);
-	void gdt_load();
-}
-
 void
-gdt_set_entry(uint8_t i, uint32_t base, uint64_t limit, bool is64, gdt_type type)
-{
-	g_gdt_table[i].limit_low = limit & 0xffff;
-	g_gdt_table[i].size = (limit >> 16) & 0xf;
-	g_gdt_table[i].size |= (is64 ? 0xa0 : 0xc0);
-
-	g_gdt_table[i].base_low = base & 0xffff;
-	g_gdt_table[i].base_mid = (base >> 16) & 0xff;
-	g_gdt_table[i].base_high = (base >> 24) & 0xff;
-
-	g_gdt_table[i].type = type | gdt_type::system | gdt_type::present;
-}
-
-void
-tss_set_entry(uint8_t i, uint64_t base, uint64_t limit)
+GDT::set_tss(TSS *tss)
 {
 	tss_descriptor tssd;
+
+	size_t limit = sizeof(TSS);
 	tssd.limit_low = limit & 0xffff;
 	tssd.size = (limit >> 16) & 0xf;

+	uintptr_t base = reinterpret_cast<uintptr_t>(tss);
 	tssd.base_00 = base & 0xffff;
 	tssd.base_16 = (base >> 16) & 0xff;
 	tssd.base_24 = (base >> 24) & 0xff;
@@ -121,123 +109,26 @@ tss_set_entry(uint8_t i, uint64_t base, uint64_t limit)
 		gdt_type::execute |
 		gdt_type::ring3 |
 		gdt_type::present;
-	kutil::memcpy(&g_gdt_table[i], &tssd, sizeof(tss_descriptor));
+
+	kutil::memcpy(&m_entries[tss_index], &tssd, sizeof(tss_descriptor));
 }

 void
-idt_set_entry(uint8_t i, uint64_t addr, uint16_t selector, uint8_t flags)
+GDT::dump(unsigned index) const
 {
-	g_idt_table[i].base_low = addr & 0xffff;
-	g_idt_table[i].base_mid = (addr >> 16) & 0xffff;
-	g_idt_table[i].base_high = (addr >> 32) & 0xffffffff;
-	g_idt_table[i].selector = selector;
-	g_idt_table[i].flags = flags;
-	g_idt_table[i].ist = 0;
-	g_idt_table[i].reserved = 0;
-}
-
-void
-tss_set_stack(unsigned ring, uintptr_t rsp)
-{
-	kassert(ring < 3, "Bad ring passed to tss_set_stack.");
-	g_tss.rsp[ring] = rsp;
-}
-
-uintptr_t
-tss_get_stack(unsigned ring)
-{
-	kassert(ring < 3, "Bad ring passed to tss_get_stack.");
-	return g_tss.rsp[ring];
-}
-
-void
-idt_set_ist(unsigned i, unsigned ist)
-{
-	g_idt_table[i].ist = ist;
-}
-
-void
-tss_set_ist(unsigned ist, uintptr_t rsp)
-{
-	kassert(ist > 0 && ist < 7, "Bad ist passed to tss_set_ist.");
-	g_tss.ist[ist] = rsp;
-}
-
-void
-ist_increment(unsigned i)
-{
-	uint8_t ist = g_idt_table[i].ist;
-	if (ist)
-		g_tss.ist[ist] += memory::frame_size;
-}
-
-void
-ist_decrement(unsigned i)
-{
-	uint8_t ist = g_idt_table[i].ist;
-	if (ist)
-		g_tss.ist[ist] -= memory::frame_size;
-}
-
-uintptr_t
-tss_get_ist(unsigned ist)
-{
-	kassert(ist > 0 && ist < 7, "Bad ist passed to tss_get_ist.");
-	return g_tss.ist[ist];
-}
-
-void
-gdt_init()
-{
-	kutil::memset(&g_gdt_table, 0, sizeof(g_gdt_table));
-	kutil::memset(&g_idt_table, 0, sizeof(g_idt_table));
-
-	g_gdtr.limit = sizeof(g_gdt_table) - 1;
-	g_gdtr.base = reinterpret_cast<uint64_t>(&g_gdt_table);
-
-	// Kernel CS/SS - always 64bit
-	gdt_set_entry(1, 0, 0xfffff,  true, gdt_type::read_write | gdt_type::execute);
-	gdt_set_entry(2, 0, 0xfffff,  true, gdt_type::read_write);
-
-	// User CS32/SS/CS64 - layout expected by SYSRET
-	gdt_set_entry(3, 0, 0xfffff,  false, gdt_type::ring3 | gdt_type::read_write | gdt_type::execute);
-	gdt_set_entry(4, 0, 0xfffff,  true, gdt_type::ring3 | gdt_type::read_write);
-	gdt_set_entry(5, 0, 0xfffff,  true, gdt_type::ring3 | gdt_type::read_write | gdt_type::execute);
-
-	kutil::memset(&g_tss, 0, sizeof(tss_entry));
-	g_tss.iomap_offset = sizeof(tss_entry);
-
-	uintptr_t tss_base = reinterpret_cast<uintptr_t>(&g_tss);
-
-	// Note that this takes TWO GDT entries
-	tss_set_entry(6, tss_base, sizeof(tss_entry));
-
-	gdt_write(1 << 3, 2 << 3, 6 << 3);
-
-	g_idtr.limit = sizeof(g_idt_table) - 1;
-	g_idtr.base = reinterpret_cast<uint64_t>(&g_idt_table);
-
-	idt_write();
-}
-
-void
-gdt_dump(unsigned index)
-{
-	const table_ptr &table = g_gdtr;
-
 	console *cons = console::get();

 	unsigned start = 0;
-	unsigned count = (table.limit + 1) / sizeof(gdt_descriptor);
+	unsigned count = (m_ptr.limit + 1) / sizeof(descriptor);
 	if (index != -1) {
 		start = index;
 		count = 1;
 	} else {
-		cons->printf("         GDT: loc:%lx size:%d\n", table.base, table.limit+1);
+		cons->printf("         GDT: loc:%lx size:%d\n", m_ptr.base, m_ptr.limit+1);
 	}

-	const gdt_descriptor *gdt =
-		reinterpret_cast<const gdt_descriptor *>(table.base);
+	const descriptor *gdt =
+		reinterpret_cast<const descriptor *>(m_ptr.base);

 	for (int i = start; i < start+count; ++i) {
 		uint32_t base =
@@ -275,51 +166,3 @@ gdt_dump(unsigned index)
 				(gdt[i].size & 0x60) == 0x40 ? "32" : "16");
 	}
 }
-
-void
-idt_dump(unsigned index)
-{
-	const table_ptr &table = g_idtr;
-
-
-	unsigned start = 0;
-	unsigned count = (table.limit + 1) / sizeof(idt_descriptor);
-	if (index != -1) {
-		start = index;
-		count = 1;
-		log::info(logs::boot, "IDT FOR INDEX %02x", index);
-	} else {
-		log::info(logs::boot, "Loaded IDT at: %lx size: %d bytes", table.base, table.limit+1);
-	}
-
-	const idt_descriptor *idt =
-		reinterpret_cast<const idt_descriptor *>(table.base);
-
-	for (int i = start; i < start+count; ++i) {
-		uint64_t base =
-			(static_cast<uint64_t>(idt[i].base_high) << 32) |
-			(static_cast<uint64_t>(idt[i].base_mid)  << 16) |
-			idt[i].base_low;
-
-		char const *type;
-		switch (idt[i].flags & 0xf) {
-			case 0x5: type = " 32tsk "; break;
-			case 0x6: type = " 16int "; break;
-			case 0x7: type = " 16trp "; break;
-			case 0xe: type = " 32int "; break;
-			case 0xf: type = " 32trp "; break;
-			default:  type = " ????? "; break;
-		}
-
-		if (idt[i].flags & 0x80) {
-			log::debug(logs::boot,
-					"   Entry %3d: Base:%lx Sel(rpl %d, ti %d, %3d) IST:%d %s DPL:%d", i, base,
-					(idt[i].selector & 0x3),
-					((idt[i].selector & 0x4) >> 2),
-					(idt[i].selector >> 3),
-					idt[i].ist,
-					type,
-					((idt[i].flags >> 5) & 0x3));
-		}
-	}
-}
--- a/src/kernel/gdt.h
+++ b/src/kernel/gdt.h
@@ -1,58 +1,66 @@
 #pragma once
 /// \file gdt.h
-/// Definitions relating to system descriptor tables: GDT, IDT, TSS
+/// Definitions relating to a CPU's GDT table
 #include <stdint.h>

-/// Set up the GDT and TSS, and switch segment registers to point
-/// to them.
-void gdt_init();
+#include "kutil/enum_bitfields.h"

-/// Set an entry in the IDT
-/// \arg i         Index in the IDT (vector of the interrupt this handles)
-/// \arg addr      Address of the handler
-/// \arg selector  GDT selector to set when invoking this handler
-/// \arg flags     Descriptor flags to set
-void idt_set_entry(uint8_t i, uint64_t addr, uint16_t selector, uint8_t flags);
+class TSS;

-/// Set the stack pointer for a given ring in the TSS
-/// \arg ring  Ring to set for (0-2)
-/// \arg rsp   Stack pointer to set
-void tss_set_stack(unsigned ring, uintptr_t rsp);
+enum class gdt_type : uint8_t
+{
+	accessed	= 0x01,
+	read_write	= 0x02,
+	conforming	= 0x04,
+	execute		= 0x08,
+	system		= 0x10,
+	ring1		= 0x20,
+	ring2		= 0x40,
+	ring3		= 0x60,
+	present		= 0x80
+};
+IS_BITFIELD(gdt_type);

-/// Get the stack pointer for a given ring in the TSS
-/// \arg ring  Ring to get (0-2)
-/// \returns   Stack pointers for that ring
-uintptr_t tss_get_stack(unsigned ring);
+class GDT
+{
+public:
+	GDT(TSS *tss);

-/// Set the given IDT entry to use the given IST entry
-/// \arg i     Which IDT entry to set
-/// \arg ist   Which IST entry to set (1-7)
-void idt_set_ist(unsigned i, unsigned ist);
+	/// Get the currently running CPU's GDT
+	static GDT & current();

-/// Set the stack pointer for a given IST in the TSS
-/// \arg ist   Which IST entry to set (1-7)
-/// \arg rsp   Stack pointer to set
-void tss_set_ist(unsigned ist, uintptr_t rsp);
+	/// Install this GDT to the current CPU
+	void install() const;

-/// Increment the stack pointer for the given vector,
-/// if it's using an IST entry
-/// \arg i     Which IDT entry to use
-void ist_increment(unsigned i);
+	/// Get the addrss of the pointer
+	inline const void * pointer() const { return static_cast<const void*>(&m_ptr); }

-/// Decrement the stack pointer for the given vector,
-/// if it's using an IST entry
-/// \arg i     Which IDT entry to use
-void ist_decrement(unsigned i);
-
-/// Get the stack pointer for a given IST in the TSS
-/// \arg ring  Which IST entry to get (1-7)
-/// \returns   Stack pointers for that IST entry
-uintptr_t tss_get_ist(unsigned ist);
-
-/// Dump information about the current GDT to the screen
+	/// Dump debug information about the GDT to the console.
 	/// \arg index  Which entry to print, or -1 for all entries
-void gdt_dump(unsigned index = -1);
+	void dump(unsigned index = -1) const;

-/// Dump information about the current IDT to the screen
-/// \arg index  Which entry to print, or -1 for all entries
-void idt_dump(unsigned index = -1);
+private:
+	void set(uint8_t i, uint32_t base, uint64_t limit, bool is64, gdt_type type);
+	void set_tss(TSS *tss);
+
+	struct descriptor
+	{
+		uint16_t limit_low;
+		uint16_t base_low;
+		uint8_t base_mid;
+		gdt_type type;
+		uint8_t size;
+		uint8_t base_high;
+	} __attribute__ ((packed, align(8)));
+
+	struct ptr
+	{
+		uint16_t limit;
+		descriptor *base;
+	} __attribute__ ((packed, align(4)));
+
+	descriptor m_entries[8];
+	TSS *m_tss;
+
+	ptr m_ptr;
+};
--- a/src/kernel/gdt.s
+++ b/src/kernel/gdt.s
@@ -1,35 +0,0 @@
-extern g_idtr
-extern g_gdtr
-
-global idt_write
-idt_write:
-	lidt [rel g_idtr]
-	ret
-
-global idt_load
-idt_load:
-	sidt [rel g_idtr]
-	ret
-
-global gdt_write
-gdt_write:
-	lgdt [rel g_gdtr]
-	mov ax, si ; second arg is data segment
-	mov ds, ax
-	mov es, ax
-	mov fs, ax
-	mov gs, ax
-	mov ss, ax
-	push qword rdi ; first arg is code segment
-	lea rax, [rel .next]
-	push rax
-	o64 retf
-.next:
-	ltr dx ; third arg is the TSS
-	ret
-
-global gdt_load
-gdt_load:
-	sgdt [rel g_gdtr]
-	ret
-
--- a/src/kernel/gdtidt.s
+++ b/src/kernel/gdtidt.s
@@ -0,0 +1,35 @@
+
+global idt_write
+idt_write:
+	lidt [rdi] ; first arg is the IDT pointer location
+	ret
+
+global idt_load
+idt_load:
+	sidt [rdi] ; first arg is where to write the idtr value
+	ret
+
+global gdt_write
+gdt_write:
+	lgdt [rdi] ; first arg is the GDT pointer location
+
+	mov ax, dx ; third arg is data segment
+	mov ds, ax
+	mov es, ax
+	mov fs, ax
+	mov gs, ax
+	mov ss, ax
+
+	push qword rsi ; second arg is code segment
+	lea rax, [rel .next]
+	push rax
+	o64 retf
+.next:
+	ltr cx ; fourth arg is the TSS
+	ret
+
+global gdt_load
+gdt_load:
+	sgdt [rdi] ; first arg is where to write the gdtr value
+	ret
+
--- a/src/kernel/idt.cpp
+++ b/src/kernel/idt.cpp
@@ -0,0 +1,137 @@
+#include "kutil/memory.h"
+#include "kutil/no_construct.h"
+#include "idt.h"
+#include "log.h"
+
+extern "C" {
+	void idt_write(const void *idt_ptr);
+
+#define ISR(i, s, name)  extern void name ();
+#define EISR(i, s, name) extern void name ();
+#define IRQ(i, q, name)  extern void name ();
+#include "interrupt_isrs.inc"
+#undef IRQ
+#undef EISR
+#undef ISR
+}
+
+// The IDT is initialized _before_ global constructors are called,
+// so we don't want it to have a global constructor, lest it overwrite
+// the previous initialization.
+static kutil::no_construct<IDT> __g_idt_storage;
+IDT &g_idt = __g_idt_storage.value;
+
+
+IDT::IDT()
+{
+	kutil::memset(this, 0, sizeof(IDT));
+	m_ptr.limit = sizeof(m_entries) - 1;
+	m_ptr.base = &m_entries[0];
+
+#define ISR(i, s, name)  set(i, & name, 0x08, 0x8e);
+#define EISR(i, s, name) set(i, & name, 0x08, 0x8e);
+#define IRQ(i, q, name)  set(i, & name, 0x08, 0x8e);
+#include "interrupt_isrs.inc"
+#undef IRQ
+#undef EISR
+#undef ISR
+}
+
+IDT &
+IDT::get()
+{
+	return g_idt;
+}
+
+void
+IDT::install() const
+{
+	idt_write(static_cast<const void*>(&m_ptr));
+}
+
+void
+IDT::add_ist_entries()
+{
+#define ISR(i, s, name)   if (s) { set_ist(i, s); }
+#define EISR(i, s, name)  if (s) { set_ist(i, s); }
+#define IRQ(i, q, name)
+#include "interrupt_isrs.inc"
+#undef IRQ
+#undef EISR
+#undef ISR
+}
+
+uint8_t
+IDT::used_ist_entries() const
+{
+	uint8_t entries = 0;
+
+#define ISR(i, s, name)   if (s) { entries |= (1 << s); }
+#define EISR(i, s, name)  if (s) { entries |= (1 << s); }
+#define IRQ(i, q, name)
+#include "interrupt_isrs.inc"
+#undef IRQ
+#undef EISR
+#undef ISR
+
+	return entries;
+}
+
+void
+IDT::set(uint8_t i, void (*handler)(), uint16_t selector, uint8_t flags)
+{
+	uintptr_t addr = reinterpret_cast<uintptr_t>(handler);
+
+	m_entries[i].base_low = addr & 0xffff;
+	m_entries[i].base_mid = (addr >> 16) & 0xffff;
+	m_entries[i].base_high = (addr >> 32) & 0xffffffff;
+	m_entries[i].selector = selector;
+	m_entries[i].flags = flags;
+	m_entries[i].ist = 0;
+	m_entries[i].reserved = 0;
+}
+
+void
+IDT::dump(unsigned index) const
+{
+	unsigned start = 0;
+	unsigned count = (m_ptr.limit + 1) / sizeof(descriptor);
+	if (index != -1) {
+		start = index;
+		count = 1;
+		log::info(logs::boot, "IDT FOR INDEX %02x", index);
+	} else {
+		log::info(logs::boot, "Loaded IDT at: %lx size: %d bytes", m_ptr.base, m_ptr.limit+1);
+	}
+
+	const descriptor *idt =
+		reinterpret_cast<const descriptor *>(m_ptr.base);
+
+	for (int i = start; i < start+count; ++i) {
+		uint64_t base =
+			(static_cast<uint64_t>(idt[i].base_high) << 32) |
+			(static_cast<uint64_t>(idt[i].base_mid)  << 16) |
+			idt[i].base_low;
+
+		char const *type;
+		switch (idt[i].flags & 0xf) {
+			case 0x5: type = " 32tsk "; break;
+			case 0x6: type = " 16int "; break;
+			case 0x7: type = " 16trp "; break;
+			case 0xe: type = " 32int "; break;
+			case 0xf: type = " 32trp "; break;
+			default:  type = " ????? "; break;
+		}
+
+		if (idt[i].flags & 0x80) {
+			log::debug(logs::boot,
+					"   Entry %3d: Base:%lx Sel(rpl %d, ti %d, %3d) IST:%d %s DPL:%d", i, base,
+					(idt[i].selector & 0x3),
+					((idt[i].selector & 0x4) >> 2),
+					(idt[i].selector >> 3),
+					idt[i].ist,
+					type,
+					((idt[i].flags >> 5) & 0x3));
+		}
+	}
+}
--- a/src/kernel/idt.h
+++ b/src/kernel/idt.h
@@ -0,0 +1,63 @@
+#pragma once
+/// \file idt.h
+/// Definitions relating to a CPU's IDT table
+#include <stdint.h>
+
+class IDT
+{
+public:
+	IDT();
+
+	/// Install this IDT to the current CPU
+	void install() const;
+
+	/// Add the IST entries listed in the ISR table into the IDT.
+	/// This can't be done until after memory is set up so the
+	/// stacks can be created.
+	void add_ist_entries();
+
+	/// Get the IST entry used by an entry.
+	/// \arg i   Which IDT entry to look in
+	/// \returns The IST index used by entry i, or 0 for none
+	inline uint8_t get_ist(uint8_t i) const {
+		return m_entries[i].ist;
+	}
+
+	/// Set the IST entry used by an entry.
+	/// \arg i   Which IDT entry to set
+	/// \arg ist The IST index for entry i, or 0 for none
+	void set_ist(uint8_t i, uint8_t ist) { m_entries[i].ist = ist; }
+
+	/// Get the IST entries that are used by this table, as a bitmap
+	uint8_t used_ist_entries() const;
+
+	/// Dump debug information about the IDT to the console.
+	/// \arg index  Which entry to print, or -1 for all entries
+	void dump(unsigned index = -1) const;
+
+	/// Get the global IDT
+	static IDT & get();
+
+private:
+	void set(uint8_t i, void (*handler)(), uint16_t selector, uint8_t flags);
+
+	struct descriptor
+	{
+		uint16_t base_low;
+		uint16_t selector;
+		uint8_t ist;
+		uint8_t flags;
+		uint16_t base_mid;
+		uint32_t base_high;
+		uint32_t reserved;		// must be zero
+	} __attribute__ ((packed, aligned(16)));
+
+	struct ptr
+	{
+		uint16_t limit;
+		descriptor *base;
+	} __attribute__ ((packed, aligned(4)));
+
+	descriptor m_entries[256];
+	ptr m_ptr;
+};
--- a/src/kernel/interrupt_isrs.inc
+++ b/src/kernel/interrupt_isrs.inc
@@ -240,6 +240,7 @@ IRQ (0xdf, 0xbf, irqBF)
 ISR (0xe0, 0, isrTimer)
 ISR (0xe1, 0, isrLINT0)
 ISR (0xe2, 0, isrLINT1)
+ISR (0xe3, 0, isrAPICError)
 ISR (0xe4, 0, isrAssert)

 ISR (0xef, 0, isrSpurious)
--- a/src/kernel/interrupts.cpp
+++ b/src/kernel/interrupts.cpp
@@ -8,6 +8,7 @@
 #include "debug.h"
 #include "device_manager.h"
 #include "gdt.h"
+#include "idt.h"
 #include "interrupts.h"
 #include "io.h"
 #include "kernel_memory.h"
@@ -15,6 +16,7 @@
 #include "objects/process.h"
 #include "scheduler.h"
 #include "syscall.h"
+#include "tss.h"
 #include "vm_space.h"

 static const uint16_t PIC1 = 0x20;
@@ -22,19 +24,14 @@ static const uint16_t PIC2 = 0xa0;

 constexpr uintptr_t apic_eoi_addr = 0xfee000b0 + ::memory::page_offset;

+constexpr size_t increment_offset = 0x1000;
+
 extern "C" {
 	void _halt();

 	void isr_handler(cpu_state*);
 	void irq_handler(cpu_state*);

-#define ISR(i, s, name)     extern void name ();
-#define EISR(i, s, name)    extern void name ();
-#define IRQ(i, q, name)  extern void name ();
-#include "interrupt_isrs.inc"
-#undef IRQ
-#undef EISR
-#undef ISR
 }

 isr
@@ -60,7 +57,7 @@ get_irq(unsigned vector)
 	}
 }

-static void
+void
 disable_legacy_pic()
 {
 	// Mask all interrupts
@@ -80,28 +77,17 @@ disable_legacy_pic()
 	outb(PIC2+1, 0x02); io_wait();
 }

-void
-interrupts_init()
-{
-#define ISR(i, s, name)     idt_set_entry(i, reinterpret_cast<uint64_t>(& name), 0x08, 0x8e);
-#define EISR(i, s, name)    idt_set_entry(i, reinterpret_cast<uint64_t>(& name), 0x08, 0x8e);
-#define IRQ(i, q, name)  idt_set_entry(i, reinterpret_cast<uint64_t>(& name), 0x08, 0x8e);
-#include "interrupt_isrs.inc"
-#undef IRQ
-#undef EISR
-#undef ISR
-
-	disable_legacy_pic();
-
-	log::info(logs::boot, "Interrupts enabled.");
-}
-
 void
 isr_handler(cpu_state *regs)
 {
 	console *cons = console::get();
 	uint8_t vector = regs->interrupt & 0xff;
-	ist_decrement(vector);
+
+	// Clear out the IST for this vector so we just keep using
+	// this stack
+	uint8_t old_ist = IDT::get().get_ist(vector);
+	if (old_ist)
+		IDT::get().set_ist(vector, 0);

 	switch (static_cast<isr>(vector)) {

@@ -137,6 +123,16 @@ isr_handler(cpu_state *regs)
 		}
 		break;

+	case isr::isrDoubleFault:
+		cons->set_color(9);
+		cons->printf("\nDouble Fault:\n");
+
+		cons->set_color();
+		print_regs(*regs);
+		print_stacktrace(2);
+		_halt();
+		break;
+
 	case isr::isrGPFault: {
 			cons->set_color(9);
 			cons->puts("\nGeneral Protection Fault:\n");
@@ -150,13 +146,13 @@ isr_handler(cpu_state *regs)
 				switch ((regs->errorcode & 0x07) >> 1) {
 				case 0:
 					cons->printf(" GDT[%x]\n", index);
-					gdt_dump(index);
+					GDT::current().dump(index);
 					break;

 				case 1:
 				case 3:
 					cons->printf(" IDT[%x]\n", index);
-					idt_dump(index);
+					IDT::get().dump(index);
 					break;

 				default:
@@ -275,7 +271,10 @@ isr_handler(cpu_state *regs)
 		print_stacktrace(2);
 		_halt();
 	}
-	ist_increment(vector);
+
+	// Return the IST for this vector to what it was
+	if (old_ist)
+		IDT::get().set_ist(vector, old_ist);
 	*reinterpret_cast<uint32_t *>(apic_eoi_addr) = 0;
 }

--- a/src/kernel/interrupts.h
+++ b/src/kernel/interrupts.h
@@ -29,6 +29,5 @@ extern "C" {
 	void interrupts_disable();
 }

-/// Fill the IDT with our ISRs, and disable the legacy
-/// PIC interrupts.
-void interrupts_init();
+/// Disable the legacy PIC
+void disable_legacy_pic();
--- a/src/kernel/interrupts.s
+++ b/src/kernel/interrupts.s
@@ -1,8 +1,14 @@
 %include "push_all.inc"

+section .text
+
 extern isr_handler
-global isr_handler_prelude
+global isr_handler_prelude:function (isr_handler_prelude.end - isr_handler_prelude)
 isr_handler_prelude:
+	push rbp     ; Never executed, fake function prelude
+	mov rbp, rsp ; to calm down gdb
+
+.real:
 	push_all
 	check_swap_gs

@@ -10,10 +16,15 @@ isr_handler_prelude:
 	mov rsi, rsp
 	call isr_handler
 	jmp isr_handler_return
+.end:

 extern irq_handler
-global irq_handler_prelude
+global irq_handler_prelude:function (irq_handler_prelude.end - irq_handler_prelude)
 irq_handler_prelude:
+	push rbp     ; Never executed, fake function prelude
+	mov rbp, rsp ; to calm down gdb
+
+.real:
 	push_all
 	check_swap_gs

@@ -21,36 +32,41 @@ irq_handler_prelude:
 	mov rsi, rsp
 	call irq_handler
 	; fall through to isr_handler_return
+.end:

-global isr_handler_return
+global isr_handler_return:function (isr_handler_return.end - isr_handler_return)
 isr_handler_return:
 	check_swap_gs
 	pop_all

 	add rsp, 16		; because the ISRs added err/num
 	iretq
+.end:

 %macro EMIT_ISR 2
-	global %1
+	global %1:function (%1.end - %1)
 	%1:
 		push 0
 		push %2
-		jmp isr_handler_prelude
+		jmp isr_handler_prelude.real
+	.end:
 %endmacro

 %macro EMIT_EISR 2
-	global %1
+	global %1:function (%1.end - %1)
 	%1:
 		push %2
-		jmp isr_handler_prelude
+		jmp isr_handler_prelude.real
+	.end:
 %endmacro

 %macro EMIT_IRQ 2
-	global %1
+	global %1:function (%1.end - %1)
 	%1:
 		push 0
 		push %2
-		jmp irq_handler_prelude
+		jmp irq_handler_prelude.real
+	.end:
 %endmacro

 %define EISR(i, s, name)     EMIT_EISR name, i   ; ISR with error code
--- a/src/kernel/main.cpp
+++ b/src/kernel/main.cpp
@@ -6,22 +6,28 @@
 #include "kutil/assert.h"
 #include "apic.h"
 #include "block_device.h"
+#include "clock.h"
 #include "console.h"
 #include "cpu.h"
 #include "device_manager.h"
 #include "gdt.h"
+#include "idt.h"
 #include "interrupts.h"
 #include "io.h"
 #include "kernel_args.h"
 #include "kernel_memory.h"
 #include "log.h"
+#include "msr.h"
 #include "objects/channel.h"
 #include "objects/event.h"
 #include "objects/thread.h"
+#include "objects/vm_area.h"
 #include "scheduler.h"
 #include "serial.h"
 #include "symbol_table.h"
 #include "syscall.h"
+#include "tss.h"
+#include "vm_space.h"

 #ifndef GIT_VERSION
 #define GIT_VERSION
@@ -31,18 +37,26 @@ extern "C" {
 	void kernel_main(kernel::args::header *header);
 	void (*__ctors)(void);
 	void (*__ctors_end)(void);
+	void long_ap_startup(cpu_data *cpu);
+	void ap_startup();
+	void ap_idle();
+	void init_ap_trampoline(void*, cpu_data *, void (*)());
 }

 extern void __kernel_assert(const char *, unsigned, const char *);

 using namespace kernel;

+volatile size_t ap_startup_count;
+static bool scheduler_ready = false;
+
 /// Bootstrap the memory managers.
-void setup_pat();
 void memory_initialize_pre_ctors(args::header &kargs);
 void memory_initialize_post_ctors(args::header &kargs);
 process * load_simple_process(args::program &program);

+unsigned start_aps(lapic &apic, const kutil::vector<uint8_t> &ids, void *kpml4);
+
 /// TODO: not this. this is awful.
 args::framebuffer *fb = nullptr;

@@ -77,12 +91,23 @@ kernel_main(args::header *header)
 	logger_init();

 	cpu_validate();
-	setup_pat();
+
+	log::debug(logs::boot, "    jsix header is at: %016lx", header);
+	log::debug(logs::boot, "     Memory map is at: %016lx", header->mem_map);
+	log::debug(logs::boot, "ACPI root table is at: %016lx", header->acpi_table);
+	log::debug(logs::boot, "Runtime service is at: %016lx", header->runtime_services);
+	log::debug(logs::boot, "    Kernel PML4 is at: %016lx", header->pml4);
+
+	uint64_t cr0, cr4;
+	asm ("mov %%cr0, %0" : "=r"(cr0));
+	asm ("mov %%cr4, %0" : "=r"(cr4));
+	uint64_t efer = rdmsr(msr::ia32_efer);
+	log::debug(logs::boot, "Control regs: cr0:%lx cr4:%lx efer:%lx", cr0, cr4, efer);

 	bool has_video = false;
 	if (header->video.size > 0) {
 		has_video = true;
-		fb = memory::to_virtual<args::framebuffer>(reinterpret_cast<uintptr_t>(&header->video));
+		fb = &header->video;

 		const args::framebuffer &video = header->video;
 		log::debug(logs::boot, "Framebuffer: %dx%d[%d] type %d @ %llx size %llx",
@@ -95,20 +120,37 @@ kernel_main(args::header *header)
 		logger_clear_immediate();
 	}

-	gdt_init();
-	interrupts_init();
+	extern IDT &g_idt;
+	extern TSS &g_bsp_tss;
+	extern GDT &g_bsp_gdt;
+	extern cpu_data g_bsp_cpu_data;
+	extern uintptr_t idle_stack_end;
+
+	IDT *idt = new (&g_idt) IDT;
+
+	cpu_data *cpu = &g_bsp_cpu_data;
+	kutil::memset(cpu, 0, sizeof(cpu_data));
+
+	cpu->self = cpu;
+	cpu->tss = new (&g_bsp_tss) TSS;
+	cpu->gdt = new (&g_bsp_gdt) GDT {cpu->tss};
+	cpu->rsp0 = idle_stack_end;
+	cpu_early_init(cpu);
+
+	disable_legacy_pic();

 	memory_initialize_pre_ctors(*header);
 	run_constructors();
 	memory_initialize_post_ctors(*header);

+	cpu->tss->create_ist_stacks(idt->used_ist_entries());
+
 	for (size_t i = 0; i < header->num_modules; ++i) {
 		args::module &mod = header->modules[i];
-		void *virt = memory::to_virtual<void>(mod.location);

 		switch (mod.type) {
 		case args::mod_type::symbol_table:
-			new symbol_table {virt, mod.size};
+			new symbol_table {mod.location, mod.size};
 			break;

 		default:
@@ -116,16 +158,29 @@ kernel_main(args::header *header)
 		}
 	}

-	log::debug(logs::boot, "    jsix header is at: %016lx", header);
-	log::debug(logs::boot, "     Memory map is at: %016lx", header->mem_map);
-	log::debug(logs::boot, "ACPI root table is at: %016lx", header->acpi_table);
-	log::debug(logs::boot, "Runtime service is at: %016lx", header->runtime_services);
+	syscall_initialize();

 	device_manager &devices = device_manager::get();
 	devices.parse_acpi(header->acpi_table);

+	// Need the local APIC to get the BSP's id
+	uintptr_t apic_base = devices.get_lapic_base();
+
+	lapic *apic = new lapic(apic_base);
+	apic->enable();
+
+	cpu->id = apic->get_id();
+	cpu->apic = apic;
+
+	cpu_init(cpu, true);
+
 	devices.init_drivers();
-	devices.get_lapic()->calibrate_timer();
+	apic->calibrate_timer();
+
+	const auto &apic_ids = devices.get_apic_ids();
+	unsigned num_cpus = start_aps(*apic, apic_ids, header->pml4);
+
+	idt->add_ist_entries();
 	interrupts_enable();

 	/*
@@ -152,8 +207,8 @@ kernel_main(args::header *header)
 	}
 	*/

-	syscall_enable();
-	scheduler *sched = new scheduler(devices.get_lapic());
+	scheduler *sched = new scheduler {num_cpus};
+	scheduler_ready = true;

 	// Skip program 0, which is the kernel itself
 	for (unsigned i = 1; i < header->num_programs; ++i)
@@ -164,3 +219,126 @@ kernel_main(args::header *header)

 	sched->start();
 }
+
+unsigned
+start_aps(lapic &apic, const kutil::vector<uint8_t> &ids, void *kpml4)
+{
+	using memory::frame_size;
+	using memory::kernel_stack_pages;
+
+	extern size_t ap_startup_code_size;
+	extern process &g_kernel_process;
+	extern vm_area_guarded &g_kernel_stacks;
+
+	clock &clk = clock::get();
+
+	ap_startup_count = 1; // BSP processor
+	log::info(logs::boot, "Starting %d other CPUs", ids.count() - 1);
+
+	// Since we're using address space outside kernel space, make sure
+	// the kernel's vm_space is used
+	cpu_data &bsp = current_cpu();
+	bsp.process = &g_kernel_process;
+
+	uint16_t index = bsp.index;
+
+	// Copy the startup code somwhere the real mode trampoline can run
+	uintptr_t addr = 0x8000; // TODO: find a valid address, rewrite addresses
+	uint8_t vector = addr >> 12;
+	vm_area *vma = new vm_area_fixed(addr, 0x1000, vm_flags::write);
+	vm_space::kernel_space().add(addr, vma);
+	kutil::memcpy(
+		reinterpret_cast<void*>(addr),
+		reinterpret_cast<void*>(&ap_startup),
+		ap_startup_code_size);
+
+	// AP idle stacks need less room than normal stacks, so pack multiple
+	// into a normal stack area
+	static constexpr size_t idle_stack_bytes = 2048; // 2KiB is generous
+	static constexpr size_t full_stack_bytes = kernel_stack_pages * frame_size;
+	static constexpr size_t idle_stacks_per = full_stack_bytes / idle_stack_bytes;
+
+	uint8_t ist_entries = IDT::get().used_ist_entries();
+
+	size_t free_stack_count = 0;
+	uintptr_t stack_area_start = 0;
+
+	ipi mode = ipi::init | ipi::level | ipi::assert;
+	apic.send_ipi_broadcast(mode, false, 0);
+
+	for (uint8_t id : ids) {
+		if (id == bsp.id) continue;
+
+		// Set up the CPU data structures
+		TSS *tss = new TSS;
+		GDT *gdt = new GDT {tss};
+		cpu_data *cpu = new cpu_data;
+		kutil::memset(cpu, 0, sizeof(cpu_data));
+
+		cpu->self = cpu;
+		cpu->id = id;
+		cpu->index = ++index;
+		cpu->gdt = gdt;
+		cpu->tss = tss;
+
+		tss->create_ist_stacks(ist_entries);
+
+		// Set up the CPU's idle task stack
+		if (free_stack_count == 0) {
+			stack_area_start = g_kernel_stacks.get_section();
+			free_stack_count = idle_stacks_per;
+		}
+
+		uintptr_t stack_end = stack_area_start + free_stack_count-- * idle_stack_bytes;
+		stack_end -= 2 * sizeof(void*); // Null frame
+		*reinterpret_cast<uint64_t*>(stack_end) = 0; // pre-fault the page
+		cpu->rsp0 = stack_end;
+
+		// Set up the trampoline with this CPU's data
+		init_ap_trampoline(kpml4, cpu, ap_idle);
+
+		// Kick it off!
+		size_t current_count = ap_startup_count;
+		log::debug(logs::boot, "Starting AP %d: stack %llx", cpu->index, stack_end);
+
+		ipi startup = ipi::startup | ipi::assert;
+
+		apic.send_ipi(startup, vector, id);
+		for (unsigned i = 0; i < 20; ++i) {
+			if (ap_startup_count > current_count) break;
+			clk.spinwait(20);
+		}
+
+		// If the CPU already incremented ap_startup_count, it's done
+		if (ap_startup_count > current_count)
+			continue;
+
+		// Send the second SIPI (intel recommends this)
+		apic.send_ipi(startup, vector, id);
+		for (unsigned i = 0; i < 100; ++i) {
+			if (ap_startup_count > current_count) break;
+			clk.spinwait(100);
+		}
+
+		log::warn(logs::boot, "No response from AP %d within timeout", id);
+	}
+
+	log::info(logs::boot, "%d CPUs running", ap_startup_count);
+	vm_space::kernel_space().remove(vma);
+	return ap_startup_count;
+}
+
+void
+long_ap_startup(cpu_data *cpu)
+{
+	cpu_init(cpu, false);
+	++ap_startup_count;
+	while (!scheduler_ready) asm ("pause");
+
+	uintptr_t apic_base =
+		device_manager::get().get_lapic_base();
+	cpu->apic = new lapic(apic_base);
+	cpu->apic->enable();
+
+	scheduler::get().start();
+}
--- a/src/kernel/memory_bootstrap.cpp
+++ b/src/kernel/memory_bootstrap.cpp
@@ -39,11 +39,8 @@ frame_allocator &g_frame_allocator = __g_frame_allocator_storage.value;
 static kutil::no_construct<vm_area_untracked> __g_kernel_heap_area_storage;
 vm_area_untracked &g_kernel_heap_area = __g_kernel_heap_area_storage.value;

-vm_area_guarded g_kernel_stacks {
-	memory::stacks_start,
-	memory::kernel_stack_pages,
-	memory::kernel_max_stacks,
-	vm_flags::write};
+static kutil::no_construct<vm_area_guarded> __g_kernel_stacks_storage;
+vm_area_guarded &g_kernel_stacks = __g_kernel_stacks_storage.value;

 vm_area_guarded g_kernel_buffers {
 	memory::buffers_start,
@@ -61,11 +58,19 @@ namespace kutil {
 	void kfree(void *p) { return g_kernel_heap.free(p); }
 }

+template <typename T>
+uintptr_t
+get_physical_page(T *p) {
+	return memory::page_align_down(reinterpret_cast<uintptr_t>(p));
+}
+
 void
 memory_initialize_pre_ctors(args::header &kargs)
 {
 	using kernel::args::frame_block;

+	page_table *kpml4 = static_cast<page_table*>(kargs.pml4);
+
 	new (&g_kernel_heap) kutil::heap_allocator {heap_start, kernel_max_heap};

 	frame_block *blocks = reinterpret_cast<frame_block*>(memory::bitmap_start);
@@ -73,17 +78,21 @@ memory_initialize_pre_ctors(args::header &kargs)

 	// Mark all the things the bootloader allocated for us as used
 	g_frame_allocator.used(
-		reinterpret_cast<uintptr_t>(kargs.frame_blocks),
+		get_physical_page(&kargs),
+		memory::page_count(sizeof(kargs)));
+
+	g_frame_allocator.used(
+		get_physical_page(kargs.frame_blocks),
 		kargs.frame_block_pages);

 	g_frame_allocator.used(
-		reinterpret_cast<uintptr_t>(kargs.pml4),
+		get_physical_page(kargs.pml4),
 		kargs.table_pages);

 	for (unsigned i = 0; i < kargs.num_modules; ++i) {
 		const kernel::args::module &mod = kargs.modules[i];
 		g_frame_allocator.used(
-			reinterpret_cast<uintptr_t>(mod.location),
+			get_physical_page(mod.location),
 			memory::page_count(mod.size));
 	}

@@ -97,7 +106,6 @@ memory_initialize_pre_ctors(args::header &kargs)
 		}
 	}

-	page_table *kpml4 = reinterpret_cast<page_table*>(kargs.pml4);
 	process *kp = process::create_kernel_process(kpml4);
 	vm_space &vm = kp->space();

@@ -105,42 +113,28 @@ memory_initialize_pre_ctors(args::header &kargs)
 		vm_area_untracked(kernel_max_heap, vm_flags::write);

 	vm.add(heap_start, heap);
+
+	vm_area *stacks = new (&g_kernel_stacks) vm_area_guarded {
+		memory::stacks_start,
+		memory::kernel_stack_pages,
+		memory::kernel_max_stacks,
+		vm_flags::write};
+	vm.add(memory::stacks_start, &g_kernel_stacks);
+
+	// Clean out any remaning bootloader page table entries
+	for (unsigned i = 0; i < memory::pml4e_kernel; ++i)
+		kpml4->entries[i] = 0;
 }

 void
 memory_initialize_post_ctors(args::header &kargs)
 {
 	vm_space &vm = vm_space::kernel_space();
-	vm.add(memory::stacks_start, &g_kernel_stacks);
 	vm.add(memory::buffers_start, &g_kernel_buffers);

 	g_frame_allocator.free(
-		reinterpret_cast<uintptr_t>(kargs.page_tables),
+		get_physical_page(kargs.page_tables),
 		kargs.table_count);
-
-	using memory::frame_size;
-	using memory::kernel_stack_pages;
-	constexpr size_t stack_size = kernel_stack_pages * frame_size;
-
-	for (int ist = 1; ist <= 3; ++ist) {
-		uintptr_t bottom = g_kernel_stacks.get_section();
-		log::debug(logs::boot, "Installing IST%d stack at %llx", ist, bottom);
-
-		// Pre-realize and xerothese stacks, they're no good
-		// if they page fault
-		kutil::memset(reinterpret_cast<void*>(bottom), 0, stack_size);
-
-		// Skip two entries to be the null frame
-		tss_set_ist(ist, bottom + stack_size - 2 * sizeof(uintptr_t));
-	}
-
-#define ISR(i, s, name)   if (s) { idt_set_ist(i, s); }
-#define EISR(i, s, name)  if (s) { idt_set_ist(i, s); }
-#define IRQ(i, q, name)
-#include "interrupt_isrs.inc"
-#undef IRQ
-#undef EISR
-#undef ISR
 }

 static void
@@ -198,15 +192,6 @@ log_mtrrs()
 		pat_names[(pat >> (6*8)) & 7], pat_names[(pat >> (7*8)) & 7]);
 }

-void
-setup_pat()
-{
-	uint64_t pat = rdmsr(msr::ia32_pat);
-	pat = (pat & 0x00ffffffffffffffull) | (0x01ull << 56); // set PAT 7 to WC
-	wrmsr(msr::ia32_pat, pat);
-	log_mtrrs();
-}
-

 process *
 load_simple_process(args::program &program)
--- a/src/kernel/objects/process.cpp
+++ b/src/kernel/objects/process.cpp
@@ -13,15 +13,11 @@ static kutil::no_construct<process> __g_kernel_process_storage;
 process &g_kernel_process = __g_kernel_process_storage.value;


-kutil::vector<process*> process::s_processes;
-
 process::process() :
 	kobject {kobject::type::process},
 	m_next_handle {1},
 	m_state {state::running}
 {
-	s_processes.append(this);
-
 	j6_handle_t self = add_handle(this);
 	kassert(self == self_handle(), "Process self-handle is not 1");
 }
@@ -39,10 +35,9 @@ process::~process()
 {
 	for (auto &it : m_handles)
 		if (it.val) it.val->handle_release();
-	s_processes.remove_swap(this);
 }

-process & process::current() { return *bsp_cpu_data.p; }
+process & process::current() { return *current_cpu().process; }
 process & process::kernel_process() { return g_kernel_process; }

 process *
@@ -63,7 +58,7 @@ process::exit(int32_t code)
 		thread->exit(code);
 	}

-	if (this == bsp_cpu_data.p)
+	if (this == current_cpu().process)
 		scheduler::get().schedule();
 }

--- a/src/kernel/objects/process.h
+++ b/src/kernel/objects/process.h
@@ -94,6 +94,4 @@ private:

 	enum class state : uint8_t { running, exited };
 	state m_state;
-
-	static kutil::vector<process*> s_processes;
 };
--- a/src/kernel/objects/thread.cpp
+++ b/src/kernel/objects/thread.cpp
@@ -9,7 +9,7 @@
 extern "C" void kernel_to_user_trampoline();
 static constexpr j6_signal_t thread_default_signals = 0;

-extern vm_area_guarded g_kernel_stacks;
+extern vm_area_guarded &g_kernel_stacks;

 thread::thread(process &parent, uint8_t pri, uintptr_t rsp0) :
 	kobject(kobject::type::thread, thread_default_signals),
@@ -43,13 +43,9 @@ thread::from_tcb(TCB *tcb)
 	return reinterpret_cast<thread*>(kutil::offset_pointer(tcb, offset));
 }

-thread &
-thread::current()
-{
-	return *bsp_cpu_data.t;
-}
+thread & thread::current() { return *current_cpu().thread; }

-inline void schedule_if_current(thread *t) { if (t == bsp_cpu_data.t) scheduler::get().schedule(); }
+inline void schedule_if_current(thread *t) { if (t == current_cpu().thread) scheduler::get().schedule(); }

 void
 thread::wait_on_signals(kobject *obj, j6_signal_t signals)
@@ -225,7 +221,5 @@ thread::create_idle_thread(process &kernel, uint8_t pri, uintptr_t rsp0)
 	thread *idle = new thread(kernel, pri, rsp0);
 	idle->set_state(state::constant);
 	idle->set_state(state::ready);
-	log::info(logs::task, "Created idle thread as koid %llx", idle->koid());
-
 	return idle;
 }
--- a/src/kernel/objects/vm_area.cpp
+++ b/src/kernel/objects/vm_area.cpp
@@ -66,9 +66,7 @@ vm_area_fixed::vm_area_fixed(uintptr_t start, size_t size, vm_flags flags) :
 {
 }

-vm_area_fixed::~vm_area_fixed()
-{
-}
+vm_area_fixed::~vm_area_fixed() {}

 size_t vm_area_fixed::resize(size_t size)
 {
@@ -91,9 +89,7 @@ vm_area_untracked::vm_area_untracked(size_t size, vm_flags flags) :
 {
 }

-vm_area_untracked::~vm_area_untracked()
-{
-}
+vm_area_untracked::~vm_area_untracked() {}

 bool
 vm_area_untracked::get_page(uintptr_t offset, uintptr_t &phys)
@@ -119,6 +115,8 @@ vm_area_open::vm_area_open(size_t size, vm_flags flags) :
 {
 }

+vm_area_open::~vm_area_open() {}
+
 bool
 vm_area_open::get_page(uintptr_t offset, uintptr_t &phys)
 {
@@ -134,6 +132,8 @@ vm_area_guarded::vm_area_guarded(uintptr_t start, size_t buf_pages, size_t size,
 {
 }

+vm_area_guarded::~vm_area_guarded() {}
+
 uintptr_t
 vm_area_guarded::get_section()
 {
--- a/src/kernel/objects/vm_area.h
+++ b/src/kernel/objects/vm_area.h
@@ -114,6 +114,7 @@ public:
 	/// \arg size  Initial virtual size of the memory area
 	/// \arg flags Flags for this memory area
 	vm_area_open(size_t size, vm_flags flags);
+	virtual ~vm_area_open();

 	virtual bool get_page(uintptr_t offset, uintptr_t &phys) override;

@@ -155,6 +156,8 @@ public:
 		size_t size,
 		vm_flags flags);

+	virtual ~vm_area_guarded();
+
 	/// Get an available section in this area
 	uintptr_t get_section();

--- a/src/kernel/page_tree.cpp
+++ b/src/kernel/page_tree.cpp
@@ -44,7 +44,7 @@ inline bool contains(uint64_t page_off, uint64_t word, uint8_t &index) {
 	uint64_t base = to_base(word);
 	uint64_t bits = to_level(word) * bits_per_level;
 	index = (page_off >> bits) & 0x3f;
-	return (page_off & (~0x3full << bits)) != base;
+	return (page_off & (~0x3full << bits)) == base;
 }

 inline uint64_t index_for(uint64_t page_off, uint8_t level) {
--- a/src/kernel/scheduler.cpp
+++ b/src/kernel/scheduler.cpp
@@ -17,6 +17,7 @@
 #include "objects/channel.h"
 #include "objects/process.h"
 #include "objects/system.h"
+#include "objects/thread.h"
 #include "objects/vm_area.h"
 #include "scheduler.h"

@@ -25,40 +26,37 @@

 #include "kutil/assert.h"

-
+extern "C" void task_switch(TCB *tcb);
 scheduler *scheduler::s_instance = nullptr;

-const uint64_t rflags_noint = 0x002;
-const uint64_t rflags_int = 0x202;
-
-extern uint64_t idle_stack_end;
-
-scheduler::scheduler(lapic *apic) :
-	m_apic(apic),
-	m_next_pid(1),
-	m_clock(0),
-	m_last_promotion(0)
+struct run_queue
 {
-	kassert(!s_instance, "Multiple schedulers created!");
+	tcb_node *current = nullptr;
+	tcb_list ready[scheduler::num_priorities];
+	tcb_list blocked;
+
+	uint64_t last_promotion = 0;
+	uint64_t last_steal = 0;
+	kutil::spinlock lock;
+};
+
+scheduler::scheduler(unsigned cpus) :
+	m_next_pid {1},
+	m_clock {0}
+{
+	kassert(!s_instance, "Created multiple schedulers!");
+	if (!s_instance)
 		s_instance = this;

-	process *kp = &process::kernel_process();
+	m_run_queues.set_size(cpus);
+}

-	log::debug(logs::task, "Kernel process koid %llx", kp->koid());
-
-	thread *idle = thread::create_idle_thread(*kp, max_priority,
-		reinterpret_cast<uintptr_t>(&idle_stack_end));
-
-	log::debug(logs::task, "Idle thread koid %llx", idle->koid());
-
-	auto *tcb = idle->tcb();
-	m_runlists[max_priority].push_back(tcb);
-	m_current = tcb;
-
-	bsp_cpu_data.rsp0 = tcb->rsp0;
-	bsp_cpu_data.tcb = tcb;
-	bsp_cpu_data.p = kp;
-	bsp_cpu_data.t = idle;
+scheduler::~scheduler()
+{
+	// Not truly necessary - if the scheduler is going away, the whole
+	// system is probably going down. But let's be clean.
+	if (s_instance == this)
+		s_instance = nullptr;
 }

 template <typename T>
@@ -69,20 +67,6 @@ inline T * push(uintptr_t &rsp, size_t size = sizeof(T)) {
 	return p;
 }

-thread *
-scheduler::create_process(bool user)
-{
-	process *p = new process;
-	thread *th = p->create_thread(default_priority, user);
-
-	TCB *tcb = th->tcb();
-	log::debug(logs::task, "Creating thread %llx, priority %d, time slice %d",
-			th->koid(), tcb->priority, tcb->time_left);
-
-	th->set_state(thread::state::ready);
-	return th;
-}
-
 void
 scheduler::create_kernel_task(void (*task)(), uint8_t priority, bool constant)
 {
@@ -112,25 +96,42 @@ scheduler::quantum(int priority)
 void
 scheduler::start()
 {
-	log::info(logs::sched, "Starting scheduler.");
-	wrmsr(msr::ia32_gs_base, reinterpret_cast<uintptr_t>(&bsp_cpu_data));
-	m_apic->enable_timer(isr::isrTimer, false);
-	m_apic->reset_timer(10);
+	cpu_data &cpu = current_cpu();
+	run_queue &queue = m_run_queues[cpu.index];
+	kutil::scoped_lock lock {queue.lock};
+
+	process *kp = &process::kernel_process();
+	thread *idle = thread::create_idle_thread(*kp, max_priority, cpu.rsp0);
+	log::debug(logs::task, "CPU%02x idle thread koid %llx", cpu.index, idle->koid());
+
+	auto *tcb = idle->tcb();
+	cpu.process = kp;
+	cpu.thread = idle;
+	cpu.tcb = tcb;
+
+	queue.current = tcb;
+
+	log::info(logs::sched, "CPU%02x starting scheduler", cpu.index);
+	cpu.apic->enable_timer(isr::isrTimer, false);
+	cpu.apic->reset_timer(10);
 }

 void
 scheduler::add_thread(TCB *t)
 {
-	m_blocked.push_back(static_cast<tcb_node*>(t));
-	t->time_left = quantum(t->priority);
+	cpu_data &cpu = current_cpu();
+	run_queue &queue = m_run_queues[cpu.index];
+	kutil::scoped_lock lock {queue.lock};

+	queue.blocked.push_back(static_cast<tcb_node*>(t));
+	t->time_left = quantum(t->priority);
 }

-void scheduler::prune(uint64_t now)
+void scheduler::prune(run_queue &queue, uint64_t now)
 {
 	// Find processes that are ready or have exited and
 	// move them to the appropriate lists.
-	auto *tcb = m_blocked.front();
+	auto *tcb = queue.blocked.front();
 	while (tcb) {
 		thread *th = thread::from_tcb(tcb);
 		uint8_t priority = tcb->priority;
@@ -138,7 +139,7 @@ void scheduler::prune(uint64_t now)
 		bool ready = th->has_state(thread::state::ready);
 		bool exited = th->has_state(thread::state::exited);
 		bool constant = th->has_state(thread::state::constant);
-		bool current = tcb == m_current;
+		bool current = tcb == queue.current;

 		ready |= th->wake_on_time(now);

@@ -153,7 +154,7 @@ void scheduler::prune(uint64_t now)
 			// page tables
 			if (current) continue;

-			m_blocked.remove(remove);
+			queue.blocked.remove(remove);
 			process &p = th->parent();

 			// thread_exited deletes the thread, and returns true if the process
@@ -161,19 +162,19 @@ void scheduler::prune(uint64_t now)
 			if(!current && p.thread_exited(th))
 				delete &p;
 		} else {
-			m_blocked.remove(remove);
+			queue.blocked.remove(remove);
 			log::debug(logs::sched, "Prune: readying unblocked thread %llx", th->koid());
-			m_runlists[remove->priority].push_back(remove);
+			queue.ready[remove->priority].push_back(remove);
 		}
 	}
 }

 void
-scheduler::check_promotions(uint64_t now)
+scheduler::check_promotions(run_queue &queue, uint64_t now)
 {
-	for (auto &pri_list : m_runlists) {
+	for (auto &pri_list : queue.ready) {
 		for (auto *tcb : pri_list) {
-			const thread *th = thread::from_tcb(m_current);
+			const thread *th = thread::from_tcb(queue.current);
 			const bool constant = th->has_state(thread::state::constant);
 			if (constant)
 				continue;
@@ -188,80 +189,145 @@ scheduler::check_promotions(uint64_t now)

 			if (stale) {
 				// If the thread is stale, promote it
-				m_runlists[priority].remove(tcb);
+				queue.ready[priority].remove(tcb);
 				tcb->priority -= 1;
 				tcb->time_left = quantum(tcb->priority);
-				m_runlists[tcb->priority].push_back(tcb);
+				queue.ready[tcb->priority].push_back(tcb);
 				log::info(logs::sched, "Scheduler promoting thread %llx, priority %d",
 						th->koid(), tcb->priority);
 			}
 		}
 	}

-	m_last_promotion = now;
+	queue.last_promotion = now;
+}
+
+static size_t
+balance_lists(tcb_list &to, tcb_list &from)
+{
+	size_t to_len = to.length();
+	size_t from_len = from.length();
+
+	// Only steal from the rich, don't be Dennis Moore
+	if (from_len <= to_len)
+		return 0;
+
+	size_t steal = (from_len - to_len) / 2;
+	for (size_t i = 0; i < steal; ++i)
+		to.push_front(from.pop_front());
+	return steal;
+}
+
+void
+scheduler::steal_work(cpu_data &cpu)
+{
+	// First grab a scheduler-wide lock to avoid deadlock
+	kutil::scoped_lock steal_lock {m_steal_lock};
+
+	// Lock this cpu's queue for the whole time while we modify it
+	run_queue &my_queue = m_run_queues[cpu.index];
+	kutil::scoped_lock my_queue_lock {my_queue.lock};
+
+	const unsigned count = m_run_queues.count();
+	for (unsigned i = 0; i < count; ++i) {
+		if (i == cpu.index) continue;
+
+		run_queue &other_queue = m_run_queues[i];
+		kutil::scoped_lock other_queue_lock {other_queue.lock};
+
+		size_t stolen = 0;
+
+		// Don't steal from max_priority, that's the idle thread
+		for (unsigned pri = 0; pri < max_priority; ++pri)
+			stolen += balance_lists(my_queue.ready[pri], other_queue.ready[pri]);
+
+		stolen += balance_lists(my_queue.blocked, other_queue.blocked);
+
+		if (stolen)
+			log::debug(logs::sched, "CPU%02x stole %2d tasks from CPU%02x",
+					cpu.index, stolen, i);
+	}
 }

 void
 scheduler::schedule()
 {
-	uint8_t priority = m_current->priority;
-	uint32_t remaining = m_apic->stop_timer();
-	m_current->time_left = remaining;
-	thread *th = thread::from_tcb(m_current);
+	cpu_data &cpu = current_cpu();
+	run_queue &queue = m_run_queues[cpu.index];
+	lapic &apic = *cpu.apic;
+	uint32_t remaining = apic.stop_timer();
+
+	if (m_clock - queue.last_steal > steal_frequency) {
+		steal_work(cpu);
+		queue.last_steal = m_clock;
+	}
+
+	// We need to explicitly lock/unlock here instead of
+	// using a scoped lock, because the scope doesn't "end"
+	// for the current thread until it gets scheduled again
+	kutil::spinlock::waiter waiter;
+	queue.lock.acquire(&waiter);
+
+	queue.current->time_left = remaining;
+	thread *th = thread::from_tcb(queue.current);
+	uint8_t priority = queue.current->priority;
 	const bool constant = th->has_state(thread::state::constant);

 	if (remaining == 0) {
 		if (priority < max_priority && !constant) {
 			// Process used its whole timeslice, demote it
-			++m_current->priority;
-			log::info(logs::sched, "Scheduler  demoting thread %llx, priority %d",
-					th->koid(), m_current->priority);
+			++queue.current->priority;
+			log::debug(logs::sched, "Scheduler  demoting thread %llx, priority %d",
+					th->koid(), queue.current->priority);
 		}
-		m_current->time_left = quantum(m_current->priority);
+		queue.current->time_left = quantum(queue.current->priority);
 	} else if (remaining > 0) {
 		// Process gave up CPU, give it a small bonus to its
 		// remaining timeslice.
 		uint32_t bonus = quantum(priority) >> 4;
-		m_current->time_left += bonus;
+		queue.current->time_left += bonus;
 	}

-	m_runlists[priority].remove(m_current);
 	if (th->has_state(thread::state::ready)) {
-		m_runlists[m_current->priority].push_back(m_current);
+		queue.ready[queue.current->priority].push_back(queue.current);
 	} else {
-		m_blocked.push_back(m_current);
+		queue.blocked.push_back(queue.current);
 	}

 	clock::get().update();
-	prune(++m_clock);
-	if (m_clock - m_last_promotion > promote_frequency)
-		check_promotions(m_clock);
+	prune(queue, ++m_clock);
+	if (m_clock - queue.last_promotion > promote_frequency)
+		check_promotions(queue, m_clock);

 	priority = 0;
-	while (m_runlists[priority].empty()) {
+	while (queue.ready[priority].empty()) {
 		++priority;
 		kassert(priority < num_priorities, "All runlists are empty");
 	}

-	m_current->last_ran = m_clock;
+	queue.current->last_ran = m_clock;

-	auto *next = m_runlists[priority].pop_front();
+	auto *next = queue.ready[priority].pop_front();
 	next->last_ran = m_clock;
-	m_apic->reset_timer(next->time_left);
+	apic.reset_timer(next->time_left);
+
+	if (next == queue.current) {
+		queue.lock.release(&waiter);
+		return;
+	}

-	if (next != m_current) {
 	thread *next_thread = thread::from_tcb(next);

-		bsp_cpu_data.t = next_thread;
-		bsp_cpu_data.p = &next_thread->parent();
-		m_current = next;
+	cpu.thread = next_thread;
+	cpu.process = &next_thread->parent();
+	queue.current = next;

-		log::debug(logs::sched, "Scheduler switching threads %llx->%llx",
-				th->koid(), next_thread->koid());
+	log::debug(logs::sched, "CPU%02x switching threads %llx->%llx",
+			cpu.index, th->koid(), next_thread->koid());
 	log::debug(logs::sched, "    priority %d time left %d @ %lld.",
-				m_current->priority, m_current->time_left, m_clock);
-		log::debug(logs::sched, "    PML4 %llx", m_current->pml4);
+			next->priority, next->time_left, m_clock);
+	log::debug(logs::sched, "    PML4 %llx", next->pml4);

-		task_switch(m_current);
-	}
+	queue.lock.release(&waiter);
+	task_switch(queue.current);
 }
--- a/src/kernel/scheduler.h
+++ b/src/kernel/scheduler.h
@@ -3,20 +3,19 @@
 /// The task scheduler and related definitions

 #include <stdint.h>
-#include "objects/thread.h"
+#include "kutil/spinlock.h"
+#include "kutil/vector.h"

 namespace kernel {
 namespace args {
 	struct program;
 }}

+struct cpu_data;
 class lapic;
 class process;
 struct page_table;
-struct cpu_state;
-
-extern "C" void isr_handler(cpu_state*);
-extern "C" void task_switch(TCB *next);
+struct run_queue;


 /// The task scheduler
@@ -42,8 +41,9 @@ public:
 	static const uint16_t process_quanta = 10;

 	/// Constructor.
-	/// \arg apic  Pointer to the local APIC object
-	scheduler(lapic *apic);
+	/// \arg cpus  The number of CPUs to schedule for
+	scheduler(unsigned cpus);
+	~scheduler();

 	/// Create a new process from a program image in memory.
 	/// \arg program  The descriptor of the pogram in memory
@@ -69,47 +69,35 @@ public:
 	/// Run the scheduler, possibly switching to a new task
 	void schedule();

-	/// Get the current TCB.
-	/// \returns  A pointer to the current thread's TCB
-	inline TCB * current() { return m_current; }
-
 	/// Start scheduling a new thread.
 	/// \arg t  The new thread's TCB
 	void add_thread(TCB *t);

-	/// Get a reference to the system scheduler
+	/// Get a reference to the scheduler
 	/// \returns  A reference to the global system scheduler
 	static scheduler & get() { return *s_instance; }

 private:
-	friend uintptr_t syscall_dispatch(uintptr_t, cpu_state &);
 	friend class process;

 	static constexpr uint64_t promote_frequency = 10;
+	static constexpr uint64_t steal_frequency = 10;

-	/// Create a new process object. This process will have its pid
-	/// set but nothing else.
-	/// \arg user True if this thread will enter userspace
-	/// \returns  The new process' main thread
-	thread * create_process(bool user);
-
-	void prune(uint64_t now);
-	void check_promotions(uint64_t now);
-
-	lapic *m_apic;
+	void prune(run_queue &queue, uint64_t now);
+	void check_promotions(run_queue &queue, uint64_t now);
+	void steal_work(cpu_data &cpu);

 	uint32_t m_next_pid;
 	uint32_t m_tick_count;

 	process *m_kernel_process;
-	tcb_node *m_current;
-	tcb_list m_runlists[num_priorities];
-	tcb_list m_blocked;
+
+	kutil::vector<run_queue> m_run_queues;

 	// TODO: lol a real clock
 	uint64_t m_clock = 0;
-	uint64_t m_last_promotion;

+	kutil::spinlock m_steal_lock;
 	static scheduler *s_instance;
 };

--- a/src/kernel/syscall.cpp
+++ b/src/kernel/syscall.cpp
@@ -1,20 +1,19 @@
 #include <stddef.h>

+#include "kutil/memory.h"
+
 #include "console.h"
-#include "cpu.h"
 #include "debug.h"
 #include "log.h"
-#include "msr.h"
-#include "scheduler.h"
 #include "syscall.h"

 extern "C" {
 	void syscall_invalid(uint64_t call);
-	void syscall_handler_prelude();
 }

-uintptr_t syscall_registry[static_cast<unsigned>(syscall::MAX)];
-const char * syscall_names[static_cast<unsigned>(syscall::MAX)];
+uintptr_t syscall_registry[256] __attribute__((section(".syscall_registry")));
+const char * syscall_names[256] __attribute__((section(".syscall_registry")));
+static constexpr size_t num_syscalls = sizeof(syscall_registry) / sizeof(syscall_registry[0]);

 void
 syscall_invalid(uint64_t call)
@@ -23,13 +22,10 @@ syscall_invalid(uint64_t call)
 	cons->set_color(9);
 	cons->printf("\nReceived unknown syscall: %02x\n", call);

-	const unsigned num_calls =
-		static_cast<unsigned>(syscall::MAX);
-
 	cons->printf("  Known syscalls:\n");
 		cons->printf("          invalid %016lx\n", syscall_invalid);

-	for (unsigned i = 0; i < num_calls; ++i) {
+	for (unsigned i = 0; i < num_syscalls; ++i) {
 		const char *name = syscall_names[i];
 		uintptr_t handler = syscall_registry[i];
 		if (name)
@@ -41,33 +37,14 @@ syscall_invalid(uint64_t call)
 }

 void
-syscall_enable()
+syscall_initialize()
 {
-	// IA32_STAR - high 32 bits contain k+u CS
-	// Kernel CS: GDT[1] ring 0 bits[47:32]
-	//   User CS: GDT[3] ring 3 bits[63:48]
-	uint64_t star =
-		(((1ull << 3) | 0) << 32) |
-		(((3ull << 3) | 3) << 48);
-	wrmsr(msr::ia32_star, star);
-
-	// IA32_LSTAR - RIP for syscall
-	wrmsr(msr::ia32_lstar,
-		reinterpret_cast<uintptr_t>(&syscall_handler_prelude));
-
-	// IA32_FMASK - FLAGS mask inside syscall
-	wrmsr(msr::ia32_fmask, 0x200);
-
-	static constexpr unsigned num_calls =
-		static_cast<unsigned>(syscall::MAX);
-
 	kutil::memset(&syscall_registry, 0, sizeof(syscall_registry));
 	kutil::memset(&syscall_names, 0, sizeof(syscall_names));

 #define SYSCALL(id, name, result, ...) \
 	syscall_registry[id] = reinterpret_cast<uintptr_t>(syscalls::name); \
 	syscall_names[id] = #name; \
-	static_assert( id <= num_calls, "Syscall " #name " has id > syscall::MAX" ); \
 	log::debug(logs::syscall, "Enabling syscall 0x%02x as " #name , id);
 #include "j6/tables/syscalls.inc"
 #undef SYSCALL
--- a/src/kernel/syscall.h
+++ b/src/kernel/syscall.h
@@ -10,13 +10,10 @@ enum class syscall : uint64_t
 #define SYSCALL(id, name, ...) name = id,
 #include "j6/tables/syscalls.inc"
 #undef SYSCALL
-
-	// Maximum syscall id. If you change this, also change
-	// MAX_SYSCALLS in syscall.s
-	MAX = 0x40
 };

-void syscall_enable();
+void syscall_initialize();
+extern "C" void syscall_enable();

 namespace syscalls
 {
--- a/src/kernel/syscall.s
+++ b/src/kernel/syscall.s
@@ -1,17 +1,32 @@
 %include "tasking.inc"

-; Make sure to keep MAX_SYSCALLS in sync with
-; syscall::MAX in syscall.h
-MAX_SYSCALLS equ 0x40
+; SYSCALL/SYSRET control MSRs
+MSR_STAR   equ 0xc0000081
+MSR_LSTAR  equ 0xc0000082
+MSR_FMASK  equ 0xc0000084
+
+; IA32_STAR - high 32 bits contain k+u CS
+; Kernel CS: GDT[1] ring 0 bits[47:32]
+;   User CS: GDT[3] ring 3 bits[63:48]
+STAR_HIGH  equ \
+	(((1 << 3) | 0)) | \
+	(((3 << 3) | 3) << 16)
+
+; IA32_FMASK - Mask off interrupts in syscalls
+FMASK_VAL  equ 0x200

 extern __counter_syscall_enter
 extern __counter_syscall_sysret
-
 extern syscall_registry
 extern syscall_invalid

-global syscall_handler_prelude
+
+global syscall_handler_prelude:function (syscall_handler_prelude.end - syscall_handler_prelude)
 syscall_handler_prelude:
+	push rbp     ; Never executed, fake function prelude
+	mov rbp, rsp ; to calm down gdb
+
+.real:
 	swapgs
 	mov [gs:CPU_DATA.rsp3], rsp
 	mov rsp, [gs:CPU_DATA.rsp0]
@@ -36,14 +51,7 @@ syscall_handler_prelude:

 	inc qword [rel __counter_syscall_enter]

-	cmp rax, MAX_SYSCALLS
-	jle .ok_syscall
-
-.bad_syscall:
-	mov rdi, rax
-	call syscall_invalid
-
-.ok_syscall:
+	and rax, 0xff ; Only 256 possible syscall values
 	lea r11, [rel syscall_registry]
 	mov r11, [r11 + rax * 8]
 	cmp r11, 0
@@ -52,8 +60,14 @@ syscall_handler_prelude:
 	call r11

 	inc qword [rel __counter_syscall_sysret]
+	jmp kernel_to_user_trampoline

-global kernel_to_user_trampoline
+.bad_syscall:
+	mov rdi, rax
+	call syscall_invalid
+.end:
+
+global kernel_to_user_trampoline:function (kernel_to_user_trampoline.end - kernel_to_user_trampoline)
 kernel_to_user_trampoline:
 	pop r15
 	pop r14
@@ -70,3 +84,28 @@ kernel_to_user_trampoline:

 	swapgs
 	o64 sysret
+.end:
+
+global syscall_enable:function (syscall_enable.end - syscall_enable)
+syscall_enable:
+	push rbp
+	mov rbp, rsp
+
+	mov rcx, MSR_STAR
+	mov rax, 0
+	mov rdx, STAR_HIGH
+	wrmsr
+
+	mov rcx, MSR_LSTAR
+	mov rax, syscall_handler_prelude.real
+	mov rdx, rax
+	shr rdx, 32
+	wrmsr
+
+	mov rcx, MSR_FMASK
+	mov rax, FMASK_VAL
+	wrmsr
+
+	pop rbp
+	ret
+.end:
--- a/src/kernel/task.s
+++ b/src/kernel/task.s
@@ -1,6 +1,5 @@
 %include "tasking.inc"

-extern g_tss
 global task_switch
 task_switch:
 	push rbp
@@ -18,7 +17,7 @@ task_switch:
 	mov [rax + TCB.rsp], rsp

 	; Copy off saved user rsp
-	mov rcx, [gs:CPU_DATA.rsp3]    ; rcx: curretn task's saved user rsp
+	mov rcx, [gs:CPU_DATA.rsp3]    ; rcx: current task's saved user rsp
 	mov [rax + TCB.rsp3], rcx

 	; Install next task's TCB
@@ -31,7 +30,7 @@ task_switch:
 	mov rcx, [rdi + TCB.rsp0]      ; rcx: top of next task's kernel stack
 	mov [gs:CPU_DATA.rsp0], rcx

-	lea rdx, [rel g_tss]           ; rdx: address of TSS
+	mov rdx, [gs:CPU_DATA.tss]     ; rdx: address of TSS
 	mov [rdx + TSS.rsp0], rcx

 	; Update saved user rsp
@@ -67,3 +66,8 @@ initialize_main_thread:

 	; the entrypoint should already be on the stack
 	jmp kernel_to_user_trampoline
+
+global _current_gsbase
+_current_gsbase:
+	mov rax, [gs:CPU_DATA.self]
+	ret
--- a/src/kernel/tasking.inc
+++ b/src/kernel/tasking.inc
@@ -6,9 +6,17 @@ struc TCB
 endstruc

 struc CPU_DATA
+.self:         resq 1
+.id:           resw 1
+.index:        resw 1
+.reserved      resd 1
 .rsp0:         resq 1
 .rsp3:         resq 1
 .tcb:          resq 1
+.thread:       resq 1
+.process:      resq 1
+.tss:          resq 1
+.gdt:          resq 1
 endstruc

 struc TSS
--- a/src/kernel/tss.cpp
+++ b/src/kernel/tss.cpp
@@ -0,0 +1,68 @@
+#include "kutil/assert.h"
+#include "kutil/memory.h"
+#include "kutil/no_construct.h"
+
+#include "cpu.h"
+#include "kernel_memory.h"
+#include "log.h"
+#include "objects/vm_area.h"
+#include "tss.h"
+
+// The BSP's TSS is initialized _before_ global constructors are called,
+// so we don't want it to have a global constructor, lest it overwrite
+// the previous initialization.
+static kutil::no_construct<TSS> __g_bsp_tss_storage;
+TSS &g_bsp_tss = __g_bsp_tss_storage.value;
+
+
+TSS::TSS()
+{
+	kutil::memset(this, 0, sizeof(TSS));
+	m_iomap_offset = sizeof(TSS);
+}
+
+TSS &
+TSS::current()
+{
+	return *current_cpu().tss;
+}
+
+uintptr_t &
+TSS::ring_stack(unsigned ring)
+{
+	kassert(ring < 3, "Bad ring passed to TSS::ring_stack.");
+	return m_rsp[ring];
+}
+
+uintptr_t &
+TSS::ist_stack(unsigned ist)
+{
+	kassert(ist > 0 && ist < 7, "Bad ist passed to TSS::ist_stack.");
+	return m_ist[ist];
+}
+
+void
+TSS::create_ist_stacks(uint8_t ist_entries)
+{
+	extern vm_area_guarded &g_kernel_stacks;
+	using memory::frame_size;
+	using memory::kernel_stack_pages;
+	constexpr size_t stack_bytes = kernel_stack_pages * frame_size;
+
+	for (unsigned ist = 1; ist < 8; ++ist) {
+		if (!(ist_entries & (1 << ist))) continue;
+
+		// Two zero entries at the top for the null frame
+		uintptr_t stack_bottom = g_kernel_stacks.get_section();
+		uintptr_t stack_top = stack_bottom + stack_bytes - 2 * sizeof(uintptr_t);
+
+		log::debug(logs::memory, "Created IST stack at %016lx size 0x%lx",
+				stack_bottom, stack_bytes);
+
+		// Pre-realize these stacks, they're no good if they page fault
+		for (unsigned i = 0; i < kernel_stack_pages; ++i)
+			*reinterpret_cast<uint64_t*>(stack_bottom + i * frame_size) = 0;
+
+		ist_stack(ist) = stack_top;
+	}
+}
--- a/src/kernel/tss.h
+++ b/src/kernel/tss.h
@@ -0,0 +1,39 @@
+#pragma once
+/// \file tss.h
+/// Definitions relating to the TSS
+#include <stdint.h>
+
+/// The 64bit TSS table
+class TSS
+{
+public:
+	TSS();
+
+	/// Get the currently running CPU's TSS.
+	static TSS & current();
+
+	/// Ring stack accessor. Returns a mutable reference.
+	/// \arg ring  Which ring (0-3) to get the stack for
+	/// \returns   A mutable reference to the stack pointer
+	uintptr_t & ring_stack(unsigned ring);
+
+	/// IST stack accessor. Returns a mutable reference.
+	/// \arg ist   Which IST entry (1-7) to get the stack for
+	/// \returns   A mutable reference to the stack pointer
+	uintptr_t & ist_stack(unsigned ist);
+
+	/// Allocate new stacks for the given IST entries.
+	/// \arg ist_entries  A bitmap of used IST entries
+	void create_ist_stacks(uint8_t ist_entries);
+
+private:
+	uint32_t m_reserved0;
+
+	uintptr_t m_rsp[3]; // stack pointers for CPL 0-2
+	uintptr_t m_ist[8]; // ist[0] is reserved
+
+	uint64_t m_reserved1;
+	uint16_t m_reserved2;
+	uint16_t m_iomap_offset;
+} __attribute__ ((packed));
+
--- a/src/kernel/vm_space.cpp
+++ b/src/kernel/vm_space.cpp
@@ -33,7 +33,7 @@ vm_space::vm_space(page_table *p) :
 {}

 vm_space::vm_space() :
-	m_kernel(false)
+	m_kernel {false}
 {
 	m_pml4 = page_table::get_table_page();
 	page_table *kpml4 = kernel_space().m_pml4;
@@ -163,6 +163,7 @@ void
 vm_space::page_in(const vm_area &vma, uintptr_t offset, uintptr_t phys, size_t count)
 {
 	using memory::frame_size;
+	kutil::scoped_lock lock {m_lock};

 	uintptr_t base = 0;
 	if (!find_vma(vma, base))
@@ -190,6 +191,7 @@ void
 vm_space::clear(const vm_area &vma, uintptr_t offset, size_t count, bool free)
 {
 	using memory::frame_size;
+	kutil::scoped_lock lock {m_lock};

 	uintptr_t base = 0;
 	if (!find_vma(vma, base))
--- a/src/kernel/vm_space.h
+++ b/src/kernel/vm_space.h
@@ -4,6 +4,7 @@

 #include <stdint.h>
 #include "kutil/enum_bitfields.h"
+#include "kutil/spinlock.h"
 #include "kutil/vector.h"
 #include "page_table.h"

@@ -127,6 +128,8 @@ private:
 		bool operator==(const struct area &o) const;
 	};
 	kutil::vector<area> m_areas;
+
+	kutil::spinlock m_lock;
 };

 IS_BITFIELD(vm_space::fault_type);
--- a/src/libraries/cpu/cpu_id.cpp
+++ b/src/libraries/cpu/cpu_id.cpp
@@ -1,5 +1,5 @@
 #include <stdint.h>
-#include "cpu/cpu.h"
+#include "cpu/cpu_id.h"

 namespace cpu {

@@ -94,4 +94,13 @@ cpu_id::has_feature(feature feat)
 	return (m_features & (1 << static_cast<uint64_t>(feat))) != 0;
 }

+uint8_t
+cpu_id::local_apic_id() const
+{
+	uint32_t eax_unused;
+	uint32_t ebx;
+	__cpuid(1, 0, &eax_unused, &ebx);
+	return static_cast<uint8_t>(ebx >> 24);
+}
+
 }
--- a/src/libraries/cpu/include/cpu/cpu_id.h
+++ b/src/libraries/cpu/include/cpu/cpu_id.h
@@ -1,5 +1,5 @@
 #pragma once
-/// \file cpu.h Definition of required cpu features for jsix
+/// \file cpu_id.h Definition of required cpu features for jsix

 #include <stdint.h>

@@ -48,6 +48,9 @@ public:
 	/// \returns      A |regs| struct of the values retuned
 	regs get(uint32_t leaf, uint32_t sub = 0) const;

+	/// Get the local APIC ID of the current CPU
+	uint8_t local_apic_id() const;
+
 	/// Get the name of the cpu vendor (eg, "GenuineIntel")
 	inline const char * vendor_id() const	{ return m_vendor_id; }

--- a/src/libraries/kutil/include/kutil/logger.h
+++ b/src/libraries/kutil/include/kutil/logger.h
@@ -6,6 +6,7 @@
 #include <stdint.h>

 #include "kutil/bip_buffer.h"
+#include "kutil/spinlock.h"

 namespace kutil {
 namespace log {
@@ -111,6 +112,7 @@ private:
 	uint8_t m_sequence;

 	kutil::bip_buffer m_buffer;
+	kutil::spinlock m_lock;

 	static logger *s_log;
 	static const char *s_level_names[static_cast<unsigned>(level::max)];
--- a/src/libraries/kutil/include/kutil/spinlock.h
+++ b/src/libraries/kutil/include/kutil/spinlock.h
@@ -1,19 +1,46 @@
+/// \file spinlock.h
+/// Spinlock types and related defintions
+
 #pragma once
-#include <atomic>

 namespace kutil {

-
+/// An MCS based spinlock
 class spinlock
 {
 public:
-	spinlock() : m_lock(false) {}
+	spinlock();
+	~spinlock();

-	inline void enter() { while (!m_lock.exchange(true)); }
-	inline void leave() { m_lock.store(false); }
+	/// A node in the wait queue.
+	struct waiter
+	{
+		bool locked;
+		waiter *next;
+	};
+
+	void acquire(waiter *w);
+	void release(waiter *w);

 private:
-	std::atomic<bool> m_lock;
+	waiter *m_lock;
+};
+
+/// Scoped lock that owns a spinlock::waiter
+class scoped_lock
+{
+public:
+	inline scoped_lock(spinlock &lock) : m_lock(lock) {
+		m_lock.acquire(&m_waiter);
+	}
+
+	inline ~scoped_lock() {
+		m_lock.release(&m_waiter);
+	}
+
+private:
+	spinlock &m_lock;
+	spinlock::waiter m_waiter;
 };

 } // namespace kutil
--- a/src/libraries/kutil/logger.cpp
+++ b/src/libraries/kutil/logger.cpp
@@ -91,6 +91,8 @@ logger::output(level severity, area_t area, const char *fmt, va_list args)
 	header->bytes +=
 		vsnprintf(header->message, sizeof(buffer) - sizeof(entry), fmt, args);

+	kutil::scoped_lock lock {m_lock};
+
 	if (m_immediate) {
 		buffer[header->bytes] = 0;
 		m_immediate(area, severity, header->message);
@@ -117,6 +119,8 @@ logger::output(level severity, area_t area, const char *fmt, va_list args)
 size_t
 logger::get_entry(void *buffer, size_t size)
 {
+	kutil::scoped_lock lock {m_lock};
+
 	void *out;
 	size_t out_size = m_buffer.get_block(&out);
 	if (out_size == 0 || out == 0)
--- a/src/libraries/kutil/spinlock.cpp
+++ b/src/libraries/kutil/spinlock.cpp
@@ -0,0 +1,49 @@
+#include "kutil/spinlock.h"
+
+namespace kutil {
+
+static constexpr int memorder = __ATOMIC_SEQ_CST;
+
+spinlock::spinlock() : m_lock {nullptr} {}
+spinlock::~spinlock() {}
+
+void
+spinlock::acquire(waiter *w)
+{
+	w->next = nullptr;
+	w->locked = true;
+
+	// Point the lock at this waiter
+	waiter *prev = __atomic_exchange_n(&m_lock, w, memorder);
+	if (prev) {
+		// If there was a previous waiter, wait for them to
+		// unblock us
+		prev->next = w;
+		while (w->locked) {
+			asm ("pause");
+		}
+	} else {
+		w->locked = false;
+	}
+}
+
+void
+spinlock::release(waiter *w)
+{
+	if (!w->next) {
+		// If we're still the last waiter, we're done
+		if(__atomic_compare_exchange_n(&m_lock, &w, nullptr, false, memorder, memorder))
+			return;
+	}
+
+	// Wait for the subseqent waiter to tell us who they are
+	while (!w->next) {
+		asm ("pause");
+	}
+
+	// Unblock the subseqent waiter
+	w->next->locked = false;
+}
+
+
+} // namespace kutil
Author	SHA1	Message	Date
Justin C. Miller	cf22ed57a2	[docs] Update the README with roadmap info	2021-02-17 00:47:12 -08:00
Justin C. Miller	b6772ac2ea	[kernel] Fix #DF when building with -O3 I had failed to specify in inline asm that an input variable was the same as the output variable.	2021-02-17 00:22:22 -08:00
Justin C. Miller	f0025dbc47	[kernel] Schedule threads on other CPUs Now that the other CPUs have been brought up, add support for scheduling tasks on them. The scheduler now maintains separate ready/blocked lists per CPU, and CPUs will attempt to balance load via periodic work stealing. Other changes as a result of this: - The device manager no longer creates a local APIC object, but instead just gathers relevant info from the APCI tables. Each CPU creates its own local APIC object. This also spurred the APIC timer calibration to become a static value, as all APICs are assumed to be symmetrical. - Fixed a bug where the scheduler was popping the current task off of its ready list, however the current task is never on the ready list (except the idle task was first set up as both current and ready). This was causing the lists to get into bad states. Now a task can only ever be current or in a ready or blocked list. - Got rid of the unused static process::s_processes list of all processes, instead of trying to synchronize it via locks. - Added spinlocks for synchronization to the scheduler and logger objects.	2021-02-15 12:56:22 -08:00
Justin C. Miller	2a347942bc	[kernel] Fix SMP boot on KVM KVM didn't like setting all the CR4 bits we wanted at once. I suspect that means real hardware won't either. Delay the setting of the rest of CR4 until after the CPU is in long mode - only set PAE and PGE from real mode.	2021-02-13 01:45:17 -08:00
Justin C. Miller	36da65e15b	[kernel] Add index to cpu_data Because the firmware can set the APIC ids to whatever it wants, add a sequential index to each cpu_data structure that jsix will use for its main identifier, or for indexing into arrays, etc.	2021-02-11 00:00:34 -08:00
Justin C. Miller	214ff3eff0	Update gitignore Adding files that have been hanging around my deveploment environment that should not get checked in	2021-02-10 23:59:05 -08:00
Justin C. Miller	8c0d52d0fe	[kernel] Add spinlocks to vm_space, frame_allocator Also updated spinlock interface to be an object, and added a scoped lock object that uses it as well.	2021-02-10 23:57:51 -08:00
Justin C. Miller	793bba95b5	[boot] Do address virtualization in the bootloader More and more places in the kernel init code are taking addresses from the bootloader and translating them to offset-mapped addresses. The bootloader can do this, so it should.	2021-02-10 01:23:50 -08:00
Justin C. Miller	2d4a65c654	[kernel] Pre-allocate cpu_data and pass to APs In order to avoid cyclic dependencies in the case of page faults while bringing up an AP, pre-allocate the cpu_data structure and related CPU control structures, and pass them to the AP startup code. This also changes the following: - cpu_early_init() was split out of cpu_early_init() to allow early usage of current_cpu() on the BSP before we're ready for the rest of cpu_init(). (These functions were also renamed to follow the preferred area_action naming style.) - isr_handler now zeroes out the IST entry for its vector instead of trying to increment the IST stack pointer - the IST stacks are allocated outside of cpu_init, to also help reduce stack pressue and chance of page faults before APs are ready - share stack areas between AP idle threads so we only waste 1K per additional AP for the unused idle stack	2021-02-10 15:44:07 -08:00
Justin C. Miller	872f178d94	[kernel] Update syscall MSRs for all CPUs Since SYSCALL/SYSRET rely on MSRs to control their function, split out syscall_enable() into syscall_initialize() and syscall_enable(), the latter being called on all CPUs. This affects not just syscalls but also the kernel_to_user_trampoline. Additionally, do away with the max syscalls, and just make a single page of syscall pointers and name pointers. Max syscalls was fragile and needed to be kept in sync in multiple places.	2021-02-10 15:25:17 -08:00
Justin C. Miller	70d6094f46	[kernel] Add fake preludes to isr handler to trick GDB By adding more debug information to the symbols and adding function frame preludes to the isr handler assembly functions, GDB sees them as valid locations for stack frames, and can display backtraces through interrupts.	2021-02-10 01:10:26 -08:00
Justin C. Miller	31289436f5	[kernel] Use PAUSE in spinwait Using PAUSE in a tight loop allows other logical cores on the same physical core to make use of more of the core's resources.	2021-02-07 23:52:06 -08:00
Justin C. Miller	5e7792c11f	[scripts] Add GDB j6tw page table walker Added the command "j6tw <pml4> <addr>" which takes any arguments that evaluate to addresses or integers. It displays the full breakdown of the page table walk for the given address, with flags.	2021-02-07 23:50:53 -08:00
Justin C. Miller	e73064a438	[kutil] Update spinlock to an MCS-style lock Update the existing but unused spinlock class to an MCS-style queue spinlock. This is probably still a WIP but I expect it to see more use with SMP getting further integrated.	2021-02-07 23:50:00 -08:00
Justin C. Miller	72787c0652	[kernel] Make sure all vma types have (virtual) dtors	2021-02-07 23:45:07 -08:00
Justin C. Miller	c88170f6e0	[kernel] Start all other processors in the system This very large commit is mainly focused on getting the APs started and to a state where they're waiting to have work scheduled. (Actually scheduling on them is for another commit.) To do this, a bunch of major changes were needed: - Moving a lot of the CPU initialization (including for the BSP) to init_cpu(). This includes setting up IST stacks, writing MSRs, and creating the cpu_data structure. For the APs, this also creates and installs the GDT and TSS, and installs the global IDT. - Creating the AP startup code, which tries to be as position independent as possible. It's copied from its location to 0x8000 for AP startup, and some of it is fixed at that address. The AP startup code jumps from real mode to long mode with paging in one swell foop. - Adding limited IPI capability to the lapic class. This will need to improve. - Renaming cpu/cpu.* to cpu/cpu_id.* because it was just annoying in GDB and really isn't anything but cpu_id anymore. - Moved all the GDT, TSS, and IDT code into their own files and made them classes instead of a mess of free functions. - Got rid of bsp_cpu_data everywhere. Now always call the new current_cpu() to get the current CPU's cpu_data. - Device manager keeps a list of APIC ids now. This should go somewhere else eventually, device_manager needs to be refactored away. - Moved some more things (notably the g_kernel_stacks vma) to the pre-constructor setup in memory_bootstrap. That whole file is in bad need of a refactor.	2021-02-07 23:44:28 -08:00
Justin C. Miller	a65ecb157d	[fb] Fix fb log scrolling While working on the double-buffering issue, I ripped out this feature from scrollback and didn't put it back in. Also having main allocate extra space for the message buffer since calling malloc/free over again several times was causing malloc to panic. (Which should also separately be fixed..)	2021-02-06 00:33:45 -08:00
Justin C. Miller	eb8a3c0e09	[kernel] Fix frame allocator next-block bug The frame allocator was causing page faults when exhausting the first (well, last, because it starts from the end) block of free pages. Turns out it was just incrementing instead of decrementing and thus running off the end.	2021-02-06 00:06:29 -08:00
Justin C. Miller	572fade7ff	[fb] Use rep stosl for screen fill This is mostly cleanup after fighting the double buffering bug - bring back rep stosl for screen fill, and move draw_pixel to be an inline function.	2021-02-05 23:55:42 -08:00
Justin C. Miller	b5885ae35f	[fb] Dynamically allocate log entry buffer Since the kernel will tell us what size of buffer we need for j6_system_get_log(), malloc the buffer for it from the heap instead of a fixed array on the stack.	2021-02-05 23:51:34 -08:00
Justin C. Miller	335bc01185	[kernel] Fix page_tree growth bug The logic was inverted in contains(), meaning that new parents were never being created, and the same level-0 block was just getting reused.	2021-02-05 23:47:29 -08:00