Now that the other CPUs have been brought up, add support for scheduling
tasks on them. The scheduler now maintains separate ready/blocked lists
per CPU, and CPUs will attempt to balance load via periodic work
stealing.
Other changes as a result of this:
- The device manager no longer creates a local APIC object, but instead
just gathers relevant info from the APCI tables. Each CPU creates its
own local APIC object. This also spurred the APIC timer calibration to
become a static value, as all APICs are assumed to be symmetrical.
- Fixed a bug where the scheduler was popping the current task off of
its ready list, however the current task is never on the ready list
(except the idle task was first set up as both current and ready).
This was causing the lists to get into bad states. Now a task can only
ever be current or in a ready or blocked list.
- Got rid of the unused static process::s_processes list of all
processes, instead of trying to synchronize it via locks.
- Added spinlocks for synchronization to the scheduler and logger
objects.
KVM didn't like setting all the CR4 bits we wanted at once. I suspect
that means real hardware won't either. Delay the setting of the rest of
CR4 until after the CPU is in long mode - only set PAE and PGE from real
mode.
Because the firmware can set the APIC ids to whatever it wants, add a
sequential index to each cpu_data structure that jsix will use for its
main identifier, or for indexing into arrays, etc.
More and more places in the kernel init code are taking addresses from
the bootloader and translating them to offset-mapped addresses. The
bootloader can do this, so it should.
In order to avoid cyclic dependencies in the case of page faults while
bringing up an AP, pre-allocate the cpu_data structure and related CPU
control structures, and pass them to the AP startup code.
This also changes the following:
- cpu_early_init() was split out of cpu_early_init() to allow early
usage of current_cpu() on the BSP before we're ready for the rest of
cpu_init(). (These functions were also renamed to follow the preferred
area_action naming style.)
- isr_handler now zeroes out the IST entry for its vector instead of
trying to increment the IST stack pointer
- the IST stacks are allocated outside of cpu_init, to also help reduce
stack pressue and chance of page faults before APs are ready
- share stack areas between AP idle threads so we only waste 1K per
additional AP for the unused idle stack
Since SYSCALL/SYSRET rely on MSRs to control their function, split out
syscall_enable() into syscall_initialize() and syscall_enable(), the
latter being called on all CPUs. This affects not just syscalls but also
the kernel_to_user_trampoline.
Additionally, do away with the max syscalls, and just make a single page
of syscall pointers and name pointers. Max syscalls was fragile and
needed to be kept in sync in multiple places.
By adding more debug information to the symbols and adding function
frame preludes to the isr handler assembly functions, GDB sees them as
valid locations for stack frames, and can display backtraces through
interrupts.
This very large commit is mainly focused on getting the APs started and
to a state where they're waiting to have work scheduled. (Actually
scheduling on them is for another commit.)
To do this, a bunch of major changes were needed:
- Moving a lot of the CPU initialization (including for the BSP) to
init_cpu(). This includes setting up IST stacks, writing MSRs, and
creating the cpu_data structure. For the APs, this also creates and
installs the GDT and TSS, and installs the global IDT.
- Creating the AP startup code, which tries to be as position
independent as possible. It's copied from its location to 0x8000 for
AP startup, and some of it is fixed at that address. The AP startup
code jumps from real mode to long mode with paging in one swell foop.
- Adding limited IPI capability to the lapic class. This will need to
improve.
- Renaming cpu/cpu.* to cpu/cpu_id.* because it was just annoying in GDB
and really isn't anything but cpu_id anymore.
- Moved all the GDT, TSS, and IDT code into their own files and made
them classes instead of a mess of free functions.
- Got rid of bsp_cpu_data everywhere. Now always call the new
current_cpu() to get the current CPU's cpu_data.
- Device manager keeps a list of APIC ids now. This should go somewhere
else eventually, device_manager needs to be refactored away.
- Moved some more things (notably the g_kernel_stacks vma) to the
pre-constructor setup in memory_bootstrap. That whole file is in bad
need of a refactor.
The frame allocator was causing page faults when exhausting the first
(well, last, because it starts from the end) block of free pages. Turns
out it was just incrementing instead of decrementing and thus running
off the end.
Instead of always mapping the framebuffer at an arbitrary location, and
so reporting that to userspace, send the physical address so drivers can
call system_map_mmio().
Probably due to old UEFI page tables going away, some systems failed to
load ACPI tables at their physical location. Make sure to translate them
to kernel offset-mapped addresses.
Create a syscall for drivers to be able to ask the kernel for a VMA that
maps a MMIO area. Also expose vm_flags via j6 table style include file
and new flags.h header.
The 0 index was still sometimes not handled properly in the hash table.
Also 0 is sometimes indicative of an error. Let's just start handles
from 1 to avoid those issues.
We started actually running up against the page boundary for kernel
stacks and thus double-faulting on page faults from kernel space. So I
finally added IST stacks. Note that we currently just
increment/decrement the IST entry by a page when we enter the handler to
avoid clobbering on re-entry, but this means:
* these handlers need to be able to operate with only a page of stack
* kernel stacks always have to be >1 pages
* the amount of nesting possible is tied to the kernel stack size.
These seem fine for now, but we should maybe find a way to use something
besides g_kernel_stacks to set up the IST stacks if/when this becomes an
issue.
The previous method of VMA page tracking relied on the VMA always being
mapped at least into one space and just kept track of pages in the
spaces' page tables. This had a number of drawbacks, and the mapper
system was too complex without much benefit.
Now make VMAs themselves keep track of spaces that they're a part of,
and make them responsible for knowing what page goes where. This
simplifies most types of VMA greatly. The new vm_area_open (nee
vm_area_shared, but there is now no reason for most VMAs to be
explicitly shareable) adds a 64-ary radix tree for tracking allocated
pages.
The page_tree cannot yet handle taking pages away, but this isn't
something jsix can do yet anyway.
ktuil::vector can take a static area of memory as its initial memory,
but the case was never handled where it outgrew that memory and had to
reallocate. Steal the high bit from the capacity value to indicate the
current memory should not be kfree()'d. Also added checks in the heap
allocator to make sure pointers look valid.
In preparation for moving things to the init process, move process
loading out of the scheduler. memory_bootstrap now has a
load_simple_process function for mapping an args::program into memory,
and the stack setup has been simplified (though all the initv values are
still being added by the kernel - this needs rework) and normalized to
use the thread::add_thunk_user code path.
Refactored out vm_space::handle_fault's allocation code into a separate
vm_space::allocate function, and reimplemented handle_fault in terms of
the new function.
This is a minor refactor including:
- Removing old commented-out syscall_dispatch function
- Removing IA32_EFER syscall-enable flag setting (this is done by the
bootloader now)
- Moving much logging from inside process/thread syscalls to the 'task'
log area, allowing for turning the 'syscall' area down to info by
default.
This also prompted a change of the process initialization protocol to
allow handles to get typed, and changing to marking them as just
self/other handls. This also means exposing the object type enum to
userspace.
This makes the job of the kernel easier when marking module pages as
used in the frame allocator. This will also help when sending modules
over to the init process.
The previous frame allocator involved a lot of splitting and merging
linked lists and lost all information about frames while they were
allocated. The new allocator is based on an array of descriptor
structures and a bitmap. Each memory map region of allocatable memory
becomes one or more descriptors, each mapping up to 1GiB of physical
memory. The descriptors implement two levels of a bitmap tree, and have
a pointer into the large contiguous bitmap to track individual pages.
To enable setting sections as NX or read-only, the boot program loader
now loads programs as lists of sections, and the kernel args are updated
accordingly. The kernel's loader now just takes a program pointer to
iterate the sections. Also enable NX in IA32_EFER in the bootloader.
There was previously no good way to block log-display tasks, either the
fb driver or the kernel log task. Now the system object has a signal
(j6_signal_system_has_log) that gets asserted when the log is written
to.
Now that the spinwait bug is fixed, the raised time for APIC calibration
can be put back to a lower value. It was previously raised thinking more
time would get a more accurate result -- but accuracy was not the issue.
The endpoint syscalls endpoint_recv and endpoint_sendrecv gained new
local stack variables for calling into possibly blocking endpoint
functions, but the len variable was being initialized to 0 instead of
the incoming buffer size.
The amount of spurious IRQ activity on real hardware severely slows down
the system (minutes per frame instead of frames per second). There's no
reason to unmask all of them from the get-go before they're set up to be
handled.
In order to allow the bootloader to do preliminary CPUID validation
while UEFI is still handling displaying information to the user, split
most of the kernel's CPUID handling into a library to be used by both
kernel and boot.
Several changes were needed to make this work:
- Update the page_table::flags to understand memory caching types
- Set up the PAT MSR to add the WC option
- Make page-offset area mapped as WT
- Add all the MTRR and PAT MSRs, and log the MTRRs for verification
- Add a vm_area flag for write_combining
The UEFI spec specifically calls out memory types with the high bit set
as being available for OS loaders' custom use. However, it seems many
UEFI firmware implementations don't handle this well. (Virtualbox, and
the firmware on my Intel NUC and Dell XPS laptop to name a few.)
So sadly since we can't rely on this feature of UEFI in all cases, we
can't use it at all. Instead, treat _all_ memory tagged as EfiLoaderData
as possibly containing data that's been passed to the OS by the
bootloader and don't free it yet.
This will need to be followed up with a change that copies anything we
need to save and frees this memory.
See: https://github.com/kiznit/rainbow-os/blob/master/boot/machine/efi/README.md