With the move to dynamic executables, crt0's _start was only ever
calling libc's __init_libc, which only ran libc's init_array list. Now
make crt0 itself (which is statically linked into every executable) call
it's own init_array list and have ld.so call every other image's ctor
lists.
Added a release config, and fixed a few spots where optimizations broke things:
- Clang was generating incorrect code for run_ctor_list in libc's init.cpp (it
ignored a check for the end of the list)
- my rep movsb memcpy implementation used incorrect inline asm constraints, so
it was returning a pointer to the end of the copied range instead of the start.
Since this function was just inline asm anyway, I rewrote it in asm by hand in
a new memutils.s file.