An Intro to Kernel Development - Debugging Kernel - Part 1

5 mins

Let's see some common issues...

In the previous blog we wrote a simple Scheduler that switched between two user processes. This week i started off with the next part which is MMU, as separating memory between processess is a crucial secuity feature, before that, we need some good base line to start with, we need some tools to debug issues, as our code grows, it will be hard to keep everything in our mind canvas. I personally need a good printf helper to debug the old fashioned way, its quick, if you know where to find the issue and is almost always good enough for high level logic debugging. I have written a print helper that supports %d, %s, %c format specifiers, this should be enough, maybe we can add %x helper in future. But, for our kernel, there is just so many external devices involved, and they like to enabled and used in a certain way and it is impossible to confirm, if we have passed everything properly to the device in question, without a debugger, this is where GDB is gonna help us, it is really good, and i only know few commands, still i havent felt the need to learn any new commads there. Let's start with some examples cases and how to debug them. CASE 1: When my simple scheduler kernel was running, it is supposed to switch between the two Task, but initially, when i ran the code, i kept getting log messages from Task1, it is obvious that the task is not getting switched, but we need to confirm that, because it is possible that something else is messed up to. so i started by adding a print statement right after the dequeue to what was the next Task that is getting scheduled here, this confirmed that the next Task was also A. From there i added bunch of prints in the entire file to figure that the Head was set to 0, after adding a new task to the end, you can find all the prints inside the file CASE 2: well, the print method might seem easy, but it took an entire day for me to figure it that way, with GDB, i believe it could have been better, still GDB is not really for finding bugs in logic, because that previous example is just me writing dumb linked-list code, GDB shines, when you need to comply with certian rules, like parameter passing, checking if we have the struct aligned in memory as we excepted, and the key is, adding extra print will ruin the existing layout and even if you figure out a fix with print, after you remove it, it may not work, because you just should changed the entire memory layout by removing a function call. I wrote a syscall, write, which can be used from user process, but no matter what i do, i was not able to get the current parameters in the irq_sync_handler in correct order. I was able to confirm that i am passing the variable correctly in the syscall file but within the dispatcher code, the arguments are messed up. i stopped my pointless printfs and attached the process to gdb, tried to add a break statement at the function, well, here is another issue, i dont see the function at all, even worse, if i add a print statment and attach it to gdb, i see the function, if i remove the print and attach it to gdb, boom, the function is no longer there. this was new for me, turns out my normal high level language brain which dont care about compiler optimizations, forgot to consider that compiler can decide that certain function could be inlined and lot of things could be cached to make things fast. this is exactly what happened in my case
            
                → 0x40001968  adrp x0, 0x40001000  
                0x4000196c  add x0, x0, #0x9b4 
                0x40001970  mov x29, sp 
                0x40001974  bl 0x400012b4  
                0x40001978 .inst 0x73550042 ; undefined 0x4000197c .inst 0x4c207265 ; undefined ─────────────────────────────────────────────────────────────────── threads ──── [#0] Id 1, stopped 0x40001968 in syscall_dispatcher (), reason: BREAKPOINT ───────────────────────────────────────────────────────────────────── trace ──── [#0] 0x40001968 → syscall_dispatcher() [#1] 0x400010e0 → el0_sync_entry() ──────────────────────────────────────────────────────────────────────────────── (remote) gef➤ info registers 
                x0 0x1 0x1 
                x1 0x40001a20 0x40001a20 
                x2 0xa 0xa 
                x3 0x0 0x0 
                x4 0x0 0x0 
                x5 0x0 0x0 
                x6 0x0 0x0 
                x7 0x0 0x0 
                x8 0x1 0x1 
                x9 0x9000018 0x9000018 
                0x40001968  adrp x0, 0x40001000  
                0x4000196c  add x0, x0, #0x9b4 
                0x40001970  mov x29, sp 
                0x40001974  bl 0x400012b4 
            
        
break before the dispatcher code showed that my arguments are in correct register x0, x1, x2, but the syscall dispatcher starts with prolog stub that alters the registers and also changes the sp structure. To get around this issues, i had to write a helper that just has asm and set it with naked attribute and internal call to the actually method with proper arguments. this solved the issue and everything works now, but there is a bigger issues somewhere which i havent figured is that if i remove `-O2` from my compiler flags, everything breaks again, this will be exericise for you guys if you want to try and find which structure is broke, or maybe we have someother stuff left in the stack, which i havent accounted for. Small Sample of working parameters layout in GDB HELPFULL GDB commands: b function name -> to break at the start of a function, tab completion works here ni -> next instruction disassable function name -> to get the instructions c -> continue info r -> show current registers I am also using GEF gdb plugin to make the output little better in every break point