I read through the code after your tweet and Karpathy's reply. I think I see two things that might be messing with your results, and I had a thought on the overall picture.
The bugs:
There's a rank mismatch between chat.py and sleep.py. chat.py sets up LoRA with rank=8, but sleep.py trains at rank=16. When chat.py does model.load_weights(adapter_file, strict=False), I'm pretty sure the shape mismatch means MLX silently skips loading those weights. So after a sleep cycle, the model is running with uninitialized LoRA layers - it's not actually using anything it learned. I think that would be enough to explain "not working so well."
The other thing was that sleep.py loads a fresh base model and fresh LoRA layers every time. (It doesn't load the existing adapter before training). So each sleep throws away the previous adapter's weights and retrains from scratch on the full qa.jsonl. With iters fixed at 100 and batch_size=1, the model sees 100 samples per sleep no matter how many Q&A pairs we've got. Fine at first, but it gets undertrained as the dataset grows.
The bigger question:
I think the brain damage problem might be baked into the approach of using LoRA for factual recall specifically. The Q&A extraction prompt in sleep.py pushes toward precise fact pairs ("What is my name?" / "Your name is Awni"), and you're essentially asking a low-rank adapter to act as a key-value store. LoRA isn't amazing at this and would rather shift how the model behaves broadly instead of memorizing specific input-output mappings.
This connects to what Karpathy was saying about memory ops as tools. What if the factual stuff (names, places, things I told you) lived in token space as a structured memory file that gets pulled into the system prompt, and LoRA sleep was just for the behavioral side? Personality, tone, communication style, the kind of stuff that doesn't have a single right answer but that you want baked into how the model responds.
That split might make the weight update problem a lot easier because you're no longer asking LoRA to do precise recall. You're just nudging the model's distribution, which is what it does well and where small errors don't show up as obviously wrong answers.
No idea if this is useful but figured I'd share what I was seeing in the code. It could be too far away from true learning, I realize, without manipulating weights with hard data.
I read through the code after your tweet and Karpathy's reply. I think I see two things that might be messing with your results, and I had a thought on the overall picture.
The bugs:
There's a rank mismatch between
chat.pyandsleep.py.chat.pysets up LoRA withrank=8, butsleep.pytrains atrank=16. Whenchat.pydoesmodel.load_weights(adapter_file, strict=False), I'm pretty sure the shape mismatch means MLX silently skips loading those weights. So after a sleep cycle, the model is running with uninitialized LoRA layers - it's not actually using anything it learned. I think that would be enough to explain "not working so well."The other thing was that
sleep.pyloads a fresh base model and fresh LoRA layers every time. (It doesn't load the existing adapter before training). So each sleep throws away the previous adapter's weights and retrains from scratch on the fullqa.jsonl. With iters fixed at 100 andbatch_size=1, the model sees 100 samples per sleep no matter how many Q&A pairs we've got. Fine at first, but it gets undertrained as the dataset grows.The bigger question:
I think the brain damage problem might be baked into the approach of using LoRA for factual recall specifically. The Q&A extraction prompt in
sleep.pypushes toward precise fact pairs ("What is my name?" / "Your name is Awni"), and you're essentially asking a low-rank adapter to act as a key-value store. LoRA isn't amazing at this and would rather shift how the model behaves broadly instead of memorizing specific input-output mappings.This connects to what Karpathy was saying about memory ops as tools. What if the factual stuff (names, places, things I told you) lived in token space as a structured memory file that gets pulled into the system prompt, and LoRA sleep was just for the behavioral side? Personality, tone, communication style, the kind of stuff that doesn't have a single right answer but that you want baked into how the model responds.
That split might make the weight update problem a lot easier because you're no longer asking LoRA to do precise recall. You're just nudging the model's distribution, which is what it does well and where small errors don't show up as obviously wrong answers.
No idea if this is useful but figured I'd share what I was seeing in the code. It could be too far away from true learning, I realize, without manipulating weights with hard data.