Resluts
I’ve spent the week working on logs: I wrote scripts to streamline parsing for NotebookLM and added conversion tools for specific log types.
I scaled the experiments up to 150 and upgraded the judge and auditor to Claude 3.5 Sonnet. We’re seeing significantly more correlations now; however, it turns out our power is extremely low, suggesting we might need a total of 300 samples for a robust analysis.
I don't think we’ll actually run that many. It’s just food for thought for now. But we can already state with certainty that model prompts directly influence their tendencies in resting state. I speculate this is due to fine-tuned implicit bias, which heavily pushes the model toward systematic production.