I wanna start off by saying: "talk is often a substitute to action"š«„. I decided never to put ideas out before I take action on them. But I made an exception on this, I'll tell you why in the end.
Inspiration for this idea: (Anthropic Paper)
So Why should one care about this paper š, Theyāve identified specific featuresāor patternsāthat correspond to particular behaviors, like being overly agreeable or sycophantic. By tweaking the modelās weights, they can control how much or how little the model exhibits these behaviors. Essentially, theyāve found the āknobsā to turn certain behaviors up or down.
After reading their paper. I went to the llm providers, chatgpt claude.ai and others. I asked for dark offensive racist jokes, and they replied they can't do that.
So these big llm providers are gonna be politically correct, overly safe etc.. An analogy got into my mind. We have news that's true and we have fake news services like Onion which is loved because it's fake and funny.
So can one build Onion for LLMs.
So the idea basically is we take the LLama3 80B, the best open source model tweak it through interpretability research techniques. The product idea is basically rogue ai, It's where people come when they want their AI to be un-hinged and un-constrained of any political correctness or guardrails that these big companies are putting the models on.
I am very new to Interpretability Research, I have studied Linear Algebra and these Deep Learning Architectures in College. I also trained a transformer architecture from scratch. But training an AutoEncoder to learn features of model activations, this is all new to me and don't have any intuition on.
So the pursuit of just figuring this out is pretty fruitful. But If I wear my entrepreneurial hat for a second, First of all
- Rogue ai is just terrible branding (or not). It will associate bad intention to what you are doing.
- THIS IS NOT SOLVING ANYBODY's problem. Have to figure out who wants this, who actually get's annoyed when chatgpt doesn't answer what they want it to answer. Just an offensive uncle you go talk to isn't gonna cut it.
The reason I am thinking about this idea out loud is because, this would be good link to attach when I reach out to anyone in needing direction with the Interpretability Research Techniques.
I strongly urge to read the Anthropic's Blog, I didn't understand half of it. But I got the gist of how they are approaching this.
I have also found this explanation of SAE's blog by Adam Karvonen helpful. Maybe I'll reach out to him for help, first I gotta get this thing moving.
I mean if I do do this, it could be a path for me to be on the Interpretability team at Anthropic š. I mean I know I am stretching at this point but who knows, just putting it out there š.