[TIL] LLM as reward models/evaluators (#RLHF, #Self-improvement)

NLP

당니이 2024. 8. 30. 05:00

다른 분야도 겅부해야지 .. 정신차리고 ..

☑️ RewardBench (8 Jun 2024)

Reward modeling에는 human judgment annotation이 필요하지만 이건 너무 costly함
Human annotation 없이 self-improvement framework를 만듦
- Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench
LLM-as-a-Judge (Judgment generator)는 아래와 같은 input을 받는다.
- an input (user instruction) x;
- two possible assistant responses y (A) and y (B) to the user instruction x;
- the evaluation prompt containing the rubric and asking to evaluate and choose the winning answer,

Background: Self-Rewarding Language Models
- LLMs can improve by judging their own responses instead of relying on human labelers (AI feedback training)
- Key insights: 모델을 계속 반복해서 training 하면 instruction following은 보장되지만, judge의 improvement는 보장되지 못한다는 단점 존재
LLM acts as an actor, judge, and meta-judge
- Self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills.
- Model이 자신의 judgement를 판단하고, 자신의 judgement skill을 수정하기 위해 feedback 사용

Surprisingly, this unsupervised approach improves the model’s ability to judge and follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4%