NLP

[TIL] LLM as reward models/evaluators (#RLHF, #Self-improvement)

당니이 2024. 8. 30. 05:00
반응형

다른 분야도 겅부해야지 .. 정신차리고 .. 

☑️ RewardBench (8 Jun 2024)

  • Evaluating Reward Models for Language Modeling
  • Reward model들을 평가하는 밴치마크이다. 
  • RLHF: 사람이 만든 preference data를 이용해 reward model을 training 하는 과정 

 

 

☑️ Self-Taught Evaluators (8 Aug 2024)

  • Reward modeling에는 human judgment annotation이 필요하지만 이건 너무 costly함
  • Human annotation 없이 self-improvement framework를 만듦 
    • Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench
  • LLM-as-a-Judge (Judgment generator)는 아래와 같은 input을 받는다.  
    • an input (user instruction) x; 
    • two possible assistant responses y (A) and y (B) to the user instruction x; 
    • the evaluation prompt containing the rubric and asking to evaluate and choose the winning answer,

any human annotation 없이도 human annotated data와 비슷한 성능을 보인다.


☑️ META-REWARDING LANGUAGE MODELS: Self-Improving Alignment with LLM-as-a-Meta-Judge (30 Jul 2024)

  • Background: Self-Rewarding Language Models
    • LLMs can improve by judging their own responses instead of relying on human labelers (AI feedback training) 
    • Key insights: 모델을 계속 반복해서 training 하면 instruction following은 보장되지만, judge의 improvement는 보장되지 못한다는 단점 존재 
  • LLM acts as an actor, judge, and meta-judge 
    • Self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. 
    • Model이 자신의 judgement를 판단하고, 자신의 judgement skill을 수정하기 위해 feedback 사용 

  • Surprisingly, this unsupervised approach improves the model’s ability to judge and follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4%
반응형