1. LLAMA 2
2. Foundation Model
- 2.1 Foundation Model - pretraining methodology
3. Fine-Tuned Chat Model
- 3.1 fine-tuning methodology
4. Results
Reference

1. LLAMA 2

상업적 용도로 사용할 수 있는 무료LLM
Pretrained(Foundation) model과 Fine-Tuned For Chat use cases model 공개
7B, 13B, 70B version

2. Foundation Model

auto-regressive transformer architectur수정하여 사용
robust data cleaning
Total Training Token 40% 증가 (1.4T tokens → 2T tokens)
Context Length 2배 증가 (2K → 4K)
추론 성능 향상을 위한 GQA (Grouped-query attention) 적용

2.1 Foundation Model - pretraining methodology

1) Pretraining Data

Meta의 제품 또는 서비스의 데이터를 포함하지 않은 공개적으로 사용 가능한 소스를 학습 corpus 추가
개인 정보가 포함된 특정 사이트들의 데이터 제거
2 trillion tokens 학습
최대한 사실적인 출처를 up-sampling함으로서 hallucinations을 줄임

2) Training Details

Standard transformer architecture의 부분을 수정하여 사용
- Pre-normalization using RMSNorm
- SwiGLU activation function
- Rotary positional embeddings
LLAMA1과의 주된 차이점:
- Context length 증가 (2048 tokens → 4096 tokens)
- 추론 성능 향상을 위한 GQA (Grouped-Query Attention) 적용
Tokenizer
- bytepair encoding (BPE) algorithm 사용
- total vocabulary size는 32k tokens

2-1) Training Details : Context length 조정 결과

더 긴 context window는 더 많은 정보 처리를 가능하게 한다:
- longer histories in chat applications
- various summarization tasks
- understanding longer documents
아래 두 표는 각각 context length만 다른 모델이 long-context benchmarks와 general tasks를 수행한 결과이다.

2-2) Training Details : GQA (Grouped-Query Attention) 적용 결과

MHA:
- 일반적인 autoregressive decoding 방식
- sequence 내 이전 token의 K와 V쌍을 caching.
- 이 방식은 attention 연산 속도 증가에 기여한다.
- context window나 batch size가 증가할 경우 K, V와 관련된 메모리 비용이 크게 증가한다는 단점이 있다. (메모리 문제 해결을 위해 제안된 것이 MQA이다.)
MQA: 성능 저하와 학습이 불안정하다는 한계를 지니는데 이를 해결하기 위해 제안된 방법론이 GQA이다.
GQA: H개 존재하던 K, V head를 G개의 그룹으로 줄이는 것이다. (즉, G=1일 경우 MQA가 되는 것이기 때문에 MQA와 MHA를 적절히 섞은 방법론이라고 볼 수 있다.)
아래 표는 동일한 모델에서 Attention 방식만 바꾸어 실험한 MHA, MQA, GQA 비교 결과이다.

3. Fine-Tuned Chat Model

대화 use-cases에 최적화된 fine-tuned version
self-supervised learning으로 학습된 LLAMA2를 fine-tuning하여 chat version을 만듦

3.1 fine-tuning methodology

Foundation Model LLAMA2를 SFT(Supervised Fine-Tuning)하여 LLAMA2-Chat을 생성
SFT를 통해 만들어진 LLAMA2-Chat에 RLHF(Reinforcement Learning from Human Feedback)를 수행하여 반복적으로 튜닝

1) Supervised Fine-Tuning(SFT)

LLAMA2-Chat 개발진들은 처음에 publicly available instruction tuning data를 사용하였는데 데이터 품질 문제로 인해 직접 FT data를 수집했다.
27,540개의 high-quality SFT data를 수집하여 학습에 적용한 결과, 비교적 적은 양의 고품질 데이터가 다량의 저품질 데이터보다 모델 성능 향상에 기여함을 확인하였다.
수집한 SFT data는 아래 샘플과 같이 Prompt와 Response로 구성된다.

2) Reinforcement Learning with Human Feedback(RLHF) : Human Preference Data Collection

Reward Modeling을 위해 인간이 선호하는 데이터를 수집한다.(100만개의 Meta reward modeling data 수집)
Annotation을 위해 hyperparameter중 temperature만 변경하여 만든 두 모델이 사용되었다.
annotator는 선호도를 significantly better, better, slightly better, or negligibly better/unsure로 구분하여 labeling한다.
annotation 작업 시 helpfulness(유용성)와 safety(안정성)에 초점을 맞추어 각각 따로 labeling을 수행한다.
- helpfulness: 모델의 응답이 사용자의 요청을 얼마나 잘 만족시키는가
- safety: 연구진들이 설정한 safety guideline을 잘 지키는가
Human annotation은 매주 batch 단위로 수집되었다. 수집 데이터가 쌓일수록 Reward Modeling의 성능이 개선되어 점점 더 나은 LLAMA2-Chat이 더 나은 방향으로 학습할 수 있게 된다.

3) Reinforcement Learning with Human Feedback(RLHF) : Reward Modeling

유용성과 안정성은 Bai et al., 2022a연구 결과, 서로 trade-off 관계에 있다는 사실이 알려져 단일 보상 모델을 유용성 보상모델(Helpfulness RM)과 안전성 보상모델(Safety RM)로 분리하여 학습과 최적화를 진행
Reward Model은 Aligned Model의 생성문의 Safety, Helpfulness에 대해 각각 binary classification task를 수행하도록 학습
보상 모델의 정확도는 LLAMA2-Chat의 최종 성능으로 이어지는 중요한 지표이다. 따라서 보상 모델의 개선이 곧 LLAMA2-Chat의 개선이다.

4) Reinforcement Learning with Human Feedback(RLHF) : Iterative Fine-Tuning

기존의 RLHF 훈련 파이프라인은 선형적 진행 방식을 띄고 있어 SFT와 Reward Model간 상호 보완이 안 되는 문제점이 있는데 이를 보완하기 위해 RM과 SFT 훈련을 반복적으로 수행하는 방법론을 제안함.
LLAMA2-Chat 개발 시 매주 수집되는 reward modeling data로 더 나은 Reward Model을 학습하여 RLHF-V1, . . . , RLHF-V5와 같이 여러 버전을 생성
Iterateve FT 도입으로 helpfulness와 safety가 좋아졌다.
RLHF fine-tuning에는 두 가지 주요 알고리즘에 사용되었다.
- Rejection Sampling : Aligned Model이 다수의 response를 생성하고 그 중에서 reward score가 가장 높은 response를 채택하는 과정
- PPO(Proximal Policy Optimization): Rejection Sampling과 Reward Model을 이용한 최종 모델 RLHF 훈련
Rejection Sampling:
- Aligned Model에서 K개의 sample을 생성한 후 Reward Model을 사용하여 최적의 후보를 선정하는 방법.
- 계속 동일한 promp를 활용하여 response를 생성하고 k개의 response 중 가장 높은 reward score를 가진 sample을 새로운 Gold Response로 채택
- LLAMA2-Chat(70B)에만 Rejection Sampling 수행
- 7B, 13B, 34B은 70B의 Rejection Sampling 데이터를 fine-tuning하여 큰 모델의 능력을 작은 모델로 distillation
PPO(Proximal Policy Optimization):
- RL에서 가장 많이 사용되는 알고리즘인 표준 PPO 사용
- Aligned Model 과 Reward Model을 이용하여 PPO 강화학습을 진행
- PPO 훈련시 Reward Model score는 Safety와 Helpfulness 중 하나로 학습되도록 설계

4) System Message for Multi-Turn Consistency : Gatt(ghost attention) Method

chatbot으로 사용하기 위한 추가적인 SFT 방법.
chatbot user의 초기 instruction을 지속적으로 반영하여 답변을 생성하기 위한 방법으로서 해당 방법론으로 multi-turn이 가능한 chatbot을 생성한다.
초기 instructin을 매 입력마다 강제적으로 삽입하는 방법은 입력 문장의 길이가 길어지게 되어 inference 비용이 커지는 문제점이 있다. 이를 해결하기 위해 LLAMA2에서는 Ghost Attention을 제안한다.
system message와 마지막 turn message를 제외한 모든 message의 loss를 0으로 설정해 초기 system message가 turn이 오래 지속되도 계속 유지되도록 함
중간 message들이 학습에는 반영되지 않지만 마지막 message를 학습할 때 참고는 되어서 ghost attention이라고 부름

4. Results

4.1 Pretrained Model

모델 성능 비교는 open-source model과 closed-source model에 대해 각각 수행.
성능 평가에는 standard academic benchmarks 사용.

대부분의 task에서 가장 좋은 성능을 보인다.
가장 많은 데이터로 pretrain이 진행되었기 때문으로 추측된다.

Llama1보다 성능 좋음
Code benchmark외 모든 benchmar에 대해 MPT보다 성능 좋음
모든 benchmark에 대해 Falcon보다 성능 좋음

LLAMA2–70B는 MMLU과 GSM8K에서 GPT-3.5에 근접한 성능을 보임
LLAMA2–70B는 모든 benchmark에서 PaLM(540B)과 비슷하거나 더 좋은 성능을 보임
LLAMA2–70B는 GPT-4와 PaLM-2-L에 비해 낮은 성능을 보임

4.2 Reward Model

safety, helpfulness 모두 좋은 성능을 보인다.
Reward Model 평가에는 StreamSHP-XL(FLAN-T5-xl기반), Open Assistant(DeBERTa V3 Large기반 보상모델), GPT-4(OpenAI’s API)가 사용됨
평가 결과, Meta RM이 전반적으로 높은 성능을 보이는 것으로 확인됨.

4.2 Chat Model

helpfulness와 safety에 대해 chatgpt보다도 좋은 성능을 보인다.
특히 safety가 높다.(34b의 safet가 안 좋아서 공개하지 않은 거라고 한다)

Reference

Hugo Touvron et al.”Llama 2: Open Foundation and Fine-Tuned Chat Models,” . 2023.

https://ai.meta.com/llama/

LLAMA 2: Open Foundation and Fine-Tuned Chat Models

10 Oct 2023 | NLP

1. LLAMA 2

2. Foundation Model

2.1 Foundation Model - pretraining methodology

1) Pretraining Data

2) Training Details

2-1) Training Details : Context length 조정 결과

2-2) Training Details : GQA (Grouped-Query Attention) 적용 결과

3. Fine-Tuned Chat Model

3.1 fine-tuning methodology

1) Supervised Fine-Tuning(SFT)

2) Reinforcement Learning with Human Feedback(RLHF) : Human Preference Data Collection

3) Reinforcement Learning with Human Feedback(RLHF) : Reward Modeling

4) Reinforcement Learning with Human Feedback(RLHF) : Iterative Fine-Tuning

4) System Message for Multi-Turn Consistency : Gatt(ghost attention) Method

4. Results

4.1 Pretrained Model

4.2 Reward Model

4.2 Chat Model

Reference