[NLP] Attention Mechanism (어텐션 메커니즘)

1. 개요(Seq2Seq의 한계)

- 하나의 고정된 Context Vector에 이전의 모든 정보를 압축하려고 하니 정보손실이 발생
→ 고정된 길이의 Context Vector에 비해 Sequence Length가 훨씬 길어진다면
과거 State의 정보가 잊혀지는 Catastrophic Forgetting 현상이 발생
결국 가장 처음 등장했던 단어를 번역해내는데에 실패할 확률이 높음!

- RNN의 고질적인 문제인 Gradient Vanishing 문제 발생

- 예측 과정에서 과거 정보를 다시한번 참고할 수 있는 방법은 없을까?

2. 구성

1) 아이디어

디코더에서 출력 단어를 예측하는 매 시점(time step)마다 인코더의 문장을 다시한번 참고!
단, 해당 시점의 단어와 가장 연관이 있는 인코더의 부분을 조금 더 집중(attention)하도록 함

Seq2seq의 구조를 아예 버리는 것이 아니라는 점을 기억하자. 기본적인 인코더 + 디코더의 형태, RNN 구성 등은 이어가되 Attention

2) Key-Query-Value

어텐션 함수는 Key, Query, Value로 표현할 수 있음

Attention(Q, K, V) = Attention Value

Query : t 시점의 디코더 셀에서의 은닉상태
Keys : 모든 시점의 인코더 셀의 은닉상태들
Values : 모든 시점의 인코더 셀의 은닉상태

어텐션 메커니즘은 기본적으로 Query와 Key의 유사도를 구하고, 이 가중치를 Value에 반영해 주목해야할 인코더의 요소를 디코더에 반영하는 것!

간략한 어텐션 메커니즘에 대한 설명은 다음과 같다

1. 어텐션 함수는 주어진 쿼리(Query)에 대해 모든 키(Key)와의 유사도를 구함
cf) 어텐션 함수의 종류에 따라 여러가지 어텐션이 존재함

2. 이 유사도를 softmax의 확률값을 통해 가중치로 각각 Key에 매핑되어있는 Value에 전달함

3. 이 가중치와 Value를 Weighted Sum(가중합)하여 최종 Attention Value를 구함

3) Dot-Product Attention

어텐션은 어텐션 함수에 따라 다양한 종류로 나뉘는데, 그 중 dot product 연산으로 scoring을 하는 dot-product 어텐션에 대해 서술해 보겠다!
(Seq2seq에서 사용되는 어텐션은 함수 부분만 변하고 거의 이와 같은 형태를 가짐)

디코더의 세번째 LSTM 셀이 출력 단어를 예측하기 위해 앞 인코더의 모든 입력 정보를 다시 참고하려고 하는 상황.
일반적으로 seq2seq구조 디코더의 현재 시점 t에서 필요한 입력값은, recurrent 신경망의 특성상 이전 t-1시점의 은닉 상태와 t-1시점의 출력 단어일 것임!

하지만 어텐션 메커니즘에서는 Attention Value가 여기에 하나 더 필요하다고 생각하면 된다.
Attention Value를 구하기 위해서는 Attention score가 필요하고, 이 score를 구하는데 사용되는 함수가 어텐션 함수이다. Dot-product Attention은 이 함수를 product한 단순한 형태를 사용!

인코더의 모든 은닉상태 각각이 디코더의 현 시점 은닉상태 St와 얼마나 유사한지를 score로 나타냄

그리고 모든 인코더의 은닉상태와 계산한 Attention Score를 모은 행렬 e에 Softmax 함수를 취해서, 확률 분포를 만들고 이를 어텐션 분포라고 함 (이 확률분포 요소의 합은 1일 것임)

그리고 이는 각 인코더의 은닉상태가 현재 디코더와 얼마나 관련이 있는지를 나타내는 가중치의 역할을 한다

이러한 가중치를 각 인코더의 은닉상태에 가중합해서 최종 Attention Value를 구함!
이는 이전 인코더의 모든 정보를 포함하고 있다는 점에서 Context Vector라고도 불림

이렇게 나온 Attention Value와 디코더의 현재 은닉상태를 concat하여 하나의 벡터로 만들고, 이를 신경망에 넣은 결과를 디코더의 input으로 활용한다! 그럼 이전의 정보를 잘 활용할 수 있기 때문에 더 잘 예측할 수 있다 ~_~

input으로 넣고 다음 단어가 나올 확률을 또 softmax로 출력하면, 가장 확률이 높은 단어를 다음 단어로 예측한다

4) Code

from tensorflow.keras.models import Model
from tensorflow.keras.layers import dot
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Dropout, Activation, concatenate
import tensorflow as tf


################################ 인코더 #################################
encoder_inputs = Input(shape=(eng_max_length,))
en_x=  Embedding(num_encoder_tokens+1, 256)(encoder_inputs)  # 임베딩차원 늘림 
en_x = Dropout(0.5)(en_x)   # dropout 추가
encoder_outputs, state_h, state_c = LSTM(256, return_sequences=True, return_state=True)(en_x)
encoder_states=[state_h, state_c]   

################################ 디코더 #################################
decoder_inputs = Input(shape=(fr_max_length,))
dex=  Embedding(num_decoder_tokens+1,256)(decoder_inputs)
decoder = Dropout(0.5)(dex)
decoder = LSTM(256, return_sequences=True)(decoder, initial_state=encoder_states)   # 인코더 state와 연결!!


################################ Attention Layer #################################
t = Dense(5000, activation='tanh')(encoder_outputs)
t1 = Dense(5000, activation='tanh')(decoder)

# dot product attention (유사도를 구하는 방법)
# attention 가중치를 value에 반영 (어텐션 score 계산) 
# > 인코더의 모든 은닉상태 각각이 디코더의 현 시점 은닉상태 St와 얼마나 유사한지를 score로 나타냄
attention = dot([t1, t], axes=[2, 2])    # attention 가중치 계산 (Query와 Key의 유사도를 구함)

attention = Dense(eng_max_length, activation='tanh')(attention)  
attention = Activation('softmax')(attention)  # 가중치를 softmax 확률 값을 통해 인코더 output(value)에 전달!

context = dot([attention, encoder_outputs], axes = [2,1])  # 어텐션 value(가중합, product, weighted sum)

decoder_combined_context = concatenate([context, decoder])   # attention value와 디코더의 현재 은닉상태를 concat하여 하나의 벡터로 만듦 
decoder_combined_context = Dense(2000, activation='tanh')(decoder_combined_context)
# decoder = Dropout(0.5)()

output = Dense( num_decoder_tokens+1, activation="softmax")(decoder_combined_context)    # 출력층 (단어 확률값으로 출력)

model3 = Model(inputs=[encoder_inputs, decoder_inputs], outputs=[output])

저작자표시

'NLP' 카테고리의 다른 글

[TIL] In-context Learning with Long-context LLMs (0)	2024.09.13
[TIL] LLM as reward models/evaluators (#RLHF, #Self-improvement) (0)	2024.08.30
[NLP] LORA : Low-Rank Adaptation of Large Language Models 논문 리뷰 (0)	2023.04.04
[NLP] Transformer(트랜스포머, Attention is all you need) (0)	2021.02.09
[NLP] Sequence-to-Sequence (Seq2Seq, 시퀀스 투 시퀀스) (0)	2021.02.08