Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xlnet, transformer xl attention score funtion problem #295

Open
wonjunchoi-arc opened this issue Nov 7, 2023 · 0 comments
Open

xlnet, transformer xl attention score funtion problem #295

wonjunchoi-arc opened this issue Nov 7, 2023 · 0 comments

Comments

@wonjunchoi-arc
Copy link

A function tf.einsum (‘ibnd,jbnd->ijbn’, (head_q, head_k) used to obtain an attachment score from xlnet, transformer xl, can’t find correlation between all words, can anyone explain it on calculation? For example, if you have a 2,2 tensor called [i,am],[a,boy] i is i,a am is am,boy A is i, a Boy only calculates am,boy correlation. please help me

def call(self, w, r, attn_mask, mems, head_mask, output_attentions, training=False):
        qlen, rlen, bsz = shape_list(w)[0], shape_list(r)[0], shape_list(w)[1]

        if mems is not None:
            cat = tf.concat([mems, w], 0)
            if self.pre_lnorm:
                w_heads = self.qkv_net(self.layer_norm(cat))
            else:
                w_heads = self.qkv_net(cat)
            r_head_k = self.r_net(r)

            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)
            w_head_q = w_head_q[-qlen:]
        else:
            if self.pre_lnorm:
                w_heads = self.qkv_net(self.layer_norm(w))
            else:
                w_heads = self.qkv_net(w)
            r_head_k = self.r_net(r)

            w_head_q, w_head_k, w_head_v = tf.split(w_heads, 3, axis=-1)

        klen = shape_list(w_head_k)[0]

        w_head_q = tf.reshape(w_head_q, (qlen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head
        w_head_k = tf.reshape(w_head_k, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head
        w_head_v = tf.reshape(w_head_v, (klen, bsz, self.n_head, self.d_head))  # qlen x bsz x n_head x d_head

        r_head_k = tf.reshape(r_head_k, (rlen, self.n_head, self.d_head))  # qlen x n_head x d_head

        # compute attention score
        rw_head_q = w_head_q + self.r_w_bias  # qlen x bsz x n_head x d_head
        AC = tf.einsum("ibnd,jbnd->ijbn", rw_head_q, w_head_k)  # qlen x klen x bsz x n_head

        rr_head_q = w_head_q + self.r_r_bias
        BD = tf.einsum("ibnd,jnd->ijbn", rr_head_q, r_head_k)  # qlen x klen x bsz x n_head
        BD = self._rel_shift(BD)

attention_score = np.einsum ("ijkl,ijml->ijkm", Q, K)/np.sqrt(hidden_size) # [batch_size, num_haed, sequence_length, sequence_length] As above, the formula for finding the basic attrition score finds the attrition score between all words and all words tf.einsum ("ibnd,jbnd->ijbn", rw_head_q, w_head_k) doesn't get the attrition socre with all the words as described above, so I want to know about this part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant