Iterative DPO #6

Sanqiang · 2024-01-25T21:56:28Z

Sanqiang
Jan 25, 2024

I think authors of self rewarding llm didn't use standard DPO but Interactive DPO, which is from their another paper: https://arxiv.org/pdf/2312.16682.pdf.

lucidrains · 2024-01-25T23:25:36Z

lucidrains
Jan 25, 2024
Maintainer

is this the section you are looking at? they say it is similar to the cringe paper, but then goes on to outline what they actually did

3 replies

lucidrains Jan 25, 2024
Maintainer

does the following code look in line with what is described in the screenshot above?

lucidrains Jan 25, 2024
Maintainer

@Sanqiang you mean "Iterative", not "Interactive"?

Minami-su Jan 26, 2024

Your code is correct.

In fact, the author is suggesting that this self-reward method is similar to PCO, as described in this paper: https://arxiv.org/pdf/2312.16682.pdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterative DPO #6

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Iterative DPO #6

Sanqiang Jan 25, 2024

Replies: 1 comment · 3 replies

lucidrains Jan 25, 2024 Maintainer

lucidrains Jan 25, 2024 Maintainer

lucidrains Jan 25, 2024 Maintainer

Minami-su Jan 26, 2024

Sanqiang
Jan 25, 2024

Replies: 1 comment 3 replies

lucidrains
Jan 25, 2024
Maintainer

lucidrains Jan 25, 2024
Maintainer

lucidrains Jan 25, 2024
Maintainer