Skip to content

Commit

Permalink
Update 2024.09.impact.md
Browse files Browse the repository at this point in the history
  • Loading branch information
okhat authored Sep 4, 2024
1 parent 75fb701 commit 102c582
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions 2024.09.impact.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Instead, think at least two steps ahead. Identify the path most people are likel

What might this look like in practice? Let's revisit the ColBERT case study. The obvious way to build efficient retrievers with BERT is to encode documents into a vector. Interestingly, there was only limited IR work that does that by late 2019. For example, the best-cited work in this category (DPR) only had its first preprint released in April 2020. Given this, you might think that the right thing to do in 2019 was to build a great single-vector IR model via BERT. In contrast, thinking just two steps ahead would be to ask: everyone will be building single-vector methods sooner or later, where will this single-vector approach get fundamentally stuck? And indeed, that question led to the [late interaction](https://x.com/lateinteraction/status/1736804963760976092) paradigm and [widely-used models](https://huggingface.co/colbert-ir/colbertv2.0).

As another example, we could use [DSPy](https://github.com/stanfordnlp/dspy). In February 2022, as prompting is becoming decently powerful, it was clear that people will want to do retrieval-based QA with prompting, not with fine-tuning like it used to be. A natural thing to do would be to build a method for just that. Thinking two steps ahead would be to ask: where will such approaches get stuck? Ultimately, retrieve-then-generate (or "RAG") approaches are perhaps the simplest possible pipeline involving LMs. For the same reasons people will be interested in it, they would increasingly be interested in (i) expressing more complex modular compositions and (ii) figuring out how the resulting sophisticated pipelines should be supervised, via automated prompting or finetuning of the underlying LMs. That's DSPy.
As another example, we could use [DSPy](https://github.com/stanfordnlp/dspy). In February 2022, as prompting is becoming decently powerful, it was clear that people will want to do retrieval-based QA with prompting, not with fine-tuning like it used to be. A natural thing to do would be to build a method for just that. Thinking two steps ahead would be to ask: where will such approaches get stuck? Ultimately, retrieve-then-generate (or "RAG") approaches are perhaps the simplest possible pipeline involving LMs. For the same reasons people will be interested in it, it was clear that they would increasingly be interested in (i) expressing more complex modular compositions and (ii) figuring out how the resulting sophisticated pipelines should be supervised or optimized, via automated prompting or finetuning of the underlying LMs. That's DSPy.

The second half of this guideline is "iterate fast". This was perhaps the very first research advice I received from my advisor Matei Zaharia, in week one of my PhD: by identifying a version of the problem you can iterate quickly on and receive feedback (e.g., latency or validation scores), you greatly impove your chances at solving hard problems. This is especially important if you will be thinking two steps ahead, which is already hard and uncertain enough.

Expand All @@ -59,7 +59,7 @@ At this point, you’ve identified a good problem and then iterated until you di

A common first step is to release the paper as a preprint on arXiv and then release a “thread” (or similar) announcing the paper’s release. When you do this, make sure your thread begins with a concrete, substantial, and accessible claim. The goal isn’t to tell people that you released a paper — that doesn't carry inherent value. The goal is to communicate your key argument in a direct and vulnerable but engaging way in the form of a specific statement that people can agree or disagree with. (Yes, I know this is hard but it is necessary.)

Perhaps more importantly, this whole process does not end after the first "release". It starts with the release. Given that now you're investing in projects, not just papers, your ideas _and of your scientific communication_ persist year-long, well beyond isolated paper releases. Let me illustrate why this matters. When I help grad students “tweet” about their work, it's not uncommon that their initial post doesn’t get as much traction as hoped. Students typically assume this validates their fear of posting about their research and take it as yet another sign that they should just move on to the next paper. Obviously, this is not correct.
Perhaps more importantly, this whole process does not end after the first "release". It starts with the release. Given that now you're investing in projects, not just papers, your ideas _and your scientific communication_ persist year-long, well beyond isolated paper releases. Let me illustrate why this matters. When I help grad students “tweet” about their work, it's not uncommon that their initial post doesn’t get as much traction as hoped. Students typically assume this validates their fear of posting about their research and take it as yet another sign that they should just move on to the next paper. Obviously, this is not correct.

A lot of personal experience, second-hand experience, and observations suggest that this is a place where persistence is massively helpful (and, by the way, exceedingly rare). With few exceptions, traction of good ideas needs you to tell people the key things many times in different contexts — and evolving your thoughts and your communication of your thoughts — either until the community can absorb these ideas or until the field reaches the right stage of development to appreciate those ideas.

Expand Down

0 comments on commit 102c582

Please sign in to comment.