Skip to content

Commit

Permalink
rebuild pubs
Browse files Browse the repository at this point in the history
  • Loading branch information
wwcohen committed Mar 18, 2024
1 parent 9ef214c commit ac02f99
Show file tree
Hide file tree
Showing 7 changed files with 39 additions and 214 deletions.
218 changes: 20 additions & 198 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,11 @@ <h3 class="title">Principal Scientist, <a href="http://ai.google.com">Google AI<

<h3 class="sec"><a name="bio"></a>Biography</h3 class="sec">

William Cohen is a Principal Scientist at Google, and is based in
Google's Pittsburgh office. He received his bachelor's degree in
Computer Science from
William Cohen is a Visiting Professor at Carnegie Mellon University in
the <a href="http://www.ml.cmu.edu">Machine Learning Department</a>.
He also holds a position as a Principal Scientist at Google, where he
worked full-time between May 2018 and March 2024. He received his
bachelor's degree in Computer Science from
<a href="http://www.duke.edu">Duke University</a> in 1984, and a PhD
in Computer Science from <a href="http://www.rutgers.edu">Rutgers
University</a> in 1990. From 1990 to 2000 Dr. Cohen worked at
Expand All @@ -54,10 +56,7 @@ <h3 class="sec"><a name="bio"></a>Biography</h3 class="sec">
the <a href="http://www.ml.cmu.edu">Machine Learning Department</a>,
with a joint appointment in
the <a href="http://www.lti.cs.cmu.edu">Language Technology
Institute</a>, as an Associate Research Professor, a Research
Professor, and a Professor. Dr. Cohen also was the Director of the
Undergraduate Minor in Machine Learning at CMU and co-Director of the
Master of Science in ML Program.
Institute</a>.

<p>
Dr. Cohen is a past president of
Expand Down Expand Up @@ -102,17 +101,13 @@ <h3 class="sec"><a name="bio"></a>Biography</h3 class="sec">
Award</a> for the most influential paper of the ISWC-2013 conference.

<p>

Dr. Cohen's research interests include question answering, machine
learning for NLP tasks, and neuro-symbolic reasoning. He has a
long-standing interest in statistical relational learning and learning
models, or learning from data, that display non-trivial structure. He
Dr. Cohen's research interests include include question answering,
machine learning for NLP tasks, and neuro-symbolic reasoning, and he
has a long-standing interest in statistical relational learning. He
holds seven patents related to learning, discovery, information
retrieval, and data integration, and is the author of more than 200
retrieval, and data integration, and is the author of more than 300
publications.

<p>Dr. Cohen is also a Consulting Professor at the School of Computer
Science at Carnegie Mellon University.
<!-- <h3 class="sec"><a name="cv">Curriculum vita</cv></h3 class="sec">
<ul>
Expand All @@ -125,16 +120,23 @@ <h3 class="sec"><a name="announce"></a>Announcements and FAQs</h3 class="sec">

<ul>

<li>March 2024: As you can see from my updated bio above, I am have
returned to CMU's ML department full-time (although I still have a
20% involvement a Google, so that email will work!) I'm really
looking forward to re-engaging with my friends at colleagues at CMU.

<li>Nov 2023: I'm honored to report that the paper <a href="https://link.springer.com/chapter/10.1007/978-3-642-41335-3_34">Knowledge
Graph Identification</a>written by Jay Pujara, Hui Miao, Lise Getoor and myself,
Graph Identification</a>, written by Jay Pujara, Hui Miao, Lise Getoor and myself,
won a <a href="https://iswc2023.semanticweb.org/awards/">10 year best paper award at
the International Semantic Web Conference, 2023.
<li>Oct 2023: I will be visiting CMU's ML department on Tuesdays in Fall 2023.
<li>May 2023: I'm very honored to report that one of

<li>May 2023: I'm very honored to report that one of
the <a href="https://arxiv.org/abs/2209.12153">papers</a> I
co-authored at EACL 2023 (with Julian Eisenschlos, Jeremy Cole, and
Fangyu Liu) won an Outstanding Paper Award.

</ul>

<!--
<h3 class="sec"><a name="proj">Projects</a></h3 class="sec">
Expand Down Expand Up @@ -182,196 +184,16 @@ <h3 class="sec"><a name="teach"></a>Teaching</h3 class="teach">

<h3 class="sec"><a name="sw">Software and demos</a></h3 class="sec">

<!--
<b>Demos:</b>
<ul>
<li>
Measure twice, cut once - <a
href="http://www.cs.cmu.edu/~vitor/">Vitor</a> and <a
href="http://www.cs.cmu.edu/rbalasub">Ramnath</a> have developed a <a
href="http://www.cs.cmu.edu/~vitor/cutonce/cutOnce.html">Thunderbird
plugin</a> that implements <a
href="http://www.cs.cmu.edu/~wcohen/postscript/ecir2008.pdf">recipient
recommendation</a> and <a
href="http://www.cs.cmu.edu/~wcohen/postscript/sdm-2007-leak.pdf">leak
detection</a> for email. It modifies Thunderbird by adding an
additional pane that pops up after you send a message, giving you one
final chance to fix any errors in your recipient list. There's a
brief <a href="cutonce.pdf">writeup on how to use it,</a> but it's
pretty self-explanatory: just download it, open Thunderbird, and go to
the tools->addon menu to install. After you've installed it, you
train by opening your folder of "Sent" mail and pressing the "train"
button. (This took about an hour for my 9000+ old messages.)
<li>
<a href="http://www.cs.cmu.edu/~nmramesh/">Ramesh
Nallapati</a> has put together two nice demos of his <a
href="http://www.cs.cmu.edu/~wcohen/postscript/topic-tomography-submitted.pdf">multiscale topic tomography</a> topic-modeling technique, one
for articles from <a
href="http://www.cs.cmu.edu/~nmramesh/science_demo/multiscale_home.html">Science</a>,
and one with <a
href="http://www.cs.cmu.edu/~nmramesh/cancer_demo/multiscale_home.html">cancer-related
articles from PubMed</a>.
<li>
Here are two movies that demo SimStudent, a programming-by-demonstration
system for constructing cognitive tutors, built by <a href="http://www.cs.cmu.edu/~mazda/">Noboru Matsuda</a>.
<ul>
<li><a href="http://www.cs.cmu.edu/~mazda/CTAT/Video/Interactive/2x+3_5.mov">Interactive mode</a> (solves problems proactively, as way of posing queries)</li>
<li><a href="http://www.cs.cmu.edu/~mazda/CTAT/Video/Non-interactive/3x_9.mov">Non-interactive mode</a></li>
</ul>
</ul>
-->


<ul>
<li><a href="https://github.com/TeamCohen/TensorLog/wiki">TensorLog is
a probabilistic first-order logic which is fully differentiable.
<li><a href="https://github.com/TeamCohen/ProPPR/wiki">ProPPR</a> is an older
"locally groundable" probabilistic first-order
logic.
<li><a href="https://github.com/TeamCohen/GuineaPig">Guinea Pig</a> is
a pure Python workflow language for Hadoop.

<p>


<li>Bhuwan Dhingra is
distributing <a href="https://github.com/bdhingra/ga-reader">an
updated version of the Gated Attention Reader</a> via Github. As of
Dec 2016 the GA Reader is obtaining state-of-the-art results on
several of the standard benchmarks for answering cloze questions.

<li>Here is <a href="http://www.cs.cmu.edu/afs/cs/Web/People/dmovshov/software.html">a comment-completion Plugin for Eclipse</a>, from Dana Movshovitz-Attias.
<li>Here is <a href="https://github.com/rbalasub/jigsaw.git">Ramnath Balasubramanyan's BlockLDA</a> code, as well as some of the other algorithms from his thesis, is available on GitHub.
<li>Code for <a href="http://www.cs.cmu.edu/~nlao/code/2010.pra.gz">Ni
Lao's PRA method</a> (described in
our <a href="http://www.cs.cmu.edu/~wcohen/postscript/ecml-2010-ni.pdf">ECML
paper</a>) is available.
<li>
<a href="http://www.cs.cmu.edu/~frank/">Frank Lin</a>'s home page contains
<ul>
<li>the <a href="http://www.cs.cmu.edu/~frank/code/icml2010-code.zip">code</a>
for power iteration clustering (the algorithm described in our
ICML-2010 paper) as well as
the <a href="http://www.cs.cmu.edu/~frank/data/icml2010-data.zip">datasets</a>
we used in the experiments.
<li>the <a href="http://www.cs.cmu.edu/~frank/code/asonam2010-code.zip">code</a>
for MultiRandomWalk (the semi-supervised learning algorithm described in our
ASONAM-2010 paper) as well as
the <a href="http://www.cs.cmu.edu/~frank/data/anonam2010-data.zip">datasets</a>
we used in those experiments.
</ul>

<p>


<li>
<a href="http://secondstring.sourceforge.net">SecondString</a> is
another open-source Java package, of approximate string matching
techniques.
<ul><li>SecondString includes a jar for part of an ancient version
of Minorthird. For those that are interested in <a href="radar.tgz">the source behind
the mysterious cls.jar</a>, here it is.
</ul>

<!---
<li><a href="slipper/">SLIPPER</a> and <a href="whirl/">WHIRL</a> are
now being distributed via Rutgers University. They are free for research
purposes.
--->

<li><a href="slipper-linux.tgz/">SLIPPER</a> is an old old
rule-learning system Yoram Singer and I developed. This code is
provided with absolutely no warranty, promise of support, or really,
any expectation that it will keep working. You are totally on your
own with this one, friend.

<h3 class="sec"><a name="data">Datasets</a></h3 class="sec">

The following datasets are available for anyone to use for research
purposes:
<ul>

<li>Zhilin Yang is
distributing <a href="http://kimi.ml.cmu.edu/qa_ssl/">the data from our
ACL-2017 paper on semi-supervised QA<a>.

<li>Lidong Bing has
distributed <a href="http://www.cs.cmu.edu/~lbing/#Datasets">two
datasets from our joint work</a>: the data used in our EMNLP 2015
paper, Improving Distant Supervision for Information Extraction Using
Label Propagation Through List, and also the dataset used in our AAAI
2016 paper, Distant IE by Bootstrapping Using Lists and Document
Structure. The <a href="http://curtis.ml.cmu.edu/gnat/biomed">data
extracted by this system can also be browsed</a>.


<li>Ni Lao has distributed the labeled data from our EMNLP 2010 paper,
Random Walk Inference and Learning in A Large Scale Knowledge Base,
both <a href="http://www.cs.cmu.edu/~nlao/data/publish.amt.labels.tar.gz">Turker-labeled
data</a>
and <a href="http://www.cs.cmu.edu/~nlao/data/publish.distant.supervision.tar.gz">NELL
pseudo labels</a>.

<li><a href="http://rtw.ml.cmu.edu/wk/coordterm/syntactic/">Coordinate
terms extracted from a MALT-parsed corpus with 230B sentences</a>,
produced by Malcolm Greaves. (Corpus is ClueWeb 2009, Wikipedia from
November 2011, Project Gutenberg, and Citeseer.)

<li><a href="CrowdComp_MTurkData.tar.gz">Data sets</a> for my paper
"Crowdsourced Comprehension: Predicting Prerequisite Structure in
Wikipedia" with Partha Talukdar from BEA-2012.

<li><a href="http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online/index.html">Collections
of HTML Tables, hyponyms, as well as extracted entity clusters and MLT
evaluations</a>, all associated with
<a href="http://www.cs.cmu.edu/afs/cs/Web/People/bbd/">Bhavana
Dalvi</a>'s paper
on <a href="postscript/wsdm-2012-bdd.pdf">WebSets</a> from WSDM-2012.

<li>The <a href="http://www.cs.cmu.edu/~frank/data/icml2010-data.zip">network
datasets</a> used in the experiments of our ICML-2010 paper
are on <a href="http://www.cs.cmu.edu/~frank/">Frank Lin</a>'s home page.

<li>
<a href="all-bibdata.tgz">100,000+ bibliography entries</a>, in the original BibTeX format, converted to an EndNote-like format, and in a featurized format, for experiments with matching (60M).

<li><a href="http://www.cs.cmu.edu/~vitor/codeAndData.html">617
messages from 20 Newsgroups, annotated for reply bodies and
signatures</a>, prepared by my former student <a
href="http://www.cs.cmu.edu/~vitor">Vitor Carvalho</a>

<li><a href="http://www.cs.cmu.edu/~einat/datasets.html">
Two subsets of the Enron data, annotated with person names</a>,
prepared by my student <a href="http://www.cs.cmu.edu/~einat">Einat
Minkov</a>.

<li><a href="http://www.cs.cmu.edu/~enron">Enron email dataset</a>
(400Mb, once you get there) contains 800,000+ emails from 150 users+
organized into 4700+ folders.


<li><a href="repository.tgz">A collection of various extraction datasets
in Minorthird format</a> (6Mb), including about 1000 Enron emails tagged
for person names and temporal expressions.

<li><a href="classify.tar.gz">classify.tar.gz</a> (0.4Mb) contains
nine problems in which the goal is to classify short entity names.
This data was used in <i>Joins that Generalize: Text Classification
Using WHIRL</i> (KDD-98).

<li><a href="ranking-data.tar.gz">ranking.tar.gz</a> (8Mb) contains the
data used for the meta-search experiments in my JAIR paper <a
href="http://www.jair.org/abstracts/cohen99a.html">Learning to Order
Things</a> (with Rob Schapire and Yoram Singer).

<li><a href="match.tar.gz">match.tar.gz</a> (0.7Mb) contains a suite of
<i>labeled</i> entity-name matching and clustering problems
(i.e. problems for which the correct matches/clusters are provided),
Expand Down
4 changes: 2 additions & 2 deletions pubgen/pubs.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"title": "SEMQA: Semi-Extractive Multi-Source Question Answering",
"authors": "Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler",
"venues": "",
"year": "2023",
"year": "2024",
"topics": "nxR",
"url": "https://arxiv.org/abs/2311.04886",
"cite": "NAACL-2024",
Expand All @@ -28,7 +28,7 @@
"title": "MEMORY-VQ: Compression for Tractable Internet-Scale Memory",
"authors": "Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Ontañón, William W. Cohen, Sumit Sanghai, Joshua Ainslie",
"venues": "",
"year": "2023",
"year": "2024",
"topics": "nxR",
"url": "https://arxiv.org/abs/2308.14903",
"cite": "NAACL-2024",
Expand Down
4 changes: 2 additions & 2 deletions pubs-R.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
</head>
<body><h3>William W. Cohen's Papers: Retrieval Augmented LMs</h3>
<ol>
<li>Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler (2023): <a href="https://arxiv.org/abs/2311.04886">SEMQA: Semi-Extractive Multi-Source Question Answering</a> in NAACL-2024.
<li>Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler (2024): <a href="https://arxiv.org/abs/2311.04886">SEMQA: Semi-Extractive Multi-Source Question Answering</a> in NAACL-2024.
</li>
<li>Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Ontañón, William W. Cohen, Sumit Sanghai, Joshua Ainslie (2023): <a href="https://arxiv.org/abs/2308.14903">MEMORY-VQ: Compression for Tractable Internet-Scale Memory</a> in NAACL-2024.
<li>Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Ontañón, William W. Cohen, Sumit Sanghai, Joshua Ainslie (2024): <a href="https://arxiv.org/abs/2308.14903">MEMORY-VQ: Compression for Tractable Internet-Scale Memory</a> in NAACL-2024.
</li>
<li>Haitian Sun, William W. Cohen, Ruslan Salakhutdinov (2023): <a href="https://arxiv.org/abs/2308.08661">Answering Ambiguous Questions with a Database of Questions, Answers, and Revisions</a> in progress.<br><ul><li><font size=-1>Following up the 'QA is the new KR' paper, we present a new collection of question-answer pairs automatically generated from Wikipedia which are more specific and ambiiguous than generated questions used in prior work, and show that this can be used to answer ambiguous questions. On the challenging ASQA benchmark, which requires generating long-form answers that summarize the multiple answers to an ambiguous question, our method improves performance by 10-15%. The new queston DB can also be used to improve diverse passage retrieval.</font></ul>
</li>
Expand Down
4 changes: 2 additions & 2 deletions pubs-n.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
<ol>
<li>Chung-Ching Chang, William W. Cohen, Yun-Hsuan Sung (2023): <a href="https://arxiv.org/abs/2311.10083">Characterizing Tradeoffs in Language Model Decoding with Informational Interpretations</a> in progress.
</li>
<li>Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler (2023): <a href="https://arxiv.org/abs/2311.04886">SEMQA: Semi-Extractive Multi-Source Question Answering</a> in NAACL-2024.
<li>Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler (2024): <a href="https://arxiv.org/abs/2311.04886">SEMQA: Semi-Extractive Multi-Source Question Answering</a> in NAACL-2024.
</li>
<li>Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Ontañón, William W. Cohen, Sumit Sanghai, Joshua Ainslie (2023): <a href="https://arxiv.org/abs/2308.14903">MEMORY-VQ: Compression for Tractable Internet-Scale Memory</a> in NAACL-2024.
<li>Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Ontañón, William W. Cohen, Sumit Sanghai, Joshua Ainslie (2024): <a href="https://arxiv.org/abs/2308.14903">MEMORY-VQ: Compression for Tractable Internet-Scale Memory</a> in NAACL-2024.
</li>
<li>Haitian Sun, William W. Cohen, Ruslan Salakhutdinov (2023): <a href="https://arxiv.org/abs/2308.08661">Answering Ambiguous Questions with a Database of Questions, Answers, and Revisions</a> in progress.<br><ul><li><font size=-1>Following up the 'QA is the new KR' paper, we present a new collection of question-answer pairs automatically generated from Wikipedia which are more specific and ambiiguous than generated questions used in prior work, and show that this can be used to answer ambiguous questions. On the challenging ASQA benchmark, which requires generating long-form answers that summarize the multiple answers to an ambiguous question, our method improves performance by 10-15%. The new queston DB can also be used to improve diverse passage retrieval.</font></ul>
</li>
Expand Down
8 changes: 4 additions & 4 deletions pubs-s.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@
<body><h3>Selected and/or recent papers by William W. Cohen</h3>
<h3>Recent papers: 2024</h3>
<ol>
<li>Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler (2024): <a href="https://arxiv.org/abs/2311.04886">SEMQA: Semi-Extractive Multi-Source Question Answering</a> in NAACL-2024.
</li>
<li>Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Ontañón, William W. Cohen, Sumit Sanghai, Joshua Ainslie (2024): <a href="https://arxiv.org/abs/2308.14903">MEMORY-VQ: Compression for Tractable Internet-Scale Memory</a> in NAACL-2024.
</li>
</ol>
<h3>Recent papers: 2023</h3>
<ol>
<li>Chung-Ching Chang, William W. Cohen, Yun-Hsuan Sung (2023): <a href="https://arxiv.org/abs/2311.10083">Characterizing Tradeoffs in Language Model Decoding with Informational Interpretations</a> in progress.
</li>
<li>Tal Schuster, Adam D. Lelkes, Haitian Sun, Jai Gupta, Jonathan Berant, William W. Cohen, Donald Metzler (2023): <a href="https://arxiv.org/abs/2311.04886">SEMQA: Semi-Extractive Multi-Source Question Answering</a> in NAACL-2024.
</li>
<li>Yury Zemlyanskiy, Michiel de Jong, Luke Vilnis, Santiago Ontañón, William W. Cohen, Sumit Sanghai, Joshua Ainslie (2023): <a href="https://arxiv.org/abs/2308.14903">MEMORY-VQ: Compression for Tractable Internet-Scale Memory</a> in NAACL-2024.
</li>
<li>Haitian Sun, William W. Cohen, Ruslan Salakhutdinov (2023): <a href="https://arxiv.org/abs/2308.08661">Answering Ambiguous Questions with a Database of Questions, Answers, and Revisions</a> in progress.<br><ul><li><font size=-1>Following up the 'QA is the new KR' paper, we present a new collection of question-answer pairs automatically generated from Wikipedia which are more specific and ambiiguous than generated questions used in prior work, and show that this can be used to answer ambiguous questions. On the challenging ASQA benchmark, which requires generating long-form answers that summarize the multiple answers to an ambiguous question, our method improves performance by 10-15%. The new queston DB can also be used to improve diverse passage retrieval.</font></ul>
</li>
<li>Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Sumit Sanghai, William W. Cohen, Joshua Ainslie (2023): <a href="https://arxiv.org/abs/2306.10231">GLIMMER: generalized late-interaction memory reranker</a> in progress.
Expand Down
Loading

0 comments on commit ac02f99

Please sign in to comment.