diff --git a/CrowdComp_MTurkData.tar.gz b/CrowdComp_MTurkData.tar.gz new file mode 100644 index 0000000..21e2083 Binary files /dev/null and b/CrowdComp_MTurkData.tar.gz differ diff --git a/CutOncev0.0.2a.xpi b/CutOncev0.0.2a.xpi new file mode 100755 index 0000000..6fdcfec Binary files /dev/null and b/CutOncev0.0.2a.xpi differ diff --git a/FastEffectiveClustering-v2.pdf b/FastEffectiveClustering-v2.pdf new file mode 100755 index 0000000..15518d8 Binary files /dev/null and b/FastEffectiveClustering-v2.pdf differ diff --git a/FastEffectiveClustering-v2.ppt b/FastEffectiveClustering-v2.ppt new file mode 100755 index 0000000..c24b75a Binary files /dev/null and b/FastEffectiveClustering-v2.ppt differ diff --git a/FastEffectiveClustering.pdf b/FastEffectiveClustering.pdf new file mode 100755 index 0000000..d3ec82b Binary files /dev/null and b/FastEffectiveClustering.pdf differ diff --git a/FastEffectiveClustering.ppt b/FastEffectiveClustering.ppt new file mode 100755 index 0000000..3471570 Binary files /dev/null and b/FastEffectiveClustering.ppt differ diff --git a/GuideToBiology-pictures-color-release1.5.pdf b/GuideToBiology-pictures-color-release1.5.pdf new file mode 100755 index 0000000..f3a9b3b Binary files /dev/null and b/GuideToBiology-pictures-color-release1.5.pdf differ diff --git a/GuideToBiology-pictures-color-release1.5.ppt b/GuideToBiology-pictures-color-release1.5.ppt new file mode 100755 index 0000000..6ca93a8 Binary files /dev/null and b/GuideToBiology-pictures-color-release1.5.ppt differ diff --git a/GuideToBiology-sampleChapter-release1.4.pdf b/GuideToBiology-sampleChapter-release1.4.pdf new file mode 100755 index 0000000..75d5ac6 Binary files /dev/null and b/GuideToBiology-sampleChapter-release1.4.pdf differ diff --git a/IIWeb.ppt b/IIWeb.ppt new file mode 100755 index 0000000..3ccdf68 Binary files /dev/null and b/IIWeb.ppt differ diff --git a/MSM-2009.ppt b/MSM-2009.ppt new file mode 100755 index 0000000..61794f7 Binary files /dev/null and b/MSM-2009.ppt differ diff --git a/Matching-1.ppt b/Matching-1.ppt new file mode 100644 index 0000000..3c5271c Binary files /dev/null and b/Matching-1.ppt differ diff --git a/Matching-2.ppt b/Matching-2.ppt new file mode 100644 index 0000000..278d335 Binary files /dev/null and b/Matching-2.ppt differ diff --git a/Matching-3.ppt b/Matching-3.ppt new file mode 100755 index 0000000..ca6ecef Binary files /dev/null and b/Matching-3.ppt differ diff --git a/Shortcut to 10-802.lnk b/Shortcut to 10-802.lnk new file mode 100755 index 0000000..5a5ff0a Binary files /dev/null and b/Shortcut to 10-802.lnk differ diff --git a/SimStudent-wc.ppt b/SimStudent-wc.ppt new file mode 100755 index 0000000..3d52f55 Binary files /dev/null and b/SimStudent-wc.ppt differ diff --git a/SlifTextComponent.html b/SlifTextComponent.html new file mode 100755 index 0000000..6f5ced0 --- /dev/null +++ b/SlifTextComponent.html @@ -0,0 +1,200 @@ + + +How to use the SLIF Text Components + + + +

How to use the SLIF Text Components

+ +

Invocation and basic options

+ +

+The SLIF text components are distributed as single large JAR file. To +run it you will need a copy of Java. A typical invocation would be + +

+ +% java -cp slifTextComponents.jar -Xmx500M SlifTextComponent -labels DIR -saveAs FILE -use COMPONENT1,COMPONENT2,.... [OPTIONS] + + +

+where -Xmx500M allocates additional memory for the Java heap, and the additional arguments are as follows: +

+ +The components available are: + + +

The Minorthird format for stand-off annotation

+ +The format for output is the one used by Minorthird. Specifically, the +output (in the default format) is a series of lines in one of these +formats: + +

+ +addToType FILE START LENGTH SPANTYPE
+setSpanProp FILE START LENGTH semantics LETTERS +
+ +

+where +

+ +

Other options

+ + + + + + + + + + + + + + + + + + + + + + +
OptionExplanation
-helpGives brief command line help
-guiPops up a window that allows you to interactively fill in the other arguments, monitor the execution of the annotation process, etc.
-showLabelsPops up a window that displays the set of documents being labeled. + (This is not recommended for a large document collection, due to + memory usage.) +
-showResultPops up a window that displays the result of the annotation. + (Again, not recommended for a large document collection.) +
-format strings Outputs results as a + tab-separated table, instead of minorthird format. The first + column summarizes the type of the span, the file the span was + taken from, and the start and end byte positions, in a + colon-separated format. (E.g., + "cellLine:p11029059-fig_4_1:1293:1303".) The remaining column(s) + are the text that is contained in the span (e.g., "HeLa cells", + for the span above) almost exactly as it appears in the document; the + only change is that newlines are replaced with spaces. +
+ +

References

+ + + +

Acknowledgements

+ +A number of people have contributed to these tools, including William +Cohen, Zhenzhen Kou, Quinten Mercer, Robert Murphy, Richard Wang, and +other members of the SLIF team. + +The initial development of these tools was supported by grant 017396 +from the Commonwealth of Pennsylvania Tobacco Settlement Fund. Further +development is supported by National Institutes of Health grant R01 +GM078622. + + + + diff --git a/Thumbs.db b/Thumbs.db new file mode 100755 index 0000000..6d4990b Binary files /dev/null and b/Thumbs.db differ diff --git a/aaai-fs-2012.ppt b/aaai-fs-2012.ppt new file mode 100644 index 0000000..b970895 Binary files /dev/null and b/aaai-fs-2012.ppt differ diff --git a/aaai-ss-2015.ppt b/aaai-ss-2015.ppt new file mode 100644 index 0000000..0a4c9cf Binary files /dev/null and b/aaai-ss-2015.ppt differ diff --git a/advice.html b/advice.html new file mode 100755 index 0000000..17f7b83 --- /dev/null +++ b/advice.html @@ -0,0 +1,181 @@ + + + + +Advice for Technical Speaking + + + + + + + +
+

Advice for technical speaking

+ +
Shamelessly pilfered from Geoff +Gordon's advice page
+ +When you hear a talk - a good one or a bad one! - think about the +presentation as well as the content. Copy what works, and avoid what +doesn't. If you see a great talk, examine it and try to figure out +what makes it great. If you see a poor talk, examine it and ask +yourself if you might make the same mistakes. Some of the most common +mistakes are below, but people are really quite creative in coming up +with new mistakes, so don't assume this list is complete!

+ +Don't use too many slides. If you have more than one slide per minute, +you are definitely using too many. One slide per two minutes is a much +more reasonable pace.

+ +Don't read your slides. You do not need to put everything you are +going to say up on a slide; that's what speaker notes are for. Save +your slides for things that don't work as well with just speech: +figures, diagrams, movies, animations, extra emphasis on important +concepts. If your slides are just lists of bullets, they should +probably be speaker notes and not slides.

+ +Don't put too many pixels on your slides. Use big fonts and +contrasting colors. Projectors are notoriously bad at making ochre and +mauve actually look different on the screen. If you copy a figure from +a paper, ask yourself whether the text labels are big enough to read +from the back of the room, and redo them if not.

+ +Don't try and say too much. You can't explain everything you've been +going in a semester project in 30 minutes - part of communicating is +deciding what to leave out. If you feel like you have to rush to say +what you need to say, you're going too fast: the presentation should +be relaxed enough that people have a chance to reflect on what you +say, and ask questions if they need to.

+ +Talk concretely about your work. People are great at abstracting from +examples, but it's hard work for them to think through high-level +abstractions. (This is the opposite case from when you're programming +a computer - then you always program the most general case possible, +and let the computer instantiate it as needed.) When you're talking +to a person, start with a concrete problem you want to solve, and then +help the person understand how to generalize that concrete problem to +the general case.

+ +View your talk as an advertisement for your paper(s). Your goal is +to convince your audience to use your ideas for their own work, so +that they cite you and make you famous. Your goal is not to +make them understand Equation 43 on page 17 (unless that convinces +them to cite you and make you famous). Instead, say what your +techniques are good for, why they're important, what the alternatives +are, and how to choose when your techniques are appropriate instead of +the alternatives. Then and only then, use the rest of the time on +technical stuff, with the goal of giving listeners the tools to read +your paper. (If you're talking about someone else's work, imagine +instead that you're trying to get the audience to trust your +evaluation of that work.)

+ +Be honest and diligent. Don't try to cover up flaws or overstate the +applicability of your techniques; instead, try to discover flaws and +limitations and expose them. + +Think concretely about your audience. Will they be able to understand +each slide as it comes up? Will they understand why each slide is +important? As a heuristic, I often find it best to prepare the first +version of a talk with one specific person I know well in +mind - and think about what I would say to engage and inform him or +her specifically.

+ +Talk at the right level for your audience. Remember that, almost by +definition, you understand the material and they don't, and fight the +inclination to go too fast. Be aware of people's cognitive +limitations. Don't make your audience figure something out if you +don't have to; that will save more processing power for what you want +them to focus on. In particular: +

    +
  • Don't ask people to listen to one thing and read another at the +same time.
  • Don't ask people to remember an equation or +definition 5 slides later: just put up a copy when you refer to it. +
  • Use direct, simple language. For example, if there are three ways +to refer to something, pick one and use it consistently throughout +your talk: don't call something a “model&rdquo on one slide and +a “parameter vector” on another. +
  • Label every graph clearly and in large fonts: both axes, every line, and even the sign of any comparison you want to make (“higher is better”). +
  • If a fact is important, emphasize it.  The audience doesn't necessarily manage to process every word you say.  Help them process your talk by telling them what is important, and by repeating things they might have forgotten. +
+

+ +Synthesize. The audience should get something out of your talk that +they can't get as quickly or easily out of the paper(s). This means: +pull together concepts from multiple papers if necessary; compare to +related work; communicate your judgement about benefits and +limitations of each technique.

+ +Be careful with equations.  You can use a limited number of +equations if you want to, but make sure that you spend enough time +explaining them that the audience truly understands them. +

    +
  • Often, it's a good idea to leave the slide blank and hand-write the equations on it during the actual talk; this trick will keep you from going too fast. Of course, this trick only works if you have a tablet or (gasp) an analog device like a whiteboard or overhead transparencies. +
  • If you use this trick, make sure you practice writing out the equations ahead of time at the same level of detail that you plan to use during the talk.; Don't just assume they're simple enough that you can't possibly get them wrong; that assumption is usually false. +
+

+ +Organize well: + +

    +
  • Introduce one new concept at a time. Make sure you know, for every part of every slide, which concept it is intended to convey.; Make sure you can describe each concept with a clear, short phrase -- else it's probably more than one concept. +
  • Introduce concepts in the right order. If concept B depends on concept A, make sure to introduce A first. +
  • Sometimes it helps to make a directed graph: nodes are the short phrases for concepts, and arrows represent prerequisites. You can then check that your talk is consistent with the graph (i.e., doesn't try to reverse any arrows). +
  • If there are directed cycles in your graph, you have a problem. Try to refactor your concepts and pull out something that you can introduce before any of the nodes in a cycle, then re-evaluate the dependencies, and repeat until you get a DAG. + +
+

+ +Start and end your talk well: +

    +
  • If possible, put up your title slide while you're being introduced. Then you don't need to read it. +
  • Make sure the audience knows who you are, especially if you're talking about a paper with multiple authors. You may want to put your name at the bottom of every slide, for people who come in late. +
  • Make sure you know the first few sentences of your talk by heart. Exact memorization is usually a bad idea for the body of the talk (it sounds stilted), but I find that knowing the first sentence or two helps me get started. (And once I get started I can almost always keep going.) + +
  • Make sure you have an obvious end to your talk, and don't just trail off into silence. Always end with a statement (e.g., “thank you&rdquo) not a question (e.g., “any questions?”).  If you end with a question, the audience doesn't know whether to answer it or applaud, which can be awkward. +
+ +Audiences hate to have their time wasted.  So: +
    +
  • Whenever you can do a little work to save your audience a little +work, you should.  E.g., make a better visualization or a better +figure, if you think it will improve your audience's ability to +understand.  Or, take that huge table of timing results from your +paper and translate it into a bar chart that highlights the +comparisons you're trying to make. + +
  • View an agreement to give a talk as a commitment.  Don't +cancel unless you really, really need to.  If you do have to +cancel, give as much notice as you can. + +
  • Plan to show up early.  That way if something goes wrong +(miss a bus, projector doesn't work, etc.), you have time to fix +it.  Snafus like the above are part of the normal order of the +world, and somehow seem to be even more common when you're about to +give a talk.  Speakers should therefore expect and plan for them. + +
  • Know your tools.  Make sure you know how to hook your laptop +up to a projector, how to operate your presentation software quickly +and unobtrusively, how to avoid having instant messages pop up on top +of your slides, etc. + +
+

+ +Don't waste your own time either.  Don't spend lots of time +designing pretty animations, flying text, etc., unless they will +actually help audience comprehension and not distract from your +talk.  Every second spent animating is a second you don't have +for explaining your ideas.

+ +

+ +
+
+ + +Last modified: Mon Jan 03 16:09:28 Eastern Standard Time 2011 + + + diff --git a/all-bibdata.tgz b/all-bibdata.tgz new file mode 100644 index 0000000..197c812 Binary files /dev/null and b/all-bibdata.tgz differ diff --git a/all-nell-triples.txt.gz b/all-nell-triples.txt.gz new file mode 100644 index 0000000..86137de Binary files /dev/null and b/all-nell-triples.txt.gz differ diff --git a/balloon.zip b/balloon.zip new file mode 100755 index 0000000..5d00d55 Binary files /dev/null and b/balloon.zip differ diff --git a/block-lda-icml-ws-2010.ppt b/block-lda-icml-ws-2010.ppt new file mode 100755 index 0000000..8a1d1aa Binary files /dev/null and b/block-lda-icml-ws-2010.ppt differ diff --git a/bottom.html b/bottom.html new file mode 100755 index 0000000..929f23f --- /dev/null +++ b/bottom.html @@ -0,0 +1,302 @@ + + +William W. Cohen + + + +

Areas of expertise

+ +I have extensive experience in machine learning and discovery, +information retrieval, information extraction, and data integration. + +

Biography

+ +William Cohen received his bachelor's degree in Computer Science from +Duke University in 1984, and a PhD +in Computer Science from Rutgers +University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T Bell Labs and later AT&T Labs-Research, and from +April 2000 to May 2002 Dr. Cohen worked at Whizbang Labs, a company +specializing in extracting information from the web. Dr. Cohen is +currently an action editor for the Journal of Machine Learning +Research, has served as an editor for the journal Machine +Learning and the Journal of +Artificial Intelligence Research, co-organized the 1994 +International Machine Learning Conference, and has served on more than +20 program committees or advisory committees. In addition to +his position at CMU, Dr. Cohen also serves on the advisory board of +Intelliseek. + +

+Dr. Cohen's research +interests include information integration and machine learning, +particularly text categorization and learning from large datasets. He +holds six patents related to learning, discovery, information +retrieval, and data integration, and is the author of more than 60 +refereed publications. + + + +

Software systems

+ + + +

Datasets

+ +The following datasets are available for anyone to use for research +purposes: + + +

Recent talks and presentations

+ + +

+

+ + +

Teaching

+ +June 21,23,25: A mini-course on Minorthird. +

+ +Materials: +

+ +

+ +From Spring 2004: "Learning to Turn Words into Data: +Machine Learning Approaches to Information Extraction and Information Integration", CALD 10-707 and LTI 11-748. + + +

Publications

+ + + +Recent papers I'm keeping in HTML or PDF (which requires Adobe +Acrobat Reader to view). Older papers are mostly in Postscript. +For Windows, I use the GSView reader for +postscript. Most of these papers are viewable in several formats in +ResearchIndex. + +

Students

+ + + + +

Contact Info

+ +

+William Cohen
+Associate Research Professor
+Center for Automated Learning & Discovery
+Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213
+Wean Hall 5317 / 412-268-7664 (voice) / 412-268-3431 (fax)
+ +

Official CMU Contact Info + +

My preferred email address is: wcohen AT cs DOT cmu DOT edu + + +

Other Stuff

+ +

For those many friends whose research I have built on, be warned. +My full name, "William Weston Cohen", is an anagram of the phrase "I now +cite shallow men". + +

I am often praised for my highly artistic and functional web site designs. +An example is the site for SC Indexing, +a professional book indexer. However, I accept few clients - this +one happens to be my wife. + +

Through my advisor, Alex Borgida, I can trace my "academic lineage" back to luminaries like +Leibniz and Alfred Whitehead. + +

Poetry anyone? +


+ + + + diff --git a/captions.tgz b/captions.tgz new file mode 100755 index 0000000..5141598 Binary files /dev/null and b/captions.tgz differ diff --git a/cikm-2012.ppt b/cikm-2012.ppt new file mode 100644 index 0000000..54e0da5 Binary files /dev/null and b/cikm-2012.ppt differ diff --git a/classify.tar.gz b/classify.tar.gz new file mode 100644 index 0000000..d79b56d Binary files /dev/null and b/classify.tar.gz differ diff --git a/cloud/Notes.html b/cloud/Notes.html new file mode 100644 index 0000000..6f023fe --- /dev/null +++ b/cloud/Notes.html @@ -0,0 +1,132 @@ +Tag Cloud + + +

What I did

+
+
+ - I used Yahoo Site explorer and wget to download 1000 pages from
+ dailykos and redstate.com.  I believe this is the top 1000 pages on
+ the site by #inlinks.  I filtered these to get blog entries, including
+ comments.
+
+ - I extracted the words from dkos & redstate blog entries, and the
+ corresponding comments, using a perl script (that uses an extendable
+ perl HTML parser, and site-specific "class" tags on the comment and
+ entry DOM nodes).  The redstate comments are a little messier, since
+ I could not easily strip out signatures.
+
+ - I tokenized, stoplisted, counted a bunch of word frequencies, and
+ saved all the words that appear >= 5 times in dkos entries, redstate
+ entries, dkos comments, redstate comments, etc.
+
+ - I estimating a bunch of relative-frequency/MI sort of statistics.
+ What seemed most reasonable was to look for "non-general English"
+ words that are "more common in context X than context Y", which
+ I express with this score
+
+ log[ P(w|X) / P(w|Y)*P(w|GE) ]
+
+ Stats for "general English" were from the brown corpus.  I smoothed
+ with a Dirichlet, and probably more importantly, by replacing zero
+ counts for P(w) with counts of 4 (since I only stored counts>=5).
+
+ Then for each X,Y I looked at, I took the top 200 scoring words,
+ broke them into 10 equal-frequency bins, and built a "tagcloud"
+ visualization of them.  The top 200 ignored a handful of stuff that I
+ decided was noise: signature line tokens, like ----; words like
+ "pstin", which seem to be poorly-tokenized dkos words; date, time and
+ number words; and words like kos, dailykos, entry, diary, fold, and
+ email.
+
+===================================================================
+X		  Y			  file name               
+===================================================================
+dkos entries	  redstate blog entries	  blue-red-entry.html
+dkos comments	  redstate blog comments  blue-red-comment.html
+dkos anything	  redstate blog anything  blue-red-all.html       
+redstate entries  dkos entries 		  red-blue-entry.html     
+redstate comments dkos comments		  red-blue-comment.html   
+redstate anything dkos anything		  red-blue-all.html       
+redstate comment  redstate entry	  redComment-redEntry.html     
+dkos comment	  dkos entry		  blueComment-blueEntry.html   
+===================================================================
+
+ For a few other context's I scored as 
+
+ log [ P(w|X)*P(W|Y) / P(w|GE) ]
+
+ ie "non-general English" words that are "common in both context X and
+ context Y"
+
+===================================================================
+X,Y			  file name               
+===================================================================
+dkos,redstate comments	  blue+red-comment.html        
+dkos,redstate entries	  blue+red-entry.html          
+dkos,redstate anything	  blue+red-all.html
+===================================================================
+
+ - I also wrote code to pick up subject-matter 'tags' from dailykos
+ (like the delicious tagging scheme), which turned out to be pretty
+ noisy (eg, "republican" and "repulican party" are both tags, as are
+ "iraq" and "iraq war".)  I set up some additional contexts X = "dkos
+ comments for entries tagged with something that contains the word T"
+ and compared them to Y="all dkos comments"
+
+===================================================================
+T		file name		
+===================================================================
+elections	blueElections-blue-comment.html
+iraq		blueIraq-blue-comment.html
+media		blueMedia-blue-comment.html
+===================================================================
+
+ - Sizes of all of this, in words:
+
+==============================
+brown                   480098  
+
+dkos-all               3351061 
+dkos-comment           3311702 
+dkos-entry               39359   
+
+redstate-all           1152883
+redstate-comment.freq   940241
+redstate-entry          212642  
+
+dkos-iraq-comment       341238  
+dkos-elections-comment  256129  
+dkos-media-comment      160413  
+==============================
+
+
+ +

Observations

+ +