Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better README for the project #2

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 31 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,43 @@
linguo
======
# linguo

![linguo robot](http://4.bp.blogspot.com/_W2b7qR0wkBs/SdPz9CyWMOI/AAAAAAAAANo/M0znxgvTURI/s320/linguo1.jpg)
<img src="./misc/LinguoInfo.png" width="150"></img>
<br>
[Image url](http://simpsons.wikia.com/wiki/File:LinguoInfo.png)
<br>
## Motivation
This work was done as part of The New York Times R&D([nytlabs](http://nytlabs.com/)) internship in July, 2014.
The result was [<i>Editor</i>](http://nytlabs.com/projects/editor.html).
<br>
><i>Editor is an experimental text editing interface that explores how collaboration between machine learning systems and journalists could afford fine-grained annotation and tagging of news articles. Our approach applies machine learning techniques interactively, as part of the writing process, rather than retroactively. This approach can offload the burden of work to the computational processes, and can create affordances for journalists to augment, edit and correct those processes with their knowledge.</i>

linguo is a set of services that provide NLP facilities
You can view it in use [here](http://nytlabs.com/projects/editor.html):
<br>
<a href="http://nytlabs.com/projects/editor.html"><img src="./misc/editor.png" align="middle"></a>

This repository contains various tools that one might encounter while dealing with text data. These are also provided in shape of services.
## Repository Description
<b><i>linguo</i></b> is a set of [microservices](https://en.wikipedia.org/wiki/Microservices) that provide Natural Language Processing abilities for text editors.

There are 6 services two of which are not implemented as a service but can be done via python's Flask Library.
<b>Note</b>: Most of the problems have been well researched since 2014, so there are better libraries available to perform the same task. The libraries like [goose](https://github.com/grangier/python-goose) and [reporter](https://pypi.python.org/pypi/reporter/0.1.2) are some of the other libraries.

1. times_tagger : This folder contains scripts to tag articles with times tags. More information is inside the folder.
<i>Editor</i> used following microservices.

2. sentence_segmentation : Scripts in this folder are aimed at providing web service to identify sentences from a body of text. It is not an easy task to identify sentences as there are instances where periods are not used as end of a sentence.
1. [times_tagger](./times_tagger/) : This folder contains scripts to tag articles with tags from the Times. More information is inside the folder.

3. keyword_extraction : Scripts in this folder are aimed at extraction of keywords from urls.
2. [sentence_segmentation](./sentence_segmentation/) : Scripts in this folder implements a web service to identify sentences from a body of text.

4. topic_tracker : It is an attempt to summarize topics that are being read by an individual or a group of people.

5. text_classifier : It is an attempt to classify text into labels as give in Times taxonomy. Multi-Label Classification is done by using Google's Word2Vec representation of word as 100 dimensional vectors.
3. [keyword_extraction](./Keyword_extraction/) : Scripts in this folder are aimed at extraction of keywords from urls.

6. html_text_extractor : Given an url, it contains massive amount of text but not all text contains core information (about which the page is). This is an attempt to build a classifier on p tags in html to classify if its good or bad.

After talking to people, I found that there are libraries like goose(https://github.com/grangier/python-goose) and reporter(https://pypi.python.org/pypi/reporter/0.1.2) that does exact same thing so I recommend use of those libraries.
4. [text_classifier](./text_classifier/) : It is an attempt to classify articles into labels as given in Times taxonomy. Multi-Label Classification is done by using Google's [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) representation of word as 100 dimensional vectors.

7. usefulScripts : It contains all the scripts which were helpful in preprocessing articles, querying mongoDB or experimenting with LDA.
As part of another experiment to find the topics of discussion in the lab, I implemented <i>Topic Tracker</i> which performs LDA on the content extracted from different URLs being visited.
1. [topic_tracker](./Topic_Tracker/) : It is an attempt to summarize topics that are being read by an individual or a group of people.

*app.py script combines all the services. It might not work because of relocation of other scripts.*
2. [ html_text_extractor](./html_text_extractor) : Given an url, it contains massive amount of text but not all text contains core information (about which the page is). This is an attempt to build a classifier on p tags in html to classify if its good or bad.



- [usefulScripts](./usefulScripts/) : It contains all the scripts which were helpful in preprocessing articles, querying mongoDB or experimenting with LDA.


[app.py](./app.py) script combines all the services.
9 changes: 5 additions & 4 deletions Topic_Tracker/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
***TOPIC TRACKER***
<b><Note:</b> The links here will not work.

**What is it?**

Topic Tracker is aimed at keeping track of what is being read or talked about most in the lab.
It accepts url links in the form of JSON with "Content-Type: application/json".
Topic Tracker is aimed at keeping track of what is being read or talked about most in the lab.
It accepts url links in the form of JSON with "Content-Type: application/json".
Typical JSON file will look like this:

{"url":"http://www.zephoria.org/thoughts/archives/2014/07/09/alice-goffman-on-the-run.html"}
Expand All @@ -12,15 +13,15 @@ Typical JSON file will look like this:

**How does it work?**

It uses LDA Model by David Blei implemented in Gensim Library of Python. It is coded in Python 2.7.
It uses [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) implemented in [gensim](https://radimrehurek.com/gensim/).

0. User sends in the url through (A) below
1. beanstalk is used to handle queue of urls
2. text is extracted from each url
3. dictionary is formed with text as the urls are being added
4. after first 25 urls LDA Model is trained and stats about number of topics are send to Redis
5. after every 25 urls thereafter stats are computed on accumulated corpus (corpus is just the accumulation of all text) and send to Redis
6. after every 100 urls LDA Model is retrained and stats
6. after every 100 urls LDA Model is retrained and stats
7. On the other end stats can be viewed through (B) below


Expand Down
Binary file added misc/LinguoInfo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added misc/editor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.