-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
81 lines (74 loc) · 3.04 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
_____ _____ _________ __ _____
|_ _|/ ____|__ __\ \ / //\ | __ \
| | | (___ | | \ \_/ // \ | |__) |
| | \___ \ | | \ // /\ \ | _ /
_| |_ ____) | | | | |/ ____ \| | \ \
|_____|_____/ |_| |_/_/ \_\_| \_\
==> Scholar
https://www.elfdict.com/w/istyar
==> Alpha versions
https://istyar.app/grouped_articles_with_bars.html
https://istyar.app/grouped_articles%20(1).html
- Visualizing the timeline of Katz Lab publication and making it easier
to explore relationship between lab papers and extended literature
Features:
- Network graph visualization of timeline of article from the lab
- Edges between publications will be determined by:
1) Manual input
2) Automatically inferred using:
a) Hierarchical Dirichlet Process
3) Orphan edges will be defaulted to inferred
- Summary statistics of each node (article)
1) Citations (timeline)
2) Title, authors, abstract, link
- Suggested further reading:
1) From within lab articles
2) Citations from article + Articles citing current one
2) Extended literature
- Highlight by Author
Backend:
- Interactive plotting done in:
- Plotly
- Scraping using
- Biopython via Entrez
- Inference using:
- Hierarchical Dirichlet Process (package unknown)
- Deployment:
- Heroku
- Storage (for lab and extended lit scraped articles):
- Amazon S3
- Periodic updates:
- Airflow
Layout:
- Front end:
- Easiesnt thing would be to allow people to enter a query (restricted
to authors) and simply receive a link to a figure.
- Data:
- Notes:
- If database for text is centralized, this will allow easier
development of more complex models (Hierachical Dirichlet Process
or Hierarchical NMF) more easily.
- NLP:
- Notes:
- Avoiding recalculating of embeddings/features with each addition
to the database will be challenging
- Using TF or TFIDF by keeping count of <words per document>
will allow updates without recalculating i.e. we can store a
sparse array with word counts for documents. As more documents
are added, both rows (documents) and word counts (columns) can
be appended. If the dictionary size (total words) gets too large,
a max number can be instantiated and words can be replaced using
some relevance criteria (highest TFIDF?)
- Organization:
- Feature extraction
- TFIDF
- TF
...
- PCA of features?
- Clustering/Topic Modelling
- LDA
- NMF
- PCA of weights?
- Distance calculation (Either on raw features or using topics)
- Cosine similarity
- KMEANS