-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the epidey-search wiki!
“The current problem with search engines is that they personalize the results based on your past history,”
“As researchers, you can fall into basically this rabbit hole that puts you in an echo chamber where you're essentially confirming your previous search history.”
“To overcome that, we need to not treat researchers and clinicians like consumers, but instead, the search must treat them as what they are: researchers looking for new things.”
The inspiration of the project came from the fact that my significant other will begin her journey in her Masters of Public health, and the problem of information overload are some of the concerns she, and I have together faced while looking for various research papers to gain deeper understanding about a particular topic.
Even for a simple tutorial online in today’s day and age, we have so much information to look through, and we cannot figure out if that is the exact tutorial that will cater to our needs.
Putting this problem at scale for those at the forefront of cutting-edge research, it made me realize that numerous human-hours could be cut down if there was some way of providing a sort-of “filter” or representation for the data that researchers are looking for.
I stumbled upon word2vec + node2vec and one can see some of the resources I utilized to think of the idea and direction. While there is a lot more reading to be done to uncover the best possible approach for the next steps, the general idea would be to employ graph based visual representation for search results using k-means clustering for the data (in this case, research papers) which can form clusters with regards to each other and the query performed by the user, by forming tighter edges to other papers based on similar opinions, topics, or if they were cited in another paper or not, and so on and so forth. This should further remove the bias that is normally faced when searching for papers somewhere akin to pubmed – which continuously shows keyword-based searches that might make the clinician miss certain relevant articles.
In Jina, a Document is anything that you want to search for: a text document, a short tweet, a code snippet an image, a video/audio clip, GPS traces of a day. A Document is also the input query when searching.
Therefore, for our purposes - a Document would be the text based research papers, while the search query itself would be a text based topic.
Q: How do you differentiate between document being the entire paper vs the query just being some text to search within the paper?
A Chunk is a small semantic unit of a Document. It could be a sentence, a 64x64 image patch, a 3 second video shot, a pair of coordinate and address.
So chunks would be sentences within the research paper. Query's would be based on sentiment analysis of the chunks and based on the overall sentiment of the document,
- https://www.curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/
- https://towardsdatascience.com/use-cases-of-googles-universal-sentence-encoder-in-production-dd5aaab4fc15#:~:text=The%20Universal%20Sentence%20Encoder%20encodes,publicly%20available%20in%20Tensorflow%2Dhub.
- https://huggingface.co/transformers/pretrained_models.html
- https://www.nltk.org/index.html
- https://www.youtube.com/watch?v=AUOf9ZTViAY&t=381s&ab_channel=AngelHack
- https://github.com/jina-ai/jina-hub
- https://docs.jina.ai/index.html
- https://www.freecodecamp.org/news/how-to-think-about-your-data-in-a-different-way-b84306fc2e1d/
- https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664
I stumbled upon word2vec + node2vec and one can see some of the resources I utilized to think of the idea and direction. In word2vec, words are linked together by a logical path with vector-based reasoning (vector offset) – e.g King – Man = Royalty, Royalty + Woman = Queen
Therefore, King – Man + Woman = Queen.
While there is a lot more reading to be done to uncover the best possible approach for the next steps, the general idea would be to employ graph based visual representation or search results using k-means clustering for the data (in this case, research papers) which can form clusters with regards to each other and the query performed by the user, by forming tighter edges to other papers based on similar opinions, topics, or if they were cited in another paper or not, and so on and so forth. This should further remove the bias that is normally faced when searching for papers somewhere akin to pubmed – which continuously shows keyword-based searches that might make the clinician miss certain relevant articles.