text-mining

Unstructured Data Analysis (Graduate) @Korea University

Notice

Syllabus (download)
Term project groups
- 1조: 박성훈, 이수빈(2018021120), 이준걸, 박혜준
- 2조: 이정호, 천우진, 유초롱, 조규원
- 3조: 백승호, 목충협, 변준형, 이영재
- 4조: 박건빈, 이수빈(2018020530), 변윤선, 권순찬
- 5조: 최종현, 이정훈, 박중민, 노영빈
- 6조: 백인성, 김은비, 신욱수, 강현규
- 7조: 전성찬, 박현지, 문관영
- 8조: 조용원, 정승섭, 민다빈, 최민서
- 9조: 박명현, 장은아, 유건령

Recommended courses

CS224d @Stanford: Deep Learning for Natural Language Processing
- Course Homepage: http://cs224d.stanford.edu/
- YouTube Video: https://www.youtube.com/playlist?list=PLlJy-eBtNFt4CSVWYqscHDdP58M3zFHIG
CS224n @Stanford: Natural Language Processing Deep Learning
- Course Homepage: http://web.stanford.edu/class/cs224n/syllabus.html
- Youtube Video: https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6
Deep Natural Lanugage Processing @Oxford
- Course Homepage: https://github.com/oxford-cs-deepnlp-2017/lectures

Schedule

Topic 1: Introduction to Text Analytics

The usefullness of large amount of text data and the challenges
Overview of text analytics methods

Topic 2: From Texts to Data

Text data collection: Web scraping

Topic 3: Text Preprocessing

Introduction to Natural Language Processing (NLP)
Lexical analysis
Syntax analysis
Other topics in NLP
Reading materials
- Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational intelligence magazine, 9(2), 48-57. (PDF)
- Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493-2537. (PDF)
- Young, T., Hazarika, D., Poria, S., & Cambria, E. (2017). Recent trends in deep learning based natural language processing. arXiv preprint arXiv:1708.02709. (PDF)

Topic 4: Neural Networks Basics

Perception, Multi-layered Perceptron
Convolutional Neural Networks (CNN)
Recurrent Neural Networks (RNN)
Practical Techniques

Topic 5-1: Document Representation I: Classic Methods

Bag of words
Word weighting
N-grams

Topic 5-2: Document Representation II: Distributed Representation

Word2Vec
GloVe
FastText
Doc2Vec
Reading materials
- Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137-1155. (PDF)
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. (PDF)
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). (PDF)
- Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). (PDF)
- Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. (PDF)

Topic 6: Dimensionality Reduction

Dimensionality Reduction
Supervised Feature Selection
Unsupervised Feature Extraction: Latent Semantic Analysis (LSA) and t-SNE
R Example
Reading materials
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391. (PDF)
- Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188-230.
- Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. (PDF) (Homepage)

Topic 7: Document Similarity & Clustering

Document similarity metrics
Clustering overview
K-Means clustering
Hierarchical clustering
Density-based clustering
Reading materials
- Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323. (PDF)

Topic 8-1: Topic Modeling I

Topic modeling overview
Probabilistic Latent Semantic Analysis: pLSA
LDA: Document Generation Process
Reading materials
- Hofmann, T. (1999, July). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (pp. 289-296). Morgan Kaufmann Publishers Inc. (PDF)
- Hofmann, T. (2017, August). Probabilistic latent semantic indexing. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 211-218). ACM.
- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. (PDF)
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. (PDF)

Topic 8-2: Topic Modeling II

LDA Inference: Gibbs Sampling
LDA Evaluation
Recommended video lectures
- LDA by D. Blei (Lecture Video)
- Variational Inference for LDA by D. Blei (Lecture Video)

Topic 9: Document Classification

Document classification overview
Naive Bayesian classifier
RNN-based document classification
CNN-based document classification
Reading materials
- Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. (PDF)
- Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657) (PDF)
- Lee, G., Jeong, J., Seo, S., Kim, C, & Kang, P. (2018). Sentiment classification with word localization based on weakly supervised learning with a convolutional neural network. Knowledge-Based Systems, 152, 70-82. (PDF)
- Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1480-1489). (PDF)
- Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. (PDF)
- Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. (PDF)

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
01 Introduction to Text Analytics		01 Introduction to Text Analytics
02 Text Data Collection from the Web		02 Text Data Collection from the Web
03 Text Preprocessing		03 Text Preprocessing
04 Neural Networks Basics		04 Neural Networks Basics
05 Document Representation I		05 Document Representation I
06 Document Representation II		06 Document Representation II
07 Dimensionality Reduction		07 Dimensionality Reduction
08 Document Similarity and Clustering		08 Document Similarity and Clustering
09 Topic Modeling		09 Topic Modeling
10 Document Classification I		10 Document Classification I
11 Document Classification II		11 Document Classification II
2017		2017
2018		2018
Term Project Archive		Term Project Archive
2019_Spring_Unstructured Data Analysis.pdf		2019_Spring_Unstructured Data Analysis.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-mining

Notice

Recommended courses

Schedule

Topic 1: Introduction to Text Analytics

Topic 2: From Texts to Data

Topic 3: Text Preprocessing

Topic 4: Neural Networks Basics

Topic 5-1: Document Representation I: Classic Methods

Topic 5-2: Document Representation II: Distributed Representation

Topic 6: Dimensionality Reduction

Topic 7: Document Similarity & Clustering

Topic 8-1: Topic Modeling I

Topic 8-2: Topic Modeling II

Topic 9: Document Classification

About

Releases

Packages

Languages

xzh263/text-mining

Folders and files

Latest commit

History

Repository files navigation

text-mining

Notice

Recommended courses

Schedule

Topic 1: Introduction to Text Analytics

Topic 2: From Texts to Data

Topic 3: Text Preprocessing

Topic 4: Neural Networks Basics

Topic 5-1: Document Representation I: Classic Methods

Topic 5-2: Document Representation II: Distributed Representation

Topic 6: Dimensionality Reduction

Topic 7: Document Similarity & Clustering

Topic 8-1: Topic Modeling I

Topic 8-2: Topic Modeling II

Topic 9: Document Classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages