Skip to content

Unstructured Data Analysis (Graduate) @korea University

Notifications You must be signed in to change notification settings

xzh263/text-mining

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

text-mining

Unstructured Data Analysis (Graduate) @Korea University

Notice

  • Syllabus (download)
  • Term project groups
    • 1조: 박성훈, 이수빈(2018021120), 이준걸, 박혜준
    • 2조: 이정호, 천우진, 유초롱, 조규원
    • 3조: 백승호, 목충협, 변준형, 이영재
    • 4조: 박건빈, 이수빈(2018020530), 변윤선, 권순찬
    • 5조: 최종현, 이정훈, 박중민, 노영빈
    • 6조: 백인성, 김은비, 신욱수, 강현규
    • 7조: 전성찬, 박현지, 문관영
    • 8조: 조용원, 정승섭, 민다빈, 최민서
    • 9조: 박명현, 장은아, 유건령

Recommended courses

Schedule

Topic 1: Introduction to Text Analytics

  • The usefullness of large amount of text data and the challenges
  • Overview of text analytics methods

Topic 2: From Texts to Data

  • Text data collection: Web scraping

Topic 3: Text Preprocessing

  • Introduction to Natural Language Processing (NLP)
  • Lexical analysis
  • Syntax analysis
  • Other topics in NLP
  • Reading materials
    • Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational intelligence magazine, 9(2), 48-57. (PDF)
    • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493-2537. (PDF)
    • Young, T., Hazarika, D., Poria, S., & Cambria, E. (2017). Recent trends in deep learning based natural language processing. arXiv preprint arXiv:1708.02709. (PDF)

Topic 4: Neural Networks Basics

  • Perception, Multi-layered Perceptron
  • Convolutional Neural Networks (CNN)
  • Recurrent Neural Networks (RNN)
  • Practical Techniques

Topic 5-1: Document Representation I: Classic Methods

  • Bag of words
  • Word weighting
  • N-grams

Topic 5-2: Document Representation II: Distributed Representation

  • Word2Vec
  • GloVe
  • FastText
  • Doc2Vec
  • Reading materials
    • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137-1155. (PDF)
    • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. (PDF)
    • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). (PDF)
    • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). (PDF)
    • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. (PDF)

Topic 6: Dimensionality Reduction

  • Dimensionality Reduction
  • Supervised Feature Selection
  • Unsupervised Feature Extraction: Latent Semantic Analysis (LSA) and t-SNE
  • R Example
  • Reading materials
    • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391. (PDF)
    • Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188-230.
    • Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. (PDF) (Homepage)

Topic 7: Document Similarity & Clustering

  • Document similarity metrics
  • Clustering overview
  • K-Means clustering
  • Hierarchical clustering
  • Density-based clustering
  • Reading materials
    • Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323. (PDF)

Topic 8-1: Topic Modeling I

  • Topic modeling overview
  • Probabilistic Latent Semantic Analysis: pLSA
  • LDA: Document Generation Process
  • Reading materials
    • Hofmann, T. (1999, July). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (pp. 289-296). Morgan Kaufmann Publishers Inc. (PDF)
    • Hofmann, T. (2017, August). Probabilistic latent semantic indexing. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 211-218). ACM.
    • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. (PDF)
    • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. (PDF)

Topic 8-2: Topic Modeling II

  • LDA Inference: Gibbs Sampling
  • LDA Evaluation
  • Recommended video lectures

Topic 9: Document Classification

  • Document classification overview
  • Naive Bayesian classifier
  • RNN-based document classification
  • CNN-based document classification
  • Reading materials
    • Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. (PDF)
    • Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657) (PDF)
    • Lee, G., Jeong, J., Seo, S., Kim, C, & Kang, P. (2018). Sentiment classification with word localization based on weakly supervised learning with a convolutional neural network. Knowledge-Based Systems, 152, 70-82. (PDF)
    • Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1480-1489). (PDF)
    • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. (PDF)
    • Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. (PDF)

About

Unstructured Data Analysis (Graduate) @korea University

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.6%
  • R 1.3%
  • Python 0.1%