Choosing dataset.
In order to determine emerging and successful technologies during the last few years, text documents regarding the filling for an Initial Public Offering (IPO) on a US stock exchange are analyzed. The dataset that consists only of nouns and adjectives is chosen for the analysis, because it describes technologies and industries better and more precise than the full dataset. Even though we can access specific sections (e.g. text between sections “Summary” and “Use of Proceeds”) of the underlying S/1 or S/1-A documents when using the “full” documents, model training based on those will result in a smaller corpus and rather imprecise topics, which do not align with expected emerging technologies (see Appendix B).
Data preprocessing.
The next important task is to transform raw data into an understandable format. Firstly, all the documents are converted to lowercase to create consistency within the dataset. Secondly, punctuation and numbers are removed since they do not contribute to our text analysis. Thirdly, stopwords are also removed from the documents in order to focus only on the important words related to each industry. Since we would like to keep technology-specific contexts of keywords and we refer to a vectorized text dataset that does not contain any pronouns or verbs, stemming is not applied in our processing. Finally, the corpus is created. Using the corpus we construct the document term matrix (DTM) for development of our topic model and set our global term in document appearance boundaries in such accordance so that emerging topic related terms must appear within a reasonable threshold.
Main method.
In order to compare topics in recent IPO and older IPO filings, one of the most common algorithms for topic modelling - Latent Dirichlet Allocation (LDA) - is used. LDA uncovers hidden semantic structures between a set of documents. These structures, theta and beta, represent a probability distribution over topics (within a document) and one over terms (within a topic), respectively (Blei, 2003). The main idea is that each document contains a combination of latent topics and each topic is characterized by a distribution over the words, and the relative importance of the topics vary from document to document. LDA works in such way: it goes through each word in the text, randomly assign it to one of the topics, calculate a score based on the probability that this word will be found in this particular topic in the set of documents and then ascribes this same word to another topic and calculates the same score. After many iterations a list of words in each topic with probabilities is created.
Approaches.
From here on, two different approaches were applied: Approach A focuses on the log-likelihood function and on a trend analysis over time, while Approach B focuses on the cosine similarity measurement between newer IPO filings and those a decade ago.
In this approach, we compare the IPO filings from a decade ago (2008 - 2010) with more recent ones. Based on how similar the topics are (i.e. how much they correlate), we define the emerging industries. Particularly in this approach, the training set consists of fillings received before 2011, the overall length of the set is 961 files. After the pre-processing, we develop an LDA model, from which we extract the most frequent terms occurring per topic (based on the beta parameter), frequency of the keywords (will be further used in the calculations and word cloud processing), as well as the way individual companies are related to the topics (based on the theta parameter).
Afterwards, we process the data for the recent years (fillings after the year 2016) in the same way with minor adjustments. As a result, we get topics defined by the LDA model for both recent years and a decade ago. Our aim here is to identify the topics that are least correlated: correlation is calculated using the cosine similarity of the keywords of the topics based on the “old” corpus and the one for the recent years (Hoberg and Phillips, 2016). All the results are stored in the correlation matrix, where the rows represent topics for the recent filings, while columns are displaying topics for the industry a decade ago.
Based on the correlation matrix, we firstly define the highest correlation for every new topic, and afterwards, comparing the values of the highest correlations for the new topics. As far as we are interested in the emerging industries, our key focus is on the least correlated topics based on the topic model for the recent years. Manually, we define that the maximum correlation for the emerging industries with the old topics is 10%. This way 12 topics are identified as emerging, based purely on the cosine similarity. However we need to take into account that for the comparison we only used data for the years 2008-2011.
At this stage we can define to what industries the companies belong to and identify companies that belong to the emerging industry, afterwards, certain keywords can be analysed more thoroughly with the KWIC analysis, understanding the concept behind the tech-related keywords, such as “sequencing”, “blockchain”, “cybersecurity” and etc. As far as topics for both old and recent filings are automatically generated during the topic modelling process, it is important to optimise the final solutions and research the final outcome. For example, based on the in-depth analysis of the generated topics it is clear that a lot of SEC-related information on stocks was taken into account by the topic models.