-
Notifications
You must be signed in to change notification settings - Fork 0
/
intro.tex
executable file
·36 lines (31 loc) · 3.9 KB
/
intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
\section{Introduction}
\label{sec:intro}
Named entity recognition (NER) is one of the first and most important steps in Information Extraction pipelines.
Generally, it is to identify mentions of entities (persons, locations, organizations, etc.)
within unstructured text.
However, the diverse and noisy nature of user-generated content as well as the emerging entities with novel surface forms make NER in social media messages more challenging.
The first challenge brought by user-generated content
%such as Twitter, Reddit and YouTube,
is its unique characteristics: short, noisy and informal.
For instance, tweets are typically short since the number of characters is restricted to 140 and people indeed tend to pose short messages even in social media without such restrictions, such as YouTube comments and Reddit.
\footnote{The average length of the sentences in this shared task is about 20 tokens per sentence.}
Hence, the contextual information in a sentence is very limited.
Apart from that,
the use of colloquial language makes it more difficult for existing NER approaches to be reused, which mainly focus on a general domain and formal text ~\cite{baldwin2015shared, derczynski2015analysis}.
%State-of-the-art NER softwares (e.g. Standford Corenlp) are less effective on such social media messages~\cite{derczynski2015analysis}.
%Due to the informal and contemporary nature of these micro-posts, performance still lags far behind that on formal text genres such as newswire.
Another challenge of NER in noisy text is the fact that there are large amounts of emerging named entities and rare surface forms among the user-generated text, which tend to be tougher to detect~\cite{augenstein2017generalisation} and recall thus is a significant problem~\cite{derczynski2015analysis}.
By way of example, the surface form ``\textit{kktny}'', in the tweet ``so.. \textit{kktny} in 30 mins?'',
actually refers to a new TV series called ``\textit{Kourtney and Kim Take New York}'', which even human experts found hard to recognize.
Additionally, it is quite often that netizens mention entities using rare morphs as surface forms.
For example, ``\textit{black mamba}'', the name for a venomous snake, is actually a morph that
Kobe Bryant created for himself for his aggressiveness in playing basketball
games \cite{DBLP:conf/acl/ZhangHPLLJKWSHY15}.
Such morphs and rare surface forms are also very difficult to detect and classify.
%This task will evaluate the ability to detect and classify novel, emerging, singleton named entities in noisy text.
%
%Detecting commonly-mentioned entities tends to be easier than the rarer, more unusual surface forms. Similarly, entities with unusual surface forms, or that are simply rare, tend to be tougher to detect~\cite{augenstein2017generalisation}, with recall being a significant problem in rapidly-changing text types~\cite{derczynski2015analysis}. However, the entities that are common in newly-emerging texts such as newswire or social media are often new, not having been mentioned in prior datasets. This poses a challenge to NER systems, where in many deployments, unusual, previously-unseen entities need to be detected reliably and with high recall. In the shared task, we are provided with turbulent data containing few repeated entities, drawn from rapidly-changing text types or sources of non-mainstream entities.
The goal of this paper is to present our system participating in the \textit{Novel and Emerging Named Entity Recognition} shared task at the EMNLP 2017 Workshop on Noisy User-generated Text (W-NUT 2017), which aims for NER in such noisy user-generated text.
%\footnote{\url{http://noisy-text.github.io/2017/emerging-rare-entities.html}}
We investigate a multi-channel BiLSTM-CRF neural network model in our participating system, which is described in~\secref{sec:approach}.
The details of our implementation are in presented in \secref{sec:eval}, where we also present some conclusion from our experiments.