Skip to content

Commit

Permalink
release version 2.4
Browse files Browse the repository at this point in the history
  • Loading branch information
huseinzol05 committed Jun 1, 2019
1 parent c3a77bc commit 8a090a8
Show file tree
Hide file tree
Showing 104 changed files with 22,223 additions and 17,876 deletions.
11 changes: 7 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Features

- **Emotion Analysis**

From BERT, Fast-Text, Dynamic-Memory Network, Sparse Tensorflow, Attention Neural Network to build deep emotion analysis models.
From Attention-Recurrent model, Sparse Tensorflow, Self-Attention to build deep emotion analysis models.
- **Entities Recognition**

Latest state-of-art CRF deep learning models to do Naming Entity Recognition.
Expand All @@ -66,16 +66,19 @@ Features
- **ELMO (biLM)**

Provide pretrained bahasa wikipedia and bahasa news ELMO, with easy interface and visualization.
- **Relevancy Analysis**

From Dilated Convolutional Neural Network and Self-Attention to build deep relevancy analysis models.
- **Sentiment Analysis**

From BERT, Fast-Text, Dynamic-Memory Network, Sparse Tensorflow, Attention Neural Network to build deep sentiment analysis models.
From Attention-Recurrent model, Sparse Tensorflow and Self-Attention to build deep sentiment analysis models.
- **Spell Correction**

Using local Malaysia NLP researches to auto-correct any bahasa words.
- Stemmer
- **Subjectivity Analysis**

From BERT, Fast-Text, Dynamic-Memory Network, Sparse Tensorflow, Attention Neural Network to build deep subjectivity analysis models.
From Attention-Recurrent model, Sparse Tensorflow and Self-Attention to build deep subjectivity analysis models.
- **Summarization**

Using skip-thought with attention state-of-art to give precise unsupervised summarization.
Expand All @@ -84,7 +87,7 @@ Features
Provide LDA2Vec, LDA, NMF and LSA interface for easy topic modelling with topics visualization.
- **Toxicity Analysis**

From BERT, Fast-Text, Dynamic-Memory Network, Attention Neural Network to build deep toxicity analysis models.
From Attention-Recurrent model, Self-Attention to build deep toxicity analysis models.
- **Word2Vec**

Provide pretrained bahasa wikipedia and bahasa news Word2Vec, with easy interface and visualization.
Expand Down
9 changes: 6 additions & 3 deletions docs/Api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,6 @@ malaya.normalize
.. automodule:: malaya.normalize
:members:

.. autoclass:: malaya.normalize._DEEP_NORMALIZER()
:members:

.. autoclass:: malaya.normalize._SPELL_NORMALIZE()
:members:

Expand All @@ -96,6 +93,12 @@ malaya.preprocessing
.. automodule:: malaya.preprocessing
:members:

malaya.relevancy
------------------

.. automodule:: malaya.relevancy
:members:

malaya.sentiment
-----------------

Expand Down
254 changes: 8 additions & 246 deletions docs/Dataset.rst
Original file line number Diff line number Diff line change
@@ -1,259 +1,21 @@
Dataset
=======

.. raw:: html

<p align="center">
<a href="#readme">
<img alt="logo" width="70%" src="https://raw.githubusercontent.com/huseinzol05/Malaya-Dataset/master/karangan-sekolah/wordcloud.png">
</a>
</p>

We want to make sure not just the code we open-sourced, but also goes to
dataset, so everyone can validate.

You can check in
`Malaya-Dataset <https://github.com/huseinzol05/Malaya-Dataset>`__ for
our open dataset.

`Article <https://github.com/huseinzol05/Malaya-Dataset/blob/master/articles>`__
--------------------------------------------------------------------------------

Total size: 3.1 MB

1. Filem
2. Kerajaan
3. Pembelajaran
4. Pendidikan
5. Sekolah

`Dependency <https://github.com/huseinzol05/Malaya-Dataset/blob/master/dependency>`__
-------------------------------------------------------------------------------------

`Dictionary, 24550 unique words <https://github.com/huseinzol05/Malaya-Dataset/blob/master/dictionary>`__
---------------------------------------------------------------------------------------------------------

`Emotion <https://github.com/huseinzol05/Malaya-Dataset/blob/master/emotion>`__
-------------------------------------------------------------------------------

Total size: 8.5 MB

1. Anger
2. Fear
3. Joy
4. Love
5. Sadness
6. Surprise

`Gender <https://github.com/huseinzol05/Malaya-Dataset/blob/master/gender>`__
-----------------------------------------------------------------------------

Total size: 2.2 MB

1. Unknown
2. Male
3. Female
4. Brand

`Irony <https://github.com/huseinzol05/Malaya-Dataset/blob/master/irony>`__
---------------------------------------------------------------------------

Total size: 100 KB

1. Positive
2. Negative

`Entities, JSON <https://github.com/huseinzol05/Malaya-Dataset/blob/master/entities>`__
---------------------------------------------------------------------------------------

Total size: 1.1 MB

1. OTHER - Other
2. law - law, regulation, related law documents, documents, etc
3. location - location, place
4. organization - organization, company, government, facilities, etc
5. person - person, group of people, believes, etc
6. quantity - numbers, quantity
7. time - date, day, time, etc
8. event - unique event happened, etc

`Karangan sekolah <https://github.com/huseinzol05/Malaya-Dataset/blob/master/karangan-sekolah>`__
-------------------------------------------------------------------------------------------------

Total size: 221 KB

`Language-detection, Wikipedia <https://github.com/huseinzol05/Malaya-Dataset/blob/master/language-detection>`__
----------------------------------------------------------------------------------------------------------------

`News, crawled <https://github.com/huseinzol05/Malaya-Dataset/blob/master/news>`__
----------------------------------------------------------------------------------

Total size: 28.9 MB

.. raw:: html

<details>

Complete list (51 news)

1. Cuti sekolah
2. isu 1MDB
3. isu agama
4. isu agong
5. isu agrikultur
6. isu air
7. isu anwar ibrahim
8. isu artis
9. isu astro
10. isu bahasa melayu
11. isu barisan nasional
12. isu cikgu
13. isu cukai
14. isu cyberjaya
15. isu dunia
16. isu ekonomi
17. isu gst
18. isu harakah
19. isu harga
20. isu icerd
21. isu imigren
22. isu kapitalis
23. isu kerajaan
24. isu kesihatan
25. isu kuala lumpur
26. isu lgbt
27. isu mahathir
28. isu makanan
29. isu malaysia airlines
30. isu malaysia
31. isu minyak
32. isu isu najib razak
33. isu pelajar
34. isu pelakon
35. isu pembangkang
36. isu perkauman
37. isu permainan
38. isu pertanian
39. isu politik
40. isu rosmah
41. isu sabah
42. isu sarawak
43. isu sosial media
44. isu sultan melayu
45. isu teknologi
46. isu TM
47. isu ubat
48. isu universiti
49. isu wan azizah
50. peluang pekerjaan
51. perkahwinan

.. raw:: html

</details>

`Sentiment News <https://github.com/huseinzol05/Malaya-Dataset/blob/master/news-sentiment>`__
---------------------------------------------------------------------------------------------

Total size: 496 KB

1. Positive
2. Negative

`Sentiment Twitter <https://github.com/huseinzol05/Malaya-Dataset/blob/master/twitter-sentiment>`__
---------------------------------------------------------------------------------------------------

Total size: 50.6 MB

1. Positive
2. Negative

`Sentiment Multidomain <https://github.com/huseinzol05/Malaya-Dataset/blob/master/multidomain-sentiment>`__
-----------------------------------------------------------------------------------------------------------

159 KB

1. Amazon review, Positive and Negative
2. IMDB review, Positive and Negative
3. Yelp review, Positive and Negative

`Part-of-Speech <https://github.com/huseinzol05/Malaya-Dataset/blob/master/part-of-speech>`__
---------------------------------------------------------------------------------------------

Total size: 3.1 MB

1. ADJ - Adjective, kata sifat
2. ADP - Adposition
3. ADV - Adverb, kata keterangan
4. ADX - Auxiliary verb, kata kerja tambahan
5. CCONJ - Coordinating conjuction, kata hubung
6. DET - Determiner, kata penentu
7. NOUN - Noun, kata nama
8. NUM - Number, nombor
9. PART - Particle
10. PRON - Pronoun, kata ganti
11. PROPN - Proper noun, kata ganti nama khas
12. SCONJ - Subordinating conjunction
13. SYM - Symbol
14. VERB - Verb, kata kerja
15. X - Other

`Polarity <https://github.com/huseinzol05/Malaya-Dataset/blob/master/polarity>`__
---------------------------------------------------------------------------------

Total size: 1.3 MB

1. Positive
2. Negative

`Political landscape <https://github.com/huseinzol05/Malaya-Dataset/blob/master/political-landscape>`__
-------------------------------------------------------------------------------------------------------

Total size: 2 MB

1. Kerajaan
2. Pembangkang

`Sarcastic news-headline <https://github.com/huseinzol05/Malaya-Dataset/blob/master/sarcastic-news-headline>`__
---------------------------------------------------------------------------------------------------------------

1. Positive
2. Negative

`Stemmer <https://github.com/huseinzol05/Malaya-Dataset/blob/master/stemmer>`__
-------------------------------------------------------------------------------

Total size: 6.5 MB

1. News stemming
2. Wikipedia stemming

`Subjectivity <https://github.com/huseinzol05/Malaya-Dataset/blob/master/subjectivity>`__
-----------------------------------------------------------------------------------------

Total size: 1.4 MB

1. Positive
2. Negative

`Toxicity <https://github.com/huseinzol05/Malaya-Dataset/blob/master/toxicity>`__
-----------------------------------------------------------------------------------------

Total size: 70 MB

Toxicity is multilabel, prefer to use sigmoid based.

1. toxic
2. severe toxic
3. obscene
4. threat
5. insult
6. identity hate

`Subtitle <https://github.com/huseinzol05/Malaya-Dataset/blob/master/subtitle>`__
---------------------------------------------------------------------------------

Total size: 1.5 MB

Suggestion
----------

1. Always apply text augmentation, like swapping based words using
synonyms or thesaurus. I still waiting respond from third-party to
open source Bahasa thesaurus.

Citation
--------

Expand Down
9 changes: 9 additions & 0 deletions docs/Relevancy.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Relevancy Analysis
===================

.. note::

This tutorial is available as an IPython notebook
`here <https://github.com/huseinzol05/Malaya/tree/master/example/relevancy>`_.

.. include:: load-relevancy.rst
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Contents:
Num2word
Pos
Preprocessing
Relevancy
Sentiment
Similarity
Spell
Expand Down
Loading

0 comments on commit 8a090a8

Please sign in to comment.