Skip to content

Commit

Permalink
update solomon islands analysis
Browse files Browse the repository at this point in the history
  • Loading branch information
ccxzhang committed Sep 12, 2023
1 parent 6b2831d commit de9c824
Show file tree
Hide file tree
Showing 6 changed files with 8,305 additions and 7,558 deletions.
41 changes: 41 additions & 0 deletions data/text/solomon_islands/st_ldavis.html

Large diffs are not rendered by default.

182 changes: 162 additions & 20 deletions scripts/notebooks/text/PACNEWS.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
{
"cell_type": "code",
"execution_count": 1,
"id": "7f762fdb",
"id": "0bc993d4",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -20,16 +20,16 @@
},
{
"cell_type": "markdown",
"id": "958d3ebf",
"id": "b3dbfca0",
"metadata": {},
"source": [
"## FACTIVA"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2aa29868",
"execution_count": 3,
"id": "21ad0297",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -50,7 +50,7 @@
},
{
"cell_type": "markdown",
"id": "9adfc80f",
"id": "2f1d617c",
"metadata": {},
"source": [
"Converting RTF file to TXT by following commands on Mac:\n",
Expand All @@ -59,8 +59,8 @@
},
{
"cell_type": "code",
"execution_count": 3,
"id": "39e6dc0e",
"execution_count": 5,
"id": "ad7c9a95",
"metadata": {
"scrolled": false
},
Expand All @@ -75,7 +75,6 @@
" entry_lst = entry.strip().split(\"\\n\\n\")\n",
" title = entry_lst[0]\n",
" date = entry_lst[1].split(\"\\n\")[1]\n",
" if date \n",
" entry_length = len(entry_lst)\n",
" if idx == len(entries) - 1:\n",
" content = \"\".join((entry_lst[i]) for i in range(entry_length)\n",
Expand All @@ -90,8 +89,8 @@
},
{
"cell_type": "code",
"execution_count": 12,
"id": "7442868f",
"execution_count": 6,
"id": "f94aeee4",
"metadata": {
"scrolled": false
},
Expand Down Expand Up @@ -124,7 +123,7 @@
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2160</th>\n",
" <th>3560</th>\n",
" <td>Another group claims legitimacy over represent...</td>\n",
" <td>461 words</td>\n",
" <td>NOUMEA, June 20 -- Another group has emerged c...</td>\n",
Expand All @@ -135,13 +134,13 @@
],
"text/plain": [
" title date \\\n",
"2160 Another group claims legitimacy over represent... 461 words \n",
"3560 Another group claims legitimacy over represent... 461 words \n",
"\n",
" news \n",
"2160 NOUMEA, June 20 -- Another group has emerged c... "
"3560 NOUMEA, June 20 -- Another group has emerged c... "
]
},
"execution_count": 12,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -155,9 +154,152 @@
"news_df[news_df.date == \"461 words\"]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c0d0baed",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>title</th>\n",
" <th>date</th>\n",
" <th>news</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Visa issue put to Gillard</td>\n",
" <td>10 May 2013</td>\n",
" <td>PORT MORESBY, May 10 -- Papua New Guinea Prime...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ECP could be revived - Australia looks to stre...</td>\n",
" <td>10 May 2013</td>\n",
" <td>PORT MORESBY, May 10 -- More Australian police...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Vanuatu National Provident Fund team visits SINPF</td>\n",
" <td>10 May 2013</td>\n",
" <td>HONIARA, May 10 -- A delegation from the Vanua...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>CNMI eyes tourism office in Russia</td>\n",
" <td>10 May 2013</td>\n",
" <td>SAIPAN, May 10 -- Due to the overwhelming grow...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Solomons' Gizo airport to close for major upgr...</td>\n",
" <td>10 May 2013</td>\n",
" <td>HONIARA, May 10 -- The Solomon Islands' Gizo A...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9895</th>\n",
" <td>Fiji eyes deals with PNG provinces</td>\n",
" <td>14 October 2013</td>\n",
" <td>PORT MORESBY, Oct. 14 -- Fiji is establishing ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9896</th>\n",
" <td>0.8pc blind in Fiji</td>\n",
" <td>14 October 2013</td>\n",
" <td>SUVA, Oct. 14 -- The prevalence of blindness i...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9897</th>\n",
" <td>Fiji calls for greater support from EU</td>\n",
" <td>14 October 2013</td>\n",
" <td>SIGATOKA, Oct. 14 -- Fiji wants the European U...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9898</th>\n",
" <td>Intervention by Vanuatu's Minister for Tourism...</td>\n",
" <td>14 October 2013</td>\n",
" <td>BRUSSELS, Oct. 14 -- On Economic Partnership A...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9899</th>\n",
" <td>UN health agency launches initiative to phase ...</td>\n",
" <td>14 October 2013</td>\n",
" <td>GENEVA, Oct. 14 -- The United Nations World He...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>9201 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" title date \\\n",
"0 Visa issue put to Gillard 10 May 2013 \n",
"1 ECP could be revived - Australia looks to stre... 10 May 2013 \n",
"2 Vanuatu National Provident Fund team visits SINPF 10 May 2013 \n",
"3 CNMI eyes tourism office in Russia 10 May 2013 \n",
"4 Solomons' Gizo airport to close for major upgr... 10 May 2013 \n",
"... ... ... \n",
"9895 Fiji eyes deals with PNG provinces 14 October 2013 \n",
"9896 0.8pc blind in Fiji 14 October 2013 \n",
"9897 Fiji calls for greater support from EU 14 October 2013 \n",
"9898 Intervention by Vanuatu's Minister for Tourism... 14 October 2013 \n",
"9899 UN health agency launches initiative to phase ... 14 October 2013 \n",
"\n",
" news \n",
"0 PORT MORESBY, May 10 -- Papua New Guinea Prime... \n",
"1 PORT MORESBY, May 10 -- More Australian police... \n",
"2 HONIARA, May 10 -- A delegation from the Vanua... \n",
"3 SAIPAN, May 10 -- Due to the overwhelming grow... \n",
"4 HONIARA, May 10 -- The Solomon Islands' Gizo A... \n",
"... ... \n",
"9895 PORT MORESBY, Oct. 14 -- Fiji is establishing ... \n",
"9896 SUVA, Oct. 14 -- The prevalence of blindness i... \n",
"9897 SIGATOKA, Oct. 14 -- Fiji wants the European U... \n",
"9898 BRUSSELS, Oct. 14 -- On Economic Partnership A... \n",
"9899 GENEVA, Oct. 14 -- The United Nations World He... \n",
"\n",
"[9201 rows x 3 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"news_df.drop_duplicates()"
]
},
{
"cell_type": "markdown",
"id": "3bfed0fa",
"id": "cac7ecca",
"metadata": {},
"source": [
"## Scraping"
Expand All @@ -166,7 +308,7 @@
{
"cell_type": "code",
"execution_count": 2,
"id": "c23311a9",
"id": "e7e94c3c",
"metadata": {
"scrolled": false
},
Expand Down Expand Up @@ -204,7 +346,7 @@
{
"cell_type": "code",
"execution_count": 6,
"id": "81e01610",
"id": "8bf6438d",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -220,7 +362,7 @@
{
"cell_type": "code",
"execution_count": 41,
"id": "ba3edfe8",
"id": "8a2aa7da",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -236,7 +378,7 @@
{
"cell_type": "code",
"execution_count": 42,
"id": "389ebd70",
"id": "be7c08f9",
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -275,7 +417,7 @@
{
"cell_type": "code",
"execution_count": 50,
"id": "6839e297",
"id": "1f40f2a0",
"metadata": {},
"outputs": [],
"source": [
Expand Down
2 changes: 1 addition & 1 deletion scripts/notebooks/text/solomon.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.9.0"
},
"toc": {
"base_numbering": 1,
Expand Down
Loading

0 comments on commit de9c824

Please sign in to comment.