This tool, "Many-knowledge," takes in messages in various formats and filters/scores them with my library "sona toki". Then the result is inserted to EdgeDB. From there, you can count the frequency of n-grams up to length 6, then export that info to a SQLite database... for reasons.
- Discord, via the DiscordChatExporter by Tyrrrz.
- Telegram, via its desktop-native export functionality.
- Reddit, via the work of /u/Watchful1
Assuming you have the above data,
Install your dependencies:
pdm install
Run the main script:
pdm run python ./src/sonamute/__main__.py
Then follow the prompts:
Specify the format of your data, then where it is on your filesystem.
If you've already fetched data, or previously specified to fetch data, this will calculate frequencies from your data.
Provide a filename for your SQLite file, then a path for it.
This, or answering "no" to "Do you want to perform another action?" will run the specified actions.
Most of what I want to do is in the issues on this repo. The rest is either meta, or not directly related to code in this library.
"hey why doesn't this code have tests" i had more excitement than foresight when i wrote this code
This tool doesn't do any of its own work detecting Toki Pona sentences. Instead, that's up to sona toki, my library to help you detect Toki Pona being spoken!
If you'd like to contribute, see the issues on that library!
This is in order of how important the platform is to include!
To an outsider, it may seem odd to want to include a specific singular forum in this data- but this forum is important because it was active from 1 Oct 2009 to, mostly, mid-2020. This period is virtually unrepresented in the data I currently have to, and this space is one of only a handful that were in use during that time. As such, this forum is highly important to the history of Toki Pona.
I do already have a backup of this data, but actually adding it to the database is difficult. I lack user IDs, post IDs, and properly formatted quotes. That's because I used this backup tool, and frankly, it's not very good. I have thus far not needed to make my own scrapers for any of the data I've collected, but this one may be different. If you have a better phpBB scraper, or otherwise a cleaner capture of this data, please reach out!
From some time in 2002 until Oct 1 2009, the Toki Pona yahoo group was one of very few spaces where Toki Pona was regularly spoken, and it appears to have been the most popular- the IRC channel could have been more popular, but it wasn't preserved that I'm aware of. Fortunately, the entire yahoo group is backed up on the forum above. Unfortunately, its formatting is mangled badly because its newlines are missing. If that weren't enough, its formatting is already highly inconsistent due to the unstable nature of email from provider to provider. Including it in the database as-is would be messy and uninformative, or even misleading; it needs some pre-processing effort.
There are several Toki Pona communities on Facebook, here, here, here, and here. The majority of their activity is in a period similar to that of Discord- that is, from 2020 onward- but they have much more pre-2020 activity than most other communities that existed around that time. Unfortunately, scraping data from Facebook is extremely difficult, and I have not fully explored doing so as a result.
There are at least two livejournal blogs that focused on Toki Pona, here and here, which were active in a similar time period to the forum or yahoo group.
kulupu.pona.la was a forum hosted by mazziechai which closed abruptly in November 2023 due to challenging circumstances. The forum was archived fully by Mazzie before the shutdown, but the format is pure HTML, making it a bit obnoxious to get the necessary data out of it.
While /u/Watchful1 has done an admirable job of scraping data from Reddit after the death of Reddit's API, they have understandably stopped short of capturing literally all the data on the platform. They only look at the top 40,000 subreddits, which means only /r/tokipona is included. I would love to include /r/mi_lon, /r/tokiponataso, and any others I can- but scraping this myself doesn't seem realistic. The API is gone, and the user API only lets you scroll through 1000 posts total. Unsure what to do about this.