-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanish data from Spanish Wiktionary #360
Comments
I'd recommend you to find the common page structure of the Spanish Wiktionary, maybe they have a help page(like how each sections are arranged). Then you might get a general idea of how a computer could parse this structure. And the new Spanish extractor also needs some language and subtitles JSON data, you could find examples in the "data" directory, the language JSON files are created by code in the "languages" folder. |
Thanks for the quick reply! Yes, I have found a page that explains the structure of entries in es.wiktionary. And it seems all entries are supposed to be made from a POS specific template. So they have a template for: Seems pretty standard, but I am not sure how difficult it would be to implement this. I'll try to take a look. I'm guessing these are what you refer to with the languages and subtitle data, right? Since my main interest is the Spanish dictionary data about Spanish itself, is it necessary to do the languages one? Or is it used for etymologies, etc. as well? I tried to read the parse_page() from fr, but I don't understand much. But I guess I could use ChatGPT to explain it to me. In any case, it looks a bit daunting for me at this point, to be honest. I might need to learn some python before I dare to give it a try. I am leaving here all these links mostly for easy retrieval purposes in case I can come back to it at some point, or in case some else comes around and has the skills and will to tackle it in the meantime. If that is ok with you guys. :) Thanks again! |
Leaving links to useful data is perfectly fine. It would be possible to modify wiktextract to only parse Spanish entries, but it would be about as much effort to implement the extractor's config data fully. Good luck with your efforts, python is a great starting point if you're new to programming, and easy to pick up if you're not. |
Hi, I'm also interested in contributing to the Spanish data extraction, and also Russian btw, for a pet project of mine (a multilingual word coach). I'm pretty proficient at programming and I've scrapped web data before, using Scrapy. I'm going to look at the code and elsewhere in this repo or the internet and see how can this be done, as long as my motivation is up and I have the time. I'd like to join forces with anyone already working on this. There's an unmerged pull request for Spanish under discussion. Can you accept this PR under a new "Spanish development/WIP" branch or something? I want to work based on that PR but I'm new to working with other people in git/GitHub using pull requests and I'm not sure about how I should go about this. |
@999pingGG, that's great to hear! More help is always welcome. 🤗 I am the author of PR #392 and the reason why it takes some time to get merged is that I am also proposing the introduction of pydantic classes to deal with data validation and schema creation, meaning there is some need to properly discuss things. In fact, I have already done more work on the Spanish extractor than the current PR covers. You can see what else I did here: https://github.com/empiriker/wiktextract/tree/es-ahead The plan is to submit these additions once the current PR is accepted. You can contribute to an extractor for a particular Wiktionary edition in two ways: a) Going deep, i.e. improving the extraction quality and covering more special cases, or b) going wide, i.e. finding sections in Wiktionary that our current scripts don't extract yet. In the case of my Spanish extractor, the etymology or morphology sections aren't parsed yet and I don't plan to implement this either. The benefit of tackling a separate section would be that your code would be more or less independent. Otherwise, a good place to start are the I don't know the best way either to contribute to the WIP of the Spanish extractor. You could work based on my es-ahead branch but I am pretty sure my code will have changed at least bit by the time it gets merged. FYI, I am planning to also add the same scope of support to the Russian Wiktionary as I did for the German and Spanish ones, i.e. extracting glosses, examples, translations, pronunciation and linkages. But I didn't get very far yet. I look forward to your first PR! 🔥 |
@empiriker I see that now your PR has been merged, congrats! I'd like to get my feet wet on the codebase based on your Spanish branch by fixing (the underlying cause of) some of those debug messages and implementing extraction for new sections, etymology and morphology, since you said that that code would be more independent, but I need some help to get started. How can I setup a development environment? Yesterday, I cloned the master branch here and tried to modify the code and experiment, but I'm a bit lost here since this project has a setup I've never seen in my (small) Python experience. When you install following the instructions, an executable binary is generated for the |
You will want to install the wiktextract with these option: You can run all tests with You will want to run What I do then is to make a copy of the db and delete all but The Spanish etymology section should be reached here: wiktextract/src/wiktextract/extractor/es/page.py Lines 26 to 51 in 06e2e0e
You can add another Good luck! |
|
Thanks for your help, I have read most but not all of the README and I didn't imagine iterating on development would be like running tests. The argument is I apologize if this is getting too off-topic, but I wanted to say that having all the data in a SQLite database is very neat and handy, I can quickly see for any word what is being parsed. Now let's write some code! |
This is thanks to @xxyzz, who implemented everything about it. Before that, we just had a big cache file (and smaller page files used for debugging and reading the source wikitext files). |
Iterating on development for me covers a couple of steps:
It goes without saying: Not necessarily all the time all the steps in that particular order. Sorry about When sampling down, keep in mind that in some editions the page namespace (id=0) has a large percentage of redirect pages (ca. 40% for the Russian edition). So with 3000 words you would be effectively developing only on 1800 pages. Depending on your personal trade off between speed and variety during development that might be enough for you. |
Hello everybody, sorry for the self advertising but maybe some of you can find this useful. First a bit of context. Almost a year ago I come across this project in my search of getting data from the Spanish Wiktionary in a functional format. But after being too lazy to understand the code and learn wikitext I decided that will be faster to implement my own solution that extract data from the rendered html. spoiler: it wasn't. But at the end I created Wikjote, a python pakage that parses the Spanish Wiktionary html into json. It dont work directly with the rendered html, instead it uses the zim files created by Kiwix. And it stills in early development and only parses the html in a simplified json structure but is enought to be functinal. Don't get me wrong, i think wiktextract is superior because it works directly with the dumps and probably when it parses correctly the spanish wiktionary my pakage will become obsolete but in the mean time some of you can find it useful. You can find the entire wiktionary as json here |
The HTML in ZIM file are simplified from Wiktionary? Some data might be lost in ZIM file. You could get better HTML files created from Parsoid that have template arguments: https://dumps.wikimedia.org/other/enterprise_html/runs Vuizur has a repo parses this format for the Russian Wiktionary: https://github.com/Vuizur/ruwiktionary-htmldump-parser But this new HTML dump files are unstable, they often lack many pages or have many duplicated pages(the file size from each run varies several GB). If they fix this issue parse HTML should be faster because templates are already expanded. And we currently has a Spanish extractor thanks to empiriker's contributions, you could find their code at here: https://github.com/tatuylonen/wiktextract/tree/master/src/wiktextract/extractor/es |
Thank you @xxyzz, I was not aware of those dumps, it is worth taking a look at them. Having more data sources is a good idea. And those are very good news, I need to try it out. |
Hi, there. This project seems fantastic, congratulations on the great work!
I have been reading around and I've seen right now only en, fr and zh wiktionaries are supported. I read somewhere you were working on a setup so that only a config file would be needed to parse other editions. I am guessing these are the files in wikiextract/extractor.
I would love to help with the Spanish wiktionary extractor, unfortunately, I am not a programmer, so even if I can try and read some code, I just get lost very, very fast.
I saw @xxyzz was working on an "es" branch at some point, but couldn't find any more info about it. Is there something that I could do to help advance this?
My main interest is extracting Spanish language data (POS, etymologies, pronunciations, definitions, linkages, flexion, etc.) from the es.wiktionary, since for other languages I think the en.wiktionary is already enough in most cases. For this case, do you think I could use wikiextract as-is or that I could modify it myself? Or do you recommend using something else (like Dbnary, which seems harder to understand)?
Thanks in any case and kudos again!
The text was updated successfully, but these errors were encountered: