Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to Python3, GTK3, port mwlib to *-python alts #1

Open
wants to merge 50 commits into
base: master
Choose a base branch
from

Conversation

srevinsaju
Copy link

@srevinsaju srevinsaju commented Dec 19, 2019

Port to Python3 - Wikipedia Activity

This activity is in BETA testing. All testers are welcome.

Known Issues:

  • Tables not working
  • References not working (will be removed)

Installation

The installation of wikipedia activity is easier for the end user as its only necessary for them to install the .xo file.
The minimal steps to make this possible is given below. Wikipedia Activity uses compressed .bz dumps in xml to create articles from wikipedia.

  • go to dumps.wikipedia.org and download the latest dump in your preferred language
    image

  • For my test purposes, I am going to download simplewiki dump progress on 20200101 as of 2020-01-11. Thats the latest complete dump.
    image

  • Download the *.bz2 link:
    image

Alternatively, if you are low on data, you can use a smaller bz2 for testing, consider using a smaller package. For example, the pure English Wikipedia gives entire wikipedia dump, in compressed state is less than 17GB and the smallest may be as small as 40-70MB

  • Lets get things into action: Clone the github repo
git clone https://github.com/srevinsaju/wikipedia-activity
git checkout python3-ss

Periodically pull new changes, as its getting updated quite frequently

  • Install the Languages
    Get the downloaded bz2 file directory
    Change you directory to the language you wish to develop for.
    For example, I am developing en_simple, so
cd en_simple
cp ~/Downloads/*.bz2 .
  • Extract the bz2 file
bzip2 -d <your bzip filename here>

You might need to install bzip2, if its not found. Google it

  • If there are a trailing numbers after the .xml consider removing it for better presentation and to prevent errors while creating a bundle
    For example, change ensimple....wiki.xml3116641313 to ensimple....wiki.xml

  • Now verify that you now have two files, search.db and ensimple....*.xml in your folder

  • Process the dump file: Execute within the language folder

../tools2/pages_parser.py en_simple   # replace en_simple with your language

NOTE: The process described on the wiki is a bit complicated. The code has been adjusted to make work easier. This is done on cf6fa12

It might take minutes (for 70 MB files) to Hours for (300 MB files) and weeks for 16.5 GB database files. It completely depends on your CPU speed. At this moment, pages_parser, crawls through the files to create an index, links, etc to enhance faster access.
Quoting Gonzalo Odiard

With the spanish file and a bit more than 2.3M pages, this process takes aprox 1:30 hours in my system.

  • You can now confirm that *.links , *.templates and redirects are found. If the above process crashes with any error, these files will be created, but they are likely to be empty. Re-running the the above code will not help, you will have to cleanup the unwanted empty files generated

  • You can now add or remove certain files from the blacklist or favorite txt. 's its use is assumed.

  • Perform the command

../tools2/make_selection.py

Alternatively, you can ignore balcklist and favourite by

../tools2/make_selection.py --all
  • Create the index
../tools2/create_index.py en_simple
  • Test the Index
../tools2/test_index.py en_simple Sun

The above command checks in en_simple configuration and extracts the wikitext of Sun

  • Generate Activity.info (new)
./setup.py gen_act <path to .xml file>
  • Check the contents of activity/activity.info. Make necessary changes
  • Create the Bundle. c.f :
./setup.py dist_xo en_simple/simplewiki-20200101-pages-articles-multistream.xml

Dependencies

server

Checklist

  • Port to Python3
  • Port to GTK3
  • Fix server.py
  • Remove search-toolbar dependencies on browse-activity : seach-toolbar
  • Fix Search bar
  • Remove mwlib and other files
  • Add other python alternative
  • Fix new tab
  • Fix home page
  • make it more sweet
  • Final Fixes
  • Test on Sugar DE
  • Test on Sugar SOAS
  • Make an automated wiki-builder
  • Write documentation
  • Improve upstream documentation

Credits

@Hrishi1999 for testing this on a new system and helping me find a lot of bugs
@quozl for global info

Disclaimer

This activity has been one of the most hardest tasks ever taken up my me. See the time taken to see how long this PR has been worked upon

@quozl
Copy link

quozl commented Dec 19, 2019

mwlib is from https://github.com/godiard/mwlib

@quozl
Copy link

quozl commented Dec 19, 2019

mwlib is likely upstream at https://github.com/pediapress/mwlib

@quozl
Copy link

quozl commented Dec 20, 2019

acb8ff1 adds files at top level of activity, which would be misleading to new developers.

@srevinsaju
Copy link
Author

@quozl, I will remove off the unwanted files, but Its not working though :(
It needs a developer with C language knowledge

@srevinsaju
Copy link
Author

srevinsaju commented Dec 21, 2019

@quozl , currently mwlib porting is not possible as what pediapress tells, the only option is to:
(Its not only the String error, the devlopers mention of other errors too, it requires manual compiling of cython to continue porting to Python3 (mwlib))
(i) Continue using python only for mwlib and port the activity to python3
(iI) Remove mwlib to https://github.com/earwig/mwparserfromhell

@quozl
Copy link

quozl commented Dec 22, 2019

Alright then, you've explained why our older mwlib has to be ported. Ensure that explanation is in the commit message of the port of our older mwlib.

@srevinsaju
Copy link
Author

@quozl the creators of mwlib has commented that its currently not possible to port mwlib to Python3. For now they are using https://github.com/earwig/mwparserfromhell .
I will have to discard this pull request, and will have to start from scratch, the imports are too many, and its need is not well commented causing me to go 😭

@quozl
Copy link

quozl commented Dec 22, 2019

Where did they comment? Why not possible? Everything is possible with code, eventually.

@srevinsaju
Copy link
Author

@quozl it seems they are working on python3.
a message by @ckepper was on the python3 port

pediapress/mwlib#66 (comment)

should I try to port it, or wait for the official release

@quozl
Copy link

quozl commented Dec 23, 2019

If it were me, I'd offer to help. If you don't feel you can do that in the time available to you, then set it aside.

@srevinsaju
Copy link
Author

So far, the commit e7b0e8e is the last functional version. The mwlib-simple is not a good alternative.

@quozl
Copy link

quozl commented Jan 9, 2020

How is this pull request going?

538fde2 seems to add redundant parentheses, as if 2to3 is run twice.

@srevinsaju
Copy link
Author

I have probably ran 2to3 twice on some files, buy I will fix it. good news is mwlib is fixed by removing the unneeded functions. please review the .re files because I don't know if what I have done is right

@srevinsaju
Copy link
Author

srevinsaju commented Jan 9, 2020

@quozl please review this: A raw HACK 5be2eea

@quozl
Copy link

quozl commented Jan 9, 2020

You'd have to explain 5be2eea, as I don't understand it.

Please confirm you have tested creating a new activity bundle from a Wikipedia download, and that the bundle does work on Sugar with Python3?

@srevinsaju
Copy link
Author

@quozl , When I try to compile mwlib after partially porting it, It asks for a constructor or a destrutor
no constructor or destructor or type conversion found before init_mwscan
This caused me to refer different articles, but I had no clue in trying to understand this stuff. So I just commented out the init_mwscan part and the error no longer exists while building.
But still, it gave some more errors like _mwscan cannot import error, (for obvious reasons), so I removed the import too, and put it in a try except. The activity depends on the sugar-web activity, and needs porting there. WIP. I wanted to know if what I did on 5be2eea is right.

PS:
I have changed PyUnicodeObject to PyASCII object because of unicode -> string conversions, and this is valid. Idk about the rest

@srevinsaju srevinsaju requested a review from godiard January 15, 2020 21:17
@srevinsaju
Copy link
Author

@walterbender @Hrishi1999 @chimosky Please review. Thanks

@quozl
Copy link

quozl commented Jan 21, 2020

Thanks. Reviewed, not tested.

  • seems to be a lot of whitespace change to the .re files, such that the real changes are invisible,
  • the port of mwlib is incomplete; there's commented code, and mwlib is still referenced by server; please remove any code not needed,
  • some other files, like pylru.py, have been rewrapped, again can't see the real changes if any; or did it come from somewhere else?,
  • can you include pijnu as source rather than a tar.gz?

Overall I'm concerned that the removal of mwlib makes the creation of a new activity bundle a very expensive CPU operation. If it can be ported instead, it should be better. On the other hand, we've failed to keep the activity bundle up to date anyway, and a reasonable update rate may be once or twice every year.

@quozl quozl mentioned this pull request Feb 25, 2020
@srevinsaju
Copy link
Author

@quozl I can remove all the mwlib resources. Should I proceed with that then.

Regarding CPU usage: to get the best performance at the lowest CPU usage, it is important to port the mwlib and not use the mwparser. The new mwparser has a lot of problems, example tables and images. However, there is no development in the pediapress/mwlib, and I do not have any idea of the C files too. Is there anything I can do for this PR?

@quozl
Copy link

quozl commented Apr 14, 2020

Sure. Keep mwlib. Learn C. It's not hard. Python is based on C. GIve yourself about ten hours to get started, and about a week to build the rest of the knowledge required. It is common in software engineering to learn a language only enough to fix something. In this situation, although I know C, and I know Python, what I don't know yet is how to write an extension for Python in C, nor how to do it for both Python 2 and Python 3. That's the skill set needed for mwlib porting. It is all empirical though; nothing needs to be a mystery. For myself I don't need to do this yet, because I've added Python 2 support to OLPC OS next release.

@quozl
Copy link

quozl commented Apr 20, 2020

Chapter 8 of Supporting Python 3 covers Migrating C Extensions.

@srevinsaju
Copy link
Author

Yes, I read that during GCI, and I tried them too, but maybe, because I didn't do them good. I had learnt C momentarily, like the syntax, but understanding pointers were way beyond my understanding. I guess, unless I have some real life program to work on C with Python libs so that I can understand them more properly. I may try in future, maybe in a few months,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants