Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasemental xmldump (--xmlrevisions) #24

Merged
merged 1 commit into from
Jul 23, 2024
Merged

Increasemental xmldump (--xmlrevisions) #24

merged 1 commit into from
Jul 23, 2024

Conversation

yzqzss
Copy link
Member

@yzqzss yzqzss commented Jul 9, 2024

close: #23

To incrementally dump xmldump:

  • --xmlrevisions must be used
python3 -m wikiteam3.tools.get_arvcontinue <xmlfile> # Get the arv_continue value of the previous --xmlrevisions xmldump
ARVCONTINUE="<arv_continue>" wikiteam3dumpgeneartor --xml --xmlrevisions ... # Generate a new incremental xmldump

video:

2024-07-09.18-53-24.mp4

@yzqzss yzqzss mentioned this pull request Jul 9, 2024
@yzqzss yzqzss changed the title feat: increasemental xmldump (--xmlrevisions) Increasemental xmldump (--xmlrevisions) Jul 9, 2024
@yzqzss yzqzss modified the milestones: 4.2.7, 4.3.0 Jul 9, 2024
@Superraptor
Copy link

This may be a dumb question @yzqzss -- I'm trying to do more tests with this, but it's refusing to run the dump, saying "A dump of this wiki was uploaded to IA in the last 365 days. Aborting."; would you happen to know a way to bypass this message? Thank you so so much!

@Superraptor
Copy link

Wait... ignore me! Just found the --force parameter!

@yzqzss yzqzss merged commit 1cdbd9c into v4-main Jul 23, 2024
12 checks passed
@yzqzss yzqzss deleted the inc-xmlrevions branch July 23, 2024 11:06
@Superraptor
Copy link

@yzqzss so I'm running on a Windows laptop now using an Anaconda environment (Python 3.12), and I'm running into an error; first I run the following:

# Installs from the incremental XML revisions branch. (I did this prior to the deletion above)
pip3 install git+https://github.com/saveweb/wikiteam3.git@inc-xmlrevions

# Create a dump.
wikiteam3dumpgenerator https://lgbtdb.wikibase.cloud --xml --xmlrevisions --force

# Get the arv_continue value of the dump.
# This prints ARVCONTINUE="20240723134615|148752" 
python -m wikiteam3.tools.get_arvcontinue ./lgbtdb.wikibase.cloud_w-20240723-wikidump/lgbtdb.wikibase.cloud_w-20240723-history.xml

# Update the dump, using the ARVCONTINUE variable.
ARVCONTINUE="20240723134615|148752" wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force

The final command gives me the error:

ARVCONTINUE=20240723134615|148752 : The term 'ARVCONTINUE=20240723134615|148752' is not recognized as the name of a cmdlet, function, script file, or 
operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1
+ ARVCONTINUE="20240723134615|148752" wikiteam3dumpgenerator --xml --xm ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (ARVCONTINUE=20240723134615|148752:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

Thanks so much!

@yzqzss
Copy link
Member Author

yzqzss commented Jul 23, 2024

Accroding to Set | Microsoft Learn

The characters <, >, |, &, ^ are special command shell characters and must be either preceded by the escape character (^) or enclosed in quotation marks when used in string (that is, *"StringContaining&*Symbol"**. If you use quotation marks to enclose a string containing one of the special characters, the quotation marks are set as part of the environment variable value.

So, This should work for you:

set ARVCONTINUE=20240723134615^|148752
wikiteam3dumpgenerator ...options...

@Superraptor
Copy link

Thanks so much for the quick response! Hmmm... So running set ARVCONTINUE=20240723134615^|148752

Results in:

At line:1 char:33
+ set ARVCONTINUE=20240723134615^|148752
+                                 ~~~~~~
Expressions are only allowed as the first element of a pipeline.
    + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
    + FullyQualifiedErrorId : ExpressionsMustBeFirstInPipeline

And set ARVCONTINUE=20240723134615|148752 provides a similar error:

At line:1 char:32
+ set ARVCONTINUE=20240723134615|148752
+                                ~~~~~~
Expressions are only allowed as the first element of a pipeline.
    + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
    + FullyQualifiedErrorId : ExpressionsMustBeFirstInPipeline

Funnily, when just using quotes (set ARVCONTINUE="20240723134615|148752") the command appears to work, but then running wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force starts a completely new dump:

...
Retrieving the XML for every page from the beginning

Trying to export all revisions from namespace -1
Trying to get wikitext from the allrevisions API and to build the XML
[arvcontinue]:
...

Running ARVCONTINUE="20240723134615|148752" wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force results in the same error as before the set command:

ARVCONTINUE=20240723134615|148752 : The term 'ARVCONTINUE=20240723134615|148752' is not recognized as the name of a cmdlet, function, script file, or 
operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1
+ ARVCONTINUE="20240723134615|148752" wikiteam3dumpgenerator --xml --xm ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (ARVCONTINUE=20240723134615|148752:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

I'm going to try a couple more potential solutions but just want to document this behavior!

@Superraptor
Copy link

I tried using PowerShell's Set-Variable instead of set with a number of combinations-- can't get anything to work.

@yzqzss
Copy link
Member Author

yzqzss commented Jul 23, 2024

$Env:ARVCONTINUE = '20240723134615|148752'

how about this one?

@Superraptor
Copy link

I ran it first as two commands:

$Env:ARVCONTINUE = '20240723134615|148752'
wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force

Then as these two commands:

$Env:ARVCONTINUE = '20240723134615|148752'
ARVCONTINUE = '20240723134615|148752' wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force

And then as one command:

$Env:ARVCONTINUE = '20240723134615|148752' wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force

Fails each time. First version runs, but quits after seeing the previous dump appears to be finished (despite config.json having "xmlrevisions": true). The second fails with "ARVCONTINUE is not recognized..."; the third fails with "Unexpected token 'wikiteam3dumpgenerator' in expression or statement". Very strange all around.

@yzqzss
Copy link
Member Author

yzqzss commented Jul 23, 2024

First version runs, but quits after seeing the previous dump appears to be finished

Move the previous dump to another place, make sure the current work dir is clean, then rerun

$Env:ARVCONTINUE = '20240723134615|148752'
wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force

Or use --path to specify another dump name.

@Superraptor
Copy link

It worked!!!! Adding the --path parameter did it! So sorry to bother but thank you so so much!

@Superraptor
Copy link

Not to return again (sorry), but I'm working on a bash implementation (using Git Bash). It ran once successfully with the following:

# Get most recent dump folder.
export LATESTMODIFIEDFOLDER=$(ls -td ./res/dumps/*/ | head -n1)

# Get the dump XML from the dump folder.
export HISTORYXML=$(find $LATESTMODIFIEDFOLDER -name "*-history.xml")

# Check if ARVCONTINUE can be found.
export ARVCONTINUEARG=$(python -m wikiteam3.tools.get_arvcontinue "$HISTORYXML")
ARVCONTINUEVAR=$(cut -d "=" -f2 <<< "$ARVCONTINUEARG")
ARVCONTINUEVAR=$(echo "$ARVCONTINUEVAR" | tr '"' "'")

# Update existing dump.
ARVCONTINUE=$ARVCONTINUEVAR wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force

However, when attempting to run a second time I get this error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\wikiteam3\tools\get_arvcontinue.py", line 21, in <module>
    main()
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\wikiteam3\tools\get_arvcontinue.py", line 17, in main
    lastArvcontinue = lastPage.attrib['arvcontinue']
                      ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
  File "src\\lxml\\etree.pyx", line 2545, in lxml.etree._Attrib.__getitem__
KeyError: 'arvcontinue'

I was confused by this, so I checked the log for the wikiteam3dumpgenerator run and it's surprisingly short:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\[X]\anaconda3\envs\py312\Scripts\wikiteam3dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\wikiteam3\dumpgenerator\__init__.py", line 7, in main
    DumpGenerator()
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 85, in __init__
    DumpGenerator.createNewDump(config=config, other=other)
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 111, in createNewDump
    generate_XML_dump(config=config, session=other.session)
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_dump.py", line 142, in generate_XML_dump
    doXMLRevisionDump(config, session, xmlfile, lastPage, useAllrevisions=True)
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_dump.py", line 29, in doXMLRevisionDump
    for xml in getXMLRevisions(config=config, session=session, lastPage=lastPage, useAllrevision=useAllrevisions):
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\wikiteam3\dumpgenerator\dump\page\xmlrev\xml_revisions.py", line 85, in getXMLRevisionsByAllRevisions
    allrevs_response = site.api(
                       ^^^^^^^^^
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\mwclient\client.py", line 288, in api
    if self.handle_api_result(info, sleeper=sleeper):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\[X]\anaconda3\envs\py312\Lib\site-packages\mwclient\client.py", line 333, in handle_api_result
    raise errors.APIError(info['error']['code'],
mwclient.errors.APIError: ('internal_api_error_Wikimedia\\Timestamp\\TimestampException', '[ecbbee4431efb2f469b2d32a] Caught exception of type Wikimedia\\Timestamp\\TimestampException', None)
Checking API... https://lgbtdb.wikibase.cloud/w/api.php
API is OK:  https://lgbtdb.wikibase.cloud/w/api.php
Checking index.php... https://lgbtdb.wikibase.cloud/w/index.php
check_index(): Trying Special:Random...
POST https://lgbtdb.wikibase.cloud/w/index.php {'title': 'Special:Random'} 302
GET https://lgbtdb.wikibase.cloud/wiki/Item:Q14106 {'title': 'Special:Random'} 200
index.php available probability: 90% (0.9)
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./lgbtdb.wikibase.cloud_w-20240725-wikidump
--delay is the default value of 1.5
There will be a 1.5 second delay between HTTP calls in order to keep the server from timing you out.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.
Undo monkey patch...
#########################################################################
# Welcome to DumpGenerator 4.2.6 by WikiTeam3 (GPL v3)                  #
# More info at: <https://github.com/saveweb/wikiteam3>                  #
# Copyright (C) 2011-2024 WikiTeam developers                           #
#########################################################################

Analysing https://lgbtdb.wikibase.cloud/w/api.php
Trying generating a new dump into a new directory...
https://lgbtdb.wikibase.cloud/w/api.php
Getting the XML header from the API

Retrieving the XML for every page from the beginning

Using [env]ARVCONTINUE='20240723164712|148952'


[NOTE] DO NOT use wikiteam3uploader to upload incremental xmldump to Internet Archive, we haven't implemented it yet


Trying to export all revisions from namespace -1
Trying to get wikitext from the allrevisions API and to build the XML
[arvcontinue]: '20240723164712|148952'

I'm very confused by the error message TimestampException... Not well-versed enough in Wikibase to know if this is an issue with the Wikibase server (it appears to be working in the browser and with WikibaseIntegrator calls so I don't think it's an API issue?) or an issue with the wikiteam3dumpgenerator package (given the nature of the exception I'm not sure this is the case either?). Regardless, the resulting dump is really small, which is not surprising after seeing the logs:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="en">
  <siteinfo>
    <sitename>lgbtDB</sitename>
    <dbname>mwdb_9417beabc5</dbname>
    <base>https://lgbtdb.wikibase.cloud/wiki/Main_Page</base>
    <generator>MediaWiki 1.39.7</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Project</namespace>
      <namespace key="5" case="first-letter">Project talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="120" case="first-letter">Item</namespace>
      <namespace key="121" case="first-letter">Item talk</namespace>
      <namespace key="122" case="first-letter">Property</namespace>
      <namespace key="123" case="first-letter">Property talk</namespace>
      <namespace key="146" case="first-letter">Lexeme</namespace>
      <namespace key="147" case="first-letter">Lexeme talk</namespace>
      <namespace key="640" case="first-letter">EntitySchema</namespace>
      <namespace key="641" case="first-letter">EntitySchema talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
    </namespaces>
  </siteinfo>

The config.json and xmlrevisions_incremental_dump.mark appear to be normal, but the siteinfo.json, SpecialVersion.html, all_dumped.mark, and index.html did not generate as normal. I'm going to continue trying to self-diagnose, but just wanted to document!

@Superraptor
Copy link

Interestingly, it works when running as normal on the command-line (using ARVCONTINUE='20240723164712|148952' wikiteam3dumpgenerator --xml --xmlrevisions https://lgbtdb.wikibase.cloud --force):

Checking API... https://lgbtdb.wikibase.cloud/w/api.php
API is OK:  https://lgbtdb.wikibase.cloud/w/api.php
Checking index.php... https://lgbtdb.wikibase.cloud/w/index.php
check_index(): Trying Special:Random...
POST https://lgbtdb.wikibase.cloud/w/index.php {'title': 'Special:Random'} 302
GET https://lgbtdb.wikibase.cloud/wiki/Item:Q22094 {'title': 'Special:Random'} 200
index.php available probability: 90% (0.9)
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./lgbtdb.wikibase.cloud_w-20240725-wikidump
--delay is the default value of 1.5
There will be a 1.5 second delay between HTTP calls in order to keep the server from timiut.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.
Undo monkey patch...
#########################################################################
# Welcome to DumpGenerator 4.2.6 by WikiTeam3 (GPL v3)                  #
# More info at: <https://github.com/saveweb/wikiteam3>                  #
# Copyright (C) 2011-2024 WikiTeam developers                           #
#########################################################################

Analysing https://lgbtdb.wikibase.cloud/w/api.php
Trying generating a new dump into a new directory...
https://lgbtdb.wikibase.cloud/w/api.php
Getting the XML header from the API

Retrieving the XML for every page from the beginning

Using [env]ARVCONTINUE=20240723164712|148952


[NOTE] DO NOT use wikiteam3uploader to upload incremental xmldump to Internet Archive, we implemented it yet


Trying to export all revisions from namespace -1
Trying to get wikitext from the allrevisions API and to build the XML
[arvcontinue]: 20240723164712|148952
Item:Q23225, 1 edits
Item:Q23224, 1 edits
Item:Q15840, 1 edits
Item:Q23226, 1 edits
Item:Q15858, 1 edits
Item:Q23227, 3 edits
Item:Q21999, 1 edits
...

Very strange all around... so it may be a bash issue.

@Superraptor
Copy link

Figured it out! You have to remove the front/trailing quotes, I did this by running ARVCONTINUEVAR="${ARVCONTINUEVAR//\"/}"; gotta love the intracacies of bash scripting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incremental dump
2 participants