Skip to content

jkingdon/levitation-perl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a Perl port of scy's levitation. It reads MediaWiki dump files
revision by revision and writes a data stream to stdout suitable for 
git fast-import.

The first 1000 pages of the german Wikipedia and all their revisions
(about 390000) can be dumped in about 15 min on relatively moderate
hardware.


INSTALL:

You need at least Perl 5.10. The Perl interpreter has to be compiled
with threads support.

You also need a working C compiler for the inline SHA1 C function.
Although gcc 4.3 was specified by old versions of levitation-perl
(circa May 2010), gcc 4.5.2 also appears to work (at least for some
small wiki dumps).

You need zlib, for example the following should work under Debian/Ubuntu:
  sudo apt-get install zlib1g-dev

You need the following modules and their dependencies from CPAN:

- Regexp::Common
- Inline
- JSON::XS
- Compress::Raw::Zlib
- Carp::Assert
- Devel::Size

- CDB_File
- XML::Bare      >= 0.44
- Deep::Hash::Utils
- File::Path     > 2.04 ? (2.08 is fine; it has to export make_path)

Some Linux distributions will already have the first set.
Under Debian / Ubuntu the following command should set you:

  sudo apt-get install libregexp-common-perl \
                       libinline-perl libjson-xs-perl \
                       libcompress-raw-zlib-perl libcarp-assert-perl \
                       libdevel-size-perl

For Fedora, the equivalent is
  sudo yum install perl-Regexp-Common \
                   perl-Inline perl-JSON-XS \
                   perl-Compress-Raw-Zlib perl-Carp-Assert


For the second set, you may be able to just run:
  sudo cpan -i CDB_File
  sudo cpan -i XML::Bare
  sudo cpan -i Deep::Hash::Utils


USAGE:

First, initialize a git repository:

  cd /tmp
  mkdir blawiki
  cd blawiki
  git init

(Alternately, go to the working directory of an existing git repository where you want to import the dump, such as one from a previous run of levitation-perl).

Then, "levitate". This is a three-step process:

  cat /path/to/blawiki-dump.xml | /path/to/levitation-perl/step1.pl
  LC_ALL=C sort rev-table.txt > rev-sorted.txt
  /path/to/levitation-perl/step2.pl | /path/to/levitation-perl/gfi.pl


Alternatively, you can just change to an empty directory and call the
"levitate" helper script with a path to a dump as parameter (may be 
7z, bz2, gz or xml):

  mkdir /tmp/blawiki
  cd /tmp/blawiki
  /path/to/levitation-perl/levitate /path/to/blawiki-dump...

Lots of progress information is printed to standard error, so it may be
best to redirect that to a file.

Have fun.

About

perl port of scy's levitation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Perl 98.8%
  • Shell 1.2%