Wikipedia Offline Readers

Looks like all the Wikimedia Foundation had to do for decent offline reader software to be developed is continue to provide database dumps. 😉 There are now several implementations, some open source, that can be used to build Wikipedia DVDs – and I’m not referring to the neat offline reader hack that was just slashdotted. Look at these:

  • Moulin (open source) uses static HTML inside a XUL-based cross-platform reader application with Gecko as the rendering engine. Doesn’t seem to have full-text search (only titles), but seems to have a very active development team. Current downloadable version still very simplistic, future versions should be interesting. Current versions do not contain images but there’s nothing technical that stands in the way of including them. I missed the Wikimania talk about this one. :-(
  • Kiwix (open source) is awesome and the slickest implementation I’ve seen so far. It was used for the Wikipedia 0.5 DVD (actually a CD, with only about 2000 articles, sadly). Has a nice full-text search, search autocompletion, and printing. Also uses static HTML as a source. Storage efficiency could be better, but this first selection does include image thumbnails, which take quite a bit of space.
  • Ksana For Wiki is still closed source. It was demonstrated at Wikimania to provide “Wikipedia on a USB stick”. Pretty nifty for looking things up without a net connection. The application actually parses the wikitext and does a fairly shoddy job at it, which makes many of the articles look rather raw. On the positive side, it does support accessing dumps in any language, has a fairly fast full-text search, and is cross-platform.
  • ZenoReader is a Windows-only closed source reader application developed for the German Wikipedia DVD. While the company which made the DVD, Directmedia, deserves credit for bringing the first WP DVD to the market, I don’t think this particular framework is likely to have much of a future. I’m not even going to bother to try to get it to run under WINE on Linux, as they suggest. From what I can gather, it’s based on the HTML of the de.wp articles which is served through a local webserver.
  • Wikipedia Offline Client seems to be a student project to create a nice graphical client. From what I can see quickly, it appears to be also based on rendering & indexing HTML pages, though they seem to have hacked the standard MediaWiki parser for the purpose. Not sure what the current status is and how likely it is to be developed further. It appears to be partially based on Knowledge, an earlier offline reader effort.
  • WikiFilter takes a similar approach to Ksana, using the wikitext as a source. Judging by the screenshot, the output is somewhat slicker, but the code hasn’t been updated in more than a year and is Windows-only. It runs as an Apache module so setup is definitely not for the meek.

UPDATE: A couple of other ones pointed out in the comments:

  • yawr is Magnus Manske’s effort to create an open source equivalent of ZenoReader.
  • WikiMiner is a Java-based search tool that can be used in conjunction with the static HTML dumps.

A few other methods to view Wikipedia without the Internet exist, such as a reader for the iPod or Erik Zachte’s TomeRaider edition. TomeRaider is a proprietary ebook reader format for PDAs. Erik explained to me how he spent countless hours trying to get every last detail to render correctly.

Perhaps the WMF should pick one of those platforms and support the developers, offer a DVD toolchain on download.wikimedia.org, etc. My long term wishlist for offline reading includes:

  • “Make your own dump” style scripts that generate input files for the reader application which include exactly the articles & images I want, so it becomes easy to customize it down to a megabyte-size selection, or to access many gigabytes of text and
    images.
  • More than one-article-per-window display modes. It should be possible to scroll through an entire category, or even the entire encyclopedia, without ever opening a new window. Google Reader or Thoof style smart loading may help here.
  • Embedded Theora & Vorbis playback. If Grolier did it 15 years ago, we should be able to have a rich media DVD as well. :-)
  • Smarter parsing of the contents. Templates in particular typically mark up semantic blocks that you may want to filter out, match to an offline equivalent, render in a separate window, etc. Of course if we want to really dream, think of the possibilities of DBpedia style data extraction and queries: go beyond full-text search and offer limitless queries & dynamic lists of the data within Wikipedia.

Of course the real challenge in the long run will be off-line editing with syncing to the live side once connectivity is available.
And I’d love to see decent enough voice recognition on mobile devices so that you can simply say the name of an article and it will immediately display it. 😉

Going back to the boring present, are you aware of other wiki reader & parser projects that are worth mentioning?

6 Comments

  1. Just how many partially-implemented parsers are there? Not to mention attempts at describing it in EBNF. Everyone gets some of the way, then THEIR BRAINS GET EATEN BY TENTACLES.

  2. WikiMiner uses a static HTML tree to be read with a standard browser and a little GPL’ed full-text search app written in Java. Seems simple, straightforward and clean. Developed for the Polish Wikipedia-on-DVD project.

  3. Stian HÃ¥klev

    August 17, 2007 at 8:30 am

    Hi Eric,

    nice update. I’ve actually been working quite a bit with this myself, but never really made anything all that release-worthy. WHen looking into the field I was surprised both by how much vapour-ware there was (tons of released specs etc, but show me the code!), and also how much focus there was on developing your own browser. FOr me the web browser works great, and it let’s me do work with a tool I am intimately familiar with – so a (transparent) background server always seemed like the best solution to me. If they want to do a front-end, great, but not _before_ solving the storage problem! You need to be able to store all of English Wikipedia (and the others, but EN is the biggest) on a DVD, and randomly access articles… The recently mentioned hack, and my solution manages this. Then you can build on it and add features that you mention. I never understood how the community could be all excited about 0.5 that contains 2000 articles. I never read WP from A-Z, what I need offline WP for is to be able to pull out my laptop (or in the future hopefully my cell/ipod etc) at a friends house and check if Steve Martin really did marry some weird person, or what is the capital of Burundi. The breadth of information is the selling point of WP.

    What also interests me is how incredibly uninterested a lot of the core crowd is in offline solutions. Having spent a year in Indonesia where internet is extremely expensive and hard to come by, I can tell you that getting local hawkers, who sell pirate DVDs for 50 cents, to distribute Wikipedia DVDs would be a huge benefit (of course they should have both the English and Indonesian, and possibly Chinese/Javanese/Bugines/Acehnese versions – the last ones are so small you’ll barely notice them).

    Finally, there seems to be a real lack of coordination between all these projects – and the people doing the offline dumps seem to be all but unapproachable… Where do they make their decisions about what dumps to provide, what schedule etc? I haven’t found that forum. So I would love an official mailing list to coordinate work on offline Wikipedias and WP dumps…

    Stian

  4. Stian, the wiki-research-l mailing list is probably the most appropriate mailing list for now, at least until offline Wikipedia specific traffic gets too high. See http://lists.wikimedia.org/mailman/listinfo/wiki-research-l and looking forward to hearing from you there.

  5. I would like to comment and share some impressions from the Taipei presentation about Moulin, by its developer Renaud Gaudin:

    > still very simplistic,
    Did you see the arabic version, it looks much better than the previous build, more features, nice professionally designed icons as well.

    Full text search was mentioned in Taipei: it was not included because of space constraints according to Renaud, but it could be switched on.

    > [Moulin] “seems to have a very active development team.”
    Renaud told in Taipei he is currently the only one doing the programming, with others doing localisation.

    I like severals things about this initiative

    1 Moulin uses the original PHP code to render the article
    (I know from experience how much work it is to reverse engineer the Mediawiki parser. I would not even have thought about building a perl version these days but somehow 5 years ago It was not a big deal, and keeping up with ever evolving Mediawiki syntax was quite doable in early years, before conditional templates aka parser functions arrived, which is of course, well… doable as well, sigh)

    2 Moulin has great potential for cross platform compatibility, Gecko runs on a few platforms, even on PPC handhelds, alas not on Palm and Symbian it seems

    3 There are even provisions for offline editing through Moulin (and a procedural system to route updates back to the online version). This obviously won’t work for correcting spelling errors, but it may work for adding whole new articles about local affairs, because …

    4 .. last but not least, Moulin is really targeting underdeveloped countries, where online access may be hard to get. Hence early focus on a.o. Arabic and Farsi, rather than on English and what have you. According to Renaud 10,000+ CD’s have already been distributed through Geekcorps to certain African communities. As far as reaching out to developing countries is concerned, Renaud is walking the walk, and I am glad he got in contact with Samuel Klein (OLTP) and others at Taipei, so that efforts can be combined or at least experiences can be shared.

    See also http://www.netsquared.org/projects/proposals/moulin-wiki-offline-wikipedia

  6. Addendum: Moulin is friendly for low end machines. The PHP rendering is done once on a server (in France). Compressed rendered files are then shipped with a Gecko based browser/editor.

Leave a Reply to Erik Zachte Cancel reply

Your email address will not be published.

*