Looks like all the Wikimedia Foundation had to do for decent offline reader software to be developed is continue to provide database dumps. 😉 There are now several implementations, some open source, that can be used to build Wikipedia DVDs – and I’m not referring to the neat offline reader hack that was just slashdotted. Look at these:
- Moulin (open source) uses static HTML inside a XUL-based cross-platform reader application with Gecko as the rendering engine. Doesn’t seem to have full-text search (only titles), but seems to have a very active development team. Current downloadable version still very simplistic, future versions should be interesting. Current versions do not contain images but there’s nothing technical that stands in the way of including them. I missed the Wikimania talk about this one.
- Kiwix (open source) is awesome and the slickest implementation I’ve seen so far. It was used for the Wikipedia 0.5 DVD (actually a CD, with only about 2000 articles, sadly). Has a nice full-text search, search autocompletion, and printing. Also uses static HTML as a source. Storage efficiency could be better, but this first selection does include image thumbnails, which take quite a bit of space.
- Ksana For Wiki is still closed source. It was demonstrated at Wikimania to provide “Wikipedia on a USB stick”. Pretty nifty for looking things up without a net connection. The application actually parses the wikitext and does a fairly shoddy job at it, which makes many of the articles look rather raw. On the positive side, it does support accessing dumps in any language, has a fairly fast full-text search, and is cross-platform.
- ZenoReader is a Windows-only closed source reader application developed for the German Wikipedia DVD. While the company which made the DVD, Directmedia, deserves credit for bringing the first WP DVD to the market, I don’t think this particular framework is likely to have much of a future. I’m not even going to bother to try to get it to run under WINE on Linux, as they suggest. From what I can gather, it’s based on the HTML of the de.wp articles which is served through a local webserver.
- Wikipedia Offline Client seems to be a student project to create a nice graphical client. From what I can see quickly, it appears to be also based on rendering & indexing HTML pages, though they seem to have hacked the standard MediaWiki parser for the purpose. Not sure what the current status is and how likely it is to be developed further. It appears to be partially based on Knowledge, an earlier offline reader effort.
- WikiFilter takes a similar approach to Ksana, using the wikitext as a source. Judging by the screenshot, the output is somewhat slicker, but the code hasn’t been updated in more than a year and is Windows-only. It runs as an Apache module so setup is definitely not for the meek.
UPDATE: A couple of other ones pointed out in the comments:
- yawr is Magnus Manske’s effort to create an open source equivalent of ZenoReader.
- WikiMiner is a Java-based search tool that can be used in conjunction with the static HTML dumps.
A few other methods to view Wikipedia without the Internet exist, such as a reader for the iPod or Erik Zachte’s TomeRaider edition. TomeRaider is a proprietary ebook reader format for PDAs. Erik explained to me how he spent countless hours trying to get every last detail to render correctly.
Perhaps the WMF should pick one of those platforms and support the developers, offer a DVD toolchain on download.wikimedia.org, etc. My long term wishlist for offline reading includes:
- “Make your own dump” style scripts that generate input files for the reader application which include exactly the articles & images I want, so it becomes easy to customize it down to a megabyte-size selection, or to access many gigabytes of text and
- More than one-article-per-window display modes. It should be possible to scroll through an entire category, or even the entire encyclopedia, without ever opening a new window. Google Reader or Thoof style smart loading may help here.
- Embedded Theora & Vorbis playback. If Grolier did it 15 years ago, we should be able to have a rich media DVD as well.
- Smarter parsing of the contents. Templates in particular typically mark up semantic blocks that you may want to filter out, match to an offline equivalent, render in a separate window, etc. Of course if we want to really dream, think of the possibilities of DBpedia style data extraction and queries: go beyond full-text search and offer limitless queries & dynamic lists of the data within Wikipedia.
Of course the real challenge in the long run will be off-line editing with syncing to the live side once connectivity is available.
And I’d love to see decent enough voice recognition on mobile devices so that you can simply say the name of an article and it will immediately display it. 😉
Going back to the boring present, are you aware of other wiki reader & parser projects that are worth mentioning?