Managing wiki data

Volapük is one of those lovely constructed languages which has made it into the 100K list on our multilingual Wikipedia portal. Of course, this is due to some bot imports of census and map information. It’s all fair – the English Wikipedia also bot-generated tens of thousands of “articles” in its early history. It does raise the question whether a freetext wiki is really a good way to maintain such data.

Over at OmegaWiki, we’re experimenting with adding structured data to the multilingual concept entities we call “DefinedMeanings”. A DefinedMeaning can be accessed using any synonym or translation associated with it. An example is the DefinedMeaning Denmark. If you expand the “Annotation” section, you’ll see information such as the countries Denmark borders on, its capital, or currency.

But there’s more to it than meets the eye: the fact that a country can border on another one is expressed by giving a DefinedMeaning a class membership. The class “state” defines what relationships an entity of this type can posses. In this way we define an ontology.

Note that if you log in and change your UI language, all the names will be automatically translated, if translations are available, as each relation type is a DefinedMeaning in its own right. OmegaWiki also supports free-text annotation and hyperlinks. We’ll add more attribute types as we need them.

What does that have to do with the Volapük Wikipedia? Well, OmegaWiki is the kind of technology you could build upon to maintain a central repository of the almanac data that was imported. Then, with some additional development, you could generate infoboxes with fully translated labels directly from OmegaWiki. And, within the free-text Wikipedia, you could focus on writing actual encyclopedia articles that reference this data.

Note that OmegaWiki is not alone in the semantic wiki space. Semantic MediaWiki adds data structure by enriching the wiki syntax itself. Freebase is a closed-source free-content “database of everything.” I’m not interested in closed source technology, but I like a lot of what I see in Semantic MediaWiki, and I expect that the technologies will converge in the end, if only conceptually.

Project Peach Pre-Orders Approach 1000

Project Peach is the latest “open source movie” that will be made by the good folks behind Blender, an open source 3D content creation suite. Unlike the first movie, Elephants Dream (which was codenamed “Project Orange”), Project Peach will have a cute story with cute furry animals. Like Elephants Dream, Peach will be released under Creative Commons Attribution License, allowing anyone to use any part of the work for any purpose.

People who pre-order the Peach DVD before October 1 can have their name listed in the credits. They’ve already taken almost a thousand pre-orders. What this means, in simple terms, is that the Blender Institute now has enough reputation to fund free culture projects to the tune of EUR 30-50K through online donations. Not huge, but still pretty exciting.

LibriVox Dramatic Recordings

I’ve been fond of the LibriVox project for some time, where volunteers contribute spoken recordings of public domain texts (see the Wikinews interview I did last year). It’s a wonderful example of what becomes possible when a work is no longer trapped by copyright law. But I only today discovered the Dramatic Works section of their catalog. Here, multiple readers distribute the speaking of lines from dramatic works like Shakespeare’s “King Lear”, and the result is edited into a single recording. The entire process is coordinated through the LibriVox forums. I love it.

Granted, the results are of varying quality, and only a handful of works have been completed so far. But the technology that enables such collaborations to happen is also still in its infancy. The very existence of high quality open source audio editing software like Audacity has already driven a great many projects (including our own Spoken Wikipedia); imagine what kind of creativity an open source audio collaboration suite could unleash.

Improvements, of course, often come in very small steps. A nice example is the Shtooka software, an open source program specifically designed for the purpose of creating pronunciation files. It is not rocket science, but according to Gerard, who has recorded hundreds of such files, it makes the process so much simpler. I wouldn’t be surprised if the folks at LibriVox come up with their own “Shtooka” solution to distributing the workload of complex dramatic recordings.

Deb A Day is Back!

I hadn’t realized that Deb A Day is back online. Now run by multiple authors, this blog features cool open source packages with detailed descriptions. All the packages can be trivially installed using Debian’s brilliant package management system. But whether you’re a Debian/Ubuntu user or not, you’re likely to discover some new tools reading this blog. I already found three which I’m likely to keep around:

  • Qalculate – a desktop calculator with autocompletion, history, and plenty of built-in unit converters (including currency conversion using Internet data).
  • Zim – a desktop wiki and outliner. I previously used Tomboy. Zim appears to be a little bit cooler in that it supports namespaces (which can be used to build a document tree, and export selected branches) and comes with a calendar plug-in which makes it easier to manage daily to-dos in parallel with global pages. At least that’s the theory — I’ll see how it works out.
  • htop – just a neat replacement for your run of the mill command line process manager. There’s probably hundreds of these replacements for common Unix tools out there. I wish distributors would start making the cooler versions the default.

The value of Deb A Day also demonstrates that we need better open source knowledge bases. Wikipedia is pretty good (the free software portal is an excellent index to tools for various purposes), though the deletionists sometimes aggressively remove “non-notable” applications (I had to fight to rescue poor Pingus from deletion). Pakanto could become a good source for vendor-neutral freely licensed package descriptions. And is good to find highly rated or popular tools in a particular category. But what’s missing is a database of in-depth reviews and tips, one which (like Deb A Day) highlights interesting new projects or little known old ones. For now, this nice little blog will have to do.

Tagged Planet aggregators rock

Planet aggregators are pretty cool, but many of them are “polluted” by blog posts that have nothing to do with the subject. If I subscribe to a MySQL or Apache feed, I don’t want to read about what the MySQL or Apache devs are having for breakfast. Planet Maemo is an example of an aggregator that works well. It still has the occasional off-topic posts, but thanks to tag-specific feeds, many of the aggregated sources are filtered. It’s a truly powerful way to keep up to date with the devs, while still allowing them room for individual expression.

I guess it’s time for me to start lobbying for a Wikimedia planet aggregator. The additional issue here is language — perhaps we need one for each language, unless there’s an aggregator software that supports multilinguality.

Blasphemy Challenge

The Blasphemy Challenge is one of many interesting uses of Youtube’s video reply functionality. A bit silly, it challenges viewers to upload their own videos where they “deny the Holy Spirit,” an unforgivable sin according to the New Testament. Up to 1,000 responders will get a free DVD of the documentary The God Who Wasn’t There.

This kind of decentralized collaboration will become really interesting when people can actually start to, well, collaborate: i.e., turn many small snippets of video into a documentary. Even now, it’s an interesting display of the new media culture that is arising on the Net.

RfC: A Free Content and Expression Definition

If you distribute this announcement, please make an addition to /Log so we can avoid duplicates.

The free culture movement is growing. Hackers have created a completely free operating system called GNU/Linux that can be used and shared by anyone for any purpose. A community of volunteers has built the largest encyclopedia in history, Wikipedia, which is used by more people every day than or Thousands of individuals have chosen to upload photos to under free licenses. But – just a minute. What exactly is a “free license”?

In the free software world, the two primary definitions – the Free Software Definition and the Open Source Definition – are both fairly clear about what uses must be allowed. Free software can be freely copied, modified, modified and copied, sold, taken apart and put back together. However, no similar standard exists in the sphere of free content and free expressions.

We believe that the highest standard of freedom should be sought for as many works as possible. And we seek to define this standard of freedom clearly. We call this definition the “Free Content and Expression Definition”, and we call works which are covered by this definition “free content” or “free expressions”.

Neither these names nor the text of the definition itself are final yet. In the spirit of free and open collaboration, we invite your feedback and changes. The definition is published in a wiki. You can find it at: or

Please use the URL <> (including the trailing slash) when submitting this link to high-traffic websites.

There is a stable and an unstable version of the definition. The stable version is protected, while the unstable one may be edited by anyone. Be bold and make changes to the unstable version, or make suggestions on the discussion page. Over time, we hope to reach a consensus. Four moderators will be assisting this process:

  • Erik Möller – co-initiator of the definition. Free software developer, author and long time Wikimedian, where he initiated two projects: Wikinews and the Wikimedia Commons.
  • Benjamin Mako Hill – co-initiator of the definition. Debian hacker and author of the Debian GNU/Linux 3.1 Bible, board member of Software in the Public Interest, Software Freedom International, and the Ubuntu Foundation.
  • Mia Garlick. General Counsel at Creative Commons, and an expert on IP law. Creative Commons is, of course, the project which offers many easy-to-use licenses to authors and artists, some of which are free content licenses and some of which are not.
  • Angela Beesley. One of the two elected trustees of the Wikimedia Foundation. Co-founder and Vice President of Wikia, Inc.

None of the moderators is acting here in an official capacity related to their affiliations. Please treat their comments as personal opinion unless otherwise noted. The Creative Commons project has welcomed the effort to clearly classify existing groups of licenses, and will work to supplement this definition with one which covers a larger class of licenses and works.

In addition to changes to the definition itself, we invite you to submit logos that can be attached to works or licenses which are free under this definition:

One note on the choice of name. Not all people will be happy to label their works “content”, as it is also a term that is heavily used in commerce. This is why the initiators of the definition compromised on the name “Free Content and Expression Definition” for the definition itself. We are suggesting “Free Expression” as an alternative term that may lend itself particularly to usage in the context of artistic works. However, we remain open on discussing the issue of naming, and invite your feedback in this regard.

We encourage you to join the open editing phase, to take part in the logo contest, or to provide feedback. We aim to release a 1.0 version of this definition fairly soon.

Please forward this announcement to other relevant message boards and mailing lists.

Thanks for your time,

Erik Möller and Benjamin Mako Hill

Losing fat for charity

Either webcomics artists are an unusually innovative bunch, or I’m reading too much of their work. 😉 In any case, on the heels of OhNoRobot, Biggest Webcomic Loser is another very interesting (and somewhat bizarre) project coming from the comics community. Overweight comics artists have decided to join forces to lose money and raise fat for charity. Er, sorry, the other way around. It works like this:

  • Each artist defines a personal weight goal.
  • The site is updated regularly with new comics from the artists who have signed up for the project.
  • Visitors can choose to pledge financial support to UNICEF on a per-lbs basis (apparently they’re not innovative enough to use the metric system).
  • The financial support per-lbs is indicated next to each comic.

The total amount of pledged money for all comics is already at over $5K, and given that it’s only been running for a few weeks, it seems like it could go much higher. Once any artist has reached their weight loss goal, those who have pledged to support “them” (i.e. UNICEF) will be reminded to make a donation.

The one thing I would change is to try to pick a general theme for the comics to follow — otherwise they tend to focus on the idea itself, which might not be the best way to attract readers who do not have the same, er, problem.

It’s a bit strange to tie this specifically to the webcomics community, but it has the advantage of providing interesting new content (comics) every day. I wonder if something similar might work for blogs. See Pledgebank for a very cool generic pledging system.

Firefox extension idea: Link killer

This one is a simple idea, and it may already exist. I’d like a way to maintain a personal blacklist of sites which I never wish to visit. When browsing the web, all links that (visibly or invisibly) point to these sites should then be somehow marked. Preferably, that should be done in a low-overhead way that doesn’t require rewriting every webpage. I’d be happy to hover over a link to see whether it is blacklisted or not, although color-coding or similar would of course be more user-friendly.

I do realize that some search engines offer blacklists as part of their “personalized search”, but I’d rather host this list on my machine and not have it tied to a global identity, requiring me to let Google et al. set eternal cookies to use them. Besides, this would also show bad links on other websites such as Wikipedia, which might come in handy.

One personal use for this is searching for lyrics. There are too many lyrics sites that are full of floating, animated, annoying ads of different types, some of them making their way through Firefox popup and ad blocking. Incidentally, this is the primary result of the music industry’s campaign against lyrics websites: the big lyrics archives which are left are scammers. Thanks, guys! (I was in the process of downloading all the lyrics from when they shut it down. Unfortunately, I only ever got to the letter H ..)

The Wayback Wayback Machine

The Internet Archive is a great invention, providing a view through the history of a webpage. It is also a great tool for investigative journalism and academic research. Whether it’s about the history of a dubious company, or a page whose content has mysteriously changed, the Internet Archive adds wiki-like versioning to webpages that otherwise would not have it. To avoid massive copyright problems, the Archive has made two crucial compromises: It does not show pages less than 6 months old, and it retroactively deletes material when the site owners want it to (see FAQ). In fact, owners don’t even need to ask — they just have to put a special robots.txt file on their webservers, and the next time the crawlers see the site, it is removed from the Archive.

An archive where material can disappear from one day to the next without notice is a quite bizarre thing. Links to archived web pages become invalid. When a site is removed from the Archive, it appears as if it had never been there. It can provide entertainment value to see the history of popular webpages, but if anything controversial is immediately removed at the owner’s whim without as much as a verification process, then the tool uses much of its value for serious research. It’s the controversial stuff that needs to be archived the most.

If the Internet Archive doesn’t fix this flaw, it needs to be replaced with a solution that doesn’t have it, such as a decentralized storage network. As a temporary hack, it would be useful if someone set up a “Wayback Wayback Machine”, a site which, on request, crawls all revisions of a website in the Internet Archive, stores them, and makes them available to researchers who can provide credentials. This would only help to protect the record of websites where it seems likely that they might be deleted in the future (scams, phishing sites, etc.) and someone thinks of requesting a secure copy (in which case they could also manually download and save the revisions). A better long-term solution is needed, and it would be best if it was provided by the Internet Archive itself.

Until then, whenever you see something unusual in the Wayback Machine, remember to make a copy. It might not be there tomorrow.