CategoryTools

Global Voices on arz.wikipedia.org

Global Voices has posted an interesting summary of discussions in the Arabic-speaking blogosphere regarding the recently launched Egyptian Arabic Wikipedia. In short, it’s a typical dispute between those who think it ought to be recognized as a separate language, and those who believe that doing so divides the community. The quoted voices seem to lean more towards the latter opinion, though one commenter expresses the interesting notion that Egyptian Arabic could function as a scratchpad for the Arabic Wikipedia among speakers who find it easier to write than formal Arabic. [via ethan]

The Power of Free Content

David Shankbone, who has contributed countless photos to Wikimedia Commons (including many very hard to obtain shots of celebrities), has written a very interesting blog post about how his photos get used throughout the universe of Wikimedia languages and projects: The global reach of just one photo. If you want to see where a photo you’ve uploaded is used, you can use the CheckUsage tool. This kind of global usage is a true testament to what’s possible when content is shared with few copyright restrictions.

LibregameWiki

Can open source games be any good? Most open source games are relatively small and simple, but the open source movement has produced some gems over the years, such as Battle for Wesnoth and Neverball. Blender’s Project Apricot is an example of a game developed using a new model of collaborative funding (DVD pre-orders) which was previously successfully used for open source 3D movie productions.

If you’re interested in keeping track of the progress of open source gaming, LibregameWiki is a great place to go. Sadly, many of the open source games it reports on aren’t considered notable enough for Wikipedia. In addition to the games themselves and the people who make them, LibregameWiki also writes about game development contests like RubyWeekend, which have recently become a source of lots of nice, open source mini-games.

Towards A True Wikimedia Commons

Wikimedia Foundation CTO Brion Vibber recently added a very neat feature to the development version of MediaWiki. In order to enable it, all you need is a snippet of code in your LocalSettings.php configuration file:

$wgForeignFileRepos[] = array(
   'class'            => 'ForeignAPIRepo',
   'name'             => 'shared',
   'apibase'          => 'http://commons.wikimedia.org/w/api.php',
   'fetchDescription' => true, // Optional
);

Your wiki installation will now have full access to Wikimedia Commons in the same way any Wikimedia wiki does. You can embed image thumbnails of any size, and they will be automatically generated and loaded from Commons. You can click images and see the file description (including the wiki description page, file history and EXIF metadata) loaded from Commons. I haven’t tried to make the embeddable video/audio player work yet, but any file type will be accessible.

This is wonderful, because it makes the nearly 3 million freely licensed files in Commons easily accessible to potentially thousands of wiki users, while retaining the critical licensing information. This implementation does not cache the data in the local wiki, so is not yet suitable for large scale installations. Caching the data intelligently is a significant challenge, as it could be a vector for denial of service attacks and also raises the questions how/when cached files should expire, etc. I wrote a proposal called “InstantCommons” a couple of years ago which included some notes on the issue. After an incomplete first implementation of InstantCommons, I’m glad that we now have a working, simple mechanism for third party use of Commons media. Given that the foreign repository can be any MediaWiki installation, it will also be interesting to see what other wiki-to-wiki exchanges might result from it.

The Rise of the Trash Blogs

Wikipedia founder Jimmy Wales is being dragged through the mud by a new type of trash blog. Tabloid journalism has always existed, but the new marketplace of ideas online allows for the rapid evolution of the most effective types of trashy pseudo-journalism: Those blogs that find the right tone — that combination of malicious insinuations, salaciousness, and half-truths — will develop a large audience; those that are just boring (like Kelly Martin rambling about Jimmy’s mortgage) will whither and die.

The recent attacks against Jimmy allege that he had some huge conflict of interest when it comes to editing the “Rachel Marsden” article. Jimmy routinely asks people on our ticket-system, OTRS, to look into articles that are one-sided when people ask him to. He had a conflict
of interest here, which, for obvious reasons, he didn’t want to disclose – so he suggested a smaller conflict of interest in his e-mail to OTRS as a reason for recusing himself from editing it. It would have been better to stay away from the matter entirely, obviously, but I don’t see it as a big deal. I’ m more worried that the community will now massively push it in the other direction as a reaction against perceived bias.

As for his personal life, it is just that: his personal life.

There are two real stories here, IMHO:

* a destructive, trashy kind of pseudo-journalism that invades people’s personal lives under the pretense of a real story;

* the destructiveness and maliciousness on the fringes of our own community.

Jimmy not only created an extraordinary project — he decided to base it on the principles of the open source / free software movement, and turned it over to a non-profit organization. This was, by no means, the obvious thing to do: Had events played out a little differently, Wikipedia would today be a dot-com with ads, probably a subsidiary of Google, Microsoft or Yahoo.

As community leader, Jimmy has developed and emphasized the values that we cherish: the assumption of good faith, the importance of neutrality in open collaboration, and the belief in a shared purpose. When he talks about bringing education to those who cannot afford it, he’s not just trying to impress. Anyone who spends 5 minutes with him will understand that this is his personal life goal.

Moreover, Jimmy stepped back graciously as Chair of the Foundation when he no longer could dedicate as much time as needed to the role. He’s helped us connect with philanthropists here in the Bay Area — donations like the recent 500,000 dollars from the last fundraiser were only possible because of his outreach efforts. His international network of contacts has helped us to build our Advisory Board, really smart people who have supported us on many occasions. In short, he’s been humble and helpful, and has always acted in the best interests of the organization.

When people try to create a malicious caricature of the man, then please remind yourself that actions speak louder than words. And also ask what the self-interest is of those who make the attacks. Whether it’s for personal reasons (15 minutes of fame, a vendetta, or simply an inherently destructive nature) or because of advertising revenue — spreading lies, insinuations and half-truths is in many people’s interest. The emergence of the new open media also means that we, as readers, have a greater responsibility to distinguish the bottom of the barrel crap — designed to stir up trouble — from honest inquiry.

Not too long ago, Jimmy and Tim O’Reilly proposed a Blogger’s Code of Conduct that we, bloggers, would voluntarily adopt. I am increasingly agreeing with him: If we want the emergence of an honest, responsible, but dynamic and healthy new media sphere, we will need a new code of ethics and a set of principles to subscribe to. We need to indicate that we are different from the trash blogs. As such, I am hereby adopting the following modules from the proposed Code of Conduct:

1. Responsibility for our own words

2. Nothing we wouldn’t say in person

3. Connect privately first

4. Take action against attacks

6. Ignore the trolls

8. Keep our sources private

9. Discretion to delete comments

10. Do no harm

I encourage you to do the same.

Of Gutsy Gibbons

The upgrade of my desktop from Ubuntu Linux “Feisty Fawn” to “Gutsy Gibbon” was moderately sucky. Aside from some odd package dependency issues that I had to resolve forcefully, the upgrade messed up my X server configuration, resulting in an unusable desktop. After some fiddling and some searching, the culprit turned out to be the xserver-xgl package, which doesn’t work with my ATI graphics chipset. ATI’s poor Linux support is hardly an excuse for leaving the user’s system in a worse state than it was before. As long as basic upgrade procedures can lead to such results, we can’t seriously hope to make inroads on the desktop.

Open Source Mesh Networking

Looks like Meraki is getting some open source competition. An open source mesh networking platform could be exciting, especially when combined with a mini-webserver built into the platform to host content even without Internet access.

Stable version demo

Yesterday I set up a first unofficial demo of stable versions in the configuration which I would like to see used on the English Wikipedia. It’s based on the FlaggedRevs extension by Aaron Schulz & Jörg Baach. Unofficial, because Brion is currently reviewing the code for security and scalability. I’m still slightly worried that we might hit a snag, as the extension goes pretty deeply into the way MediaWiki serves pages, and of course Wikipedia can’t afford to run something that causes major slowdowns.

Still, if you want to get a feeling for the kind of configuration that I, Jimmy Wales and Florence Devouard have expressed support for, play with it. The main new thing you’ll notice is a top right corner icon which indicates the status of the page you’re looking at. On selected pages, the last revision shown to unregistered users is the most recently vandalism-patrolled one, while the rest of the wiki behaves as before. Essentially, in this configuration, it accomplishes four things:

  • Allows us to open up semi-protected pages to editing, since edits have to be reviewed and vandalism does not affect what the general public sees;
  • Allows us, as a consequence, to also use this kind of “quality protection” more widely — e.g. when an article has reached a very high quality, and is unlikely to be substantially be improved by massively collaborative editing
  • Improves our vandalism patrolling abilities, since vandalism doesn’t have to be repeatedly checked: we record who has looked at which changes. Also, changes by trusted users don’t have to be reviewed. The tagging system ties into “recent changes patrolling”, an old feature that has never scaled well.
  • Allows us to make processes like “Featured article candidates” and “Good article candidates” revision-based rather than page-based. Thus we can more easily track when an article reached a certain quality stage, and can better examine whether changes past that point have increased or decreased its quality.

Some wikis would like to use more restrictive configurations than the one shown here. For example, a group in the German Wikipedia community is advocating to let all edits by unregistered users be reviewed before applying them, rather than only doing so on selected pages. I’ve argued against this at some length here. I think the current configuration strikes a good balance of openness, quality, and transparency to the reader. Of course there’s already a big wishlist, some of which should be addressed before taking this feature live.

Managing wiki data

Volapük is one of those lovely constructed languages which has made it into the 100K list on our multilingual Wikipedia portal. Of course, this is due to some bot imports of census and map information. It’s all fair – the English Wikipedia also bot-generated tens of thousands of “articles” in its early history. It does raise the question whether a freetext wiki is really a good way to maintain such data.

Over at OmegaWiki, we’re experimenting with adding structured data to the multilingual concept entities we call “DefinedMeanings”. A DefinedMeaning can be accessed using any synonym or translation associated with it. An example is the DefinedMeaning Denmark. If you expand the “Annotation” section, you’ll see information such as the countries Denmark borders on, its capital, or currency.

But there’s more to it than meets the eye: the fact that a country can border on another one is expressed by giving a DefinedMeaning a class membership. The class “state” defines what relationships an entity of this type can posses. In this way we define an ontology.

Note that if you log in and change your UI language, all the names will be automatically translated, if translations are available, as each relation type is a DefinedMeaning in its own right. OmegaWiki also supports free-text annotation and hyperlinks. We’ll add more attribute types as we need them.

What does that have to do with the Volapük Wikipedia? Well, OmegaWiki is the kind of technology you could build upon to maintain a central repository of the almanac data that was imported. Then, with some additional development, you could generate infoboxes with fully translated labels directly from OmegaWiki. And, within the free-text Wikipedia, you could focus on writing actual encyclopedia articles that reference this data.

Note that OmegaWiki is not alone in the semantic wiki space. Semantic MediaWiki adds data structure by enriching the wiki syntax itself. Freebase is a closed-source free-content “database of everything.” I’m not interested in closed source technology, but I like a lot of what I see in Semantic MediaWiki, and I expect that the technologies will converge in the end, if only conceptually.

The Value of Open Source Software in Wikimedia

Florence Devouard (Chair and the Wikimedia Foundation) and I disagree a bit about the value of open source software in the Wikimedia Foundation projects. Lately Florence has been taking a more “best tool for the job”, “don’t reinvent the wheel” approach, especially when it comes to tools we use internally, or as web services (a recent discussion was about survey tools). I don’t consider myself an ideological person — I discard beliefs as quickly as I adopt them if they aren’t useful. Maximizing open-source use internally and elsewhere simply strikes me as a best practice for a non-profit like the Wikimedia Foundation.

Let’s take the example of survey tools. For a user survey, you could use a number of web services, or you could build an open source extension to MediaWiki that is used for collecting information. If you use the former, you might get a deal with a company that lets you use their service for free, in return for the exposure that being “advertised” through its use on Wikipedia will give them. But consider that you might want to run a similar or different survey again the next year, to validate if certain trends (like gender participation) have been affected by your actions.

If you go with the proprietary software vendor, there’s a good chance that they will downgrade you to a regular customer status once they believe they’ve saturated the audience they can reach through you: no more free beer. If the company gets bought or goes bankrupt, you might not be able to work with them anymore at all. If you have specific usability complaints (say, because their survey uses JavaScript that only runs in Internet Explorer but not in Firefox), you’ll have to go through the usual support processes with third parties, and your request might not get processed at all. Depending on the nature of the deal, you also have to rely on their backup and privacy practices being sane.

As a vast online community dealing with all imaginable topics, Wikipedia has a huge number of detractors, including some deeply malicious or even mentally disturbed trolls. This means your software likely has to be more secure, because malicious hackers are more likely to try to pollute your survey with nonsense. With a proprietary survey vendor, there’s no way to let the community inspect the code for very common security vulnerabilities like SQL injection attacks. Given that they’d be running on an external server, it would also be harder to generate reliable (anonymized) user identifiers that can’t be easily hacked using a Perl script, to protect your survey against systematic data pollution. It’s not inconceivable that such an attack would even come from within the Wikipedia community itself, as a reaction to the use of proprietary software (believing in open source doesn’t mean that you’re not a dick).

Open source software is open for security auditing. Software which is committed to our own Subversion repository can also be fairly openly modified by a large number of committers, thanks to a liberal policy of granting access to the repository. In effect, the code is almost like its own little wiki world, with reverts and edit wars, but also a constant collaborative drive towards more quality. People from all parts of the MediaWiki ecosystem contribute to it (I’ve often said that MediaWiki is almost like a Linux kernel of the free culture movement), and are likely to share improvements if they need them, if only out of the self-interest to see them maintained in the official codebase.

If you need to retool your survey for, say, doing a usability inquiry into video use, an existing open source toolset makes it fairly easy to build upon what you have. And if you want to do a survey/poll that isn’t anonymized, hooking into MediaWiki will again make your life easier.

You might say: “Gee, Erik, you’re making this sound a lot more complicated than it is. A survey is just a bunch of questions and answers – what do you need complex software for? Can’t you just drop in a different piece of proprietary software whenever needed?” If you believe that, I recommend having a conversation with Erik Zachte, the creator of WikiStats. Erik knows a thing or two about analyzing data. He explained to me that one of the things you want to ensure is that the results you collect follow a standardized format. For example, if a user is asked to select a country they are from, you’ll want a list of countries to choose from, rather than asking them to type a string.

Moreover, you want this data to be translated into as many languages as possible. This is already being done in MediaWiki for the user interface, through the innovative “MediaWiki:” namespace, where users can edit user interface messages through the wiki itself. This is how we’ve managed to build a truly multilingual site even in minority languages: by making the users part of the translation process.

So, if you work with your proprietary survey vendor, you have to convince them to manage a truckload of translations for you, and you have to make damn sure that all the translated data is well-structured and re-usable should you ever decide to switch the survey tool. Otherwise you’ll be spending weeks just porting the data from one toolset to another. You can try to have them work on the data with you, but you’ll be spending a lot of your time trying to push your proprietary vendor to behave in a semi-open manner, when you could have simply decided to follow best practices to begin with. Companies that aren’t committed to open standards to begin with will always be driven towards a greater need to “control and protect our IP” from their internal forces: investors, boards, lawyers, managers.

Sure, you might have a higher upfront investment if there’s no existing toolset you can build on. But I find it quite funny that the same companies who go on and on about protecting their “intellectual property” are often so very quick to give up theirs: Open source software effectively belongs to you (and everyone else), with everything that entails. And it’s an ecosystem that gets richer every day. Instead of literally or metaphorically “buying into” someone else’s ideas, open source maximizes progress through cooperation. I cannot think of a better fit for our wiki world.

The reason to default to open source best practices is not ideological. It’s deeply pragmatic, but with a view on the long term perspectives of your organization. So while I agree with Florence that we should keep open (no pun intended) the option of using proprietary software in some areas of Wikimedia (particularly internal use), I would posit that any cost-benefit analysis has to take the very large number of long term benefits of the open source approach into account.

[UPDATE] LimeSurvey looks like a decent open source survey tool that we could use if we don’t care that much about deep integration.