Archive

Posts Tagged ‘wiki’

Wiki Revisions History

I recently wanted to get some sample data for some of my nosql trials and decided to search for some wiki metadata. More specifically, the history of the revisions. Very soon I realized that many folks have already built applications on it and that there is extensive API available to get the data.

I could make the data set by running throught the API and getting revisions on all pages. However, scraping isn’t a good idea and media wiki limits the results for the same reason.

For the info : API calls can be made by referring to the documentation here http://www.mediawiki.org/wiki/API

As an example for the API call, if we need to find the revisions for a page named “Geography_of_Afghanistan”, we could use the following call…

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Geography_of_Afghanistan&rvprop=ids|timestamp|user&rvlimit=5000

And the following call would also give us the comments

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Geography_of_Afghanistan&rvprop=ids|timestamp|user|comment&rvlimit=5000

Notice that although we use the rvlimit as 5000 , the results are limited to 500 and we also get the message with the response ..

“rvlimit may not be over 500 (set to 5000) for users” 

To get the complete data set , media wiki provided data dumps that can be downloaded. Refer to this link for the dumps .http://dumps.wikimedia.org/enwiki/20110317/

What we need is the meta history. Once I downloaded the dump I realised that the latest xsd was not available for the data set. The latest xsd doc supplied by media wiki is at http://www.mediawiki.org/xml/export-0.4.xsd ,but, we need the export-0.5.xsd to work with the downloaded dumps.

So, to solve the problem above, I downloaded trang. trang can be used to generate xsd from xml. Here is a good write-up to get an idea.

I will add the export-0.5.xsd that got generated to this blog. Hope it helps other till the xsd is published by media wiki.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:export-0.5="http://www.mediawiki.org/xml/export-0.5/">
  <xs:element name="mediawiki">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.5:siteinfo"/>
        <xs:element maxOccurs="unbounded" ref="export-0.5:page"/>
      </xs:sequence>
      <xs:attribute name="version" use="required" type="xs:decimal"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="siteinfo">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.5:sitename"/>
        <xs:element ref="export-0.5:base"/>
        <xs:element ref="export-0.5:generator"/>
        <xs:element ref="export-0.5:case"/>
        <xs:element ref="export-0.5:namespaces"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="sitename" type="xs:NCName"/>
  <xs:element name="base" type="xs:anyURI"/>
  <xs:element name="generator" type="xs:string"/>
  <xs:element name="case" type="xs:NCName"/>
  <xs:element name="namespaces">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" ref="export-0.5:namespace"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="namespace">
    <xs:complexType mixed="true">
      <xs:attribute name="case" use="required" type="xs:NCName"/>
      <xs:attribute name="key" use="required" type="xs:integer"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="page">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.5:title"/>
        <xs:element ref="export-0.5:id"/>
        <xs:element minOccurs="0" ref="export-0.5:redirect"/>
        <xs:element minOccurs="0" ref="export-0.5:restrictions"/>
        <xs:element maxOccurs="unbounded" ref="export-0.5:revision"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="title" type="xs:string"/>
  <xs:element name="redirect">
    <xs:complexType/>
  </xs:element>
  <xs:element name="restrictions" type="xs:string"/>
  <xs:element name="revision">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.5:id"/>
        <xs:element ref="export-0.5:timestamp"/>
        <xs:element ref="export-0.5:contributor"/>
        <xs:element minOccurs="0" ref="export-0.5:minor"/>
        <xs:element minOccurs="0" ref="export-0.5:comment"/>
        <xs:element ref="export-0.5:text"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="timestamp" type="xs:NMTOKEN"/>
  <xs:element name="contributor">
    <xs:complexType>
      <xs:choice minOccurs="0">
        <xs:element ref="export-0.5:ip"/>
        <xs:sequence>
          <xs:element ref="export-0.5:username"/>
          <xs:element ref="export-0.5:id"/>
        </xs:sequence>
      </xs:choice>
      <xs:attribute name="deleted" type="xs:NCName"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="ip" type="xs:string"/>
  <xs:element name="username" type="xs:string"/>
  <xs:element name="minor">
    <xs:complexType/>
  </xs:element>
  <xs:element name="comment">
    <xs:complexType mixed="true">
      <xs:attribute name="deleted" type="xs:NCName"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="text">
    <xs:complexType>
      <xs:attribute name="bytes"/>
      <xs:attribute name="deleted" type="xs:NCName"/>
      <xs:attribute name="id" type="xs:integer"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="id" type="xs:integer"/>
</xs:schema>