Archive

Posts Tagged ‘xml’

Wiki Revisions History

I recently wanted to get some sample data for some of my nosql trials and decided to search for some wiki metadata. More specifically, the history of the revisions. Very soon I realized that many folks have already built applications on it and that there is extensive API available to get the data.

I could make the data set by running throught the API and getting revisions on all pages. However, scraping isn’t a good idea and media wiki limits the results for the same reason.

For the info : API calls can be made by referring to the documentation here http://www.mediawiki.org/wiki/API

As an example for the API call, if we need to find the revisions for a page named “Geography_of_Afghanistan”, we could use the following call…

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Geography_of_Afghanistan&rvprop=ids|timestamp|user&rvlimit=5000

And the following call would also give us the comments

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Geography_of_Afghanistan&rvprop=ids|timestamp|user|comment&rvlimit=5000

Notice that although we use the rvlimit as 5000 , the results are limited to 500 and we also get the message with the response ..

“rvlimit may not be over 500 (set to 5000) for users” 

To get the complete data set , media wiki provided data dumps that can be downloaded. Refer to this link for the dumps .http://dumps.wikimedia.org/enwiki/20110317/

What we need is the meta history. Once I downloaded the dump I realised that the latest xsd was not available for the data set. The latest xsd doc supplied by media wiki is at http://www.mediawiki.org/xml/export-0.4.xsd ,but, we need the export-0.5.xsd to work with the downloaded dumps.

So, to solve the problem above, I downloaded trang. trang can be used to generate xsd from xml. Here is a good write-up to get an idea.

I will add the export-0.5.xsd that got generated to this blog. Hope it helps other till the xsd is published by media wiki.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://www.mediawiki.org/xml/export-0.5/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:export-0.5="http://www.mediawiki.org/xml/export-0.5/">
  <xs:element name="mediawiki">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.5:siteinfo"/>
        <xs:element maxOccurs="unbounded" ref="export-0.5:page"/>
      </xs:sequence>
      <xs:attribute name="version" use="required" type="xs:decimal"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="siteinfo">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.5:sitename"/>
        <xs:element ref="export-0.5:base"/>
        <xs:element ref="export-0.5:generator"/>
        <xs:element ref="export-0.5:case"/>
        <xs:element ref="export-0.5:namespaces"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="sitename" type="xs:NCName"/>
  <xs:element name="base" type="xs:anyURI"/>
  <xs:element name="generator" type="xs:string"/>
  <xs:element name="case" type="xs:NCName"/>
  <xs:element name="namespaces">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" ref="export-0.5:namespace"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="namespace">
    <xs:complexType mixed="true">
      <xs:attribute name="case" use="required" type="xs:NCName"/>
      <xs:attribute name="key" use="required" type="xs:integer"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="page">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.5:title"/>
        <xs:element ref="export-0.5:id"/>
        <xs:element minOccurs="0" ref="export-0.5:redirect"/>
        <xs:element minOccurs="0" ref="export-0.5:restrictions"/>
        <xs:element maxOccurs="unbounded" ref="export-0.5:revision"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="title" type="xs:string"/>
  <xs:element name="redirect">
    <xs:complexType/>
  </xs:element>
  <xs:element name="restrictions" type="xs:string"/>
  <xs:element name="revision">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="export-0.5:id"/>
        <xs:element ref="export-0.5:timestamp"/>
        <xs:element ref="export-0.5:contributor"/>
        <xs:element minOccurs="0" ref="export-0.5:minor"/>
        <xs:element minOccurs="0" ref="export-0.5:comment"/>
        <xs:element ref="export-0.5:text"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="timestamp" type="xs:NMTOKEN"/>
  <xs:element name="contributor">
    <xs:complexType>
      <xs:choice minOccurs="0">
        <xs:element ref="export-0.5:ip"/>
        <xs:sequence>
          <xs:element ref="export-0.5:username"/>
          <xs:element ref="export-0.5:id"/>
        </xs:sequence>
      </xs:choice>
      <xs:attribute name="deleted" type="xs:NCName"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="ip" type="xs:string"/>
  <xs:element name="username" type="xs:string"/>
  <xs:element name="minor">
    <xs:complexType/>
  </xs:element>
  <xs:element name="comment">
    <xs:complexType mixed="true">
      <xs:attribute name="deleted" type="xs:NCName"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="text">
    <xs:complexType>
      <xs:attribute name="bytes"/>
      <xs:attribute name="deleted" type="xs:NCName"/>
      <xs:attribute name="id" type="xs:integer"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="id" type="xs:integer"/>
</xs:schema>
Advertisements

Accessing databases using datasource xml

March 15, 2010 Leave a comment

We will use the previous struts project to setup database access to MySQL running on localhost.
Run the following command to create the .project and .classpath files for Eclipse and import the project in Eclipse

mvn eclipse:m2eclipse

Download the database driver mysql-connector-java-5.1.11-bin.jar from here and save it to the lib folder in this location $JBOSS_HOME/server/web/lib

Create a jboss-web.xml file in the WEB-INF folder with the following content

<?xml version="1.0" encoding="UTF-8"?>
<jboss-web>
 <resource-ref>
 <res-ref-name>CodesiloDS</res-ref-name>
 <res-type>javax.sql.DataSource</res-type>
 <jndi-name>java:CodesiloDS</jndi-name>
 </resource-ref>
</jboss-web>

In the web.xml file add the following :

 <resource-ref>
 <res-ref-name>CodesiloDS</res-ref-name>
 <res-type>javax.sql.DataSource</res-type>
 <res-auth>Container</res-auth>
 </resource-ref>

Create a datasource xml file and add it to the deploy dir of JBoss.

Here is the content of my file:

<datasources>
 <local-tx-datasource>
 <jndi-name>CodesiloDS</jndi-name>
 <connection-url>jdbc:mysql://localhost:3306/database-name</connection-url>
 <driver-class>com.mysql.jdbc.Driver</driver-class>
 <user-name>username</user-name>
 <password>password</password>
 <min-pool-size>5</min-pool-size>
 <max-pool-size>20</max-pool-size>
 <!-- Typemapping for JBoss 4.0 -->
 <metadata>
 <type-mapping>mySQL</type-mapping>
 </metadata>
 </local-tx-datasource>
</datasources>

Replace database-name, username and password with appropriate values.

To keep things simple to test, we will add an action in struts-config and have the Action class call the database.

In the action class add this

private final String DATABASE_JNDI_SOURCE = "java:CodesiloDS";

and have this in the execute method. (In the real world most of this code would reside in your model and any helper classes.)

public ActionForward execute(ActionMapping mapping, ActionForm form,
 HttpServletRequest request, HttpServletResponse response)
 throws Exception {

 Connection conn = null;
 Statement stmt = null;
 ResultSet rs = null;
 String date ="";
 try{
 Context ctx = new InitialContext();
 DataSource ds = (DataSource)ctx.lookup(DATABASE_JNDI_SOURCE);
 if(ds!= null){
 conn = ds.getConnection();
 stmt = conn.createStatement();
 rs = stmt.executeQuery("select curdate() from dual;");
 if(rs.next()){
 date = rs.getString(1);
 }
 }
 }
 catch(Exception e){
 ---- code here----
 }
 finally{
 try{
 rs.close();
 stmt.close();
 conn.close();
 }
 catch(Exception e){
 ---- code here -----
 }
 }
 System.out.println("This is a test: " + date);
 return mapping.findForward("success");
 }

Once we call the action, we should see the current date printed on the console.