27 September 2009

Extracting Scientists &SF writers from Wikipedia.


Images via wikipedia
In a recent post on FriendFeed, Christopher Harris asked: do you know of any science fiction writer who is/was also a scientist?. My first approach to automatically retrieve those names, was to use Freebase. For example, the following MQL query retrieves the Scientists and the SF Writers.

[{
"id":null,
"name":null,
"type" : "/people/person",
"a:profession":[{"name":"Scientist"}],
"b:profession":[{"name":"Science-Fiction Writer"}],
"limit":100

}]
The MQL query Editor returned the following result:
{
"code": "/api/status/ok",
"result": [
{
"a:profession": [{
"name": "Scientist"
}],
"b:profession": [{
"name": "Science-fiction writer"
}],
"id": "/en/edward_llewellyn",
"name": "Edward Llewellyn",
"type": "/people/person"
},
{
"a:profession": [{
"name": "Scientist"
}],
"b:profession": [{
"name": "Science-fiction writer"
}],
"id": "/en/konrad_fialkowski",
"name": "Konrad Fiałkowski",
"type": "/people/person"
}
],
"status": "200 OK",
"transaction_id": "cache;cache01.p01.sjc1:8101;2009-09-27T15:58:11Z;0002"
}
Only two persons ! That's not much, because the articles in Wikipedia, as well as in Freebase are classified using a hierarchical Categories (sadly, it is not an acyclic graph), but there is no tool to find the articles matching the sub-categories. So , you'll have to repeat this quety for the "British scientists", the "French Biologists", etc... (by the way, I think wikipedians should not have allowed to mix two distinct kind of categories (e.g. profession and nationality).It messes-up the classification). (do you know if this can be achieved using SPARQL and DBPedia ?)

Then I wrote a java tool extracting the pages having a given WP category using the wikipedia API. This tool, "wpsubcat" is available here: http://code.google.com/p/lindenb/downloads/list and requires BerkeleyDB java Edition in order to store the temporary results. The source code is available here: WPSubCat.

Usage

-debug-level <java.util.logging.Level> default:OFF
-base <url> default:http://en.wikipedia.org
-ns <int> restrict results to the given namespace default:14 (Category)
-db-home BerkeleyDB default directory:/tmp/bdb
-d <integer> max recursion depth default:3

-add <category> add a starting article
OR
(stdin|files) containing articles' titles

Examples


Retrieve all the subClasses of 'Category:Scientists'
java -cp je-3.3.75.jar:wpsubcat.jar org.lindenb.tinytools.WPSubCat \
-add "Category:Scientists" > catscientists.txt

Retrieve all the scientists.
java -cp je-3.3.75.jar:wpsubcat.jar org.lindenb.tinytools.WPSubCat \
-ns 0 -d 0 catscientists.txt > scientists.txt


Result


After a series of 'sort' and 'comm', the result is the following list (in fact, it is underestimated, I've sightly improved the way the sub-categories are retrieved) :

That's it

Pierre

3 comments:

Egon Willighagen said...

The DBPedia SPARQL end point can be found at:

http://dbpedia.org/sparql

Noel O'Boyle said...

Cool, but you missed Alastair Reynolds who was working at the ESA until last year, and Joe Haldeman (only a BSc in science admittedly).

Pierre Lindenbaum said...

@baoilleach: those persons are missing because they haven't been 'categorized'. E.g. Haldeman was classified with the following categories (2009-09-28):

1943 births | Living people | People from Oklahoma City, Oklahoma | People from Gainesville, Florida | American science fiction writers | Military science fiction writers | American novelists | Writers from Oklahoma | Hugo Award winning authors | Nebula Award winning authors | University of Maryland, College Park alumni | American military personnel of the Vietnam War | Worldcon Guests of Honor | Iowa Writers' Workshop alumni.

None of those categories is a sub-class of 'Category:Scientist' (depth=3).

Same comment for Reynolds, he was categorized as "Alumni_of_Newcastle_University", but this category is not a sub-class of 'Scientist' (depth=3).