UserPreferences

ScholarsBox/NsdlIntegration


  1. Why integrate Scholar's Box with NSDL?
  2. Technical Notes on How we might integrate
  3. Dean Krafft's response
  4. Demonstrating a search of NSDL via WebDAV
  5. Making sense of the Search results

Why integrate Scholar's Box with NSDL?

Well, the NSDL is supposed to be aggregating a huge amount of science, math, engineering related learning content -- so there's a likelihood that users of the ScholarsBox interested in science materials will want to get able to gather/create/share those materials. From a development point of view, the NSDL seems to be a great testbed for our work on interoperability. It should be an ideal environment for building the ScholarsBox because a key goal of the NSDL is provide a rich infrastructure that glues together collections and services. Finally, there's substantial amount of funding for NSDL, for example, in the NationalScienceDigitalLibrary/ProgramSolicitation2004.

Technical Notes on How we might integrate

I've been studying the NationalScienceDigitalLibrary/TechnicalArchitecture to figure out how to integrate the ScholarsBox with NSDL. It's been a bit tricky to find the latest information. The session on [WWW]NSDL Services Interoperability and Web Services from the [WWW]NSDL Annual Meeting 2003 seems particularly helpful in getting the latest. Let's look at individual talks.

[WWW]Core Integration Web Services by [WWW]Dean B. Krafft of Cornell. It's good to see agreement from the Core Integration people that the current CI infrastructure is heavy and that there is a interest (commitment) to moving towards a more Web-friendly access system:

Dean Krafft gives an example of a RESTful approach to getting the OAI Record of a NSDL record -- given the OAI ID. For example, [WWW]Why It's Essential, a lesson plan on seasons, is indexed by NSDL with an OAI ID of oai:nsdl.org:dlese.org:oai:dlese.org:DLESE-000-000-004-326. You can thus get the [WWW]corresponding NSDL metadata record. The obvious immediate question is how to do a query to get the OAI ID in the first place, a question that seems to be in the process of being asked and answered ("What other queries should we support?") Alternatives mentioned: search engine style, SQL, or XQuery. (Note the ImsGlobal/DigitalRepositoriesSpec and the ECL implementation by EduSource uses XQuery....)

Current conclusion: the NationalScienceDigitalLibrary/ProgramSolicitation2004 still looks promising. I just send some email to Dean Krafft to ask him some questions:

Dean Krafft's response

Dean Krafft kindly replied with a thoughtful and detailed email. With his permission, I quote it an excerpt:

Here's a very quick summary of where things stand. In addtion to the REST
access to the OAI server, there is currently a (not very well documented)
WebDAV access to the NSDL standard search (using Lucene, being done as a
subcontract for Core Integration by the folks at UMass). I've included a
simple PHP example script that does a search lookup. That would let you
search the NSDL and have your Scholar's Box user be able to pull out
interesting bibliographic records to include in their personal collection. We
definitely have a SOAP/WSDL interface to search on the project plan -
hopefully within the next 6 months.

We also have an Archive project, which is archiving snapshots of NSDL
content. That keeps a permananent record of potentially ephemeral content
sites. Given the OAI ID (same as you would use for the REST lookup of the
metadata), you can get an HTML page of the archive. The URL is
http://srb.npaci.edu/cgi-bin/nsdl-find.cgi?identifier=oai:nsdl.org:internetsc
out:oai:scout.wisc.edu:ScoutArchives-10433 (replace the identifier argument
with what you want to get).

Unfortunately, that's just a UI version. We are working on a SOAP/WSDL
interface that will let you select among the monthly snapshots (and other
stuff) - should be out in the next couple of months.

We're very actively working on a relationship store and architecture to
support stuff like annotations, augmented metadata, and user formed
collections. That stuff is at least 6 months off, but it might work very well
together with the Scholar's Box once we've got it.

In terms of your own repository, have you taken a look at the Fedora Digital
Repository work (http://www.fedora.info)? It might fit in to what you need
(bias alert - the developers are just a couple of cubicles over).

While we've focused on institutional portals for particular communities in
most of our descriptions, there is certainly no problem with a "personal
portal" or application like Scholar's Box interacting directly with NSDL CI
services (search, archive, eventually annotation and declaring
relationships).

You might want to take a look at a page that CI has put together with
information for NSDL proposal writers:
http://cinews.comm.nsdlib.org/cgi-bin/wiki.pl?For_Proposal_Writers

Demonstrating a search of NSDL via WebDAV

This leads then to my looking into PythonLanguage/WebDavTechniques to see how we will be able to use WebDAV to do a search. However, as Dean K. then pointed out to me in a subsequent email:

The PHP is pretty incidental to the WebDAV search - mostly you just need to
submit the right HTTP request, which you should be able to read out of the
code pretty easily. What comes back is a bunch of XML (the search result). I
can give you an XSLT that translates it into our own "search results list" on
the site, but it should be fairly self-explanatory.

and looking at an except of the PHP code he sent me:

<?php
} else {

        $host = "search1.nsdl.org"; 
        $port = 8080; 
        $path = "/searchserver";

//      $clientSID = rand(2000, 10000);

        $search = htmlspecialchars(stripslashes($search));

        $searchstring = '<ns0:searchrequest xmlns:ns2="NSDL_1.0:"' .
        ' xmlns:ns1="http://interlib.org/SDLIP/1.0#" xmlns:ns0="DAV:">' .
        '<ns1:SearchRequest><ns1:numDocs>20</ns1:numDocs><ns1:query><ns2:request>' .
        '<ns2:query>rankBy(avg(' . $search . '))</ns2:query><ns2:fields/>' .
        '<ns2:numberToSkip>0</ns2:numberToSkip></ns2:request></ns1:query>' .
//      '<ns1:clientSID>' . $clientSID . '</ns1:clientSID><ns1:stateTimeoutReq>0</ns1:stateTimeoutReq>' .
        '<ns1:clientSID>1008</ns1:clientSID><ns1:stateTimeoutReq>0</ns1:stateTimeoutReq>' .
        '</ns1:SearchRequest></ns0:searchrequest>';


        $fp = fsockopen($host, $port, $errno, $errstr, $timeout = 30); 

        if(!$fp){ 
        //error tell us 
        echo "$errstr ($errno)\n"; 
          
        }else{ 

             //send the server request 
             fputs($fp, "SEARCH $path HTTP/1.1\r\n"); 
             fputs($fp, "Content-Encoding: utf-8\r\n"); 
             fputs($fp, "Content-Type: text/xml\r\n"); 
             fputs($fp, "Content-Length: ".strlen($searchstring)."\r\n\r\n"); 
             fputs($fp, $searchstring . "\r\n\r\n"); 

             //Echo the header on through
             fgets($fp); // Substitute for initial header
             header("HTTP/1.1 200 OK");
             header(fgets($fp));
             header(fgets($fp));

             //loop through the response from the server 
             while(!feof($fp)) { 
               echo fgets($fp, 4096); 
            } 
             //close fp - we are done with it 
             fclose($fp); 
        }
}
?>

that we might not even have to look too deeply into WebDAV but just mimic the PHP code in Python and get some XML coming out....

See [WWW]Search & Discovery Services for the NSDL (word doc) for documentation of SDS.

Making sense of the Search results

We're trying to understand the type of tags we get back from the NSDL metadata repository.

The [WWW]NSDL Metadata Primer : NSDL standard metadata indicates:

Given that statement, we expect DC and some DC ED extensions. We get such tags with [WWW]sample record but not from the search results (from a search for "season") where we get tags like:

                                <brandIconURL>http://content.nsdl.org/brands/dlese.org.gif</brandIconURL>
                                <subject..GEM>Earth science
 ~^ Geography
 ~^ Physical sciences</subject..GEM>
                                <format..IMT>text/html</format..IMT>
                                <brandWidth>54</brandWidth>
                                <language..RFC3066>en</language..RFC3066>
                                <rights..>Copyright 2001 National Geographic Society. All rights reserved.</rights..>
                                <title..>Why It's Essential</title..>
                                <relation.conformsTo.>Supports National Council for Geographic Education (NCGE) standard: Physical Systems:The physical processes that shape the patterns of Earth's surface</relation.conformsTo.>
                                <nsdlUniqueId>oai:nsdl.org:dlese.org:oai:dlese.org:DLESE-000-000-004-326</nsdlUniqueId>
                                <type..DCMIType>InteractiveResource</type..DCMIType>
                                <category>item</category>
                                <brandHeight>30</brandHeight>
                                <publisher..>National Geographic Society</publisher..>
                                <brandTitle>DLESE</brandTitle>
                                <description..>This lesson plan asks students to think about aspects of the changing seasons in their region such as temperature variations, seasonal household chores, changes in foods available at the market, and the length of the days. Students will discuss their experiences with and knowledge of the four seasons; look at pictures of the four seasons and compare those pictures to the seasons in their home region; plan and hold a party commemorating the four seasons; and write stories depicting themselves showing a visitor some of the things they like best about their favorite season.</description..>

There is no namespace qualification and the tags look like but are not identical to DC. Is this tag set a transitional one?

Jon Phipps (one of the developers and formerly the Metadata Repository tech lead at the NSDL):

What you're seeing is the NSDL Search Service's Search Response XML format which is optimized for the NSDL search interface and is a legacy of the limitations imposed by our original search engine. Right at the moment, the best way to get useful metadata is to grab the contents of the <nsdlUniqueId> tag and perform an OAI-PMH GetRecord request: http://services.nsdl.org:8080/nsdloai/OAI?verb=GetRecord&identifier=oai:nsdl.org:dlese.org:oai:dlese.org:DLESE-000-000-004-326&metadataPrefix=nsdl_dc as you indicated above. This will retrieve a Qualified Dublin Core record and this is what we do in our own search results to display the unqualified DC metadata in More Info. We at the NSDL agree that this is, umm, suboptimal and as Dean indicated, we hope to make a number of improvements.