UserPreferences

MarcXmlToOpenUrlCrosswalk


  1. David Walker's suggestions
  2. Trying out the OpenURL Referrer extension
    1. OpenURL Referrer problems
  3. Reading about OpenURL
  4. Specific example to compare 0.1 to 1.0
  5. Learning about MARC 21, MARCXML, OpenURL 1.0, 0.1 etc
  6. MARC examples from MetaLib
  7. MODS versions of the MARCXML examples from MetaLib
  8. Paths for generalization of this work
  9. use of 773 for journals
  10. OpenURL 0.1 vs 1.0
  11. Correspondence with Walt Crawford on MARCXML and OpenURL
  12. A start at mapping MARC to OpenURL
  13. Examples of stuff coming out of 773$g
  14. MARCXML, MODS, and representation of serials metadata
  15. Our best effort at a mapping
    1. author handling
    2. title handling
    3. volume, number, page handling
    4. other fields: ignore?
    5. other bibliographic info that might be useful for citation but not necessarily for OpenURL (listed in CDL)
  16. Conclusions with respect to MARC and OpenURL
  17. Current Next Steps

As part of integrating the Scholar's Box with MetaLib and part of understanding the interrelationships among bibliographic metadata, we are trying to figure out how to construct an OpenURL from the MARC XML coming from the MetaLib X-Server. Here's a query I sent out to a MetaLib X-Server development list:

David Walker's suggestions

David Walker from CSU San Marcos has done some work in this area. He has kindly shared some info, which I quote here with his permission:

Trying out the OpenURL Referrer extension

I need to remind myself of the intricacies of OpenURLs(s) and the MARC XML spec, so some of what I write here is geared to bringing myself up to speed again.

To that end, I have tried installing the [WWW]Openly's OpenURL Referrer FireFoxBrowser extension:

This extension is interesting to me because of its implementation of both the 0.1 and 1.0 versions of OpenURL -- and because it does so in the context of GoogleScholar, the intriguing new kid on the block when it comes to MetaSearch.

OpenURL Referrer problems

Thomas P. Ventimiglia, the author of the extension, has been extremely helpful in tracking down the issues. We've still not gotten to the bottom of the problem(s) yet (some of which may be more Firefox problems or that of another extension -- we don't know yet. There is behavior that bring up the matter of how Firefox extensions interact with each other, a topic little discussed in my limited view of things.

At any rate, the extension is working well enough for me to use to hook up to the UC E-links server and also look at the format of OpenURL 1.0.

Reading about OpenURL

In the meantime, I did download the extension and printed out some code to study it.

I've also printed out [WWW]Ex Libris - OpenURL Syntax to formally study the OpenURL 0.1 syntax. Searches for good, simple info on OpenURL 0.1 and the putatively much more complicated 1.0 led me to Walt Crawford, his [WWW]OpenURL - Brief Bibliography, which, in turn, points to a [WWW]2-page description of OpenUrl. To help me understand OpenURL version 1.0, I plan to read [WWW]Z39.88-2004: The OpenURL Framework for Context-Sensitive Services The Key/Encoded-Value (KEV) Format Implementation Guidelines

Specific example to compare 0.1 to 1.0

In the tradition of google vanity searches, I will use a [WWW]search for Yee and Beaubien on GoogleScholar as an example. With the [WWW]Openly's OpenURL Referrer extension installed, I get the following OpenURLs:

Now if I break down the pieces of respective OpenURLs to look how the key/value pairs compare between the two versions of OpenURL. The first table contains elements that have analogs between the two versions of OpenURL

Thing 0.1 Thing 1.0
resolver http://ucelinks.cdlib.org:8888/sfx_local 1.0 resolver http://sirsi-resolver.sirsi.net/
sid openly:openurlref rfr_id info:sid/openly.com:openurlref
genre article rft.genre article
title Library%20Hi%20Tech rft.jtitle Library%20Hi%20Tech
date 2004 rft.date 2004
atitle A%20preliminary%20crosswalk%20from%20METS%20to%20IMS%20content%20packaging rft.atitle A%20preliminary%20crosswalk%20from%20METS%20to%20IMS%20content%20packaging
aulast Yee rft.aulast Yee
auinit R rft.auinit R

There are also extra terms in the OpenURL 1.0:

url_ver Z39.88-2004
rft_val_fmt info:ofi/fmt:kev:mtx:journal
rfe_id http%3A%2F%2Fscholar.google.com%2Fscholar%3Fhl%3Den%26lr%3D%26q%3Dyee%2Bbeaubien%26btnG%3DSearch
rft_id http%3A%2F%2Fwww.ingenta.com%2Fisis%2Fsearching%2FExpand%2Fingenta%3Fpub%3Dinfobike%3A%2F%2Fmcb%2F238%2F2004%2F00000022%2F00000001%2Fart00008
url_ctx_fmt info:ofi/fmt:kev:mtx:ctx

While we're at it, we should take a look at the type of URLs returned by GoogleScholar

Once I now have seen working OpenURLs up close, it's good to take a look at the full range of possibilities:

From [WWW]Ex Libris - OpenURL Syntax:

META-TAG

value

description

genre

bundles:

 
 

journal

a journal, volume of a journal, issue of a journal

 

book

a book

 

conference

a publication bundling proceedings of a conference

 

individual items:

 
 

article

a journal article

 

preprint

a preprint

 

proceeding

a conference proceeding

 

bookitem

an item that is part of a book

aulast

 

A string with the first author's last name

aufirst

 

A string with the first author's first name

auinit

 

A string with the first author's first and middle initials

auinit1

 

A string with the first author's first initial

auinitm

 

A string with the first author's middle initials

     

issn

 

An ISSN number

eissn

 

An electronic ISSN number

coden

 

A CODEN

isbn

 

An ISBN number

sici

 

A SICI of a journal article, volume or issue. Compliant with ANSI/NISO Z39.56-1996 Version 2 (see http://sunsite.berkeley.edu/SICI/)

bici

 

A BICI for a section of a book, to which an ISBN has been assigned. Compliant with http://www.niso.org/bici.html

title

 

The title of a bundle (journal, book, conference)

stitle

 

The abbreviated title of a bundle

atitle

 

The title of an individual item (article, preprint, conference proceeding, part of a book )

     

volume

 

The volume of a bundle

part

 

The part of a bundle

issue

 

The issue of a bundle

spage

 

The start page of an individual item in a bundle

epage

 

The end page of an individual item in a bundle

pages

 

Pages covered by an individual item in a bundle. The format of this field is ' spage-epage'

artnum

 

The number of an individual item, in cases where there are no pages available.

date

YYYY-MM-DD

YYYY-MM

YYYY

The publication date of the item or bundle encoded in the "Complete date" variant of ISO8601 (see http://www.w3.org/TR/NOTE-datetime). This format is YYYY-MM-DD where YYYY is the four-digit year, MM is the month of the year between 01 (January) and 12 (December), and DD is the day of the month between 01 and 28 or 29 or 30 or 31, depending on length of the month and whether it is a leap year.

ssn

winter | spring | summer | fall

The season of publication

quarter

1 | 2 | 3 | 4

The quarter of publication

Learning about MARC 21, MARCXML, OpenURL 1.0, 0.1 etc

The next big goal of my crosswalking MARC XML to OpenURL work is to produce a table that maps out how we are going to construct OpenURLs from the MetaLib MARC XML.

MARC examples from MetaLib

Let me include three examples and then pull out the salient details for translating them into OpenURLs:

MODS versions of the MARCXML examples from MetaLib

Using [WWW]MODS v3 to MARC21Slim transformation, and applying them to the three examples:

MARCXML MODS
[WWW]"ATE battles soaring IC device complexity" [WWW]"ATE battles soaring IC device complexity"
[WWW]Web of Science reference [WWW]Web of Science reference
[WWW]Milosz's ABC [WWW]Milosz's ABC

Note from [WWW]"ATE battles soaring IC device complexity" the use of relatedItem->part->text in MODS to store the information in 773$g from MARC.

Paths for generalization of this work

Down the road, we want to figure out how to generalize our work in at least three directions:

Before I dive into documenting the MARC examples coming from MetaLib, I want to make sure I have a reasonably solid understanding of the MarcSpec, MarcXmlSpec, OpenURL 0.1 and OpenURL 1.0. I know that I whip something together to translate among the various metadata formats -- but since I'm interested in interoperability among bibliographic metadata, I'm taking time now to carefully explicate the various models.

Is there an XML representation of the OpenURLs (for representing the data elements? for the embedding of OpenURLs as bibliographic metadata? I've seen DanChudnov pull one together. Is there an official respresentation?

What's the significance of the word "slim" in the various MARCXML schemas I see. Is the following an answer? [WWW]Cover Pages: Library of Congress Publishes MARC 21 XML Schema and Transformation Tools.: W

To make sense of the various datafields and subfields in MARC 21, I'm consulting [WWW]MARC 21 Concise Format for Bibliographic Data. For example, 856 $u is the URL/URN: [WWW]Holdings, Location, Alternate Graphics, etc. Fields (841- 88X):

Subfield $u may be repeated only if both a URN or a URL or more than one URN are recorded.

[WWW]Understanding MARC Bibliographic: Parts 7 to 10 includes [WWW]A Summary of Commonly Used MARC 21 Fields and [WWW]A List of Other Fields Often Seen in MARC Records.

use of 773 for journals

[WWW]Linking Entry Fields (76X-78X):

OpenURL 0.1 vs 1.0

[WWW]/usr/lib/info || Comments || OpenURL Standard goes to ballot is KarenCoyle's nice distillation of pros and cons:

Conceptual foundation for OpenURL 1.0: [WWW]Generalizing the OpenURL Framework beyond References to Scholarly Works: The Bison-Futé Model

Correspondence with Walt Crawford on MARCXML and OpenURL

I posed the question of whether there are [WWW]any MARCXML to OpenURL crosswalk? on the [WWW]OpenURL Mailing list, to which WaltCrawford [WWW]replied:

I followed up with Walt in private email with the following question:

Walt then answered in his email (which he kindly gave me permission to quote):

That's a tough question, and I'm not sure I would have a reasonable answer.

I think this is inherently a one-way mapping: There would be very little
point in translating OpenURL metadata back to MARC21, as far as I can see.
(I may be missing something: That frequently happens.)

As to general agreement on the mappings, well, we've put ours out there in
public, and it's based on my 25 years of experience with MARC (before doing
the mapping, that is). I've pointed at least one would-be OpenURL source to
that page. If I was new to the game, and I had a mapping available, I sure
wouldn't reinvent that particular wheel. I can't imagine that RLG would
object to having that portion of the page copied to a more general site
(although I'd have to ask!) and be offered as a general model, since I
don't believe I've seen any other general models for MARC=>OpenURL mapping.

My mapping is as complete as possible--and that may be because we didn't
partner with anybody in doing our OpenURL implementation (except, that is,
colleagues at California Digital Library and the University of Chicago to
look over my spec and see whether it made sense). Thus, we weren't aware of
any shortcuts we could take, so we didn't take any.

Most OpenURL sources except online catalogs--particularly article-level
databases--almost certainly don't store data in MARC21. For them, the
mapping is useless. I don't know whether OCLC has published their mapping
for FirstSearch (I couldn't readily find it). Online catalog vendors tend
to regard everything as proprietary information, although this might be an
exception.

I'm not sure that I see "pragmatic mapping" as a problem. For that matter,
the Eureka mapping was done with deep familiarity of the MARC formats, and
was based on where the data should reside, rather than an analysis of
actual databases: It's as much a theoretical mapping as a pragmatic
mapping.

The big and, I believe, somewhat insoluble problem in MARC=>OpenURL mapping
is the 773$g. Because the syntax for that field, which combines year,
volume, issue, and pagination, is either undefined or ill-defined (MARC21
rarely specifies internal syntax for a textual subfield!), all mappings are
inherently pragmatic. We've refined our algorithms somewhat as we discover
nuances of unusual databases, but I've accepted that we will never get the
mapping right in 100% of article-level records. What we can do and have
done is encourage database providers to follow data entry practices that
make extraction feasible. (Here, again, most database producers may not
have this problem: They probably store the data in separate elements, where
we store in MARC21.)

I then answered:

Thanks, Walt, for your very helpful answer.  I'm certainly coming at
this problem without  much experience with MARC and how it is actually
used -- so your long experience is what is needed, I think, to get at
the relevant issues.  All I've been working with so far is the MARCXML
coming from Ex-Libris' MetaLib product.  Hence I didn't know whether it
is common to use MARC at all to hold article level metadata.  Thanks,
also, for confirming what I had perceived to be the problematic nature
of 773$g.  (That makes me wonder:  is this problem shared by MODS too?)

In terms of going from OpenURL to MARC21 -- it might not be that
useful.  I am interested in the issue of how to pull together
bibliographic metadata from disparate sources (including places like
Amazon.com and Google Scholar). I have been wondering about the use of
MARCXML or MODS  as a hub format and therefore the pragmatics of
translating OpenURLs into MARCXML or MODS.

Can you find out whether it would be ok me to quote your mapping on my
wiki (with appropriate attribution, of course)?

It's interesting to me that "Online catalog vendors tend to regard
everything as proprietary information, although this might be an
exception."

Also, your email was so helpful to me -- and I think it might be helpful
to others -- can I quote your email on my blog?

Walt then responded:

You can certainly quote my email on your blog (although it's just my
*sense* that things like mappings tend to be regarded as proprietary, based
possibly on my attempts to find out what "relevance" means to various
vendors).

You can certainly link to the Eureka page with the mapping, without
permission: It's part of the open web. I'll send a quick note to the
parties that would be involved to see about copying the section--I can't
imagine it would be a difficulty, but it may take a while to get an answer.

A caveat here: I don't know much about MARCXML, I know even less about MODS
(but others here know more), and thus I may be in over my head at times.
But we're all learning in different ways, I guess.

Anyway, I'll ask about quoting the mapping on your wiki and get back to you
as soon as possible. If you haven't heard from me in a week or two, bug
me...

A start at mapping MARC to OpenURL

The following chart does not pull together everything we know about crosswalking MARC and OpenURL -- but is my own start at looking at the relationships. (I've drawn also from DavidWalker's XSLT.)

In the MetaLib MARC XML:

I've not attempted to reconcile this mapping with that of WaltCrawford's yet -- or for that matter, a [WWW]Rules for constructing a MARC record from an OpenURL written by Mary Heath (?) of the CDL. I will also be working with TomSchirmer as he implements our mappings to document that work.

Examples of stuff coming out of 773$g

TomSchirmer has pulled together the following list of sample 773$g entries that come from MetaLib in an effort to figure out a good parsing strategy:

Feb 2005, v143 i2, p84(2)
Feb 10, 2005, pNA
Jan 27, 2005, pNA
Jan 27, 2005, pNA
Jan 24, 2005, pNA
Jan 19, 2005, pNA
Jan 14, 2005, pNA
Jan 13, 2005, pNA
Jan 6, 2005, pNA
Feb 2005, v143 i2, p84(2)
Jan 11, 2005, pA18(L), col 06 (6 col in)
Jan 7, 2005, pA22(L), col 04 (4 col in)
Jan 4, 2005, pA19, col 02 (19 col in)
Dec 27, 2004, v76 i53, p19(1)
Winter 2004, v7 i4, p50(1)
Jan 20, 2005, pB2, col 03 (15 col in)
Jan 15, 2005, pB14, col 01 (11 col in)
Jan 15, 2005, pB14, col 01 (11 col in)
Jan 9, 2005, pBU26, col 01 (28 col in)
Jan 2, 2005, pAR30(L), col 03 (24 col in)

MARCXML, MODS, and representation of serials metadata

With MARCXML, we got the 773$g problem. (That is, metadata such as volume, issue number, page range are glommed together). How is the possible conflation of volume, number, year handled in MODS?

I'm working to understand how well journal citations are handled in MODS, as opposed to MARC XML -- in this case, MODS seems to be richer and MARC21/MARC XML

It might be useful to just convert MARCXML to MODS first, making it easier to understand.

Presumably using human-friendly tags in MODS will make it easier for me to handle the bibliographic metadata. Moreover, there is a lot of communal wisdom presumably the human documented [WWW]MARC Mapping to MODS (Library of Congress) and [WWW]MODS to MARC Mapping (Library of Congress).

I was under the impression that the MARCXML model is richer (a proper superset) of the MODS semantic model. That is, anything that is expressible in MODS is expressible in MARCXML -- but not the other way around. Perhaps I'm confusing richness with granularity -- since conceivably one can stick in all sorts of stuff into fields. The following comment from [WWW]MODS to MARC Mapping (Library of Congress) is apropos:

To help me understand concretely the relationship MODS and MARCXML specifically in regards to the handling of serials metadata (and the 773 tags in MARC), I used the XSLT [WWW]MODS v3 to MARC21Slim transformation to transform the [WWW]sample MODS encoding for an "article in a serial" (Neil Brenner. "[WWW]The Urban Question: Reflections on Henri Lefebvre, Urban Theory and the Politics of scale" International Journal of Urban and Regional Research, June 2000, vol. 24, no. 2, pp. 361-378 (27 pages in length) to get [WWW]MARCXML version. Notice the loss of the end page info in the MARCXML version. Is there no explicit slot in MARCXML for the end page, and therefore the end page is dropped in the mapping? Or is there an oversight in the translation? (I would think that the writer of the XSLT could have chosen to mixed the end page into the 773$g slot.

The following extract from the [WWW]MODS User Guidelines seems relevant here:

Our best effort at a mapping

The following MarcXml to OpenUrl mapping represents my synthesis of the work of TomSchirmer, DavidWalker, WaltCrawford (as written in [WWW]Setting up OpenURL for Eureka) and [WWW]Mary Heath (I believe) of the CaliforniaDigitalLibrary () and [WWW]Ex Libris - OpenURL Syntax.

To borrow from WaltCrawford, we form an OpenURL of the following form:

baseurl: hardwired (for the user's institution) or dynamically associated (see OpenUrl/RepurposabilityDemo)

sid: needs to be generated to give the resolver the appropriate sense of the generator of the OpenURL. (I'm not totally clear on how to use it yet)

genre: The possible values are one of: journal, book, article, bookitem, conference, proceeding, preprint

The RLG approach: '"article," "journal," "book," or "bookitem" if the record indicates one of those genres. Omitted otherwise. Based on MARC21 field 773 data and leader data.'
The CDL approach: " looks for the presence of a 773$t, not the genre, to determine how to handle an item. If a 773$t exists in the record, PIR treats the item as a piece of a larger work."

author handling

aulast The basic place to look is 100$a. Crawford mentions [WWW]700$a: "first author's last name, typically 100$a or 700$a up to but not including the first comma."

I guess that it's not too hard to look for the last name (look for the name before the comma) -- but it can be a real challenge to get at the first and middle names. There are many names that don't fit the last name/first name + middle initial model. For example, here are some examples of names from Melvyl:

For the given names, the possible fields are:

First thing to note as Crawford does is the range of permitted values for the given names: "An OpenURL query may contain aufirst and auinitm; aufirst by itself; auinit, auninit1, or none of these. It will not contain any other combination."

One of the challenging issues with names is that the various name fields have to be parsed out. Crawford wrote that they are "parsed from 100$a or 700$a". The general proper parsing of names in all their varieties, both within an single linguistic or across many linguistic/social contexts is a hard problem.

issn Crawford wrote: "does include hyphen. Taken from 773$x if present, 022$a otherwise."

isbn Crawford wrote: "does not include hyphens. Taken from 773$z if present or otherwise from 020$a." (if genre is book)

title handling

There are two major cases. For an article, the article title is in 245$a, and the journal title should be in 773$t. For books, the titile is in 245$a.

title. Crawford wrote "taken from 773$t if present; 245$a otherwise".

atitle Crawford wrote "taken from 245$a if 773 is present."

stitle ("The abbreviated title of a bundle") Crawford wrote "abbreviated title, taken from 773$p if present, 210$a otherwise."

volume, number, page handling

date

sici

other fields: ignore?

other bibliographic info that might be useful for citation but not necessarily for OpenURL (listed in CDL)

Conclusions with respect to MARC and OpenURL

I am working right now to finish up my documentation of my work on crosswalking MARC XML to OpenURL. Not that I've figured out everything there is to know or to discover with respect to translating MARC XML to OpenURLs. Rather, I've learned enough and want to move on to other problems.

Let me just start with some conclusions and then work backwards to justify them:

I still want to provide the details behinds these conclusions.

Current Next Steps