Metadata Stores in Australia: NLA/Trove

Recent experiences with the NLA/Trove test environment have resulted in a working list of unexpected behaviours. N.B. This is a working list - please contact Simon Pockley about any amendments or additions you would like to make.

The procedure for reporting issues to NLA is submitting the details through the contact form on Trove website.

This will record the issue in the NLA's job tracking system. However, for urgent issues like this, you can directly contact ANDS (Julie McCulloch, Simon Pockley or Amir Aryani) or NLA (Virginia and Catriona). Virginia and Catriona have kindly accepted to help ANDS partners to get through the Trove integration as smooth as possible.

Testing environment

TIM Beta is the system that NLA use to check the accuracy of records before they are loaded into production. It isn’t a development system. There are limitations to the testing environment.

a. The test NLA records cannot be deleted, so tests cannot be rerun and new people need to be created and used each time.
b. NLA ID in TIM Beta resolves to Trove Production rather than Trove Test. This can prompt concern/confusion that valid NLA ID's might be incorrectly being ascribed to real researchers and then published on the web.

Status - no action. The testing environment in TIM Beta does not impact on the production environment
Access to TIM

a. Access to TIM is on an individual basis without a generic link to the institution.

Status - work-around. Establish a convention for managed usernames and passwords, for example, 'ANDS.tim.matching01, 'ANDS.tim.matching02', etc. for multiple users. A single sign-on could be established, remembering that multiple users cannot login to TIM with the same user name and password simultaneously. Generic usernames and email addresses will allow usernames to be reassigned and the password reset.

b. Also having to create individual IDs for Trove and TIM (and making these the same for the test instances of each) is cumbersome compared to AAF access to Research Data Australia.

Status – No action. Signup to Trove is open to any user of Trove, not just those with access to the Australian Access Federation (AAF) credentials. One username and password is used for both Trove and TIM (production and test).
Transparency of record status

There’s no indication whether records have been auto-matched or gone into TIM for manual matching.

Status - work-around. Checking which records are in TIM waiting for matching is a manual process.
Identifiers lose their type

Identifiers lose the type indicated in the contributor party record e.g. 'scopus'. Even NLA supplied party identifiers become type="contributor’s ISIL code". Any identifiers and the key supplied in a contributor party record to Trove become type="contributor’s ISIL code" in the Trove party record.

Examples of these two records in RDA Demo:

Dr Tung-Kai (Paul) Shyy - UQ record

Dr Tung-Kai (Paul) Shyy - Trove record

The types of "URI" and "AU-QU-local" are changed to "AU-QU".

Status – low impact - no action
Single element names not harvested

RIF-CS schema guidelines state that <namePartType> is optional. The reason for variation in how names are presented is that GeoNetWork uses ISO 19115 and/or ANZLIC as its native schema which allows free text entry of names in a single element. RIF-CS party records type person that have origins in either ANZLIC or ISO19115 will be not be harvested by TROVE due to the lack of <namePartType>. Note that this issue impacts on Party person records only. Group records without <namePartType> will be ingested by Trove.
Status – low impact - no action
6a. NLA converts keys to RDA URLS

Not all institutions maintain or even create public profile pages. These are used as Source URLS in a Trove record.

Status - work-around for where there are no source URLs for staff profiles is to use an optional function in Trove. If option is set to ‘on’ the entityID (Source URL) for a university's records in Trove are transformed to ANDS URLS. Trove uses a predefined URL structure to convert keys to a link to RDA and not to the researcher’s web page on the university web site. If set to ‘off’ Trove will use the source URL provided.

You can contact NLA through the Trove 'contact us' form (above) and ask for activating this option. If you use this option (resolving the entityID to an ANDS URL) then you need to provide the party records to RDA with the same RIF-CS keys as the EAC-CPF entityIDs.

A URL that resolves to RDA would need to be in the field in EAC-CPF records in this format:

http://researchdata.ands.org.au/view/?key=

Where localRegistrykey is the key for the record in RDA, example:

http://researchdata.ands.org.au/view/?key=deakin.edu.au/dro/author/2739

6b. URL construction not supported by EAC-CPF

University of Adelaide (uses EAC-CPF) has reported that their source URL resolves to MINT (requiring a log-in) as they do not have a public facing researcher profile system. Institutions that send their party records in RIF-CS may have the URL to RDA constructed by an XSLT stylesheet developed by the NLA. University of Adelaide uses EAC-CPF, consequrntly a URL is not created.
Status – work-around options
1. Do nothing. The NLA have indicated that they accept that the source URL will resolve to Mint login screen. Not desirable.
2. Use RIF-CS instead of EAC-CPF for records contributed to Trove where it is not desirable to use a resolvable url in entityID. Although the NLA will accept records in either format. Is loss of data in the crosswalk from RIF-CS to EAC-CPF a concern? Not desirable.
3. Retain the recommended the current practice, to include the source URL in entityID See Workflow Q&A 12a and 12b.
4. Propose to the NLA that they construct the source URL for EAC-CPF, as they already do for RIF-CS. Rather than individual institutions making changes at a local level to accommodate their application of the identifier as a URL. This is something we (ANDS) will do ASAP.
Vanishing records

As reported by UQ. A feed of party records was harvested but some of the records neither appeared in Trove nor in TIM - they vanished without a trace. The harvest operation did not fail, since other party records from the same harvest of the feed appeared in TIM or were automatically matched and appeared in Trove.

Status - behaviour yet to be replicated
Broken links when Researchers leave an institution (tombstone records)

There is a need for Policy development regarding Researchers who move or even die. NLA has no control over contributor’s source links. Some universities will use the links to the researchers' local profile as the source URL. However, these links may not be maintained after the researcher leaves the universities. So in the long run there will be a lot of broken links in Trove.

NLA comment: There is no need for a University to delete the record for a researcher when they move away from that institution. The NLA party identity is a container record and so can contain records from different universities for the same researcher. This is where some descriptive information would be valuable in the record e.g. “Left University of ZX in 2011”.

Status – low impact - no action
Inconsistent matching algorithms

Testing Trove's automatic matching algorithms are behaving unexpectedly:

Issue first raised by Hoylen Sue at UQ - as follows

Situation #1: party records are harvested for the first time. That is, Trove does not contain any party records with these RIF-CS keys.

Situation #2: when these unchanged party records are re-harvested. That is, Trove already has existing party records with these same RIF-CS keys. In this situation UQ has found the two party records for Watson and Drebber pass "Matching Rules Part A" and replace the existing party records -- which is the correct behaviour. But the party record for Holmes does not match and goes into TIM Beta -- which is not correct. It should have been automatically matched just like the other two. Neither us nor NLA have figured out why Holmes is treated differently.

NLA comment:

Paul has checked this problem as much as he can and when he sends the record to TIM manually it matches. Essentially the way he's doing it is exactly the way the harvester would send the records to TIM so we have no idea why the records aren't matching when we harvest them. Paul says we need more data to work with and he's reluctant to use your data as then you won't be able to use it. Perhaps if we had lots more test records to work with we might be able to see a pattern. Sorry I don't have a solution or a reason for the records not matching.

Status - behaviour yet to be replicated
Complex matching rules

There is an assumption that if local party records contain the NLA identifier as an identifier, then matching will occur automatically, but this is not necessarily the case. The matching rules are complex and deliberately conservative. Records which fail Part A of the matching rules then go to Part B where the NLA identifier is just one of the match points.

Rules for Matching party Records in Trove

From the matching rules:

The purpose of Part A: to decide if two records are actually the same record
An incoming record is checked to determine if it has the same contributor ISIL and record ID as a record in Trove.

If it does, a sanity check is applied. If the incoming record passes the sanity check, it is automatically matched and overlays (replaces) the matching record in Trove. If the incoming record fails the matching rules in Part A then matching rules in Part B are applied.

The purpose of Part B: to decide if a record should be part of an existing identity.

The rules specified in Part B are applied to an incoming record that has failed the matching rules specified in Part A.

An incoming record is checked to determine if it has an NLA persistent identifier. If it does, a sanity check is applied and must be passed for a match to occur.

Incoming records that fail the matching rules specified in Part A and B are passed by the identity service to the unmatched record queue for human review for matching or new record creation in TIM

Example:

A university creates a record for “Davies, Peter” containing the NLA party identifier for the existing Trove record for “Davies, Peter Eric” (the Trove record was contributed by Libraries Australia, ISIL code AU-ANL:PEAU).

Result: Part A matching rules apply, but neither ISIL code or recordID match so Part B rules are applied.

In the Part B matching rules the system looks at the incoming record for the NLA party identifier and the NLA’s ISIL (AU-ANL:PEAU) and the NLA party identifier, recorded in control/sources/source/objectXMLWrap.

If the university added the NLA party identifier for “Davies, Peter Eric” and the NLA agency code to the incoming record, the system would run the sanity check. It would match on Davies, on Peter, but it wouldn’t find Eric in the incoming record and so the name would fail the matching rules. Because there is always the risk that a mistake could be made when the NLA party identifier is manually entered into another organisation’s record, the system has to be certain it’s the same person so it will always check all name parts and all name parts have to match.

Result: In this example, if the names didn’t automatically match then the record failed the sanity checks at some point.
Status –investigate. The example above explains why records containing an NLA party identifier will not necessarily match even though they have the same NLA Identifer.
RIF-CS 1.4 mapping to EAC-CPF (Encoded archival context for corporate bodies, persons and families)

There is currently no separate overarching application profile document for NLA use of EAC-CPF. The NLA maintain mappings from MARC and RIF but these are both in flux at present as the NLA make changes to support RDA and RIF-CS 1.4.The mapping for RIF-CS 1.4 is finished but NLA’s IT is still to finish doing the style sheet to convert the records.

NLA is using EAC-CPF 2010. It upgraded from EAC in 2011:

See info about the upgrade here: EAC to RIF-CS mapping

Mapping is here: Excel table

Status – in progress
Name sequence or name order

The order in which names are entered is important because the Trove System has its own way of processing first name and last name. Best to enter in natural order i.e. first name: Jane, last name: Smith

Status – work-around
Duplicate records appear in matching queues

Records that fail Part A and Part B of the Trove matching rules will pass to TIM for manual matching. When you see duplicate records appearing in TIM, this is because:

1. in an initial harvest, the record failed the matching rules and moved to TIM
2. a subsequent harvest occurred before the record from the initial harvest was matched and the system found a record with the same and (Part A of the matching rules). This record (a duplicate) also moves to TIM.
3. unmatched records in the queue can’t match themselves to incoming records, so duplicate records appear.

This system behaviour points to a very strong reason for synchronising harvests with the time set aside for manual matching. Ideally, after each harvest the records should all be matched before the next harvest occurs, otherwise duplicate records begin appearing.

Please note, short harvest intervals are therefore not recommended. NLA will set up harvest schedules to suit each institutions timetable. The smallest harvest interval is every fifteen minutes. The largest is once per year.

Status – no action
source url from Trove to RDA Page Not Found

Reported by QUT and Deakin: This could an example of where re-harvests have to occur before all information is received in order to display and link to records correctly. The source URL for QUT records leads to the RDA record.

However, the problem remains in Trove, where under the Biographies heading, the message "View the full record at AU-QUT" is incorrect. When the source URL leads to RDA as is the case with QUT records it should read "View the full record at Queensland University of Technology". In addition, currently, the text for QUT is "AU-QUT" and it should be "Queensland University of Technology".

To see all QUT record: http://trove.nla.gov.au/people/result?q=QUT

QUT example: Rural Industries Research and Development Corporation (Australia)

URL in RDA: http://researchdata.ands.org.au/rural-industries-research-and-development-corporation-rirdc
Key in RDA: 10378.3/8085/1018.13432
NLA Party id: http://nla.gov.au/nla.party-577392

Source URL to AU-QUT (should be RDA)
http://trove.nla.gov.au/goto?i=people&w=577392&d=http%3A%2F%2Fservices.ands.org.au%2Fhome%2Forca%2Frda%2Fview.php%3Fkey%3D10378.3%2F8085%2F1018.13432

EAC-CPF
<entityId>http://services.ands.org.au/home/orca/rda/view.php?key=10378.3/8085/1018.13432</entityId>

QUT example: Professor Clive Bean

Source URL to AU-QUT (should be RDA)
http://trove.nla.gov.au/goto?i=people&w=477383&d=http%3A%2F%2Fservices.ands.org.au%2Fhome%2Forca%2Frda%2Fview.php%3Fkey%3D10378.3%2F8085%2F1018.13434

Status – in progress - checked on the 27th Feb QUT is now displaying correctly

Metadata Stores in Australia

Pages

NLA/Trove

0 comments:

Post a Comment

Contributors