Responses and Suggestions for Messy Metadata

Digital Library and Archives received some responses to one of our previous blog posts, Automated Repository Deposit and Messy Metadata.  These responses were sent via email because blog comments were not enabled at that time.  Here are most of the comments.  Please feel free to comment on the blog post if anything is missing.

Andi Ogier

So looking at the metadata record that’s missing an author and after pulling up the pdf object I’m wondering what kind of program Mendeley uses to pull citation information out of pdf files, and if something similar can be used on the BioMed articles.   As anyone who uses Mendeley knows, the generated citation doesn’t always work, but something like that might be able to pull out likely first/last names which could then be checked against a list of VT faculty.  Subjects are tricky–wouldn’t it be nice if there was some kind of compendium of likely keywords that, if found in an article, could generate a very general subject heading or headings? 

Having duplicate metadata is also a huge problem with e-books; we were talking about writing some kind of script that would identify duplicate records and flag them for deletion, or running the records through a script that would kick out the duplicates before they were ever even put in the system.  Something similar might work for you. 
Anne Lawrence

Hi. The metadata, including authors is in the metadata in the pdf,
. We might be able to use XMP,, to extract the data from the PDF (and also to write metadata to PDFs, if desired). The XML file for this article,
, is not tagged like the other, better files such as, so authors can’t be extracted from it.

Philip Young

I’m interested in working on this with you, but I don’t have any immediate suggestions until I learn more about how this works. 


This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply