Issue 58, 2023-12-04
Extra Editorial: On the Release of Patron Data in Issue 58 of Code4Lib Journal
We, the editors of the Code4Lib Journal, sincerely apologize for the recent incident in which Personally Identifiable Information (PII) was released through the publication of an article in issue 58.
Editorial
Issue 58 of the Code4Lib Journal is bursting at the seams with examples of how libraries are creating new technologies, leveraging existing technologies, and exploring the use of AI to benefit library work. We had an unprecedented number of submissions this quarter and the resulting issue features 16 articles detailing some of the more unique and innovative technology projects libraries are working on today.
Enhancing Serials Holdings Data: A Pymarc-Powered Clean-Up Project
Following the recent transition from Inmagic to Ex Libris Alma, the Technical Services department at the University of Southern California (USC) in Los Angeles undertook a post-migration cleanup initiative. This article introduces methodologies aimed at improving irregular summary holdings data within serials records using Pymarc, regular expressions, and the Alma API in MarcEdit. The challenge identified was the confinement of serials’ holdings information exclusively to the 866 MARC tag for textual holdings.
To address this challenge, Pymarc and regular expressions were leveraged to parse and identify various patterns within the holdings data, offering a nuanced understanding of the intricacies embedded in the 866 field. Subsequently, the script generated a new 853 field for captions and patterns, along with multiple instances of the 863 field for coded enumeration and chronology data, derived from the existing data in the 866 field.
The final step involved utilizing the Alma API via MarcEdit, streamlining the restructuring of holdings data and updating nearly 5,000 records for serials. This article illustrates the application of Pymarc for both data analysis and creation, emphasizing its utility in generating data in the MARC format. Furthermore, it posits the potential application of Pymarc to enhance data within library and archive contexts.
The Use of Python to Support Technical Services Work in Academic Libraries
Technical services professionals in academic libraries are firmly committed to digital transformation and have embraced technologies and data practices that reshape their work to be more efficient, reliable, and scalable. Evolving systems, constantly changing workflows, and management of large-scale data are constants in the technical services landscape. Maintaining one’s ability to effectively work in this kind of environment involves embracing continuous learning cycles and incorporating new skills – which in effect means training people in a different way and re-conceptualizing how libraries provide support for technical services work. This article presents a micro lens into this space by examining the use of Python within a technical services environment. The authors conducted two surveys and eleven follow up interviews to investigate how Python is used in academic libraries to support technical services work and to learn more about training and organizational support across the academic library community. The surveys and interviews conducted for this research indicate that understanding the larger context of culture and organizational support are of high importance for illustrating the complications of this learning space for technical services. Consequently, this article will address themes that affect skills building in technical services at both a micro and macro level.
Pipeline or Pipe Dream: Building a Scaled Automated Metadata Creation and Ingest Workflow Using Web Scraping Tools
Since 2004, the FRASER Digital Library has provided free access to publications and archival collections related to the history of economics, finance, banking, and the Federal Reserve System. The agile web development team that supports FRASER’s digital asset management system embarked on an initiative to automate collecting documents and metadata from US governmental sources across the web. These sources present their content on web pages but do not serve the metadata and document links via an API or other semantic web technologies, making automation a unique challenge. Using a combination of third-party software, lightweight cloud services, and custom Python code, the FRASER Recurring Downloads project transformed what was previously a labor-intensive daily process into a metadata creation and ingest pipeline that requires minimal human intervention or quality control.
This article will provide an overview of the software and services used for the Recurring Downloads pipeline, as well as some of the struggles that the team encountered during the design and build process, and current use of the final product. The project required a more detailed plan than was designed and documented. The fully manual process was not intended to be automated when established, which introduced inherent complexity in creating the pipeline. A more comprehensive plan could have made the iterative development process easier by having a defined data model, and documentation of—and strategy for—edge cases. Further initial analysis of the cloud services used would have defined the limitations of those services, and workarounds could have been accounted for in the project plan. While the labor-intensive manual workflow has been reduced significantly, the required skill sets to efficiently maintain the automated workflow present a sustainability challenge of task distribution between librarians and developers. This article will detail the challenges and limitations of transitioning and standardizing recurring web scraping across more than 50 sources to a semi-automated workflow and potential future improvements to the pipeline.
A practical method for searching scholarly papers in the General Index without a high-performance computer
The General Index is a free database that offers unprecedented access to keywords and ngrams derived from the full text of over 107 million scholarly articles. Its simplest use is looking up articles that contain a term of interest, but the data set is large enough for text mining and corpus linguistics. Despite being positioned as a public utility, there is no user interface; one must download, query, and extract results from raw data tables. Not only is computing skill a barrier to use, but the file sizes are too large for most desktop computers to handle. This article will show a practical way to use the GI for researchers with moderate skills and resources. It will walk though building a bibliography of articles and a visualizing yearly prevalence of a topic in the General Index, using simple R programming commands and a modestly equipped desktop computer (code is available at https://osf.io/s39n7/). It will briefly discuss what else can be done (and how) with more powerful computational resources.
Using Scalable Vector Graphics (SVG) and Google Sheets to Build a Visual Tool Location Web App
At the University Libraries at Virginia Tech, we recently built a visual kiosk web app for helping patrons in our makerspace locate the tools they need and assist our staff in returning and inventorying our large selection of tools, machines, and consumables. The app is built in Svelte, and uses the Google Sheets “publish to web as csv” feature to pull data from a staff-maintained list of equipment in the space. All of this is tied to a Scalable Vector Graphics (SVG) file that is controlled by JavaScript and CSS to provide an interactive map of our shelving and storage locations, highlighting bins as patrons select specific equipment from a searchable list on the kiosk, complete with photos of each piece of equipment. In this article, you will learn why the app was made, the problems it has solved, why certain technologies were used and others weren’t, the challenges that arose during development, and where the project stands to go from here.
Bringing it All Together: Data from Everywhere to Build Dashboards
This article will talk about how Binghamton University approached building a data dashboard bringing together various datasets, from MySQL, vendor emails, Alma Analytics and other sources. Using Power BI, Power Automate and a Microsoft gateway, we can see the power of easy access to data without knowing all of the disparate systems. We will discuss why we did it, some of the how we did it, and privacy concerns.
Real-Time Reporting Using the Alma API and Google Apps Script
When the University of Michigan Library migrated from the Aleph Integrated Library System (ILS) to the Alma Library Services Platform (LSP), many challenges arose in migrating our workflows from a multi-tier client/server structured ILS with an in-house, locally hosted server which was accessed by staff through a dedicated client to a cloud-based LSP accessed by staff through a browser. Among those challenges were deficiencies in timely reporting functionality in the new LSP, and incompatibility with the locally popular macro software that was currently in use. While the Alma LSP includes a comprehensive business intelligence tool, Alma Analytics, which includes a wide variety of out-of-the-box reports and on-demand reporting, it suffers from one big limitation: the data on which the reports are based are a copy of the data from Alma extracted overnight. If you need a report of data from Alma that is timely, Analytics isn’t suitable. These issues necessitated the development of an application that brought together the utility of the Alma APIs and the convenience of the Google Apps Script platform. This article will discuss the resulting tool which provides a real-time report on invoice data stored in Alma using the Google Apps Script platform.
Using Airtable to download and parse Digital Humanities Data
Airtable is an increasingly popular cloud-based format for entering and storing research data, especially in the digital humanities. It combines the simplicity of spreadsheets like CSV or Excel with a relational database’s ability to model relationships and link records. The Center for Digital Research in the Humanities (CDRH) at Nebraska uses Airtable data for two projects, African Poetics (africanpoetics.unl.edu) and Petitioning for Freedom (petitioningforfreedom.unl.edu). In the first project, the data focuses on African poets and news coverage of them, and in the second, the data focuses on habeas corpus petitions and individuals involved in the cases. CDRH’s existing software stack (designed to facilitate display and discovery) can take in data in many formats, including CSV, and parse it with Ruby scripts and ingest it into an API based on the Elasticsearch search index. The first step in using Airtable data is to download and convert it into a usable data format. This article covers the command line tools that can download tables from Airtable, the formats that can be downloaded (JSON being the most convenient for automation) and access management for tables and authentication. Python scripts can process this JSON data into a CSV format suitable for ingesting into other systems The article goes on to discuss how this data processing might work. It also discusses the process of exporting information from the join tables, Airtable’s relational database-like functionality. Join data is not human-readable when exported, but it can be pre-processed in Airtable into parsable formats. After processing the data into CSV format, this article touches on how CDRH API fields are populated from plain values and more complicated structures including Markdown-style links. Finally, this article discusses the advantages and disadvantages of Airtable for managing data, from a developer’s perspective.
Leveraging Aviary for Past and Future Audiovisual Collections
Now that audio and video recording hardware is easy to use, highly portable, affordable, and capable of producing high quality content, many universities are seeing a rise in demand for oral history projects and programs on their campuses. The burden of preserving and providing access to this complex format typically falls on the library, oftentimes with no prior involvement or consultation with library staff. This can be challenging when many library staff have no formal training in oral history and only a passing familiarity with the format. To address this issue, librarians at the College of Charleston have implemented AVPreserve’s audiovisual content platform, Aviary, to build out a successful oral history program.
The authors will share their experience building new oral history programs that coexist alongside migrated audiovisual materials from legacy systems. They will detail how they approached migrating legacy oral histories in batch form, and how they leveraged Aviary’s API and embed functionalities to present Aviary audiovisual materials seamlessly alongside other cultural heritage materials in a single, searchable catalog. This article will also discuss techniques for managing an influx of oral histories from campus stakeholders and details on how to make efficient use of time-coded transcripts and indices for the best user experience possible.
Standing Up Vendor-Provided Web Hosting Services at Florida State University Libraries: A Case Study
CreateFSU is Florida State University Libraries’ branded Reclaim Hosting, Domain of One’s Own web-hosting service. CreateFSU provides current FSU faculty, staff, and some students web domains and over 150 popular open-source content management systems including Wordpress, Drupal, Scalar, and Omeka. Since the launch of the service in September 2021, the Libraries have negotiated the demands of providing such a service with various administrative stakeholders across campus, expanded the target audience, provided support and refined our workflows and documentation to make the service fit campus needs. Using this service, members of the FSU community showcase the fruits of their research to a broad audience in ways that are highly accessible and engaging. More work needs to be done to promote CreateFSU to the FSU community and identify opportunities to integrate the service into existing research and learning workflows. To expand the service to meet new use cases and ensure its scalability, the Libraries hope to convince campus partners to consider its utility to their missions and contribute funding. This article lays out our experiences in launching and hosting this service over its first two years and proposes steps for future development and growth.
Islandora for archival access and discovery
This article is a case study describing the implementation of Islandora 2 to create a public online portal for the discovery, access, and use of archives and special collections materials at the University of Nevada, Las Vegas. The authors will explain how the goal of providing users with a unified point of access across diverse data (including finding aids, digital objects, and agents) led to the selection of Islandora 2 and they will discuss the benefits and challenges of using this open source software. They will describe the various steps of implementation, including custom development, migration from CONTENTdm, integration with ArchivesSpace, and developing new skills and workflows to use Islandora most effectively. As hindsight always provides additional perspective, the case study will also offer reflection on lessons learned since the launch, insights on open-source repository sustainability, and priorities for future development.
Developing a Multi-Portal Digital Library System: A Case Study of the new University of Florida Digital Collections
The University of Florida (UF) launched the UF Digital Collections in 2006. Since this time, the system has grown to over 18 million pages of content. The locally developed digital library system consisted of an integrated public frontend interface and a production backend. As with other monoliths, being able to adapt and make changes to the system became increasingly difficult as time went on and the size of the collections grew. As production processes changed, the system was modified to make improvements on the backend, but the public interface became dated and increasingly not mobile responsive. A decision was made to develop a new system, starting with decoupling the public interface from the production system. This article will examine our experience in rearchitecting our digital library system and deploying our new multi-portal, public-facing system. After an environmental scan of digital library technologies, it was decided to not use a current open-source digital library system. A relatively new programming team, who were new to the library ecosystem, allowed us to rethink many of our existing assumptions and provided new insights and development opportunities. Using technologies that include Python, APIs, ElasticSearch, ReactJS, PostgreSQL, and more, has allowed us to build a flexible and adaptable system that allows us to hire developers in the future who may not have experience building digital library systems.
Jupyter Notebooks and Institutional Repositories: A Landscape Analysis of Realities, Opportunities and Paths Forward
Jupyter Notebooks are important outputs of modern scholarship, though the longevity of these resources within the broader scholarly record is still unclear. Communities and their creators have yet to holistically understand creation, access, sharing and preservation of computational notebooks, and such notebooks have yet to be designated a proper place among institutional repositories or other preservation environments as first class scholarly digital assets. Before this can happen, repository managers and curators need to have the appropriate tools, schemas and best practices to maximize the benefit of notebooks within their repository landscape and environments.
This paper explores the landscape of Jupyter notebooks today, and focuses on the opportunities and challenges related to bringing Jupyter Notebooks into institutional repositories. We explore the extent to which Jupyter Notebooks are currently accessioned into institutional repositories, and how metadata schemas like CodeMeta might facilitate their adoption. We also discuss characteristics of Jupyter Notebooks created by researchers at the National Center for Atmospheric Research, to provide additional insight into how to assess and accession Jupyter Notebooks and related resources into an institutional repository.
Beyond the Hype Cycle: Experiments with ChatGPT’s Advanced Data Analysis at the Palo Alto City Library
In June and July of 2023 the Palo Alto City Library’s Digital Services team embarked on an exploratory journey applying Large Language Models (LLMs) to library projects. This article, complete with chat transcripts and code samples, highlights the challenges, successes, and unexpected outcomes encountered while integrating ChatGPT Pro into our day-to-day work.
Our experiments utilized ChatGPTs Advanced Data Analysis feature (formerly Code Interpreter). The first goal tested the Search Engine Optimization (SEO) potential of ChatGPT plugins. The second goal of this experiment aimed to enhance our web user experience by revising our BiblioCommons taxonomy to better match customer interests and make the upcoming Personalized Promotions feature more relevant. ChatGPT helped us perform what would otherwise be a time-consuming analysis of customer catalog usage to determine a list of taxonomy terms better aligned with that usage.
In the end, both experiments proved the utility of LLMs in the workplace and the potential for enhancing our librarian’s skills and efficiency. The thrill of this experiment was in ChatGPT’s unprecedented efficiency, adaptability, and capacity. We found it can solve a wide range of library problems and speed up project deliverables. The shortcomings of LLMs, however, were equally palpable. Each day of the experiment we grappled with the nuances of prompt engineering, contextual understanding, and occasional miscommunications with our new AI assistant. In short, a new class of skills for information professionals came into focus.
Comparative analysis of automated speech recognition technologies for enhanced audiovisual accessibility
The accessibility of digital audiovisual (AV) collections is a difficult legal and ethical area that nearly all academic libraries will need to navigate at some point. The inclusion of AV accessibility features like captions and transcripts enormously benefit users with disabilities in addition to providing extra value to the repository more universally. However, implementing these features has proven challenging for many reasons. Recent technological advancements in automatic speech recognition (ASR) and its underlying artificial intelligence (AI) technology offer an avenue for librarians in stewarding more accessible collections. This article will discuss these opportunities and present research from Florida State University Libraries evaluating the performance of different ASR tools. The authors will also present an overview of basic AV accessibility-related concepts, ethical issues in using AI technology, and a brief technical discussion of captioning formats.
Using Event Notifications, Solid and Orchestration for Decentralizing and Decoupling Scholarly Communication
The paper presents the case for a decentralized and decoupled architecture for scholarly communication. An introduction to the Event Notifications protocol will be provided as being applied in projects such as the international COAR Notify Initiative and the NDE-Usable program by memory institutions in The Netherlands. This paper provides an implementation of Event Notifications using a Solid server. The processing of notifications can be automated using an orchestration service called Koreografeye. Koreografeye will be applied to a citation extraction and relay experiment to show all these tools fit together.