by Corey Schmidt
Introduction
Make workflows easier. That was, and remains, the impetus to my journey down the Python application development rabbit hole. I was trained as an archivist, not a software developer or anything of the like. I processed and digitized a number of collections through internships at various archives and libraries, but was introduced to Python in graduate school where we learned the basics of coding and how Python can automate repeatable workflows. While I was skeptical if I would use coding outside of grad school, I discovered how useful it was in my first job working at the University of Georgia Special Collections Libraries managing their project to migrate out of Archivists’ Toolkit, our old archival management system for managing archival descriptions of our collections, and into ArchivesSpace.
A critical piece of this migration was figuring out how we were going to rework our existing workflow to export EAD.xml files, run a series of cleanup/standardization operations on them, and upload them to our finding aids website. This led me to use Python to replace our existing workflow with something easier to distribute and simpler to use for our collections processing faculty and staff. I learned how to make a graphical user interface (GUI) using Python, create an executable (.exe) file to bundle Python and its dependencies into a single file, and use a free-to-use tool to make a Microsoft Windows installer. Unexpectedly, I found myself working around Microsoft’s own safeguards against unknown publishers and antivirus software that were flagging my executables as malicious, as well as disentangling recent license changes to the previously open-source and free-to-use PySimpleGUI package I was using for the application. These challenges forced me to confront what it means to build trust with users, to maintain homegrown, open-source desktop applications, and how making workflows easier not only applies to the users but the developers too.
Background
To say our existing workflow for exporting, editing, and uploading EAD.xml files to our finding aid website was clunky is underselling it. Getting our collection metadata out of our archival management system and on our website required manually exporting each individual collection from Archivists’ Toolkit in EAD2002 XML files and running a Perl script written a decade ago. Since the script was not written with any documentation or error-logging, users had to hope that the cleanup did not kick back any errors because they had no idea how it worked or how to fix it. If the cleanup worked, then users needed to open a terminal or use WinSCP to upload the file to our finding aids website server. Archivists and processors then manually entered a command-line argument in a specific directory on the web server to index only those files just uploaded to the server to make them visible and searchable on the website. If this argument was not entered exactly as written in our own documentation, it could cause the website to index every single file in the directory, totaling over 5000 EAD.xml files, and render the website to be essentially unusable for our researchers and patrons for the next 24 to 48 hours. Other issues could arise in the process as well, such as .lazy files, which are files used to help load elements on the collections page when they’re requested by a user (XTF » Under the Hood), not being set with the correct permissions. When this happens, the collection page displays a seemingly indecipherable error screen, despite the collection appearing in our search results with all the appropriate metadata. Looking at all the potential ways things could go wrong or just be difficult for our users to manage, I thought to myself: “Surely we could simplify this process and make it more consistent.” That’s where Python came to mind.
Building a Python Desktop App
I learned that Python is great for many things, but is especially good for automating repeatable, consistent processes. Since we were moving to ArchivesSpace, I knew we could use the ArchivesSpace API and Python to search for collections and export their EAD.xml files individually or in batches. For cleaning up the EAD.xml files, Python has a library for that: lxml (lxml). It can sleuth through XML files looking for specific tags and attributes, as well as content, and edit them as needed. For uploading and editing files on a file server, there are the SCP and SFTP Python libraries that this handy guide showed me exactly how to use for this exact purpose (SSH & SCP in Python with Paramiko). I set out to work and built different Python scripts to handle the exporting, editing, and uploading of the files, but I needed a way to bring this process together for the user, preferably without having to work within the terminal.
Then I discovered PySimpleGUI. It is a Python library designed to make graphical user interfaces (GUIs) using Python, borrowing concepts from other Python libraries that do similar things and adding its own features to the mix. PySimpleGUI had everything from buttons, to user input boxes, pop-ups, default settings, drop down menus, checkboxes, radio buttons, almost anything you can associate with desktop application features. I began designing my application using simple wireframes, then testing different features PySimpleGUI offered as I figured out which ones would make each task more intuitive. After some work, I demoed the GUI to our faculty and staff and they were very enthusiastic. All a user had to do was enter the collection identifier(s) they wanted into an input box on screen and click the “Export” button. They could choose what specific cleanup processes they wanted to run and where to put the finished files. No longer was it necessary to enter terminal commands to upload an EAD.xml file. Now there was just a selection box for which collection you wanted to upload and an “Upload” button. There was no chance of re-indexing the whole website or weird file-permission errors, since the correct commands were now hard-coded, though you could change how those commands worked in a settings menu, if needed.
In addition, I added an output screen to tell the user what was happening in real time. Should something go wrong, it would display an error message and record it in a log file using the loguru Python library (GitHub – Delgan/loguru: Python logging made (stupidly) simple). The resulting workflow made exporting, cleaning, and uploading collections from ArchivesSpace to our finding aids website much easier and understandable for our faculty and staff, but there was something that could make it even more approachable.
While the Python scripts and GUI simplified the workflow tremendously, I needed a way to distribute the code to our users in a way that was already familiar to them and as straightforward as possible. In PySimpleGUI’s documentation, it mentioned the ability to create an executable file (.exe) for your GUI app. Creating an executable file would bundle all the app’s code, dependencies, and the Python interpreter into one file. This would allow the user to run the app just like they would any other application they can download and install on their PC (this is also called “freezing”, see Freezing Your Code). This would make the workflow even easier for our users. It would also make it easier to deliver the code to our users, since I would not need to teach our users how to update code from GitHub, which was our code-sharing platform. So, I created an .exe file using PyInstaller and it did exactly that – create a single file from which I could run our app without having to install Python, update code through GitHub, or run any commands in the terminal.
Going even one step further, I discovered something called Inno Setup, a program for creating a Windows installer for your application. This would enable our users to install our .exe app just like they would any other Windows software with the added benefit of having it run like any other app on their computer. It would install the app on their local user profile, not needing administrator privileges to install or run, and would be set up in its own application folder so any files, like exported EAD.xml files or log files, could be stored in a central, default location. With both a single executable file and a Windows installer, the process was familiar to our users since it was just like installing and running any other kind of software on their PC. While I did attempt to create a similar process with Macs, I could not get it to work despite PyInstaller’s promises of being able to create executable files for Windows, Linux, and Mac. Despite this, it felt as though I found the end of the software development rabbit hole – we had an open-source, Windows desktop application complete with a GUI, executable, and installer. Then, the rabbit hole got even deeper.
Trouble with Windows
Overcoming the first major obstacle involved Windows itself, attempting to protect users from installing potentially harmful software. When you create a Windows installer without being a licensed software publisher, Windows will flash a warning message on the user’s screen, warning them that this app is coming from an unknown publisher.
Looking at the warning message, there is no indication how to run the application anyway. The only way to move forward is to select “More info” in the warning message, which will then reveal a “Run anyway” button at the bottom of the window.
How do you get around this error in the first place? You can sign your code, which effectively signals to Microsoft that you are a legitimate software publisher and allows them to trace any issues back to you. Unlike all the other steps mentioned so far that are free to use (with one exception we’ll talk about later), code signing requires paying a code-signing provider, with different levels of certificates (see A guide to code signing certificates for the Microsoft app store and a question for the experts : r/electronjs on reddit for a thorough breakdown of this). In addition, you’ll need to purchase a Hardware Security Module (HSM) to store the certificate either locally or in the cloud via an HSM provider. At this stage, I was unwilling to pony up hundreds of dollars to remove this warning message when the simplest solution was to inform our users how to get around the warning message.
This was the first indication that self-publishing software comes with its own risks, primarily security ones. It’s good practice for Microsoft to have these guardrails in place, as so many seemingly legitimate pieces of software come packaged with malware, viruses, and all the nefariousness bad actors can cram into a desktop application. As a result of this warning message, users are more cautious when downloading and running apps they find online, even when those apps are developed in-house for a very niche workflow. In our first few releases, I taught our faculty and staff how to get around this message, assuring them that nothing I coded should be harming their computers. Virus scanners, however, did not agree.
Trouble with Virus Scanners
While our process of creating .exe files and Windows installers had seemingly worked pretty well, it had not anticipated that virus scanning software would catch our homegrown Python app and flag it as harmful. A user reported to me they were having difficulty running the app from their computer. Microsoft identified the app as a virus and forcibly quarantined and removed the Windows installer and application files. Not understanding why it was doing this, I reached out to our system administrator in IT for help. He suggested running the installer file through VirusTotal, which is an online inspector tool that uses many antivirus scanners to check a file or URL to see if the file is identified as being malicious (How it works – VirusTotal). Upon uploading the installer and executable file, we discovered that Microsoft and a host of other antivirus scanners flagged the app as being malicious. This is known as a “false-positive.”
A false-positive is when an antivirus software flags another piece of software as a virus, even though it is not. This can happen when malicious actors use an open-source piece of software to distribute malware to users (see What’s a False Positive and How Can You Fix It? | All About Cookies for more details). Our system administrator and I realized that PyInstaller was the root of the issue – being flagged as a virus because it had been used by others to spread malicious code (False-positive search – PyInstaller Google Group). We first attempted to compile PyInstaller locally, rather than pulling from PyInstaller’s code online using this guide: Pyinstaller EXE False-Positive Trojan Virus [RESOLVED]. While this prevented the false-positive reports from occurring the first few times, soon it began being flagged by other antivirus software with even more false-positives. We decided to try our luck with a different executable generator package and settled on using cx-Freeze (GitHub – marcelotduarte/cx_Freeze). Though it required more setup than PyInstaller, our initial testing showed promise as no antivirus software flagged it as a virus.
In addition, we worked together to automate the whole process of generating an executable and bundling it with a Windows installer by using GitHub Actions, making the whole distribution process to our users immensely easier and faster (GitHub Actions documentation). Unfortunately, it didn’t take long before the false-positives returned and we resorted to the last method we had available – reporting the reports. If an antivirus software flags a file as malicious but you know it is not, you can submit a report to that specific antivirus software stating the file was flagged as a false-positive. VirusTotal has a list of all the contact methods for antivirus software it uses so you can report these false-positives. After submitting a false-positive report to one of the antivirus software providers, they responded within a few days and removed their flag. Another antivirus software never responded to our false-positive report and it still shows on VirusTotal’s results, however. While some software providers are more responsive than others, we at least figured out a way to fight back against the false-positives and now include a VirusTotal report on all our most recent releases, making sure to call out false-positives as they appear and report them as such to the appropriate virus scanners. We hope these small steps build trust with our users. We want to show we are doing our best to ensure the software we are distributing is vetted and safe to use, in addition to being straightforward and easy.
Trouble with Licenses
With a process for managing false-positives going, there was one final issue we had to overcome, which involved changes to the license model for the Python GUI we were using. Beginning in 2024, PySimpleGUI announced it was changing its licensing model, classifying users as “Hobbyist” and “Commercial”. According to PySimpleGUI’s own definition, “developers at educational institutions who use PySimpleGUI for administrative or research purposes are considered Commercial Developers and register as Commercial Developers and pay the corresponding $99 license fee.” (FAQ – PySimpleGUI Documentation). Commercial users need to pay $99 per developer to maintain their developer and distribution keys for a perpetual license for the versions supported within one year of initial purchase. One of the primary reasons for this change was to make the continued development of the library possible with a more consistent revenue stream and to ensure the authenticity of the PySimpleGUI version you are using. This creates additional complications because we released the app with a CC-BY-SA-4.0 license, allowing anyone to use the code or download the distributables. This potentially conflicts with PySimpleGUI’s own license, so we’re not exactly sure how to approach this going forward. Thankfully, previous releases of our app remain online and will still be downloadable and usable, though if we update PySimpleGUI to version 5 or above, it will require us to purchase a license and include a distribution key in the app’s code to verify its legitimacy. For now, we decided to pay for a license for one year to give us time to think about alternative GUIs (tkinter or FreeSimpleGUI are our most viable alternatives) and how we will go about distributing the workflow moving forward. Understanding the landscape of software licenses goes beyond my training as an archivist, but it demonstrates another layer of complexity to consider when building and sharing your own desktop applications.
Conclusion
Diving down the software development rabbit hole came with unexpected challenges around building, distributing, and maintaining our workflow. This taught me the value of how to make this kind of desktop application easier to use for our users and myself where possible. Creating a GUI to export EAD.xml files from ArchivesSpace, cleaning those files, and uploading them to our website made the process so much easier for users, but had the unintended consequence of us reconsidering how we can license our app moving forward. Generating executable files to make running the workflow a breeze by not having to install Python or any of its dependencies simplified using and sharing the code, but also forced us to figure out how to work against false-positive reports from antivirus scanners. Making a Windows installer made installation of the app straightforward and familiar and wrapping the whole process together with GitHub actions made generating new releases a more straightforward process. Also, it demonstrated how building trust with users takes intentional effort and building trust with Microsoft takes money (if you’re willing to pay for it).
Being aware of the pitfalls of building your own software for users brought to life just how complex and safeguarded the process can be. There are a lot of people and processes out there actively protecting users from malicious code and bad actors, which is worth the extra hoops we have to jump through if we are making our own good-intentioned software. I have my own qualms about paying for things like code-signing, but I understand the intention is to make the producer of the code accountable to the users. Tools like VirusTotal exist to help developers – even unintentional developers – understand how the tools we’re using are evaluated for nefarious purposes. I fell down the applications developer rabbit hole and stumbled around the world of application security, distribution, and licensing, driven by a desire to make my fellow archivists’ jobs a little easier. Though I think it’s worth the effort to help create easy-to-use workflows for our colleagues, if you plan on going down the same rabbit hole as I did, just remember to watch your step.
Appendix
I was inspired to write this article from last year’s Code4Lib conference, where one presentation talked about creating a Python GUI and using PyInstaller for one of their internal workflows. I asked if they were planning on distributing it or making it publicly available, but they said they were keeping it to internal use only. I heard other people talk about making similar things (especially when PySimpleGUI changed its license mode) in the Code4Lib #python slack channel, so I was hoping this article could bring to light some issues people can look out for if they decide to go as far as I did making Python-generated desktop applications.
About the Author
Corey Schmidt is an IT Specialist at the Smithsonian Institution, with experience in ArchivesSpace, Python, and project management. He graduated from the University of Michigan School of Information with a Masters in Information in 2019 and from Truman State University with a Bachelor of Arts in History in 2016.
Subscribe to comments: For this article | For all articles
Leave a Reply