Please note this may not work for your institution so please review what the code does before using it. PureToResearchfish is a Python program designed to streamline the manipulation of data extracted from Elsevier Pure, making it easier to upload to Researchfish. The final excel sheet may require some additional manipulation but this should save off several hours of work.
- Python
- Excel
- pandas
- numpy
- fuzzywuzzy
- Excel file containing data from Pure
- Excel file from Researchfish containing Funder Reference IDs
PuretoResearchfish.py
- Delete Duplicates: Removes duplicate rows from the dataset.
- Filter by "Funder Project Reference":
- Removes rows with blank "Funder Project Reference" fields.
- Splits "specific" rows with more than one "Funder Project Reference" id.
- Filter by DOIs and Additional Source IDs:
- Keeps rows with a DOI OR where Additional Source IDs start with "PubMed:".
- Removes the "PubMed:" prefix from Additional Source IDs.
- Clear Additional Source IDs if DOIs are Present: Clears the "Additional Source IDs" field if a DOI is present.
- Filter by "Funder Project Reference": Removes rows with just dates or via institution in "Funder Project Reference"
PureAddFunder
- Match Name to Funder Organisation: Adds Funder Organisation while matching to Name.
- Add Funder ID: Adds the Funder ID matching to Funder Organisation.
- Clone the repository:
git clone https://github.com/trulylostheaven/PureToResearchfish.git cd PureToResearchfish - Install the required Python packages:
pip install pandas numpy pip install fuzzywuzzy
- Run the program:
python run_program.py
- Run program PuretoResearchfish.py (this will run both programs in order)
- Select input_file.xlsx, select comparison_xlsx, and select comparison_sheet
- Delete Duplicates The program utilizes Excel to remove duplicate rows from the dataset. This is run twice, once in the beginning and at the end.
- Filter by "Funder Project Reference" Rows with blank "Funder Project Reference" fields are removed using pandas.
- Filter by DOIs and Additional Source IDs The program filters rows based on the presence of DOIs (Digital Object Identifiers) or "PubMed:" in Additional Source IDs.
- Clear Additional Source IDs if DOIs are Present If a DOI is found in a row, the program clears the corresponding "Additional Source IDs".