This repository contains a tool for calculating the dimensionality of MOFs from CIF files. Based on an earlier CSD version (created by S.B. Wiggin, CCDC, 2020-01-21), this version is updated to directly process CIF files and supports parallel execution for handling larger datasets more efficiently.
Note: Both versions of the code rely on the integrated CSD Python library, particularly the entry.crystal.polymer_expansion function. For larger MOFs, this function tends to consume significantly more RAM, leading to potential spikes in RAM usage during execution. To prevent your jobs from being killed due to memory issues, it is strongly recommended to allocate a generous amount of RAM for both the parallel and basic versions of the code.
The MOF_Dimensions_CIF.py script is an updated version of the CCDC code generated by S.B. Wiggin (https://github.com/ccdc-opensource/science-paper-mofs-2020.git), reworked to allow CIF files as input.
To run this version of the code, you need to specify two arguments:
- The absolute path of the dataset (directory containing CIF files)
- The absolute path to the output file (a .csv file)
python MOF_Dimensions_CIF.py -i /path/to/dataset/directory -o /path/to/csv/file.csvThis command will iterate over all of the CIF structures in the dataset directory sequentially (one structure at a time) and print the dimensionality results to both the command line and the CSV file, similar to how the original code printed the results.
The MOF_Dimensions_CIF_Parallel.py script was created to shorten the runtime by allowing the code to run on multiple CPU cores simultaneously using Python's built-in ProcessPoolExecutor (from the concurrent.futures library).
To run this version of the code, you need to specify three arguments:
- The absolute path of the dataset (directory containing CIF files)
- The absolute path to the output file (a .csv file)
- The number of cores to use (integer)
python MOF_Dimensions_CIF_Parallel.py -i /path/to/dataset/directory -o /path/to/csv/file.csv -n 4This command will split the dataset into pools (based on the number of CPU cores provided). Each pool will be computed in parallel, with the output printed to both the command line and the CSV file. Note that while this code is much faster and scales with the number of CPU cores provided, it requires more RAM to schedule these parallelizable pools.
For reference, bash scripts are also provided in the repository to demonstrate further parallelization. For instance, the ARC-MOF dimensionality results were computed using these batch scripts. The SUBMIT.sh script splits the overall dataset into smaller batches of user-specified sizes. Each batch can be submitted independently using either the MOF_Dimensions_CIF_Parallel.py or MOF_Dimensions_CIF.py scripts, as specified in the SEND.sh script.