This software repository contains several Python scripts for computing
- a smallest string attractor,
- a smallest straight-line program (SLP), and
- a bidirectional macro scheme (BMS) with the fewest phrases, by using the SAT solver pySAT.
The algorithm is described in the following paper.
Hideo Bannai, Keisuke Goto, Masakazu Ishihata, Shunsuke Kanda, Dominik Köppl, Takaaki Nishimoto: Computing NP-hard Repetitiveness Measures via MAX-SAT. In Proc. 30th Annual European Symposium on Algorithms (ESA 2022), pp 12:1--12:16, 2022.
The implementation reported in the ESA 2022 paper is here. The full version of this paper can be found here.
- Install
uvandpython3.10on your OS. - Run the following command at the root of this repository.
uv syncThis installs package dependencies like python-sat locally to the repository, which are needed for running the Python scripts.
If the command fails, it is likely that you have a different minor version of Python3 installed.
Once the installation is done,
you can run the Python scripts using uv , such as:
uv run src/slp_solver.py --str "abracadabra"The script shell/example_run.sh evaluates our solutions on files of the Canterbury and Calgary corpus.
On success, it creates a JSON file shell/example_run/benchmark.json storing various statistics of the conducted benchmark.
Our executables are written in Python, and can be accessed from the src directory.
You can run the programs via the uv command as follows for the example string abracadabra.
uv run src/slp_solver.py --str "abracadabra"
uv run src/attractor_solver.py --algo min --str "abracadabra"
uv run src/bms_solver.py --str "abracadabra"Common to all Python scripts are the input parameters --str for giving a string as an input, or --file for reading an input file.
With parameter -h you can obtain the list of all available parameters:
uv run src/attractor_solver.py -husage: attractor_solver.py [-h] [--file FILE] [--str STR] [--output OUTPUT] [--contains CONTAINS [CONTAINS ...]] [--size SIZE] [--algo ALGO] [--log_level LOG_LEVEL]
Compute Minimum String Attractors.
optional arguments:
-h, --help show this help message and exit
--file FILE input file
--str STR input string
--output OUTPUT output file
--contains CONTAINS [CONTAINS ...]
list of text positions that must be included in the string attractor, starting with index 1
--size SIZE exact size or upper bound of attractor size to search
--algo ALGO [min: find a minimum string attractor, exact/atmost: find a string attractor whose size is exact/atmost SIZE]
--log_level LOG_LEVEL
log level, DEBUG/INFO/CRITICAL
The output consists of, omitting some logging information, a line of JSON, which stores the following attributes:
date: the date of executionstatus: empty in case of success; otherwise it can be used for error messagesalgo: the name of the used program likeslp-satfor our solution computing the smallest SLPfile_name: the name of the input file (empty if--stris used)file_len: the length of the input (given by either--stror--file)time_prep: the time needed for defining all hard and soft clausestime_total: the total running timesol_nvars: the number of defined Boolean variables for the solversol_nhard: the number of defined hard clausessol_nsoft: the number of defined soft clausessol_navgclause: the average number of literals a clause containssol_ntotalvars: the size of the CNF, i.e., the sum of all literals in each hard clausesol_nmaxclause: the number of literals in the largest clausefactor_size: the size of the output (i.e., the smallest number of factors of a bidirectional macro scheme (BMS), the smallest size of a string attractor, the smallest size of a grammar)factors: an instance of a valid output attaining the sizefactor_size. This is-
for string attractors a list of text positions
-
for a bidirectional macro scheme (BMS) a list of pairs [pos, len] for the direction to copy from
T[pos]a substring of lengthlen. Ifposis -1, then the factor is a ground phrase and it stores its character inlen. -
for SLP a pair (start symbol, production rules), where the production rules are stored in a list (actually, a dictionary. Furthermore, the encoding is a representation of the partial parse tree and differs, in representation, slightly from a grammar)
A non-terminal is given by the triplet (
from,to,char) such that its expansion is the substring T[from..to-1]. Each production rule has the shape- (
from,to, None): [(fromLeft,toLeft,charLeft), (fromRight,toRight,charRight)], or - (
from,to,char): []
The first case represents an internal node in the partial parse tree, and this node has two children representing non-terminals, each defined again by this triplet of
from,to, andcharvalue.The second case represents leaves of the partial parse tree.
- If T[
from..to-1] is a single character, we know that this non-terminal expands to a single characterchar. - If T[
from..to-1] is longer than 1, the leaf is a non-terminal equivalent to the non-terminal (char,char+to-from, None) guaranteed to exist.
- (
-
Please find below concrete examples in how the output looks like.
We compute a smallest string attractor of size factor_size = 497 for the dataset grammar.lsp having a length of 3721.
uv run src/attractor_solver.py --file data/cantrbry/grammar.lsp --algo min{"date": "2022-04-19 20:42:38.553750", "status": "", "algo": "attractor-sat", "file_name": "grammar.lsp", "file_len": 3721, "time_prep": 0.08829855918884277, "time_total": 0.38588380813598633, "sol_nvars": 3721, "sol_nhard": 1669, "sol_nsoft": 3721, "sol_navgclause": 17.860395446375076, "sol_ntotalvars": 29809, "sol_nmaxclause": 802, "factor_size": 497, "factors": [2, 5, 8, 10, 12, 14, 20, 22, 25, 28, 31, 34, 38, 44, 50, 51, 54, 57, 59, 62, 65, 69, 72, 77, 80, 85, 95, 99, 108, 114, 118, 121, 125, 133, 139, 143, 155, 161, 170, 181, 183, 186, 194, 201, 226, 239, 255, 261, 268, 276, 289, 294, 298, 307, 310, 313, 326, 341, 346, 356, 358, 378, 411, 429, 440, 449, 454, 472, 482, 489, 510, 523, 535, 546, 591, 596, 628, 650, 675, 677, 688, 707, 719, 725, 741, 759, 766, 785, 796, 805, 829, 845, 854, 872, 913, 917, 930, 941, 959, 981, 1005, 1011, 1018, 1029, 1031, 1051, 1053, 1064, 1082, 1103, 1108, 1118, 1129, 1149, 1167, 1176, 1186, 1189, 1216, 1224, 1234, 1253, 1272, 1275, 1280, 1286, 1292, 1294, 1300, 1305, 1316, 1321, 1327, 1335, 1346, 1354, 1369, 1377, 1400, 1413, 1430, 1432, 1435, 1482, 1498, 1512, 1552, 1573, 1584, 1585, 1587, 1591, 1613, 1629, 1631, 1644, 1670, 1717, 1732, 1738, 1742, 1744, 1746, 1749, 1752, 1754, 1762, 1770, 1776, 1782, 1786, 1788, 1792, 1795, 1798, 1804, 1806, 1809, 1811, 1815, 1819, 1821, 1825, 1831, 1841, 1851, 1857, 1862, 1868, 1874, 1880, 1885, 1891, 1898, 1901, 1904, 1913, 1915, 1918, 1925, 1928, 1933, 1942, 1952, 1956, 1962, 1964, 1966, 1972, 1978, 1983, 2000, 2005, 2013, 2017, 2022, 2028, 2031, 2040, 2044, 2056, 2062, 2074, 2076, 2085, 2089, 2101, 2106, 2109, 2112, 2119, 2123, 2124, 2133, 2143, 2147, 2150, 2153, 2157, 2159, 2165, 2168, 2175, 2177, 2178, 2183, 2190, 2197, 2206, 2212, 2214, 2216, 2220, 2224, 2226, 2228, 2230, 2239, 2243, 2248, 2252, 2256, 2261, 2266, 2277, 2281, 2290, 2295, 2299, 2303, 2309, 2312, 2316, 2321, 2335, 2340, 2344, 2360, 2362, 2366, 2379, 2383, 2386, 2401, 2404, 2418, 2428, 2448, 2457, 2470, 2472, 2491, 2496, 2500, 2516, 2524, 2536, 2540, 2558, 2566, 2569, 2572, 2584, 2587, 2601, 2613, 2622, 2627, 2631, 2639, 2649, 2657, 2664, 2680, 2690, 2693, 2696, 2705, 2718, 2721, 2741, 2745, 2749, 2762, 2766, 2777, 2793, 2797, 2808, 2815, 2825, 2836, 2840, 2844, 2859, 2868, 2871, 2894, 2900, 2902, 2916, 2918, 2921, 2927, 2943, 2964, 2968, 2974, 2990, 2993, 2998, 3004, 3017, 3023, 3026, 3037, 3051, 3056, 3065, 3068, 3070, 3074, 3085, 3092, 3099, 3103, 3115, 3119, 3122, 3131, 3139, 3142, 3147, 3150, 3152, 3159, 3164, 3165, 3177, 3184, 3197, 3202, 3205, 3209, 3216, 3220, 3222, 3230, 3235, 3241, 3249, 3252, 3257, 3259, 3264, 3268, 3271, 3278, 3281, 3288, 3295, 3298, 3301, 3305, 3314, 3319, 3324, 3328, 3335, 3336, 3348, 3360, 3366, 3369, 3374, 3377, 3385, 3391, 3394, 3398, 3405, 3414, 3420, 3422, 3428, 3431, 3440, 3443, 3449, 3454, 3458, 3465, 3467, 3471, 3474, 3477, 3480, 3483, 3486, 3490, 3493, 3500, 3504, 3515, 3519, 3521, 3526, 3530, 3534, 3544, 3550, 3555, 3558, 3560, 3564, 3570, 3573, 3579, 3581, 3585, 3590, 3592, 3594, 3598, 3603, 3606, 3609, 3612, 3624, 3626, 3627, 3631, 3634, 3636, 3639, 3642, 3644, 3645, 3647, 3649, 3651, 3653, 3655, 3662, 3668, 3671, 3676, 3677, 3681, 3685, 3689, 3692, 3694, 3697, 3701, 3705, 3712]}We compute a smallest bidirectional macro scheme (BMS) of size factor_size = 43 for cp.html-50 of length 50.
uv run src/bms_solver.py --file data/cantrbry_pref/cp.html-50{"date": "2022-04-19 20:51:17.510793", "status": "", "algo": "bms-sat", "file_name": "cp.html-50", "file_len": 50, "time_prep": 0.010410785675048828, "time_total": 0.014687776565551758, "sol_nvars": 1140, "sol_nhard": 3517, "sol_nsoft": 50, "sol_navgclause": 2.3810065396644866, "sol_ntotalvars": 8374, "sol_nmaxclause": 761, "factor_size": 43, "factors": [[-1, 60], [-1, 104], [-1, 101], [-1, 97], [-1, 100], [41, 3], [-1, 116], [-1, 105], [-1, 116], [-1, 108], [-1, 101], [-1, 62], [-1, 67], [-1, 111], [-1, 109], [-1, 112], [-1, 114], [-1, 101], [-1, 115], [-1, 115], [-1, 105], [-1, 111], [-1, 110], [-1, 32], [-1, 80], [-1, 111], [-1, 105], [-1, 110], [-1, 116], [-1, 101], [-1, 114], [-1, 115], [-1, 60], [-1, 47], [8, 6], [-1, 10], [-1, 60], [-1, 77], [-1, 69], [-1, 84], [-1, 65], [-1, 32], [-1, 72]]}We compute a smallest SLP grammar of size factor_size = 68 for cp.html-50 of length 50.
uv run src/slp_solver.py --file data/cantrbry_pref/cp.html-50{"date": "2022-04-22 12:57:13.385176", "status": "", "algo": "slp-sat", "file_name": "cp.html-50", "file_len": 50, "time_prep": 0.05773806571960449, "time_total": 0.0611567497253418, "sol_nvars": 2693, "sol_nhard": 28704, "sol_nsoft": 50, "sol_navgclause": 2.7269370122630994, "sol_ntotalvars": 78274, "sol_nmaxclause": 52, "factor_size": 68, "factors": "((0, 50, None), {(49, 50, 72): None, (48, 49, 32): None, (47, 48, 65): None, (46, 47, 84): None, (45, 46, 69): None, (44, 45, 77): None, (42, 44, 6): None, (36, 42, 8): None, (35, 36, 47): None, (34, 35, 60): None, (33, 34, 115): None, (32, 33, 114): None, (31, 32, 101): None, (30, 31, 116): None, (29, 30, 110): None, (28, 29, 105): None, (27, 28, 111): None, (26, 27, 80): None, (25, 26, 32): None, (24, 25, 110): None, (23, 24, 111): None, (22, 23, 105): None, (21, 22, 115): None, (20, 21, 115): None, (19, 20, 101): None, (18, 19, 114): None, (17, 18, 112): None, (16, 17, 109): None, (15, 16, 111): None, (14, 15, 67): None, (13, 14, 62): None, (12, 13, 101): None, (11, 12, 108): None, (10, 11, 116): None, (9, 10, 105): None, (8, 9, 116): None, (8, 14, None): ((8, 13, None), (13, 14, 62)), (7, 8, 60): None, (6, 7, 10): None, (6, 8, None): ((6, 7, 10), (7, 8, 60)), (5, 6, 62): None, (4, 5, 100): None, (3, 4, 97): None, (2, 3, 101): None, (1, 2, 104): None, (0, 1, 60): None, (0, 50, None): ((0, 49, None), (49, 50, 72)), (0, 2, None): ((0, 1, 60), (1, 2, 104)), (0, 3, None): ((0, 2, None), (2, 3, 101)), (0, 4, None): ((0, 3, None), (3, 4, 97)), (0, 5, None): ((0, 4, None), (4, 5, 100)), (0, 6, None): ((0, 5, None), (5, 6, 62)), (0, 8, None): ((0, 6, None), (6, 8, None)), (0, 14, None): ((0, 8, None), (8, 14, None)), (0, 15, None): ((0, 14, None), (14, 15, 67)), (0, 16, None): ((0, 15, None), (15, 16, 111)), (0, 17, None): ((0, 16, None), (16, 17, 109)), (0, 18, None): ((0, 17, None), (17, 18, 112)), (0, 19, None): ((0, 18, None), (18, 19, 114)), (0, 20, None): ((0, 19, None), (19, 20, 101)), (0, 21, None): ((0, 20, None), (20, 21, 115)), (0, 22, None): ((0, 21, None), (21, 22, 115)), (0, 23, None): ((0, 22, None), (22, 23, 105)), (0, 24, None): ((0, 23, None), (23, 24, 111)), (0, 25, None): ((0, 24, None), (24, 25, 110)), (0, 26, None): ((0, 25, None), (25, 26, 32)), (0, 27, None): ((0, 26, None), (26, 27, 80)), (0, 28, None): ((0, 27, None), (27, 28, 111)), (0, 29, None): ((0, 28, None), (28, 29, 105)), (0, 30, None): ((0, 29, None), (29, 30, 110)), (0, 31, None): ((0, 30, None), (30, 31, 116)), (0, 32, None): ((0, 31, None), (31, 32, 101)), (0, 33, None): ((0, 32, None), (32, 33, 114)), (0, 34, None): ((0, 33, None), (33, 34, 115)), (0, 35, None): ((0, 34, None), (34, 35, 60)), (0, 36, None): ((0, 35, None), (35, 36, 47)), (0, 42, None): ((0, 36, None), (36, 42, 8)), (0, 44, None): ((0, 42, None), (42, 44, 6)), (0, 45, None): ((0, 44, None), (44, 45, 77)), (0, 46, None): ((0, 45, None), (45, 46, 69)), (0, 47, None): ((0, 46, None), (46, 47, 84)), (0, 48, None): ((0, 47, None), (47, 48, 65)), (0, 49, None): ((0, 48, None), (48, 49, 32)), (8, 10, None): ((8, 9, 116), (9, 10, 105)), (8, 11, None): ((8, 10, None), (10, 11, 116)), (8, 12, None): ((8, 11, None), (11, 12, 108)), (8, 13, None): ((8, 12, None), (12, 13, 101))})"}We have a collection of test datasets in the folder data.
To measure the output sizes of these datasets, we provide the script shell/measure_datasets.sh.
It requires the program jq to be installed, and outputs the JSON file shell/measure/stats.json storing for each file the output size of each computed repetitiveness measure.
If you apply this script on a SLURM cluster, then it batches the experiments. Collecting the final data has then to be done manually (the last lines in the shell script).
We include implementations of straight-forward algorithms to compute the smallest string attractor, based on the code by Jeffrey Shallit.
uv run src/attractor_naive.py --str "abracadabra"For SLP, we have a code that enumerates all full binary trees, and considers them as derivation trees.
The size of the SLP is determined by minimizing this tree where nodes can reused.
Two versions: one implemented in Python, and the other in rust is included:
uv run src/slp_naive.py --str "abracadabra"or
cd rust
cargo run --release -- -t abracadabra(We have yet to properly format the output)
We include the code to generate Table 1 in the Appendix of the paper.
uv run src/min_substr.py --mode=expLicensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Please use the following bibtex when you refer this library from your papers.
@inproceedings{bannai2022computing,
title={{Computing NP-Hard Repetitiveness Measures via MAX-SAT}},
author={Bannai, Hideo and Goto, Keisuke and Ishihata, Masakazu and Kanda, Shunsuke and K{\"o}ppl, Dominik and Nishimoto, Takaaki},
booktitle={30th Annual European Symposium on Algorithms (ESA 2022)},
pages={12:1--12:16},
year={2022}
}
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Before you submit your PR, please keep in mind the following practices:
- Use
tox -e fmtto format the code. - Use
tox -e lintto lint and typecheck - Use
toxto run all tests and lint