The CounQER system provides a pipeline for identifying set predicates in a KB. We use linguistic and co-occurrence alignment metrics to analse the relationship between the predicates. The results of these alignments can be explored in the project demo page at https://counqer.mpi-inf.mpg.de. The project uses pgadmin to access its backend PostgreSQL database.
The project runs in a Python3 virtual environment. requirements.txt provides the list of the necessary packages.
python3 -m venv \path\to\myenv
activate \path\to\myenv\bin\activate
Once inside the environment change to counqer/ and install the required packages.
pip install -r requirements.txt
Location: ./datasetup
Create a local n-tuple DB from RDF dumps of KBs.
-
create*<KB-name>*DB.pya. This file calls
createDBif the table is to be hosted in a posstgres serverb.
createcsvis called to create a csv file which can be imported to any database management system (like Postgresql, Hive) as a table. -
query the SPO tables for a list of distinct predicates and their frequencies. Save results as csv (
predfreq_p_all.csv) corresponding DB subfolder. -
generate_property_details_.py Uses the
property_details_from_postgresto create a table and a csv files with the table values. These values can then be copied to the table using psql commands.psql -h postgres2.d5.mpi-inf.mpg.de -d <database_name> -U <username> <database_nmae>=> \copy fb_pred_property FROM '<KB-name>_pred_property.csv' DELIMITER E'\t' CSV HEADER; <database_nmae>=> \q
Location: ./classifier_crowd_annotations
Sample predicates from candidate KBs to present to the crowd annotators
-
sql_query_for_set_predicateshas the sql query used to sample data items for counting predicates in the first querya. We filter out less (<50) frequent, non-integer (<5% integer values and >5% float values) predicates
b. The samples are saved in
./countingfolder as csv files under the names of the corresponding KBs.c. Create a entity lookup list for freebase using
sql_fb_entity_label.d.
get_labelled_triples.pyreads all sampled predicates from./countingand creates a data file with labelled triples./counting/counting_labelled_triples.csv.e.
clean_labelled_triples.Runifies triples from multiple sources to create a csv file ready for upload to the crowd-sourcing platform.NOTE: Since Freebase returns empty subject labels we create a larger sample size (of 200 predicates) and select 100 samples with 5 complete example triples.
-
sql_query_for_set_predicateshas the sql query used to sample data items for enumerating predicates in the second query.a. Sampled data from each KB is saved in
./enumeratingfolder.
Note We create a test set containing honey-pot questions for figure-eight task (in ./test folder). First we run the get_labelled_triples.py on the selected test predicates and then manually edit the test_rows_figure_eight.csv file to add the annotations columns (_golden, *<question>*_gold, *<question>*_gold_reason).
Location: ./predicate_usage_features
- Download the POS tagger data for nltk.
$ python
>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.pos_tag(nltk.word_tokenize('This is a sentence'))
-
Run
get_estimated_matches.pyto get the predicate usage features from the Bing API for all frequent (>= 50) predicates. Data stored in -
Run
get_sub_obj_types.py
./pred_property_p_50 has the predicate property files of all KBs with predicate frequency >= 50. Next, we collect data from different sources to create a unified feature file of all predicates (predicates_p_50.csv) and the labelled predicates (labelled_data_counting.csv, labelled_data_enumerating.csv) in the folder ./feature_file using the script ./create_feature_file.R.
| KB | All | Frequent |
|---|---|---|
| DBP-raw | 59,149 | 13,394 |
| inv | 14,085 | 3,241 |
| DBP-map | 1,355 | 1,127 |
| inv | 653 | 543 |
| WD-truthy | 5,032 | 3,346 |
| inv | 1,079 | 721 |
| Freebase | 784,936 | 8,289 |
| inv | 14,871 | 5,583 |
| YAGO | (79) | (79) |
Location: ./classifier
We have two classifiers - one for counting and one for enumerating in .../*<type>*/*<type>*_classifier.R.
Classifier models used -
- Logistic regression
- Bayesian glm
- Lasso regression
- Neural network with single hidden layer
The predictions are saved in .../*<type>*/predictions.csv.
Random Classifier performance:
- Counting: 345 data points, 39 positive, 306 negative
| Predicted | |||
|---|---|---|---|
| Actual | 0 | 1 | |
| 0 | 272 | 34 | 306 |
| 1 | 34 | 5 | 39 |
| 306 | 39 | 345 |
Precision = Recall = F1 = 12.8%
- Enumerating: 328 data points, 133 positive, 195 negative
| Predicted | |||
|---|---|---|---|
| Actual | 0 | 1 | |
| 0 | 116 | 79 | 195 |
| 1 | 79 | 54 | 133 |
| 195 | 133 | 328 |
Precision = Recall = F1 = 40.6%
Precision Recall scores of all models a. Counting
| Model | Recall | Precision | F1 |
|---|---|---|---|
| Random | 12.8 | 12.8 | 12.8 |
| Logistic | 51.2 | 19.0 | 27.7 |
| Bayesian | 48.7 | 20.2 | 28.5 |
| Lasso | 71.7 | 23.3 | 35.1 |
| Neural | 35.8 | 20.8 | 26.3 |
b. Enumerating
| Model | Recall | Precision | F1 |
|---|---|---|---|
| Random | 40.6 | 40.6 | 40.6 |
| Logistic | 55.6 | 51.7 | 53.5 |
| Bayesian | 55.6 | 51.0 | 53.5 |
| Lasso | 51.1 | 59.6 | 55.0 |
| Neural | 53.0 | 49.6 | 51.2 |
Predicted counting predicates
| KB | Input | Output | Filtered |
|---|---|---|---|
| DBP-raw | 13,394 | 5,853 | 5853 |
| DBP-map | 1,127 | 898 | 898 |
| WD-truthy | 3,346 | 1,922 | 1,067 |
| Freebase | 8,289 | 1,723 | 1,687 |
Predicted enumerating predicates
| KB | Input | Output | Filtered |
|---|---|---|---|
| DBP-raw | 16,635 | 2,894+1196 = 4090 | 2894+1196 = 4090 |
| DBP-map | 1,670 | 173+135 = 308 | 173+135 = 308 |
| WD-truthy | 4,067 | 99+117 = 216 | 86+ 117 = 203 |
| Freebase | 13,872 | 6311+1441 = 7752 | 6177+1437 = 7614 |
Location: ./alignment
-
Create a csv file with entity names across different platforms.
a. DBpedia entity: http://dbpedia.org/resource/ b. Wikidata entity: http://www.wikidata.org/entity/
shorten_entity_names.py- remove url prefic which identifies the KB.get_sameAs_dbpedia.py- for all unique entities collected from KB and shortened, get the corresponting entity identities in other KBs (namely, Wikidata and Freebase). -
Get the number of entities per subject per predicate information from KB query using psql.
a. Enumerating
\copy (Select sub, pred, count(*) from *<kb-name>* where obj_type='named_entity' group by pred, sub order by pred) to 'filepath/named_entities_per_pred_per_sub_*<kb>*.csv' with CSV;Since Freebase has 700k predicates, modify above query by filtering only top frequently occurring predicates.
\copy (Select sub, pred, count(*) from freebase_spot where pred in (*<list from file fb_pred_names_p_50>*) obj_type='named_entity' group by pred, sub order by pred) to 'filepath/named_entities_per_pred_per_sub_*<kb>*.csv' with CSV;Stored in DB server as a table with name
*<kb-name>*_sub_pred_necount.b. Counting
\copy (Select sub, pred, obj from freebase_spot where pred in (*<list from file fb_pred_names_p_50>*) and obj_type='int' order by pred, sub) to '/GW/D5data-11/existential-extraction/count_information/integer_per_pred_per_sub_fb.csv' with CSV;Stored in DB server as a table with name
*<kb-name>*_sub_pred_intval.Note Create indexes on the predicate column.
-
Create a view of triples in each kb having p_50 predicates.
create view *<kb_name>*_p_50 as select * from *<kb-name>*_spot where pred in (*<list from file kb_pred_names_p_50>*) -
Get co-occurrence statistics on the generated view. Store co-occuring pairs (predE, predC, #co-occurring subjects) in
./cooccurrence/*<kb-name>*_predicate_pairs.csv. ~~``` psql select t1.pred as predE, t2.pred as predC, count(distinct sub) from (select * from <kb_name>_p_50 where obj_type='named_entity') as t1 inner join (select * from <kb_name>_p_50 where obj_type='int') as t2 on t1.sub = t2.sub group by t1.pred, t2.pred*Note* This is not time-efficient. Use instead ```select t1.pred as predE, t2.pred as predC, count(*) from *<kb-name>*_sub_pred_necount as t1 inner join *<kb-name>*_sub_pred_intval as t2 on t1.sub = t2.sub group by t1.pred, t2.pred -
Get predicate marginals (#subjects per predicate) in files labelled
./marginals/*<kb-name>*_int.csvfor counting predicate marginals and./marginals/*<kb-name>*_ne.csvfor enumerating predicate marginals.select pred, count(*) from *<tablename>* group by predwhere*<tablename>* in *kb-name*_sub_pred_intval, *kb-name*_sub_pred_neocunt, *kb-name*_obj_pred_necount -
Run
get_cooccurrence_scores.pyto get the alignment metrics. -
RunNote: Get linguistic similarity scores online since reading existing files is time consuming.get_linguistic_sim.pyto generate linguistic alignment.
-
Get inverse predicates from postgres server
select pred_inv from *<kb-name>*_inv_pred_property where frequency >= 50into a list inp_50_prednames/ -
Get the number of entities per subject per inverse predicate information from KB query using psql.
\copy (Select obj, pred, count(*) from *<kb-name>*_spot where pred in (*<list from file kb-name_pred_names_p_50>*) and obj_type='named_entity' group by pred, obj order by pred) to 'filepath/named_entities_per_pred_per_sub_*<kb>*.csv' with CSV; -
Get co-occurrence stats for inv predicates
-
Label inverse predicates as enumerating using the enumerating classifier.
Location: ./alignment
filter_prednames.py - to remove codes and id's from predicted predicates. The number of predicates (id and code names) filtered before and after classification -
| Type | Pre-class | Post-class | # removed by classifier |
|---|---|---|---|
| Enumerating | 2158 (26156) | 147 (9477) | 2011 (93.1%) |
| Enum_inv | 9 (10091) | 4 (2890) | 5 (55.5%) |
| Counting | 2158 (26156) | 881 (10396) | 1277 (59.1%) |
Note: number in bracket denotes the predicates input to the filter.
-
Get the (filtered) predicate lists from
get_predicate_list.R. -
Keep only required metrics (predicate pairs which are in the predicted lists) in
./metrics_reqfolder by runningmetrics_assembly.R.
Number of aligments obtained = 4265
| KB name | Direct | Inverse |
|---|---|---|
| DBP map | 138 | 126 |
| DBP_raw | 1947 | 1756 |
| WD | 22 | 2 |
| FB | 120 | 154 |
| Total | 2227 | 2038 |
Location: ./alignment_crowd_annotations
-
clean_fig8_test_ques.R- to re-use figure8 evaluation questions. -
test_questions/edit_fig8_for_mturk.py- create test csv for mturk -
clean_mturk_resp.R- check responses of test questions. -
select_random_prop_for_eval.R- create a list of 300 counting and 300 enumerating (ratio of inverse vs. direct) predicates for crowd evaluation. -
eval_questions/create_eval_top3_pairs.py- to get list of top predicates from different metrics.#datapoints for enumerating = 460#datapoints for counting = 371 -
eval_questions/create_datafile.py- create csv with labelled triples for mturk.#datapoints for enumerating = 169which implies that 291 pairs do not cooccur.#datapoints for counting = 72which implies that 299 pairs do not cooccur. -
Launch Mturk task with the csv files in
eval_questions/data/and runeval_questions/notify_successful_workers.pyto notify selected workers to take the task. -
Download MTurk results to
eval_annotations/and runeval_annotations/clean_mturk_repsonse.Rto get absolute scores for all pairs0.5*(1/3)*(#complete*1 + #incomplete*0.5 + #unrelated*0)Note: 0.5 * (1/j) = is weight for topicality and enumeration scores times the number of judges (m); #x * w = number of votes x received from 3 judges times the weight of x.
Location: ./evaluation
evaluate.py- To generate dcg scores for all metrics.aggregated_ndcg.R- To get mean ndcg of all metrics.
The demo is developed in Python using Flask webframework and run on an Apache webserver. The site is under contruction and may not exhibit full functionalites of the system.
Location: ./flask_app
Location: ./predicate_list
Scipts for create json files of KB set predicates to be displayed in the demo.
############### Notes
counting <- read.csv('alignment/counting_filtered.csv')
wd_labels <- read.csv('datasetup/WD/wd_property_label.csv')
wd_labels$id <- substr(as.character(wd_labels$Property), 32, nchar(as.character(wd_labels$Property)))
counting$id <- sapply(counting$pred, function(x) substr(x, start=tail(gregexpr('/', x)[[1]], 1)+1, stop=nchar(as.character(x))))
counting <- inner_join(counting, wd_labels, by='id')