Semantic Proximity
Overview
This tool aims to perform semantic proximity search between nodes on the graph based on anchored metagraphs or node embeddings. The anchored metagraph is introduced in our TKDE19 paper, an extension of metagraph in our ICDE16 paper.
Citation
Metagraph-based Learning on Heterogeneous Graphs.
Y. Fang, W. Lin, V. W. Zheng, M. Wu, J. Shi, K. C.-C. Chang and X. Li
TKDE 33(1), 2019, pp. 154--168.
Semantic Proximity Search on Graphs with Metagraph-based Learning.
Y. Fang, W. Lin, V. W. Zheng, M. Wu, K. C.-C. Chang and X. Li.
In ICDE 2016, pp. 277--288.
Code and Data
Compiled Java binary (cross platform): Download
Requirement: JRE 1.8 and Trove 3.0.3.
Sample datasets are included. The source of these datasets can be found in the above citations.
Usage (anchored metagraphs as features)
Command line
java -cp <String> -Dconfig=<String> -Dsplits=<Integer> -Dfdb=<String> -Dgt=<String> -Dtest=<String> -Dtrain=<String> -Dpred=<String> exec.Learn
Command line arguments
-cp <String>
Java class path for the JRE files.-Dconfig=<String>
Model configuration file, which stores the hyper-parameters.-Dsplits=<Integer>
Number of splits for train/test sets (see more details in -Dtest below)-Dfdb=<String>
A file in Metagraph Feature Format.-Dgt=<String>
The ground truth file.-Dtest=<String>
The prefix of test set filename. The program will automatically append numbers 1, 2, 3, ... (up to the number of splits) to the prefix as the actual filenames. For example, if -Dtest=test and -Dsplit=10, then the program will attempt to load test set files named test1, test2, ..., test10.-Dtrain=<String>
The prefix of train set filename. Similar to the -Dtest argument.-Dpred=<String>
The prefix of prediction filename. Similar to the -Dtest argument.
The format of configuration, ground truth, train, test and prediction files can be found in the File Format section below.
Sample command line
java -cp lib/*: -Dconfig=config -Dsplits=10 -Dfdb=data/dblp/feature.dblp -Dgt=data/dblp/advisor/ground_truth.txt -Dtest=data/dblp/advisor/test -Dtrain=data/dblp/advisor/train -Dpred=data/dblp/advisor/pred exec.Learn
Usage (node embeddings as features)
Command line
java -cp lib/*: -Dconfig=<String> -Dsplits=<Integer> -Demb=<String> -Dgt=<String> -Dtest=<String> -Dtrain=<String> -Dpred=<String> exec.LearnEmb
Command line arguments
-Demb=<String>
A file containing node embeddings in text format. The first line contains two integers, the number of nodes and number of dimensions. Each subsequent line contain the embedding of a node. All lines are space delimited.All other arguments are the same as using anchored metagraphs as features above.
File Format
Configuration file
Stores the following hyperparmeters of the learner. Default values are included in the sample file.
GD_STEP=1
Learning rate for gradient descent.GD_EPSILON=1E-5
Maximum relative different before gradient descent stops.GD_TRY=5
Number of trials to perform gradient descent using different seeds.GA_FW=1.01
Factor to adjust the learning rate when the loss is decreasing after each iteration.GA_BW=0.5
Factor to adjust the learning rate when the loss in not decreasing after each iteration.MU=5
Scaling factor in Eq. (4) of our TKDE19 paper.LOG_FEATURE=true
Whether to apply log1p transformation on the metagraph features.SILENT=false
Whether to suppress messages.
Ground truth file
Each line represents one query, delimited by tabs. The first column in each line represents the query node ID, and subsequent columns represent the ground-truth (positive) nodes ranked by proximity.
Train file
Each line contains a triple (q, a, b), where q denotes the query node, and a, b are two nodes such that node a should be ranked higher than node b w.r.t. q.
Test file
Each line represents a query, delimited by tabs. The first column in each line represents the query node ID, and subsequent columns represent the candidate nodes (both positive and negative) in randomized order.
Prediction file
Each line represents a query, delimited by tabs. The first column in each line represents the query node ID, and subsequent columns represent the candidate nodes (both positive and negative) in ranked order. Note that the query nodes may appear in an arbitrary order, different from the corresponding test file.
Disclaimer
We provide any code and/or data on an as-is basis. Use at your own risk.