Data and Tools‎ > ‎

Semantic Proximity

Overview

This tool aims to learn semantic proximity between nodes on the graph based on metagraphs.

Citation

Y. Fang, W. Lin, V. W. Zheng, M. Wu, K. C.-C. Chang and X. Li. Semantic Proximity Search on Graphs with Metagraph-based Learning. In ICDE 2016, pp. 277--288. [PDF] [BibTex]

Code Download

Module Requirement Comment Link
Operating System Windows The main program is in Java and thus portable, but one required executable (included) requires Windows. nil
Runtime MinGW Only the following DLLs are needed:
libgcc-4.5.2-1-mingw32-dll-1 libstdc++-4.5.2-1-mingw32-dll-6
Downloading newer versions may not work. Extract the two DLLs into the lib subdirectory or any directory in PATH environment.



Download
Runtime Java JRE 1.8. Download
Library Trove Trove 3.0.3, a high performance
collections library for Java.
Download
Main Program SemanticProx Sample data are also included in the download. Download

Usage

Step 1: Match metagraph instances
java -cp lib\* -Dconfig=<String> prep.MineGraph

-cp lib\*
Java classpath, including path to the main program and Trove library JAR files.

-DConfig=<String>
The configuration file, a plain text to store configuration properties -- see details below.

Step 2: Build feature index from matched metagraph instances
java -cp lib\* -Dconfig=<String> prep.BuildFeature

Step 3: Learn semantic proximity on training set, and evaluate on test set
java -cp lib\* -Dconfig=<String> -DClass=<String> -DSize=<Integer> exec.Learn

-DClass=<String>
One of the semantic classes to be learnt, as given in the groundtruth file -- see details below.

-DSize=<Integer>
Number of training examples to be used (must not exceed the total number of examples in the training splits -- see details below).

Configuration properties

A sample configuration file is included in the download, which consists of key-value pairs.

FILE_GRAPH=<String>
Path to the labeled graph file.

FILE_GROUNDTRUTH_DB=<String>
Path to the groundtruth file. Each line represents one query and its candidate nodes, in the following format delimited by tabs:
<Class> <Query> <Candidate>:<L> <Candidate>:<L> <Candidate>:<L> ...
where <Class> is the desired semantic class, <Query> is the query node ID, <Candidate> is the candidate node ID, and <L> is either 1 or 0 to indicate whether the preceding candidate node is a true answer for the query.

FILE_METAGRAPH_QUERY=<String>
Path to the Metagraph Query file, which can be generated by the modified GRAMI. A pre-generated file is included for each dataset in the download.

FILE_METAGRAPH_DB=<String>
Path to the Metagraph Database file.

FILE_FEATURE_DB=<String>
Path to the feature index file. Will be generated by step 2 above automatically.

LIB_SUBMATCH=<String>
Path to the SubMatch program. Note that we are using an earlier version (included in the download) different from what we released here; using the latest released version will not work correctly with the main program.

DIR_OUT=<String>
Directory for result output.

DIR_SPLITS=<String>
Directory for training and testing data splits. 
The splits are stored in files with the name train_<Class>_<n> or test_<Class>_<n> for training and testing data respectively, where <Class> is the semantic class and <n> is the split number. Train and test queries in the same split do not have overlaps. 
In training data, each line is a triplet of q, x, y such that for query q, node x should be ranked before node y.
In testing data, each line is a list of tab delimited node IDs, where the first ID is the query, and subsequent IDs are candidate nodes.

DIR_METAGRAPH_INSTANCE=<String>
Directory for metagraph instances generated by step 1 above.

NUM_SPLITS=<Integer>
Total number of splits to use.

MAX_THREADS=<Integer>
Maximum number of threads available.

CORE_TYPE=<Integer>
Type of the core nodes, i.e., the type of nodes used as queries. In our datasets, it is 0 or the user nodes.

MAX_VERTICE=<Integer>
We only consider metagraphs up to this limit.

MAX_FREQ=<Integer>
We only consider metagraphs with number of instances up to this limit.

MU=<Integer>
A scaling parameter controlling the shape of the Sigmoid distribution in the likelihood function.

GD_STEP=10
Learning rate for gradient descent. 

GD_EPSILON=1E-5
Maximum relative different before gradient descent stops.

GD_TRY=5
Number of trials to perform gradient descent using different seeds.

Sample Data 

Sample data of two input graphs and their correponding metagraph queries are included, which are also used in our citation above. They are derived from SNAP's Facebook data and Forward's LinkedIn data.

Disclaimer

We provide any code and/or data on an as-is basis. Use at your own risk.