approved
Distance Calculator

The program is intended for calculating semantic distances between input texts. As a commandline script it takes a list of tab-separated text pairs (line-per-pair) and returns an output (one-per-line) of semantic distances between the texts. The distances are calculated with cosines (1-cos) of centroid vectors of each of the input texts. The commandline scipt call: java [VM options] DistanceCalc.jar [embeddings_file] [input_pairs] [output_file] Example: java -Xmx10g -jar DistanceCalc.jar /e:/SoBigData/wiki_test_vect.txt /e:/SoBigData/testPairs.tsv /e:/SoBigData/testOut2.tsv As far as programatic use goes, here is a Java example for calculating distances between the input pairs: // Setting up options Map options = new HashMap(); options.put("stat","NONE"); //no stat-based weighting options.put("stopwords", "NONE"); // no stopwords elimination, comment for standar english stopwords removal options.put("l2norm", null); // l2 normalization of each word vector WordEmbeddingsManager wem = new WordEmbeddingsManager( args[0], null, null, options, WeightedCentroidTextRepresentation.class); ArrayList list=new ArrayList(); // reading the input file try(BufferedReader br = new BufferedReader(new FileReader(args[1]))) { for(String line; (line = br.readLine()) != null; ) { list.add(line); } br.close(); } //calculate the distances double[] results=new double[list.size()]; int i =0; for (String l:list) { String[] pair=l.split("\t"); String s1=pair[0]; String s2=pair[1]; results[i]=wem.getDistance(s1, s2); i++; } Feel free to write me at (maciek.rybinski[at]gmail.com) if you need any more information on the available options, text representations, etc.

Tags
Data and Resources
To access the resources you must log in
  • CodeZIP

    The resource: 'Code' is not accessible as guest user. You must login to access it!
Additional Info
Field Value
Accessibility Both
AccessibilityMode OnLine Access
AccessibilityMode Download
Area Societal Debates
Availability On-Line
Basic rights Download
Basic rights Copying
CreationDate 2017-05-01
Creator Rybinski, Maciej
Field/Scope of use Research only
Group Societal Debates and Misinformation
Owner Rybinski, Maciej
ProgrammingLanguage Java
Sublicense rights No
Territory of use World Wide
Thematic Cluster Text and Social Media Mining [TSMM]
UsageMode Download
system:type Method
Management Info
Field Value
Author Gorrell Genevieve
Maintainer Gorrell Genevieve
Version 1
Last Updated 11 September 2023, 15:26 (CEST)
Created 6 July 2018, 16:42 (CEST)