Skip to content

ubleipzig/stardust

 
 

Repository files navigation

stardust

Stardust, strdist. String distance measures for the command line.

Build Status

Actual star dust

Overview

$ stardust
NAME:
   stardust - String similarity measures for tab separated values.

USAGE:
   stardust [global options] command [command options] [arguments...]

VERSION:
   0.1.1

AUTHOR:
  Martin Czygan - <martin.czygan@gmail.com>

COMMANDS:
   adhoc    Adhoc distance
   cosine   Cosine word-wise
   coslev   Cosine word-wise and levenshtein combined
   dice     Sørensen–Dice coefficient
   hamming  Hamming distance
   jaro     Jaro distance
   jaro-winkler Jaro-Winkler distance
   levenshtein  Levenshtein distance
   ngram    Ngram distance
   plain    Plain passthrough (for IO benchmarks)
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   -f '1,2'     c1,c2 the two columns to use for the comparison
   --delimiter, -d '    '   column delimiter (defaults to tab)
   --help, -h       show help
   --version, -v    print the version

For starters

$ stardust hamming "Hallo" "Hello"
Hallo   Hello   1

$ stardust ngram "Hallo" "Hello"
Hallo   Hello   0.2

$ stardust ngram "Hallo Welt" "Hello World"
Hallo Welt  Hello World 0.21428571428571427

Are the man pages of cp and mv more similar that those of ls and mv, when measured with a trigram model?

$ stardust ngram "$(echo $(man ls))" "$(echo $(man mv))" | cut -f3
0.29057337220602525

$ stardust ngram "$(echo $(man cp))" "$(echo $(man mv))" | cut -f3
0.4792746113989637

They seem to. And according to Jaro similarity?

$ stardust jaro "$(echo $(man ls))" "$(echo $(man mv))" | cut -f3
0.5597612762544908

$ stardust jaro "$(echo $(man cp))" "$(echo $(man mv))" | cut -f3
0.6376732132890776

Still.

Specific options

Some measures come with additional options, e.g. ngram will take a size option, which corresponds to the n in ngram.

$ stardust ngram --help
NAME:
   ngram - Ngram similarity

USAGE:
   command ngram [command options] [arguments...]

DESCRIPTION:
   Compute Ngram similarity, which lies between 0 and 1.

OPTIONS:
   --size, -s '3'   value of n

$ stardust ngram --size 2 "Hello" "Hallo"
Hello   Hallo   0.3333333333333333

$ stardust ngram --size 1 "Hallo" "Hello"
Hallo   Hello   0.6

Input from files

Using example.tsv:

$ stardust ngram example.tsv | sort -t$'\t' -k3,3 -nr | head -3
Deutsches Museum    Deutsches Museum    1
Deutsche Suchthilfestatistik    Deutsches Museum    0.17647058823529413
Deutsche+Guggenheim magazine /  Deutsches Museum    0.16666666666666666

Which is equivalent to:

$ cat example.tsv | stardust ngram | sort -t$'\t' -k3,3 -nr | head -3
Deutsches Museum    Deutsches Museum    1
Deutsche Suchthilfestatistik    Deutsches Museum    0.17647058823529413
Deutsche+Guggenheim magazine /  Deutsches Museum    0.16666666666666666

About

stardust, strdist. String distance and similarity measures for the command line.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 91.4%
  • Makefile 4.9%
  • Shell 3.7%