Small set of command line tools written in Go to help extracting data from webdevdata.org.
You can download binaries for Linux, Windows and Mac in the GitHub releases page.
Searches for all tags matching the proviced CSS selector
in
file
and prits a CSV with file,tag_name
to STDOUT
.
Files to parse can be passed as arguments or via STDIN
with one
file per line.
The -attrs
option can be used to provide a comma separated list of
attributes to print in the CSV.
Examples:
Passing file as an argument and specifying attributes to print
$ wdd_select -attrs="class,id" "section, body > div" data_samples/forecast_io.html
data_samples/forecast_io.html,div,inner,""
data_samples/forecast_io.html,section,currently section,""
data_samples/forecast_io.html,section,next_hour section,""
data_samples/forecast_io.html,section,next_24_hours section,""
Passing files via STDIN
$ find data_samples -name "*.html" | wdd_select "head, body"
data_samples/forecast_io.html,head
data_samples/forecast_io.html,body
data_samples/jimsmarketingblog_com.html,head
data_samples/jimsmarketingblog_com.html,body
Checks HTML meta tags from file
and prints a CSV with
file,meta_name
to STDOUT
.
example:
$ wdd_meta_names data_samples/jimsmarketingblog_com.html
data_samples/jimsmarketingblog_com.html,description
data_samples/jimsmarketingblog_com.html,google-site-verification
data_samples/jimsmarketingblog_com.html,google-site-verification
data_samples/jimsmarketingblog_com.html,y_key
Generating CSV with all meta tag names from webdevdata.org crawl (using
GNU/Parallel instead of xargs
to parallelize work):
$ find webdevdata.org-2013-10-30-231036 -name "*tml.txt" | parallel "wdd_meta_names >> meta_names.csv"
Checks for html tag with manifest attribute from file
and prints a CSV
with file,manifest_value
to STDOUT
.
example:
$ wdd_html_manifest data_samples/forecast_io.html
data_samples/forecast_io.html,cache.desktop.manifest
Counts all HTML tags from file
and prints a CSV with
tag,count
to STDOUT
.
example:
$ wdd_tag_count data_samples/jimsmarketingblog_com.html
data_samples/jimsmarketingblog_com.html,meta,13
data_samples/jimsmarketingblog_com.html,li,40
data_samples/jimsmarketingblog_com.html,footer,1
data_samples/jimsmarketingblog_com.html,script,15
data_samples/jimsmarketingblog_com.html,a,78
data_samples/jimsmarketingblog_com.html,option,64
data_samples/jimsmarketingblog_com.html,nav,1
data_samples/jimsmarketingblog_com.html,img,11
data_samples/jimsmarketingblog_com.html,input,4
data_samples/jimsmarketingblog_com.html,center,8
[...]
go get github.com/webdevdata/webdevdata-tools
cd $GOPATH/src/github.com/webdevdata/webdevdata-tools
make all
- Tools are in the build directory
You can use make release
to generate cross-compiled binaries for Linux,
Windows and Mac.