Package to crawl the web. Used by main program Crawl.

To install:
$ go get github.com/RickyS/Crawl
$ go get github.com/RickyS/Creep

You'll neeed both packages, the depend on each other. The main program is crawl. The working package is creep. Note the capital letters on the names to 'go get'.

The easiest introduction might be to run
go test
This runs for 9 seconds on my system.

Package creep implements a web crawler. It reads web pages and follows links to the rest of the web, recursively, ad infinitum, within the limits provided. We use the term creep to avoid name clashes with other software called 'walk' and 'crawl'. I'm thinking of changing it to 'stroll'.

The goroutines in crawl.go listens on a request channel and then scans the web page specified in the message from the request channel. Each link-to-another-web-page found is then enqueued onto the request channel. Eventually, this or another goroutine will read that request and process it.

The code in samedomain.go uses the package "github.com/joeguo/tldextract" to get the database to help figure out whether two different URLs belong to the same domain. It turns out that this is not as simple as it might seem.

In order to prevent infinite regress, the program limits operation to the list of domains in the json file.

There are parameters in the json file that adjust the limitations. TBD.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
alias.bash		alias.bash
c1_test.go		c1_test.go
creep.go		creep.go
creep.sublime-project		creep.sublime-project
creep.sublime-workspace		creep.sublime-workspace
loadtestfile.go		loadtestfile.go
samedomain.go		samedomain.go
test1.json		test1.json
test2.json		test2.json
testdummy.json		testdummy.json
testgolang.json		testgolang.json
testgolangbig.json		testgolangbig.json
testgolangbigsmall.json		testgolangbigsmall.json
testiana.json		testiana.json
testicann.json		testicann.json

License

RickyS/Creep

Folders and files

Latest commit

History

Repository files navigation

Package to crawl the web. Used by main program Crawl.

About

Resources

License

Stars

Watchers

Forks

Languages