Skip to content

oyiptong/dmozscrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dmozscraper

This collection of tools has the purpose of going through the Open Directory Project content data dump and extracting data.

It includes:

  • a tool that extracts data from the content rdf dump and produces a CSV
  • a tool that will take the CSV and load up jobs in a redis queue
  • a tool that will pop a job off a redis queue and scrape the urls for content and stores it in a postgres datagbase

Requirements

  • go programming language
  • redis
  • postgresql
  • git
  • mercurial
  • go get github.com/vmihailenco/redis
  • go get github.com/bmizerany/pq
  • go get github.com/saintfish/chardet
  • go get code.google.com/p/go-charset/charset
  • go get code.google.com/p/go-charset/data

Libraries used:

About

a dmoz.org scraper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages