Skip to content

oudommeas/swan

 
 

Repository files navigation

Swan Build Status GoDoc

swan

An implementation of the Goose HTML Content / Article Extractor algorithm in golang.

Swan allows you to extract cleaned up text and HTML content from any webpage by removing all the extra junk that so many pages have these days.

Check out the go documentation page for full usage and examples.


Features

  • Main content extraction from almost any source
  • Extract HTML content with images
  • Get article metadata, publish dates, and a lot more
  • Recognize different content types and apply special extractions (currently only recognizes comic sites and normal sites)

Planned

  • Inline videos into HTML content when found in an article
  • Recognize news sources and extract corresponding video / audio content
  • Recognize and extract more types of content

About

An implementation of the Goose HTML Content / Article Extractor algorithm in golang

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 95.5%
  • Go 4.5%