Skip to content

dominiek/content_focus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Content Focus

This is a little gem that allows you to input raw HTML and extract the most relevant piece of content. This is useful when doing semantic analysis on HTML pages for example.

Right now, ContentFocus only supports ‘permanent content extraction’. This is the content that’s non-temporal on a page, like for example:

  • About section
  • Author information
  • Article body
  • Generic information block

The algorithm uses several ways of determining this and it will try to neglect irrelevant pieces of content (navigation, styling, etc.)

Example


  require 'rubygems'
  require 'content_focus'
  
  content_focus = ContentFocus::HTML.new(html_data)
  
  # Will return the most relevant content in text
  static_text = content_focus.static_text
  
  # Will return the most relevant block of content in a Hpricot HTML tree element
  static_fragment = content_focus.static_fragment

Author

Dominiek ter Heide
http://dominiek.com/
(Note: I wrote this a while back and thought this could be useful to some developers)

About

This is a little gem that allows you to input raw HTML and extract the most relevant piece of content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages