de.l3s.boilerpipe
Interface BoilerpipeExtractor

All Superinterfaces:
BoilerpipeFilter
All Known Implementing Classes:
ArticleExtractor, ArticleSentencesExtractor, DefaultExtractor, ExtractorBase, KeepEverythingExtractor, KeepEverythingWithMinKWordsExtractor, LargestContentExtractor, NumWordsRulesExtractor

public interface BoilerpipeExtractor
extends BoilerpipeFilter

Describes a complete filter pipeline.

Author:
Christian Kohlschütter

Method Summary
 java.lang.String getText(org.xml.sax.InputSource is)
          Extracts text from the HTML code available from the given InputSource.
 java.lang.String getText(java.io.Reader r)
          Extracts text from the HTML code available from the given Reader.
 java.lang.String getText(java.lang.String html)
          Extracts text from the HTML code given as a String.
 java.lang.String getText(TextDocument doc)
          Extracts text from the given TextDocument object.
 
Methods inherited from interface de.l3s.boilerpipe.BoilerpipeFilter
process
 

Method Detail

getText

java.lang.String getText(java.lang.String html)
                         throws BoilerpipeProcessingException
Extracts text from the HTML code given as a String.

Parameters:
html - The HTML code as a String.
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException

getText

java.lang.String getText(org.xml.sax.InputSource is)
                         throws BoilerpipeProcessingException
Extracts text from the HTML code available from the given InputSource.

Parameters:
is - The InputSource containing the HTML
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException

getText

java.lang.String getText(java.io.Reader r)
                         throws BoilerpipeProcessingException
Extracts text from the HTML code available from the given Reader.

Parameters:
r - The Reader containing the HTML
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException

getText

java.lang.String getText(TextDocument doc)
                         throws BoilerpipeProcessingException
Extracts text from the given TextDocument object.

Parameters:
doc - The TextDocument.
Returns:
The extracted text.
Throws:
BoilerpipeProcessingException