On my recent SpellChecker project I had a module, in which I required spelling dictionary (i.e. Lucene Index) based on lucene NGramAnalyzers, reading from 1000's of .htm files we had in our repository. For this I have taken an advantage of lucene's extension capabilities by creating my own Analyzer to truncate the html tags since it was not required to store them in spelling dictionary. An Analyzer is an encapsulation of the analysis process.
Creating lucene Analyzer is very simple and easy task as we just have to extend Analyzer class which also have have only one method to override i.e.
public TokenStream tokenStream( String fieldName, Reader reader){}
TokenStream comes in two flavour i.e. Tokenizer and TokenFilter . Distinction is that Tokenizer deals with individual characters whereas TokenFilter deals with words. One important aspect is that we can create chain of these tokenstreams so that output of one object can be used as input for another object. In Tokenizer, the input should be in the form of Reader class. So what I have done that created a analyzer similar to StandardAnalyzer and supplied a new Reader object, which will truncate any words between '<' '>' characters from .htm files.
Following is the code of my HtmTagTruncatorAnalyzer class
public class MyStandardAnalyzer extends Analyzer{
private Set stopWords;
public MyStandardAnalyzer(){
stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);
}
public MyStandardAnalyzer(String[] stopWords){
this.stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);
this.stopWords.addAll(StopFilter.makeStopSet(stopWords));
}
public TokenStream tokenStream(String fieldName,Reader reader) {
return new StopFilter(new LowerCaseFilter(new StandardFilter
(new StandardTokenizer(getNewReader(reader)))),stopWords);
}
private Reader getNewReader(Reader reader){
StringBuffer tempBuffer = new StringBuffer();
try{
boolean result = true;
int length = 0;
loop1:while(true){
int character = reader.read();
char c = (char)character;
if(character == -1) {
break;
}
if(c == '<'){
result = false;
}
if(result == true){
if(c == '.' ){
tempBuffer.append(' ');
continue loop1;
}
if(c == '0' ||c == '1' ||c == '2' ||c == '3'||c == '4' ||
c == '5'||c == '6'||c == '7'||c == '8'||c == '9'||c == '&'||
c == ','||c == '-'||c == '_'||c == '@'){
tempBuffer.append(' ');
continue loop1;
}
tempBuffer.append(c);
}else if(result == false && c == '>'){
result = true;
tempBuffer.append(' ');
}
}
}catch(Exception e){
e.printStackTrace();
}
String text = tempBuffer.toString();
return new StringReader(text);
}
}
No comments:
Post a Comment