Saturday, February 23, 2008

Htm Truncator Lucene Analyzer


On my recent SpellChecker project I had a module, in which I required spelling dictionary (i.e. Lucene Index) based on lucene NGramAnalyzers, reading from 1000's of .htm files we had in our repository. For this I have taken an advantage of lucene's extension capabilities by creating my own Analyzer to truncate the html tags since it was not required to store them in spelling dictionary. An Analyzer is an encapsulation of the analysis process.



Creating lucene Analyzer is very simple and easy task as we just have to extend Analyzer class which also have have only one method to override i.e.


public TokenStream tokenStream( String fieldName, Reader reader){}



TokenStream comes in two flavour i.e. Tokenizer and TokenFilter . Distinction is that Tokenizer deals with individual characters whereas TokenFilter deals with words. One important aspect is that we can create chain of these tokenstreams so that output of one object can be used as input for another object. In Tokenizer, the input should be in the form of Reader class. So what I have done that created a analyzer similar to StandardAnalyzer and supplied a new Reader object, which will truncate any words between '<' '>' characters from .htm files.

Following is the code of my HtmTagTruncatorAnalyzer class


public class MyStandardAnalyzer extends Analyzer{
private Set stopWords;
public MyStandardAnalyzer(){
stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);
}

public MyStandardAnalyzer(String[] stopWords){
this.stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);
this.stopWords.addAll(StopFilter.makeStopSet(stopWords));
}

public TokenStream tokenStream(String fieldName,Reader reader) {
return new StopFilter(new LowerCaseFilter(new StandardFilter
(new StandardTokenizer(getNewReader(reader)))),stopWords);
}

private Reader getNewReader(Reader reader){
StringBuffer tempBuffer = new StringBuffer();
try{
boolean result = true;
int length = 0;
loop1:while(true){
int character = reader.read();
char c = (char)character;
if(character == -1) {
break;
}
if(c == '<'){
result = false;
}
if(result == true){
if(c == '.' ){
tempBuffer.append(' ');
continue loop1;
}
if(c == '0' ||c == '1' ||c == '2' ||c == '3'||c == '4' ||
c == '5'||c == '6'||c == '7'||c == '8'||c == '9'||c == '&'||
c == ','||c == '-'||c == '_'||c == '@'){
tempBuffer.append(' ');
continue loop1;
}
tempBuffer.append(c);
}else if(result == false && c == '>'){
result = true;
tempBuffer.append(' ');
}
}
}catch(Exception e){
e.printStackTrace();
}

String text = tempBuffer.toString();
return new StringReader(text);
}

}

Hibernate's Solution for org.hibernate.hql.ast.QuerySyntaxException.

It take a more then 2 hours to me to solve this exception. I am sharing it with you guys so that new comers won't face it.

Suppose I have persistent object named 'User' mapped to Database table 'USER_DETAILS' in my User.hbm.xml file. When I tried to get object through HQL query ie
User user = session.createQuery("from User_Details where User_Id=1 ).uniqueResult();

I got " org.hibernate.hql.ast.QuerySyntaxExcepti
on: User_Details is not mapped [from User_Details where USER_ID=1".

Solution : I found that in HQL we have to supply persistent Object class name instead of DB table name in queries to retrieve object. i.e.
User user = session.createQuery("from User where User_Id=1 ).uniqueResult();

Regards

Mohit Agrawal