Everything will treat the following file types as plain text files: htm
Does this mean that when I search for "to be or not to be" everything will also search all those html tokens which may be in the middle of the phrase like
If is this the case, when someone will search for "type javascript" almost every htm in face of earth will be in the results
I think html to text parsing can be easy, even with the help of external preprocessors like HTMLAsText from nirsoft.net
Because of various needs and industry conflicts I think that external preprocessors like pdftotext and calling other COM class eg. word, excel or any format showed up in the future will be a must.
to <b>be</b> or <i>not</i> to <b>be</b>
is treated as:
to <b>be</b> or <i>not</i> to <b>be</b>
content:"to <b>be</b> or <i>not</i> to <b>be</b>"
would match:
to <b>be</b> or <i>not</i> to <b>be</b>
content:"to be or not to be"
would *not* match:
to <b>be</b> or <i>not</i> to <b>be</b>
as the spaces are treated as literal.
content:<to be or not to be>
would match:
to <b>be</b> or <i>not</i> to <b>be</b>
as the search expression is expanded to: to AND be AND or AND not AND to AND be
content:<type javascript>
would match most htm files.
You could do something like:
content:javascript !regex:content:"<.*javascript.*>"
To ignore javascript inside < and >
content:"<script type=":text/javascript":"
would match:
<script type="text/javascript"